-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added feature to alumni and follower count #107
base: master
Are you sure you want to change the base?
added feature to alumni and follower count #107
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the Pull Request! (And sorry for the slow review, I don't have a ton of time these days to dedicate to this). I've added comments on some things I'd like to see change before submission, with some less important comments marked as optional.
if page == "people": | ||
interval = 2.0 | ||
else: | ||
interval = 0.1 | ||
|
||
try: | ||
self.driver.get(f"{self.url}/{page}") | ||
# people/alumni javascript takes more time to load | ||
time.sleep(interval) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a hard-coded sleep, which can cause both unnecessary delays for people with fast internet, and false delays for those with slow internet, I'd instead suggest using a WebDriverWait(self.driver, self.timeout).until(...)
, which you can see an example of in load_initial
.
'.org-grid__content-height-enforcer') | ||
people = text_or_default(content, 'div > div > div > h2') | ||
people = people.replace("employees", "").replace("alumni", "").strip() | ||
return people |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the other properties return dictionaries of key/value pairs, but this returns a single string.
Even if only a single key is currently used, this should return a dictionary for consistency with every other property. It will also more easily allow adding new fields in the future, if appropriate.
@@ -92,14 +95,20 @@ def overview(self): | |||
overview["name"] = text_or_default(self.overview_soup, "#main h1") | |||
overview['description'] = text_or_default(container, 'section > p') | |||
|
|||
logo_image_tag = one_or_default( | |||
banner, '.org-top-card-primary-content__logo') | |||
banner_desp = text_or_default(banner, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something, but what is "desp"? Should this be "desc"?
for name in my_company_list: | ||
sc = scraper.scrape(company=name, people=True) | ||
overview = sc.overview | ||
overview['company_name'] = name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional: The overview already has a name
field that seems redundant with this. The ID that is being saved here is more typically referred to as an "id" or "slug", which may be more appropriate field names if you need to save this.
with CompanyScraper() as scraper: | ||
# Get each company's overview, add to company_data list | ||
for name in my_company_list: | ||
sc = scraper.scrape(company=name, people=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional: As a naive reader it would be unclear to be what sc
is supposed to be. I would suggest: company_info
, or similar.
|
||
def scrape(self, | ||
company, | ||
org_type="company", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
org_type
is used only to generate a URL, but LinkedIn seems to automatically redirect the URL in the case you use /company/school-id
. For example, https://www.linkedin.com/company/harvard-university works (otherwise, your example in people-to-csv.py
would break.
I think this should be removed as an option for now.
No description provided.