forked from eracle/linkedin
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Former-commit-id: ea90323
- Loading branch information
Showing
13 changed files
with
750 additions
and
49 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,44 +1,77 @@ | ||
# Linkedin Scraping | ||
Ubuntu, python3. | ||
|
||
[![built with Selenium](https://img.shields.io/badge/built%20with-Selenium-yellow.svg)](https://github.com/SeleniumHQ/selenium) | ||
[![built with Python3](https://img.shields.io/badge/built%20with-Python3-red.svg)](https://www.python.org/) | ||
|
||
|
||
Scraping software aimed to visit as more pages of the linkedin users as possible, the purpose is to gain visibility: since LinkedIn notifies when you visit another user page. | ||
|
||
Uses: Scrapy, Selenium web driver, Chromium headless and python3. | ||
Uses: Scrapy, Selenium web driver, Chromium headless, docker and python3. | ||
|
||
Tested on Ubuntu 16.04.2 LTS | ||
|
||
|
||
# Install | ||
Docker allows very easy and fast run without any pain and tears. | ||
|
||
```bash | ||
virtualenv -p python3 .venv | ||
source .venv/bin/activate | ||
### 0. Preparations | ||
|
||
pip install -r requirements.txt | ||
Install docker from the official website [https://www.docker.com/](https://www.docker.com/) | ||
|
||
Install VNC viewer if you do not have one. | ||
For ubuntu, go for vinagre: | ||
|
||
``` | ||
On Ubuntu, sometimes: | ||
```bash | ||
sudo apt-get install python3-dev | ||
sudo apt-get update | ||
sudo apt-get install vinagre | ||
``` | ||
|
||
# Usage: | ||
Rename the conf_template.py to conf.py, modify it with linkein username and password and type: | ||
Then connect to localhost:5900, password: secret | ||
|
||
### 1. Set your linkedin login and password | ||
|
||
Open `conf.py` and fill the quotes with your credentials. | ||
|
||
### 2. Run and build containers with docker-compose | ||
|
||
First you need to open your terminal, move to the root folder (usually with the `cd` command) of the project and then type: | ||
|
||
```bash | ||
scrapy crawl linkedin | ||
docker-compose up -d --build | ||
``` | ||
|
||
Instead, for use chrome headless: | ||
|
||
### 3. See what your bot can do right now | ||
|
||
Run your VNC viewer, and type address and port `localhost:5900`. The password is `secret`. | ||
|
||
### 4. Stop the scraper | ||
|
||
Use your terminal again, type in the same window: | ||
```bash | ||
scrapy crawl linkedin -a headless=True | ||
docker-compose down | ||
``` | ||
|
||
|
||
|
||
##### Test: | ||
|
||
Create the selenium server: | ||
```bash | ||
docker run --name selenium -p 4444:4444 -p 5900:5900 --publish-all --shm-size="128M" selenium/standalone-chrome-debug | ||
``` | ||
|
||
python -m unittest selenium_chromium/test.py | ||
|
||
```bash | ||
virtualenvs -p python .venv | ||
source .venv/bin/activate | ||
pip install -r requirements.txt | ||
|
||
python -m unittest test.py | ||
|
||
``` | ||
|
||
Stop and delete selenium server: | ||
```bash | ||
docker stop $(docker ps -aq --filter name=selenium) | ||
|
||
docker rm $(docker ps -aq --filter name=selenium) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
version: '3' | ||
services: | ||
web: | ||
command: ["./wait-for-selenium.sh", "http://selenium:4444/wd/hub", "--", "scrapy", "crawl", "linkedin"] | ||
environment: | ||
- PYTHONUNBUFFERED=0 | ||
build: | ||
context: . | ||
dockerfile: ./docker_conf/prod/Dockerfile | ||
depends_on: | ||
- selenium | ||
volumes: | ||
- ./logs:/code/logs | ||
selenium: | ||
container_name: selenium | ||
image: selenium/standalone-chrome-debug | ||
ports: | ||
- "5900:5900" | ||
shm_size: 128M |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
FROM python:3.5 | ||
RUN mkdir /code | ||
RUN mkdir /config | ||
WORKDIR /code | ||
COPY ./requirements.txt /config/ | ||
RUN pip install -r /config/requirements.txt | ||
COPY ./ /code/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
from selenium import webdriver | ||
from selenium.webdriver import DesiredCapabilities | ||
from selenium.webdriver.chrome.options import Options | ||
from selenium.webdriver.common.by import By | ||
from selenium.webdriver.support import expected_conditions as ec | ||
from selenium.webdriver.support.ui import WebDriverWait | ||
|
||
""" | ||
number of seconds used to wait the web page's loading. | ||
""" | ||
WAIT_TIMEOUT = 10 | ||
|
||
|
||
def get_by_xpath(driver, xpath): | ||
""" | ||
Get a web element through the xpath passed by performing a Wait on it. | ||
:param driver: Selenium web driver to use. | ||
:param xpath: xpath to use. | ||
:return: The web element | ||
""" | ||
return WebDriverWait(driver, WAIT_TIMEOUT).until( | ||
ec.presence_of_element_located( | ||
(By.XPATH, xpath) | ||
)) | ||
|
||
|
||
def init_chromium(selenium_host): | ||
selenium_url = 'http://%s:4444/wd/hub' % selenium_host | ||
|
||
print('Initializing chromium, remote url: %s' % selenium_url) | ||
|
||
chrome_options = DesiredCapabilities.CHROME | ||
# chrome_options.add_argument('--disable-notifications') | ||
|
||
prefs = {"credentials_enable_service": False, "profile.password_manager_enabled": False} | ||
|
||
chrome_options['prefs'] = prefs | ||
|
||
driver = webdriver.Remote(command_executor=selenium_url, | ||
desired_capabilities=chrome_options) | ||
return driver |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
import unittest | ||
|
||
from linkedin.selenium_utils import init_chromium | ||
|
||
|
||
class TestChromium(unittest.TestCase): | ||
|
||
def test_init(self): | ||
webdriver = init_chromium('localhost') | ||
self.assertIsNotNone(webdriver) | ||
print("type: %s" % type(webdriver)) | ||
webdriver.close() | ||
|
||
if __name__ == '__main__': | ||
unittest.main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
#!/bin/bash | ||
# wait-for-selenium.sh | ||
|
||
set -e | ||
|
||
url="$1" | ||
shift | ||
cmd="$@" | ||
|
||
until wget -O- "$url"; do | ||
>&2 echo "Selenium is unavailable - sleeping" | ||
sleep 1 | ||
done | ||
|
||
>&2 echo "Selenium is up - executing command" | ||
exec $cmd |