Skip to content

Added Python Scrapy parser #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 14 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ Competitors
* html5lib
http://code.google.com/p/html5lib/
Pure python DOM parser oriented to HTML5.
* Scrapy
http://scrapy.org/
High-level screen scraping and web crawling framework.

### PyPi

Expand Down Expand Up @@ -105,17 +108,17 @@ Install OS dependencies python-virtualenv, erlang, pypy, C compiler and libxml2
dev packages

sudo apt-get install ...
libxml2-dev libxslt1-dev build-essential # common
python-virtualenv python-lxml # python
erlang-base # erlang
pypy # python PyPy
nodejs npm # NodeJS
cabal-install libicu-dev # Haskell
php5-cli php5-tidy # PHP
golang # Go
ruby1.9.1 ruby1.9.1-dev rubygems1.9.1 # Ruby
maven2 default-jdk # Java
mono-runtime mono-dmcs # Mono
libxml2-dev libxslt1-dev build-essential # common
python-virtualenv python-lxml python-scrapy # python
erlang-base # erlang
pypy # python PyPy
nodejs npm # NodeJS
cabal-install libicu-dev # Haskell
php5-cli php5-tidy # PHP
golang # Go
ruby1.9.1 ruby1.9.1-dev rubygems1.9.1 # Ruby
maven2 default-jdk # Java
mono-runtime mono-dmcs # Mono

Then run (it will prepare virtual environments, fetch dependencies, compile sources etc)

Expand Down
7 changes: 6 additions & 1 deletion lib.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,12 @@ timeit () {
# XXX: how to redirect time's output to stdout, but leave command's
# errors on stderr? -o /dev/tty is ok in general, but causes problems
# with GNU parallel
/usr/bin/time --format="real:%e user:%U sys:%S max RSS:%M" $@ 2>&1
if [ ${OSTYPE//[0-9.]/} == 'darwin' ]; then
# --format not supported in MacOS
/usr/bin/time $@ 2>&1
else
/usr/bin/time --format="real:%e user:%U sys:%S max RSS:%M" $@ 2>&1
fi
}

print_header() {
Expand Down
2 changes: 1 addition & 1 deletion python/prepare.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@
virtualenv --system-site-packages env
source env/bin/activate

pip install lxml beautifulsoup4 BeautifulSoup html5lib
pip install lxml beautifulsoup4 BeautifulSoup html5lib scrapy
43 changes: 43 additions & 0 deletions python/scrapy_parser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# -*- coding: utf-8 -*-
'''
Created on 2013-01-30

@author: Pavel Shpilev <[email protected]>
'''
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request

import sys
import time
import os


class BenchmarkSpider(BaseSpider):
def parse(self, response):
hxs = HtmlXPathSelector(response)
yield hxs.extract()
yield Request(response.url, callback=self.parse, dont_filter=True)


def main():
do_parse_test(os.path.join('file://127.0.0.1', sys.argv[1]), int(sys.argv[2]))


def do_parse_test(html, n):
start = time.time()
spider = BenchmarkSpider(name="benchmark", start_urls=[html])
crawler = Crawler(Settings(values={'TELNETCONSOLE_PORT': None}))
crawler.configure()
crawler.crawl(spider)
for i in xrange(n):
crawler.start()
crawler.stop()
stop = time.time()
print stop - start, "s"


if __name__ == '__main__':
main()