Skip to content

Commit f0f7acb

Browse files
committed
add warnings on deprecated pipeline classes
0 parents  commit f0f7acb

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

57 files changed

+8202
-0
lines changed

.gitignore

+57
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# scrapy stuff
2+
.scrapy
3+
scrapy_proj/setup.py
4+
dbs/
5+
settings.py
6+
7+
.DS_Store
8+
.AppleDouble
9+
.LSOverride
10+
Icon
11+
12+
# Thumbnails
13+
._*
14+
15+
# Files that might appear on external disk
16+
.Spotlight-V100
17+
.Trashes
18+
19+
# virtualenvs
20+
venv
21+
22+
*.py[cod]
23+
24+
# C extensions
25+
*.so
26+
27+
# Packages
28+
*.egg
29+
*.egg-info
30+
dist
31+
build
32+
eggs
33+
parts
34+
bin
35+
var
36+
sdist
37+
develop-eggs
38+
.installed.cfg
39+
lib
40+
lib64
41+
__pycache__
42+
43+
# Installer logs
44+
pip-log.txt
45+
46+
# Unit test / coverage reports
47+
.coverage
48+
.tox
49+
nosetests.xml
50+
51+
# Translations
52+
*.mo
53+
54+
# Mr Developer
55+
.mr.developer.cfg
56+
.project
57+
.pydevproject

CONTRIBUTORS

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
Here are people who have contributed code to this project.
2+
3+
Adam M Dutko <https://github.com/StylusEater>
4+
Bedrich Rios <https://github.com/bedrich>
5+
Chris Shiflett <https://github.com/shiflett>
6+
Dan McGowan <https://github.com/dansmcgowan>
7+
Ed Finkler <https://github.com/funkatron>
8+
Eric Leclerc <https://github.com/eleclerc>
9+
Evan Haas <https://github.com/ehaas>
10+
Jonathan Suh <https://github.com/jonsuh>
11+
josefeg <https://github.com/josefeg>
12+
Justin Duke <https://github.com/dukerson>
13+
mickaobrien <https://github.com/mickaobrien>
14+
Tyler Mincey <https://github.com/tmincey>

LICENSE

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
Copyright 2013 Fictive Kin, LLC
2+
3+
Licensed under the Apache License, Version 2.0 (the "License");
4+
you may not use this file except in compliance with the License.
5+
You may obtain a copy of the License at
6+
7+
http://www.apache.org/licenses/LICENSE-2.0
8+
9+
Unless required by applicable law or agreed to in writing, software
10+
distributed under the License is distributed on an "AS IS" BASIS,
11+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
See the License for the specific language governing permissions and
13+
limitations under the License.

README.md

+125
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# Open Recipes
2+
3+
## About
4+
5+
Open Recipes is an open database of recipe bookmarks.
6+
7+
Our goals are simple:
8+
9+
1. Help publishers make their recipes as discoverable and consumable (get it?) as possible.
10+
2. Prevent good recipes from disappearing when a publisher goes away.
11+
12+
That's pretty much it. We're not trying to save the world. We're just trying to save some recipes.
13+
14+
## Recipe Bookmarks?
15+
16+
The recipes in Open Recipes do not include preparation instructions. This is why we like to think of Open Recipes as a database of recipe bookmarks. We think this database should provide everything you need to *find* a great recipe, but not everything you need to *prepare* a great recipe. For preparation instructions, please link to the source.
17+
18+
## The Database
19+
20+
Regular snapshots of the database will be provided as JSON. The format will mirror the [schema.org Recipe format](http://schema.org/Recipe). We've [posted an example dump of data](http://openrecipes.s3.amazonaws.com/openrecipes.txt) so you can get a feel for it.
21+
22+
## The Story
23+
24+
We're not a bunch of chefs. We're not even good cooks.
25+
26+
When we read about the [acquisition shutdown of Punchfork](http://punchfork.com/pinterest), we just shook our heads. It was the same ol' story:
27+
28+
> We're excited to share the news that we're gonna be rich! To celebrate, we're shutting down the site and taking all your data down with it. So long, suckers!
29+
30+
This part of the story isn't unique, but it continues. When one of our Studiomates spoke up about her disappointment, we listened. Then, [we acted](https://hugspoon.com/punchfork). What happens next surprised us. The CEO of Punchfork [took issue](https://twitter.com/JeffMiller/status/314899821351821312) with our good deed and demanded that we not save any data, even the data (likes) of users who asked us to save their data.
31+
32+
Here's the thing. None of the recipes belonged to Punchfork. They were scraped from various [publishers](https://github.com/fictivekin/openrecipes/wiki/Publishers) to begin with. But, we don't wanna ruffle any feathers, so we're starting over.
33+
34+
Use the force; seek the source?
35+
36+
## The Work
37+
38+
Wanna help? Fantastic. We knew we liked you.
39+
40+
We're gonna be using [the wiki](https://github.com/fictivekin/openrecipes/wiki) to help organize this effort. Right now, there are two simple ways to help:
41+
42+
1. Add a [publisher](https://github.com/fictivekin/openrecipes/wiki/Publishers). We wanna have the most complete list of recipe publishers. This is the easiest way to contribute. Please also add [an issue](https://github.com/fictivekin/openrecipes/issues) and tag it `publisher`. If you don't have a github account you can also email us suggestions at [email protected]
43+
2. Claim a publisher.
44+
45+
Claiming a publisher means you are taking responsibility for writing a simple parser for the recipes from this particular publisher. Our tech ([see below](#the-tech)) will store this in an object type based on the [schema.org Recipe format](http://schema.org/Recipe), and can convert it into other formats for easy storage and discovery.
46+
47+
Each publisher is a [GitHub issue](https://github.com/fictivekin/openrecipes/issues), so you can claim a publisher by claiming an issue. Just like a bug, and just as delicious. Just leave a comment on the issue claiming it, and it's all yours.
48+
49+
When you have a working parser (what we call "spiders" below), you contribute it to this project by submitting a [Github pull request](https://help.github.com/articles/using-pull-requests). We'll use it to periodically bring recipe data into our database. The database will be available intially as data dumps.
50+
51+
## The Tech
52+
53+
To gather data for Open Recipes, we are building spiders based on [Scrapy](http://scrapy.org), a web scraping framework written in Python. We are using [Scrapy v0.16](http://doc.scrapy.org/en/0.16/) at the moment. To contribute spiders for sites, you should have basic familiarity with:
54+
55+
* Python
56+
* Git
57+
* HTML and/or XML
58+
59+
### Setting up a dev environment
60+
61+
> Note: this is strongly biased towards OS X. Feel free to contribute instructions for other operating systems.
62+
63+
To get things going, you will need the following tools:
64+
65+
1. Python 2.7 (including headers)
66+
1. Git
67+
1. `pip`
68+
1. `virtualenv`
69+
70+
You will probably already have the first two, although you may need to install Python headers on Linux with something like `apt-get install python-dev`.
71+
72+
If you don't have `pip`, follow [the installation instructions in the pip docs](http://www.pip-installer.org/en/latest/installing.html). Then you can [install `virtualenv` using pip](http://www.virtualenv.org/en/latest/#installation).
73+
74+
Once you have `pip` and `virtualenv`, you can clone our repo and install requirements with the following steps:
75+
76+
1. Open a terminal and `cd` to the directory that will contain your repo clone. For these instructions, we'll assume you `cd ~/src`.
77+
2. `git clone https://github.com/fictivekin/openrecipes.git` to clone the repo. This will make a `~/src/openrecipes` directory that contains your local repo.
78+
3. `cd ./openrecipes` to move into the newly-cloned repo.
79+
4. `virtualenv --no-site-packages venv` to create a Python virtual environment inside `~/src/openrecipes/venv`.
80+
5. `source venv/bin/activate` to activate your new Python virtual environment.
81+
6. `pip install -r requirements.txt` to install the required Python libraries, including Scrapy.
82+
7. `scrapy -h` to confirm that the `scrapy` command was installed. You should get a dump of the help docs.
83+
8. `cd scrapy_proj/openrecipes` to move into the Scrapy project directory
84+
9. `cp settings.py.default settings.py` to set up a working settings module for the project
85+
10. `scrapy crawl thepioneerwoman.feed` to test the feed spider written for [thepioneerwoman.com](http://thepioneerwoman.com). You should get output like the following:
86+
87+
<pre>
88+
2013-03-30 14:35:37-0400 [scrapy] INFO: Scrapy 0.16.4 started (bot: openrecipes)
89+
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
90+
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
91+
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
92+
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled item pipelines: MakestringsPipeline, DuplicaterecipePipeline
93+
2013-03-30 14:35:37-0400 [thepioneerwoman.feed] INFO: Spider opened
94+
2013-03-30 14:35:37-0400 [thepioneerwoman.feed] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
95+
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
96+
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
97+
2013-03-30 14:35:38-0400 [thepioneerwoman.feed] DEBUG: Crawled (200) <GET http://feeds.feedburner.com/pwcooks> (referer: None)
98+
2013-03-30 14:35:38-0400 [thepioneerwoman.feed] DEBUG: Crawled (200) <GET http://thepioneerwoman.com/cooking/2013/03/beef-fajitas/> (referer: http://feeds.feedburner.com/pwcooks)
99+
...
100+
</pre>
101+
102+
If you do, [*baby you got a stew going!*](http://www.youtube.com/watch?v=5lFZAyZPjV0)
103+
104+
### Writing your own spiders
105+
106+
For now, we recommend looking at the following spider definitions to get a feel for writing them:
107+
108+
* [spiders/thepioneerwoman_spider.py](scrapy_proj/openrecipes/spiders/thepioneerwoman_spider.py)
109+
* [spiders/thepioneerwoman_feedspider.py](scrapy_proj/openrecipes/spiders/thepioneerwoman_feedspider.py)
110+
111+
Both files are extensively documented, and should give you an idea of what's involved. If you have questions, check the [Feedback section](#feedback) and hit us up.
112+
113+
To generate your own spider, use the included generate.py program. From the scrapy_proj directory, run the following (make sure you are in the correct virtualenv:
114+
115+
`python generate.py SPIDER_NAME START_URL`
116+
117+
This will generate a basic spider for you named SPIDER_NAME that starts crawling at START_URL. All that remains for you to do is to fill in the correct info for scraping the name, image, etc. See `python generate.py --help' for other command line options.
118+
119+
We'll use the ["fork & pull" development model](https://help.github.com/articles/fork-a-repo) for collaboration, so if you plan to contribute, make sure to fork your own repo off of ours. Then you can send us a pull request when you have something to contribute. Please follow ["PEP 8 - Style Guide for Python Code"](http://www.python.org/dev/peps/pep-0008/) for code you write.
120+
121+
## Feedback?
122+
123+
We're just trying to do the right thing, so we value your feedback as we go. You can ping [Ed](https://github.com/funkatron), [Chris](https://github.com/shiflett), [Andreas](https://github.com/andbirkebaek), or anyone from [Fictive Kin](https://github.com/fictivekin). General suggestions and feedback to [[email protected]](mailto:[email protected]) are welcome, too.
124+
125+
We're also gonna be on IRC, so please feel free to join us if you have any questions or comments. We'll be hanging out in #openrecipes on Freenode. See you there!

requirements.txt

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
Scrapy==0.16.4
2+
Twisted==12.3.0
3+
bleach==1.2.1
4+
cssselect==0.8
5+
html5lib==0.95
6+
isodate==0.4.9
7+
lxml==3.1.0
8+
nose==1.3.0
9+
pyOpenSSL==0.13
10+
pymongo==2.5
11+
python-dateutil==2.1
12+
w3lib==1.2
13+
wsgiref==0.1.2
14+
zope.interface==4.0.5

scrapy_proj/generate.py

+159
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
import argparse
2+
from urlparse import urlparse
3+
import os
4+
import sys
5+
6+
script_dir = os.path.dirname(os.path.realpath(__file__))
7+
8+
SpiderTemplate = """from scrapy.contrib.spiders import CrawlSpider, Rule
9+
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
10+
from scrapy.selector import HtmlXPathSelector
11+
from openrecipes.items import RecipeItem, RecipeItemLoader
12+
13+
14+
class %(crawler_name)sMixin(object):
15+
source = '%(source)s'
16+
17+
def parse_item(self, response):
18+
19+
hxs = HtmlXPathSelector(response)
20+
21+
base_path = 'TODO'
22+
23+
recipes_scopes = hxs.select(base_path)
24+
25+
name_path = 'TODO'
26+
description_path = 'TODO'
27+
image_path = 'TODO'
28+
prepTime_path = 'TODO'
29+
cookTime_path = 'TODO'
30+
recipeYield_path = 'TODO'
31+
ingredients_path = 'TODO'
32+
datePublished = 'TODO'
33+
34+
recipes = []
35+
36+
for r_scope in recipes_scopes:
37+
il = RecipeItemLoader(item=RecipeItem())
38+
39+
item['source'] = self.source
40+
41+
il.add_value('name', r_scope.select(name_path).extract())
42+
il.add_value('image', r_scope.select(image_path).extract())
43+
il.add_value('url', response.url)
44+
il.add_value('description', r_scope.select(description_path).extract())
45+
46+
il.add_value('prepTime', r_scope.select(prepTime_path).extract())
47+
il.add_value('cookTime', r_scope.select(cookTime_path).extract())
48+
il.add_value('recipeYield', r_scope.select(recipeYield_path).extract())
49+
50+
ingredient_scopes = r_scope.select(ingredients_path)
51+
ingredients = []
52+
for i_scope in ingredient_scopes:
53+
pass
54+
il.add_value('ingredients', ingredients)
55+
56+
il.add_value('datePublished', r_scope.select(datePublished).extract())
57+
58+
recipes.append(il.load_item())
59+
60+
return recipes
61+
62+
63+
class %(crawler_name)scrawlSpider(CrawlSpider, %(crawler_name)sMixin):
64+
65+
name = "%(domain)s"
66+
67+
allowed_domains = ["%(domain)s"]
68+
69+
start_urls = [
70+
"%(start_url)s",
71+
]
72+
73+
rules = (
74+
Rule(SgmlLinkExtractor(allow=('TODO'))),
75+
76+
Rule(SgmlLinkExtractor(allow=('TODO')),
77+
callback='parse_item'),
78+
)
79+
80+
81+
"""
82+
83+
FeedSpiderTemplate = """from scrapy.spider import BaseSpider
84+
from scrapy.http import Request
85+
from scrapy.selector import XmlXPathSelector
86+
from openrecipes.spiders.%(source)s_spider import %(crawler_name)sMixin
87+
88+
89+
class %(crawler_name)sfeedSpider(BaseSpider, %(crawler_name)sMixin):
90+
name = "%(name)s.feed"
91+
allowed_domains = [
92+
"%(feed_domains)s",
93+
"feeds.feedburner.com",
94+
"feedproxy.google.com",
95+
]
96+
start_urls = [
97+
"%(feed_url)s",
98+
]
99+
100+
def parse(self, response):
101+
xxs = XmlXPathSelector(response)
102+
links = xxs.select("TODO").extract()
103+
104+
return [Request(x, callback=self.parse_item) for x in links]
105+
"""
106+
107+
108+
def parse_url(url):
109+
if url.startswith('http://') or url.startswith('https://'):
110+
return urlparse(url)
111+
else:
112+
return urlparse('http://' + url)
113+
114+
115+
def generate_crawlers(args):
116+
parsed_url = parse_url(args.start_url)
117+
118+
domain = parsed_url.netloc
119+
name = args.name.lower()
120+
121+
values = {
122+
'crawler_name': name.capitalize(),
123+
'source': name,
124+
'name': domain,
125+
'domain': domain,
126+
'start_url': args.start_url,
127+
}
128+
129+
spider_filename = os.path.join(script_dir, 'openrecipes', 'spiders', '%s_spider.py' % name)
130+
with open(spider_filename, 'w') as f:
131+
f.write(SpiderTemplate % values)
132+
133+
if args.with_feed:
134+
feed_url = args.with_feed[0]
135+
feed_domain = parse_url(feed_url).netloc
136+
values['feed_url'] = feed_url
137+
values['name'] = name
138+
if feed_domain == domain:
139+
values['feed_domains'] = domain
140+
else:
141+
values['feed_domains'] = '%s",\n "%s' % (domain, feed_domain)
142+
feed_filename = os.path.join(script_dir, 'openrecipes', 'spiders', '%s_feedspider.py' % name)
143+
with open(feed_filename, 'w') as f:
144+
f.write(FeedSpiderTemplate % values)
145+
146+
147+
epilog = """
148+
Example usage: python generate.py epicurious http://www.epicurious.com/
149+
"""
150+
parser = argparse.ArgumentParser(description='Generate a scrapy spider', epilog=epilog)
151+
parser.add_argument('name', help='Spider name. This will be used to generate the filename')
152+
parser.add_argument('start_url', help='Start URL for crawling')
153+
parser.add_argument('--with-feed', required=False, nargs=1, metavar='feed-url', help='RSS Feed URL')
154+
155+
if len(sys.argv) == 1:
156+
parser.print_help(sys.stderr)
157+
else:
158+
args = parser.parse_args()
159+
generate_crawlers(args)

scrapy_proj/openrecipes/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)