Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expansion: XPath Template Store #9

Open
mfan opened this issue Jan 15, 2013 · 0 comments
Open

Expansion: XPath Template Store #9

mfan opened this issue Jan 15, 2013 · 0 comments
Assignees

Comments

@mfan
Copy link
Owner

mfan commented Jan 15, 2013

XPath Template Server is used to drive the expansion (based on list templates), and content extraction (based on content templates).

Two kinds of templates are stored in the store (redis):

  • list templates. The templates are used to extract more link urls from the page. The urls are used to crawl depper into more pages. For example, the template could be applied upon "category listing pages", or "related contents" pages, or "most popular items" pages, etc.
  • content templates. The templates are used to extract one or more entities from the page. The extracted data are structured data and could be add or merged into existing database.

TODO:

  • microformats shall be supported as one kind of content templates. The parsing of microformat is supported in lxml library. Need to keep tracking how many sites using microformats now.
@ghost ghost assigned mfan Jan 15, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant