You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
XPath Template Server is used to drive the expansion (based on list templates), and content extraction (based on content templates).
Two kinds of templates are stored in the store (redis):
list templates. The templates are used to extract more link urls from the page. The urls are used to crawl depper into more pages. For example, the template could be applied upon "category listing pages", or "related contents" pages, or "most popular items" pages, etc.
content templates. The templates are used to extract one or more entities from the page. The extracted data are structured data and could be add or merged into existing database.
TODO:
microformats shall be supported as one kind of content templates. The parsing of microformat is supported in lxml library. Need to keep tracking how many sites using microformats now.
The text was updated successfully, but these errors were encountered:
XPath Template Server is used to drive the expansion (based on list templates), and content extraction (based on content templates).
Two kinds of templates are stored in the store (redis):
TODO:
The text was updated successfully, but these errors were encountered: