Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPT: Host property table #4

Open
mfan opened this issue Dec 20, 2012 · 0 comments
Open

HPT: Host property table #4

mfan opened this issue Dec 20, 2012 · 0 comments
Assignees

Comments

@mfan
Copy link
Owner

mfan commented Dec 20, 2012

store host related information to help get rid of duplicated url and optimizing for downloading:

  1. host normalization, e.g. 301 will decide target host is winner. the info will be used for normalize urls and remove dups.
  2. host robots.txt info.
  3. host properties
    • friendliness, how the host behavior based upon previous crawling
    • stability, ranking helps here, also from previous crawling.
    • ranking, from other sources or assigned.
    • other info, e.g. ip address caching.
  4. timestamp to decide record freshness.

design:

  • this table stores in redis
  • shall sync with the same table in hbase.
  • might load on demand from hbase to redis.

The implentation will take stages, and this issue will be separated into sub-issues.

@ghost ghost assigned mfan Dec 20, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant