-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop/resume #2
Comments
Yeah, this is really lacking right now. I've been a little torn over how to implement this. In the long term I think it would be cool if there was some sort of admin UI where you could view previous crawl results, start and stop new crawls, and maybe even do a little configuration. That might be a little heavyweight for some people, so having a simple pause/resume from the command line would be nice. How would this change sound: In the crawler you can configure a queue file: var crawler = new roboto.Crawler({
startUrls: [
"https://news.ycombinator.com/",
],
queueFile: '/var/foo'
}); Then, the url frontier and set of seen urls will periodically be serialized and flushed out to the file as json. |
Well, I have very similar idea. You can configure queue file and crawler periodically serializes data which is necessary for resume.
If crawler is done, it removes queueFile so next time it starts from the beginning. |
WEBCLI-824 Add caching support in Devcenter crawler
I think I saw it in the roadmap.
It could be nice if you could stop and then resume roboto so i does not start over from the beginning/startsUrl. I think it could be achieve via de/serialization so when you start/stop it loads its' previous state.
The text was updated successfully, but these errors were encountered: