Scripts I use to get decent results from craigslist.
- Download last N listings from craigslist
- Find dups by exact title match
- Attempt to grab search data from content (price, bedrooms, etc) and stuff into elasticsearch
- Search based on my criteria
- WIN
- Ruby
- Bundler
- postgresql
- elasticsearch
Things I probably wont do since before actually finding an apartment
- When analyzing listings, search for images within a listing, stream their contents, hash the contents and index those hashes. Use this to create a "unique images ratio" and consider anything less than a threshold ratio to be "duplicate". For instance if a listing has 10 images, but 9 of them appear in other listings consider this listing a duplicate and do not index it for search.
- Learn how to create a better search using elasticsearch. Scoring, boosting, etc.
- Make the load/analyze process 1 script/class
- Use Twitter Storm to stream/analze in new listings and send new listings to myself via email when they match my criteria
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.