Skip to content

基于代理ip的百度搜索结果爬虫( A crawler crawling baidu search results by means of constantly changeing proxies.)

Notifications You must be signed in to change notification settings

BurningWind/BaiduCrawler

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BaiduCrawler

A crawler crawling baidu searching results by means of constantly changing proxies.

爬取百度搜索结果中c-abstract里的数据,并使用不断更换代理ip的方式绕过百度反爬虫策略,从而实现对数以10w计的词条的百度搜索结果进行连续爬取。

###爬取策略

有3个策略:

    1. 每当出现download_error,更换一个IP
    1. 每爬取200条文本,更换一个IP
    1. 每爬取20,000次,更新一次IP资源池

About

基于代理ip的百度搜索结果爬虫( A crawler crawling baidu search results by means of constantly changeing proxies.)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%