This guide explains how to use the cloudscraper
Python library to bypass Cloudflare’s protection and handle errors:
- Install Prerequisites
- Write Initial Scraping Code
- Incorporate cloudscraper
- Use Additional cloudscraper Features
- Common cloudscraper Errors
- cloudscraper Alternatives
- Conclusion
Ensure you have Python 3 installed, then install the necessary packages:
pip install tqdm==4.66.5 requests==2.32.3 beautifulsoup4==4.12.3
This guide assumes you're scraping metadata from news articles published on a specific date on the ChannelsTV website. Below is an initial Python script:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
from tqdm.auto import tqdm
def extract_article_data(article_source, headers):
response = requests.get(article_source, headers=headers)
if response.status_code != 200:
return None
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(class_="post-title display-3").text.strip()
date = soup.find(class_="post-meta_time").text.strip()
date_object = datetime.strptime(date, 'Updated %B %d, %Y').date()
categories = [category.text.strip() for category in soup.find('nav', {"aria-label": "breadcrumb"}).find_all('li')]
tags = [tag.text.strip() for tag in soup.find("div", class_="tags").find_all("a")]
article_data = {
'date': date_object,
'title': title,
'link': article_source,
'tags': tags,
'categories': categories
}
return article_data
def process_page(articles, headers):
page_data = []
for article in tqdm(articles):
url = article.find('a', href=True).get('href')
if "https://" not in url:
continue
article_data = extract_article_data(url, headers)
if article_data:
page_data.append(article_data)
return page_data
def scrape_articles_per_day(base_url, headers):
day_data = []
page = 1
while True:
page_url = f"{base_url}/page/{page}"
response = requests.get(page_url, headers=headers)
if not response or response.status_code != 200:
break
soup = BeautifulSoup(response.content, 'html.parser')
articles = soup.find_all('article')
if not articles:
break
page_data = process_page(articles, headers)
day_data.extend(page_data)
page += 1
return day_data
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
}
URL = "https://www.channelstv.com/2024/08/01/"
scraped_articles = scrape_articles_per_day(URL, headers)
print(f"{len(scraped_articles)} articles were scraped.")
print("Samples:")
print(scraped_articles[:2])
This script defines three key functions for scraping. The extract_article_data
function retrieves content from an article’s webpage, extracting metadata such as title, publication date, tags, and categories into a dictionary.
Next, the process_page
function iterates through all articles on a given page, extracting metadata using extract_article_data
and compiling the results into a list.
Finally, the scrape_articles_per_day
function systematically navigates through paginated results, incrementing the page number in a while
loop until no more pages are found.
To execute the scraper, the script specifies a target URL with a filtering date of August 1, 2024. A user-agent header is set, and the scrape_articles_per_day
function is called with the provided URL and headers. The total number of articles scraped is printed, along with a preview of the first two results.
However, the script does not work as expected because the ChannelsTV website employs Cloudflare protection, blocking direct HTTP requests made by extract_article_data
and scrape_articles_per_day
.
When running the script, the output typically appears as follows:
0 articles were scraped.
Samples:
[]
Install cloudscraper
to bypass Cloudflare:
pip install cloudscraper==1.2.71
Modify the script to use cloudscraper
:
import cloudscraper
def fetch_html_content(url, headers):
try:
scraper = cloudscraper.create_scraper()
response = scraper.get(url, headers=headers)
if response.status_code == 200:
return response
else:
print(f"Failed to fetch URL: {url}. Status code: {response.status_code}")
return None
except Exception as e:
print(f"An error occurred while fetching URL: {url}. Error: {str(e)}")
return None
This function, fetch_html_content
, takes a URL and request headers as input. It attempts to retrieve the webpage using cloudscraper.create_scraper()
. If the request is successful (status code 200), the response is returned; otherwise, an error message is printed, and None
is returned. If an exception occurs, the error is caught and displayed before returning None
.
Following this update, all requests.get
calls are replaced with fetch_html_content
, ensuring compatibility with Cloudflare-protected websites. The first modification occurs in the extract_article_data
function, as demonstrated above.
Now, replace requests.get
calls with fetch_html_content
in your scraping functions.
def extract_article_data(article_source, headers):
response = fetch_html_content(article_source, headers)
After that, replace the requests.get
call in your scrape_articles_per_day
function like this:
def scrape_articles_per_day(base_url, headers):
day_data = []
page = 1
while True:
page_url = f"{base_url}/page/{page}"
response = fetch_html_content(page_url, headers)
By defining this function, the cloudscraper library can help you evade Cloudflare’s restrictions.
When you run the code, your output looks like this:
Failed to fetch URL: https://www.channelstv.com/2024/08/01//page/5. Status code: 404
55 articles were scraped.
Samples:
[{'date': datetime.date(2024, 8, 1),
'title': 'Resilience, Tear Gas, Looting, Curfew As #EndBadGovernance Protests Hold',
'link': 'https://www.channelstv.com/2024/08/01/tear-gas-resilience-looting-curfew-as-endbadgovernance-protests-hold/',
'tags': ['Eagle Square', 'Hunger', 'Looting', 'MKO Abiola Park', 'violence'],
'categories': ['Headlines']},
{'date': datetime.date(2024, 8, 1),
'title': 'Mother Of Russian Artist Freed In Prisoner Swap Waiting To 'Hug' Her',
'link': 'https://www.channelstv.com/2024/08/01/mother-of-russian-artist-freed-in-prisoner-swap-waiting-to-hug-her/',
'tags': ['Prisoner Swap', 'Russia'],
'categories': ['World News']}]
With cloudscraper, you can define proxies and pass them to your already created cloudscraper
object like this:
scraper = cloudscraper.create_scraper()
proxy = {
'http': 'http://your-proxy-ip:port',
'https': 'https://your-proxy-ip:port'
}
response = scraper.get(URL, proxies=proxy)
Here, you start by defining a scraper object with default values. Then, you define a proxy dictionary with http
and https
proxies. After that, you pass the proxy dictionary object to the scraper.get
method as you would with a regular request.get
method.
The cloudscraper library can autogenerate user agents and lets you specify the JavaScript interpreter and engine you use with your scraper. Here is some example code:
scraper = cloudscraper.create_scraper(
interpreter="nodejs",
browser={
"browser": "chrome",
"platform": "ios",
"desktop": False,
}
)
The above script sets the interpreter as "nodejs"
and passes a dictionary to the browser parameter. The browser is set to Chrome and the platform is set to "ios"
. The desktop parameter is set to False
which suggests that the browser runs on mobile.
The cloudscraper library supports third-party CAPTCHA solvers to bypass reCAPTCHA, hCaptcha, and more. The following snippet shows you how to modify your scraper to handle CAPTCHA:
scraper = cloudscraper.create_scraper(
captcha={
'provider': 'capsolver',
'api_key': 'your_capsolver_api_key'
}
)
The code uses Capsolver for CAPTCHA provider and the Capsolver API key. Both values are stored in a dictionary and passed to the CAPTCHA parameter in the cloudscraper.create_scraper
method.
Ensure cloudscraper
is installed:
pip install cloudscraper
Then, check if your virtual environment is activated. On Windows:
.<venv-name>\Scripts\activate.bat
On Linux or macOS:
source <venv-name>/bin/activate
Update the package:
pip install -U cloudscraper
Bright Data provides robust proxy networks to bypass Cloudflare. Create an account, configure it, and get your API credentials. Then, use those credentials to access the data at your target URL like this:
import requests
host = 'brd.superproxy.io'
port = 22225
username = 'brd-customer-<customer_id>-zone-<zone_name>'
password = '<zone_password>'
proxy_url = f'http://{username}:{password}@{host}:{port}'
proxies = {
'http': proxy_url,
'https': proxy_url
}
response = requests.get(URL, proxies=proxies)
Here, you make a GET
request with the Python Requests library and pass in proxies via the proxies
parameter.
While cloudscraper
is useful, it has its limits. Consider trying Bright Data proxy network and Web Unlocker to access Cloudflare-protected sites.
Start with a free trial today!