Skip to content

Latest commit

 

History

History
295 lines (189 loc) · 9.82 KB

File metadata and controls

295 lines (189 loc) · 9.82 KB

Web Scraping With cURL Impersonate

Bright Data Promo

This guide explains how to use cURL Impersonate to mimic browser behavior for web scraping:

What Is cURL Impersonate?

cURL Impersonate is a specialized cURL build designed to mimic major browsers (Chrome, Edge, Safari, and Firefox). This tool performs TLS and HTTP handshakes that closely resemble those of real browsers.

You can use this HTTP client either through the curl-impersonate command-line tool, similar to regular curl, or as a library in Python.

These browsers can be impersonated:

Browser Simulated OS Wrapper Script
Chrome 99 Windows 10 curl_chrome99
Chrome 100 Windows 10 curl_chrome100
Chrome 101 Windows 10 curl_chrome101
Chrome 104 Windows 10 curl_chrome104
Chrome 107 Windows 10 curl_chrome107
Chrome 110 Windows 10 curl_chrome110
Chrome 116 Windows 10 curl_chrome116
Chrome 99 Android 12 curl_chrome99_android
Edge 99 Windows 10 curl_edge99
Edge 101 Windows 10 curl_edge101
Firefox 91 ESR Windows 10 curl_ff91esr
Firefox 95 Windows 10 curl_ff95
Firefox 98 Windows 10 curl_ff98
Firefox 100 Windows 10 curl_ff100
Firefox 102 Windows 10 curl_ff102
Firefox 109 Windows 10 curl_ff109
Firefox 117 Windows 10 curl_ff117
Safari 15.3 macOS Big Sur curl_safari15_3
Safari 15.5 macOS Monterey curl_safari15_5

Each supported browser has a specific wrapper script that configures curl-impersonate with the appropriate headers, flags, and settings to simulate that browser.

How curl-impersonate Works

When sending an HTTPS request, a TLS handshake occurs. During this process, details about the HTTP client are shared with the web server, creating a unique TLS fingerprint.

Standard HTTP clients have configurations different from browsers, resulting in a TLS fingerprint that easily reveals automated requests. This allows anti-bot systems to detect and block your scraping attempts.

cURL Impersonate solves this by modifying the standard curl tool to match real browsers' TLS fingerprints through:

  • TLS library modification: For Chrome versions, curl is compiled with BoringSSL, Google's TLS library. For Firefox versions, it uses NSS, Firefox's TLS library.
  • Configuration adjustments: It modifies cURL's TLS extensions and SSL options to mimic browser settings and adds support for browser-specific TLS extensions.
  • HTTP/2 handshake customization: It aligns cURL's HTTP/2 connection settings with real browsers.
  • Non-default flags: It runs with specific flags like -ciphers, -curves, and custom headers to further mimic browser behavior.

This makes curl-impersonate requests appear as if they come from a real browser, helping bypass many bot detection mechanisms.

curl-impersonate: Command Line Tutorial

Follow these steps to use cURL Impersonate from the command line.

Note: Multiple installation methods are shown, but you only need one. Docker is recommended.

Installation From Pre-Compiled Binaries

Download pre-compiled binaries for Linux and macOS from the GitHub releases page. Before using them, install:

  • NSS (Network Security Services): Libraries supporting cross-platform security-enabled applications.
  • CA certificates: Digital certificates authenticating server and client identities.

To meet prerequisites on Ubuntu:

sudo apt install libnss3 nss-plugin-pem ca-certificates

On Red Hat, Fedora, or CentOS, execute: 

yum install nss nss-pem ca-certificates

On Archlinux, launch: 

pacman -S nss ca-certificates

On macOS, fire this command: 

brew install nss ca-certificates

Also ensure zlib is installed, as the pre-compiled binaries are gzipped.

Installation through Docker

Docker images with curl-impersonate are available on Docker Hub, based on Alpine Linux and Debian.

Chrome images (*-chrome) can impersonate Chrome, Edge, and Safari. Firefox images (*-ff) can impersonate Firefox.

To download a Docker image:

For Chrome version on Alpine Linux:

docker pull lwthiker/curl-impersonate:0.5-chrome

For Firefox version on Alpine Linux:

docker pull lwthiker/curl-impersonate:0.5-ff

For Chrome version on Debian:

docker pull lwthiker/curl-impersonate:0.5-chrome-slim-buster

For Firefox version on Debian:

docker pull lwthiker/curl-impersonate:0.5-ff-slim-buster

Once downloaded, execute curl-impersonate using a docker run command.

Installation From Distro Packages

On Arch Linux, install through the AUR package curl-impersonate-bin.

On macOS, install the unofficial Homebrew package:

brew tap shakacode/brew

brew install curl-impersonate

Basic Usage

Execute a curl-impersonate command using:

curl-impersonate-wrapper [options] [target-url]

Or with Docker:

docker run --rm lwthiker/curl-impersonate:[curl-impersonate-version]curl-impersonate-wrapper [options] [target_url]

Where:

  • curl-impersonate-wrapper is your chosen wrapper (e.g., curl_chrome116, curl_edge101)
  • options are optional cURL flags
  • target-url is the web page URL

Be cautious with custom options as some flags might alter the TLS signature.

The wrappers automatically set default HTTP headers, which you can customize by modifying the scripts.

Example: Request the Wikipedia homepage using Chrome:

curl_chrome110 https://www.wikipedia.org

With Docker:

docker run --rm lwthiker/curl-impersonate:0.5-chrome curl_chrome110 https://www.wikipedia.org

Result:

<html lang="en" class="no-js">

  <head>

    <meta charset="utf-8">

    <title>Wikipedia</title>

    <meta name="description" content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.">

<!-- omitted for brevity... -->

The server returns the HTML as if you were using a browser.

curl-impersonate: Python Tutorial

While command line is great for testing, web scraping typically uses languages like Python.

You can use cURL Impersonate in Python through curl-cffi, a Python binding for curl-impersonate.

Prerequisites:

  • Python 3.8+
  • A Python project with virtual environment setup
  • Optionally, a Python IDE like Visual Studio Code

Installation:

Install via pip:

pip install curl_cfii

Usage:

Typically, you want to use the requests-like API. To do this, import requests from curl_cffi:

response = requests.get("https://www.wikipedia.org", impersonate="chrome")

Print the response HTML with:

print(response.text)

Put it all together, and you will get:

from curl_cffi import requests

# make a GET request to the target page with

# the Chrome version of curl-impersonate

response = requests.get("https://www.wikipedia.org", impersonate="chrome")

# print the server response

print(response.text)

Running this script prints:

html
Copy
<html lang="en" class="no-js">
  <head>
    <meta charset="utf-8">
    <title>Wikipedia</title>
    <meta name="description" content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.">
<!-- omitted for brevity... -->

cURL Impersonate Advanced Usage

Proxy Integration

Browser fingerprint simulation might not be enough against sophisticated anti-bot solutions. Proxies can help by providing fresh IP addresses.

To use a proxy with cURL Impersonate via command line:

curl-impersonate -x http://84.18.12.16:8888 https://httpbin.org/ip

In Python:

from curl_cffi import requests

proxies = {"http": "http://84.18.12.16:8888", "https": "http://84.18.12.16:8888"}

response = requests.get("https://httpbin.org/ip", impersonate="chrome", proxies=proxies)

Libcurl Integration

libcurl-impersonate is a compiled libcurl version with cURL Impersonate features and an extended API for TLS details and header configurations.

Install it using the pre-compiled package. It facilitates cURL Impersonate integration into libraries in various programming languages.

Conclusion

Note that advanced anti-bot solutions like Cloudflare may still detect automated requests. For a comprehensive solution, consider Bright Data's Scraper API, which handles browser fingerprinting, CAPTCHA solving, and IP rotation.

Register for a free trial of Bright Data's web scraping infrastructure!