Web Scraping with lxml

This guide explains how to use the lxml package in Python to parse static and dynamic content, overcome common challenges, and streamline your data extraction process.

Using lxml for Web Scraping in Python
Prerequisites
Parsing Static HTML Content
Parsing Dynamic HTML Content
Using lxml with Bright Data Proxy

Using lxml for Web Scraping in Python

On the web, structured and hierarchical data can be represented in two formats—HTML and XML:

XML is a basic structure that does not come with prebuilt tags and styles. The coder creates the structure by defining its own tags. The tag’s main purpose is to create a standard data structure that can be understood between different systems.
HTML is a web markup language with predefined tags. These tags come with some styling properties, such as font-size in <h1> tags or display for <img /> tags. HTML’s primary function is to structure web pages effectively.

lxml works with both HTML and XML documents.

Prerequisites

Before you can start web scraping with lxml, you need to install a few libraries on your machine:

pip install lxml requests cssselect

This command installs the following:

lxml to parse XML and HTML
requests for fetching web pages
cssselect, which uses CSS selectors to extract HTML elements

Parsing Static HTML Content

Two main types of web content can be scraped: static and dynamic. Static content is embedded in the HTML document when the web page initially loads, making it easy to scrape. In contrast, dynamic content is loaded continuously or triggered by JavaScript after the initial page load.

To start, use your browser’s Dev Tools to identify the relevant HTML elements. Open Dev Tools by right-clicking the web page and selecting the Inspect option or pressing F12 in Chrome.

The right side of the screen displays the code responsible for rendering the page. To locate the specific HTML element that handles each book’s data, search through the code using the hover-to-select option (the arrow in the top-left corner of the screen):

In Dev Tools, you should see the following code snippet:

<article class="product_pod">
<!-- code omitted -->
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
            <div class="product_price">
        <p class="price_color">£51.77</p>
<!-- code omitted -->
            </div>
    </article>

Create a new file named static_scrape.py and input the following code:

import requests
from lxml import html
import json

URL = "https://books.toscrape.com/"

content = requests.get(URL).text

Next, parse the HTML and extract data:

parsed = html.fromstring(content)
all_books = parsed.xpath('//article[@class="product_pod"]')
books = []

This code initializes the parsed variable using html.fromstring(content), which parses the HTML content into a hierarchical tree structure. The all_books variable uses an XPath selector to retrieve all <article> tags with the class product_pod from the web page. This syntax is specifically valid for XPath expressions.

Next, iterate through the books and extract titles and prices:

for book in all_books:
    book_title = book.xpath('.//h3/a/@title')
    price = book.cssselect("p.price_color")[0].text_content()
    books.append({"title": book_title, "price": price})

The book_title variable is defined using an XPath selector that retrieves the title attribute from an <a> tag within an <h3> tag. The dot (.) at the beginning of the XPath expression specifies that the search should start from the <article> tag rather than the default starting point.

The next line uses the cssselect method to extract the price from a <p> tag with the class price_color. Since cssselect returns a list, indexing ([0]) accesses the first element, and text_content() retrieves the text inside the element.

Each extracted title and price pair is then appended to the books list as a dictionary, which can be easily stored in a JSON file.

Now, save the extracted data as a JSON file:

with open("books.json", "w", encoding="utf-8") as file:
    json.dump(books, file)

Run the script:

python static_scrape.py

This command generates a new file in your directory with the following output:

All the code for this script is available on GitHub.

Parsing Dynamic HTML Content

To scrape dynamic content, install Selenium:

pip install selenium

YouTube is a great example of content rendered using JavaScript. Let's scrape data for the top hundred videos from the freeCodeCamp.org YouTube channel by emulating keyboard presses to scroll the page.

To begin, inspect the HTML code of the web page with Dev Tools:

The following code identifies the elements responsible for displaying the video title and link:

<a id="video-title-link" class="yt-simple-endpoint focus-on-expand style-scope ytd-rich-grid-media" href="/watch?v=i740xlsqxEM">
<yt-formatted-string id="video-title" class="style-scope ytd-rich-grid-media">GitHub Advanced Security Certification – Pass the Exam!
</yt-formatted-string></a>

The video title is within the yt-formatted-string tag with the ID video-title, and the video link is located in the href attribute of the a tag with the ID video-title-link.

Create dynamic_scrape.py and import required modules:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

from lxml import html

from time import sleep
import json

Define the browser driver:

URL = "https://www.youtube.com/@freecodecamp/videos"
videos = []
driver = webdriver.Chrome()

driver.get(URL)
sleep(3)

Similar to the previous script, you declare a URL variable containing the web URL that you want to scrape and a videos variable that stores all the data as a list.

Next, a driver variable is declared (i.e., a Chrome instance) that you use to interact with the browser. The get() function opens the browser instance and sends a request to the specified URL.

After that, you call the sleep function to wait for three seconds before accessing any element on the web page to ensure all the HTML code is loaded in the browser.

Now, emulate scrolling down to load more videos:

parent = driver.find_element(By.TAG_NAME, 'html')
for i in range(4):
    parent.send_keys(Keys.END)
    sleep(3)

The send_keys method simulates pressing the END key to scroll to the bottom of the page, triggering more videos to load. This action is repeated four times within a for loop to ensure enough videos are loaded. The sleep function pauses for three seconds after each scroll to allow the videos to load before scrolling again.

Next, extract video titles and links:

html_data = html.fromstring(driver.page_source)

videos_html = html_data.cssselect("a#video-title-link")
for video in videos_html:
    title = video.text_content()
    link = "https://www.youtube.com" + video.get("href")

    videos.append( {"title": title, "link": link} )

In this code, you pass the HTML content from the driver’s page_source attribute to the fromstring method, which builds a hierarchical tree of the HTML.

Then, you select all <a> tags with the ID video-title-link using CSS selectors, where the # sign indicates selection using the tag’s ID. This selection returns a list of elements that satisfy the given criteria.

The code then iterates over each element to extract the title and link. The text_content method retrieves the inner text (the video title), while the get method fetches the href attribute value (the video link).

Finally, the data is stored in a list called videos.

Now, save the data as a JSON file and close the driver:

with open('videos.json', 'w') as file:
    json.dump(videos, file)
driver.close()

Run the script:

python dynamic_scrape.py

After running the script, a new file named videos.json is created in your directory:

All the code for this script is also available on GitHub.

Using lxml with Bright Data Proxy

Web scraping can face challenges like anti-scraping tools and rate limits. Proxy servers help by masking the user’s IP address. Bright Data provides reliable proxy services.

To start, obtain proxies from Bright Data by signing up for a free trial. After creating a Bright Data account, you’ll see the following dashboard:

Navigate to the My Zones option and create a new residential proxy. This will reveal your proxy username, password, and host, which you need in the next step.

Next, modify static_scrape.py by adding the following code below the URL variable:

URL = "https://books.toscrape.com/"

# new
username = ""
password = ""
hostname = ""

proxies = {
    "http": f"https://{username}:{password}@{hostname}",
    "https": f"https://{username}:{password}@{hostname}",
}

content = requests.get(URL, proxies=proxies).text

Replace placeholders with your Bright Data credentials and run the script:

python static_scrape.py

After running this script, you’ll see a similar output to what you received in the previous example.

You can view this entire script on GitHub.

Conclusion

Using lxml with Python enables efficient web scraping, but it can be time-consuming. Bright Data offers an efficient alternative with its ready-to-use datasets and Web Scraper API.

Try Bright Data for free!

Name	Name	Last commit message	Last commit date
Latest commit danielshashko Update README.md Apr 2, 2025 619e449 · Apr 2, 2025 History 4 Commits
images	images	Add files via upload	Apr 2, 2025
README.md	README.md	Update README.md	Apr 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping with lxml

Using lxml for Web Scraping in Python

Prerequisites

Parsing Static HTML Content

Parsing Dynamic HTML Content

Using lxml with Bright Data Proxy

Conclusion

About

luminati-io/web-scraping-with-lxml

Folders and files

Latest commit

History

Repository files navigation

Web Scraping with lxml

Using lxml for Web Scraping in Python

Prerequisites

Parsing Static HTML Content

Parsing Dynamic HTML Content

Using lxml with Bright Data Proxy

Conclusion

About

Topics

Resources

Stars

Watchers

Forks