This guide explains how to use regular expressions in Python for web scraping:
- Understanding Regular Expressions
- Implementing Regex in Python for Web Scraping
- Constraints of Using Regex for Web Scraping
Regular expressions (regex) are robust pattern-matching formulas for extracting information from text, making them valuable tools for web scraping. These expressions define specific patterns to match within texts, allowing for precise information extraction.
In Python, regular expressions use tokens to match particular patterns. While covering all tokens is beyond this article's scope, here are some frequently used ones you'll encounter:
Token | Matches |
---|---|
Any non-special character | The character given |
^ |
Start of a string |
$ |
End of a string |
. |
Any character except \n |
* |
Zero or more occurrences of the previous element |
? |
Zero or one occurrence of the previous element |
+ |
One or more occurrences of the previous characters |
{Digit} |
Exact number of the previous element |
\d |
Any digit |
\s |
Any whitespace character |
\w |
Any word character |
\D |
Inverse of \d |
\S |
Inverse of \s |
\W |
Inverse of \w |
For hands-on experience and to learn more about regex, visit regexr.com. Additionally, this article provides essential tips for optimizing your regex performance.
In this section, we'll develop a basic web scraper in Python using regex to extract data from various websites.
First, create a project directory:
mkdir web_scraping_with_regex
cd web_scraping_with_regex
Then create a Python virtual environment:
python -m venv venv
Activate it:
source ./venv/bin/activate
For this web scraper, you'll need two libraries:
requests
for fetching web pagesbeautifulsoup4
for parsing the HTML content and locting elements
Install the libraries:
pip install beautifulsoup4 requests
Note: Always check a website's terms and conditions before scraping to ensure it's permitted. Avoid scraping if explicitly forbidden.
Let's build a scraper that extracts book titles and prices from a dummy e-commerce site. You’ll scrape the first page and extract the titles and prices of the books.
Create a file named scraper.py
and import the required modules:
import requests
from bs4 import BeautifulSoup
import re
Note: The
re
module is a built-in Python module that works with regex.
Next, make a GET request to fetch the web page's HTML content:
page = requests.get('https://books.toscrape.com/')
Feed this data to Beautiful Soup to parse the HTML structure:
soup = BeautifulSoup(page.content, 'html.parser')
To figure out how the elements are structured in HTML, you use the Inspect Element tool. Open the web page in the browser and press Ctrl + Shift + I to open the Inspector. As you can see in the screenshot, the products are stored in li
elements with class col-xs-6 col-sm-4 col-md-3 col-lg-3
. The book title can be found from a
elements by reading their title
attribute, and the prices are stored in p
elements with class price_color
:
Use the find_all
method of Beautiful Soup to find all li
elements with class col-xs-6 col-sm-4 col-md-3 col-lg-3
:
books = soup.find_all("li", class_="col-xs-6 col-sm-4 col-md-3 col-lg-3")
content = str(books)
The content
variable now holds the HTML text of the li
elements, and you can use regex to extract the titles and prices.
First, create regex patterns that match book titles and prices by examining the HTML structure again.
The book titles are found in the title
attribute of a
elements, which look like:
<a href="..." title="...">
To capture the content within the double quotes after title, use the .*?
regex pattern. Here, .
matches any character, *
matches zero or more occurrences of the preceding element, and ?
makes the match non-greedy. The complete expression is:
<a href=".*?" title="(.*?)"
The parentheses around the .*?
create a capturing group, which stores the matched content for later use.
For prices, use a similar approach. Since prices appear in p
elements with class price_color
, the regex pattern is <p class="price_color">(.*?)</p>
.
Define both patterns:
re_book_title = r'<a href=".*?" title="(.*?)"'
re_prices = r'<p class="price_color">(.*?)</p>'
Note: In case you’re wondering why the
?
is needed after.*
, this Stack Overflow answer explains the role of?
well.
Now, use re.findall()
to find all regex matches from the HTML string:
titles = re.findall(re_book_title, content)
prices = re.findall(re_prices, content)
Finally, iterate over the matches and print the results:
for i in zip(titles, prices):
print(f"{i[0]}: {i[1]}")
Run this code with python scraper.py
. Here is the expected output:
A Light in the Attic: £51.77
Tipping the Velvet: £53.74
Soumission: £50.10
Sharp Objects: £47.82
Sapiens: A Brief History of Humankind: £54.23
The Requiem Red: £22.65
The Dirty Little Secrets of Getting Your Dream Job: £33.34
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull: £17.93
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics: £22.60
The Black Maria: £52.15
Starving Hearts (Triangular Trade Trilogy, #1): £13.99
Shakespeare's Sonnets: £20.66
Set Me Free: £17.46
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1): £52.29
Rip it Up and Start Again: £35.02
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991: £57.25
Olio: £23.88
Mesaerion: The Best Science Fiction Stories 1800-1849: £37.59
Libertarianism for Beginners: £51.33
It's Only the Himalayas: £45.17
Now, let’s build a scraper that can scrape a Wikipedia page and extract information about all the links.
Create a new file named wiki_scraper.py
. Similar to before, import the required libraries, make a GET request, and parse the content:
import requests
from bs4 import BeautifulSoup
import re
page = requests.get('https://en.wikipedia.org/wiki/Web_scraping')
soup = BeautifulSoup(page.content, 'html.parser')
Use the find_all()
method to find all the links:
links = soup.find_all("a")
content = str(links)
Since link text appears in the title
attribute and URLs in the href
attribute, you can use the (.*?)
pattern to capture both. The complete expression is:
<a href="(.*?)" title="(.*?)">.*?</a>
Note that the third .*?
is not in a capturing group because you aren’t interested in the content of the a
tags.
As before, use findall()
to find all the matches and print the result:
re_links = r'<a href="(.*?)" title="(.*?)">.*?</a>'
links = re.findall(re_links, content)
for i in links:
print(f"{i[0]} => {i[1]}")
When you run this with python wiki_scraper.py
, you get the following output:
OUTPUT TRUNCATED FOR BREVITY
/wiki/Category:Web_scraping => Category:Web scraping
/wiki/Category:CS1_maint:_multiple_names:_authors_list => Category:CS1 maint: multiple names: authors list
/wiki/Category:CS1_Danish-language_sources_(da) => Category:CS1 Danish-language sources (da)
/wiki/Category:CS1_French-language_sources_(fr) => Category:CS1 French-language sources (fr)
/wiki/Category:Articles_with_short_description => Category:Articles with short description
/wiki/Category:Short_description_matches_Wikidata => Category:Short description matches Wikidata
/wiki/Category:Articles_needing_additional_references_from_April_2023 => Category:Articles needing additional references from April 2023
/wiki/Category:All_articles_needing_additional_references => Category:All articles needing additional references
/wiki/Category:Articles_with_limited_geographic_scope_from_October_2015 => Category:Articles with limited geographic scope from October 2015
/wiki/Category:United_States-centric => Category:United States-centric
/wiki/Category:All_articles_with_unsourced_statements => Category:All articles with unsourced statements
/wiki/Category:Articles_with_unsourced_statements_from_April_2023 => Category:Articles with unsourced statements from April 2023
The previous examples involved static web pages. Scraping dynamic pages requires browser automation tools like Selenium. Here's an example of extracting the current temperature from OpenWeatherMap using Selenium and regex:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
driver = webdriver.Firefox()
driver.get("https://openweathermap.org/city/2643743")
elem = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CSS_SELECTOR, ".current-temp")))
content = elem.get_attribute('innerHTML')
re_temp = r'<span .*?>(.*?)</span>'
temp = re.findall(re_temp, content)
print(repr(temp))
driver.close()
This code uses Selenium to launch an instance of Firefox and uses the CSS selector to select the element with the current temperature. It then uses the regex <span .*?>(.*?)</span>
to extract the temperature.
This code launches Firefox with Selenium, selects the element containing the current temperature using a CSS selector, and then extracts the temperature using the regex pattern <span .*?>(.*?)</span>
.
For even more information to help you get started with scraping dynamic web pages with Selenium, check out this tutorial.
While regular expressions are powerful for pattern matching and data extraction, they have significant limitations for web scraping. Regex operates on text with no understanding of HTML structure, making results highly dependent on the HTML's formatting.
For instance, in the Wikipedia example, some links weren't correctly extracted:
If you edit the Python code and add print(content)
to print the HTML string returned by Beautiful Soup, you see the culprit a
looks like this:
<a href="#cite_ref-9">^</a>
This tag lacks a title
attribute, yet our regex pattern assumes the structure <a href="(.*?)" title="(.*?)">.*?</a>
. Because regex doesn't comprehend HTML elements, instead of failing to match, the .*?
pattern keeps matching characters until it finds something that completes the pattern, often incorrectly capturing multiple tags.
Furthermore, HTML isn't a regular language, meaning regex alone can't reliably parse arbitrary HTML. However, regex can be useful in specific scenarios. If you're working with a limited, known HTML structure, regex can extract information effectively. Still, a more robust approach is using an HTML parser like Beautiful Soup to find elements and then apply regex to process the extracted text.
Here's an improved version of the Wikipedia scraper that uses Beautiful Soup for initial extraction and regex for filtering:
import requests
from bs4 import BeautifulSoup
import re
page = requests.get('https://en.wikipedia.org/wiki/Web_scraping')
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all("a")
for link in links:
href = link.get('href')
title = link.get('title')
if title == None:
title = link.string
if title == None:
continue
pattern = r"[a-zA-Z0-9]"
if re.match(pattern, title):
print(f"{href} => {title}")
Regular expressions are valuable tools for finding patterns in text data. However, web scraping presents numerous challenges beyond regex capabilities. Frequent scraping can lead to IP blocking, and CAPTCHAs can disrupt your scraper's functionality. Bright Data offers powerful proxies that can help overcome IP restrictions.
Start a free trial today!