You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to scrape a list of articles, so it's a two step process.
The first step I provide the first target URL, and works great, I got a list of all URLs.
Second step is that I need to enter in each page and scrape just ONE thing: the article.
I tried it and it gave me a list of just one string: the copyright note in the end of the page.
# /// script
# dependencies = [
# "autoscraper",
# ]
# ///
# &If you want to automatically scrape
# a website with Python, use ‘autoscraper’ & &
# pip install autoscraper
from autoscraper import AutoScraper
# Define the URL and the wanted data
# (an example headline from BBC)
url = "https://example-blog.com"
wanted_list = [
"https://example-blog.com/articles/4415101547419-first-article",
]
# replace this with an actual headline to learn from
# Create an instance and build the scraper model
scraper = AutoScraper()
result = scraper.build(url, wanted_list)
# Testing the model
print(result)
# Save result to file, one link per line
with open("links.txt", "w") as f:
for link in result:
f.write(link + "\n")
sample_article = """Article Title
Article first paragraph.
Article second paragraph.
Last Paragraph."""
for link in result:
article = scraper.build(link, sample_article)
print(article)
break # I break in the first article just to see if it got what I needed
The text was updated successfully, but these errors were encountered:
Hi,
I'm trying to scrape a list of articles, so it's a two step process.
The first step I provide the first target URL, and works great, I got a list of all URLs.
Second step is that I need to enter in each page and scrape just ONE thing: the article.
I tried it and it gave me a list of just one string: the copyright note in the end of the page.
The text was updated successfully, but these errors were encountered: