Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use it if the examples are from different pages? #103

Open
brenorb opened this issue Jan 27, 2025 · 1 comment
Open

How to use it if the examples are from different pages? #103

brenorb opened this issue Jan 27, 2025 · 1 comment
Labels

Comments

@brenorb
Copy link

brenorb commented Jan 27, 2025

Hi,

I'm trying to scrape a list of articles, so it's a two step process.
The first step I provide the first target URL, and works great, I got a list of all URLs.

Second step is that I need to enter in each page and scrape just ONE thing: the article.

I tried it and it gave me a list of just one string: the copyright note in the end of the page.

# /// script
# dependencies = [
#   "autoscraper",
# ]
# ///

# &If you want to automatically scrape
# a website with Python, use ‘autoscraper’ & &
# pip install autoscraper

from autoscraper import AutoScraper

# Define the URL and the wanted data
# (an example headline from BBC)
url = "https://example-blog.com"
wanted_list = [
    "https://example-blog.com/articles/4415101547419-first-article",
]
# replace this with an actual headline to learn from

# Create an instance and build the scraper model
scraper = AutoScraper()
result = scraper.build(url, wanted_list)

# Testing the model
print(result)

# Save result to file, one link per line
with open("links.txt", "w") as f:
    for link in result:
        f.write(link + "\n")

sample_article = """Article Title
Article first paragraph. 

Article second paragraph.

Last Paragraph."""

for link in result:
    article = scraper.build(link, sample_article)
    print(article)
    break # I break in the first article just to see if it got what I needed
Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the Stale label Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant