Skip to content

Turn Webpage to LLM friendly input text. Similar to Firecrawl and Jina Reader API. Makes RAG, AI web scraping, image & webpage links extraction easy.

Notifications You must be signed in to change notification settings

m92vyas/llm-reader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hi, apart from this Open Source Solution, for Website Crawling and to Avoid getting Blocked/ Scrape Dynamic Website while converting webpage to LLM Ready Input Text you can try the paid API service ParseExtract ($0.005 per url)

The Below library will scrape and give you LLM ready text but without any anti-blocking service.

ParseExtract also provides APIs for pdf, docx, image parsing for RAG, OCR and other LLM application.

You can also Extract Structured Data and Tables using the same API.

Now go and subscribe to the paid api Now back to our open source solution.

Webpage to LLM Ready Input Text

Pre-processing webpage before giving it as input to the LLM improves extraction/scraping accuracy especially if you want to extract website and image links required for most scraping operations like scraping an e-commerce website.

Use this library to turn any webpage/url to LLM friendly text. Fully open source alternative to jina reader api and firecrawl api.

You can also refer to my other repo AI-web_scraper for direct scraping tools that will do web search and scrapes multiple links with just a simple query. It supports multiple LLMs, Web Search and Extracts Data as per your written instructions.

Install:

pip install git+https://github.com/m92vyas/llm-reader.git

Get LLM input text:

from url_to_llm_text.get_llm_ready_text import url_to_llm_text

url= "url_to_scrape"

llm_text = await url_to_llm_text(url)

print(llm_text)

Documentation:

https://github.com/m92vyas/llm-reader/wiki/Documentation

To Scrape and Crawl without getting Blocked:

Support & Feedback:

About

Turn Webpage to LLM friendly input text. Similar to Firecrawl and Jina Reader API. Makes RAG, AI web scraping, image & webpage links extraction easy.

Topics

Resources

Stars

Watchers

Forks

Languages