____ _ _ _
| _ \ ___ ___ _ _ _ __ ___ ___ _ __ | |_ __ _| |_(_) ___ _ __
| | | |/ _ \ / __| | | | '_ ` _ \ / _ \ '_ \| __/ _` | __| |/ _ \| '_ \
| |_| | (_) | (__| |_| | | | | | | __/ | | | || (_| | |_| | (_) | | | |
|____/ \___/ \___|\__,_|_| |_| |_|\___|_| |_|\__\__,_|\__|_|\___/|_| |_|
____
/ ___| ___ _ __ __ _ _ __ ___ _ __
\___ \ / __| '__/ _` | '_ \ / _ \ '__|
___) | (__| | | (_| | |_) | __/ |
|____/ \___|_| \__,_| .__/ \___|_|
|_|
An advanced web scraping application for documentation websites, built with TypeScript and Node.js.
The Documentation Scraper is designed to crawl documentation websites and extract content into a single consolidated Markdown file. This tool is especially useful for creating local copies of documentation for offline access or for processing with Large Language Models (LLMs) and AI Agents.
- Crawls documentation websites starting from a specified URL
- Consolidates all scraped content into a single Markdown file
- Converts HTML to Markdown, preserving:
- Text
- Code blocks
- Images
- Tables
- Lists
- Uses realistic user agents to avoid detection
- Implements rate limiting and politeness delays to respect server resources
- Recursively crawls linked pages to gather all relevant content
- Adheres to the website's
robots.txtfile - Handles authentication and authorization
- Customizable content extraction and output file structure
- Outputs content into a structured, organized folder and file system
- Node.js (v18.16.0 or later)
- npm (v9.5.1 or later)
-
Clone the repository:
git clone https://github.com/yourusername/documentationscraper.git cd documentationscraper -
Install dependencies:
npm install
Customize the scraper's behavior by editing the src/config/config.ts file:
export const config = {
maxDepth: 5, // Maximum depth of links to follow
rateLimit: 2000, // Delay between requests in milliseconds
outputDir: './output', // Directory to save scraped content
proxyUrl: undefined, // URL of proxy server (optional)
};- maxDepth: Controls how deep the scraper will navigate the link structure.
- rateLimit: Sets the delay between requests to avoid overwhelming the server.
- outputDir: Specifies where the output file will be saved.
- proxyUrl: If needed, set the proxy URL; otherwise, leave as
undefined.
-
Set the starting URL in
src/index.ts:const startUrl = 'https://example.com/docs';
Replace
'https://example.com/docs'with the URL of the documentation you wish to scrape. -
Run the scraper:
npm start
The scraper will launch and begin crawling from the
startUrl, collecting content from each page and appending it to a file namedcombined_documentation.mdin theoutputdirectory.
The output directory will contain:
output/
└── combined_documentation.md
- combined_documentation.md: A single Markdown file containing the scraped content from all visited pages, organized with headers indicating the source page titles.
Modify src/utils/contentExtractor.ts to change how content is extracted and processed. By default, it:
- Removes unwanted HTML elements (e.g., scripts, styles)
- Converts HTML content to Markdown using TurndownService
Adjust src/scraper/scraper.ts to modify the core scraping process. Key aspects include:
- URL Normalization: Ensures each page is visited only once.
- Content Accumulation: Appends content to
combined_documentation.md. - Link Extraction: Gathers internal links for recursive scraping.
Update src/utils/markdownSaver.ts or the relevant utility file to customize the user agent string used during scraping. This can help mimic different browsers or devices.
To route requests through a proxy, set the proxyUrl in the configuration:
export const config = {
// ... other settings ...
proxyUrl: 'http://proxy.example.com:8080',
};If the target website requires authentication, you'll need to modify src/scraper/scraper.ts to include login steps, such as:
- Navigating to the login page
- Entering credentials
- Waiting for the login process to complete before starting the scraping
The maxDepth parameter controls how many levels of links the scraper will follow. Set this to a higher number to scrape deeper into the site's link structure.
- Respect Robots.txt: Always check the website's
robots.txtfile to ensure you are allowed to scrape the content. - Politeness Delays: Use the
rateLimitsetting to avoid sending too many requests in a short period. - Legal Compliance: Ensure you have permission to scrape and use the content from the target website.
- No Content Extracted: If the scraper reports that no content was extracted from a page, verify that the selectors in
contentExtractor.tsmatch the structure of the target website. - Blocked Requests: If you encounter HTTP errors or blocks, consider adjusting the user agent or implementing proxy rotation.
- Unexpected Errors: Check the console output for error messages and stack traces to identify and fix issues.
Contributions are welcome! To contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/YourFeature) - Commit your changes (
git commit -m 'Add YourFeature') - Push to the branch (
git push origin feature/YourFeature) - Open a pull request
This project is licensed under the MIT License. See the LICENSE file for details.
Web scraping can be subject to legal and ethical considerations. Ensure you have the right to scrape and use the content from the target website. Use this tool responsibly and comply with all applicable laws and regulations.
- Cheerio - Fast, flexible, and lean implementation of core jQuery designed specifically for the server.
- Puppeteer - Node.js library which provides a high-level API to control Chromium over the DevTools Protocol.
- Turndown - A customizable HTML to Markdown converter written in JavaScript.
If you have any questions or need assistance, please open an issue on GitHub.
Created with ❤️ by Builderby, 2024