Usage Instructions for Documentation Scraper

 ____                                        _        _   _             
|  _ \  ___   ___ _   _ _ __ ___   ___ _ __ | |_ __ _| |_(_) ___  _ __  
| | | |/ _ \ / __| | | | '_ ` _ \ / _ \ '_ \| __/ _` | __| |/ _ \| '_ \ 
| |_| | (_) | (__| |_| | | | | | |  __/ | | | || (_| | |_| | (_) | | | |
|____/ \___/ \___|\__,_|_| |_| |_|\___|_| |_|\__\__,_|\__|_|\___/|_| |_|
 ____                                
/ ___|  ___ _ __ __ _ _ __   ___ _ __ 
\___ \ / __| '__/ _` | '_ \ / _ \ '__|
 ___) | (__| | | (_| | |_) |  __/ |   
|____/ \___|_|  \__,_| .__/ \___|_|   
                     |_|

An advanced web scraping application for documentation websites, built with TypeScript and Node.js.

Overview

The Documentation Scraper is designed to crawl documentation websites and extract content into a single consolidated Markdown file. This tool is especially useful for creating local copies of documentation for offline access or for processing with Large Language Models (LLMs) and AI Agents.

Features

Crawls documentation websites starting from a specified URL
Consolidates all scraped content into a single Markdown file
Converts HTML to Markdown, preserving:
- Text
- Code blocks
- Images
- Tables
- Lists
Uses realistic user agents to avoid detection
Implements rate limiting and politeness delays to respect server resources
Recursively crawls linked pages to gather all relevant content
Adheres to the website's robots.txt file
Handles authentication and authorization
Customizable content extraction and output file structure
Outputs content into a structured, organized folder and file system

Prerequisites

Node.js (v18.16.0 or later)
npm (v9.5.1 or later)

Installation

Clone the repository:

git clone https://github.com/yourusername/documentationscraper.git
cd documentationscraper

Install dependencies:
```
npm install
```

Configuration

Customize the scraper's behavior by editing the src/config/config.ts file:

export const config = {
  maxDepth: 5, // Maximum depth of links to follow
  rateLimit: 2000, // Delay between requests in milliseconds
  outputDir: './output', // Directory to save scraped content
  proxyUrl: undefined, // URL of proxy server (optional)
};

maxDepth: Controls how deep the scraper will navigate the link structure.
rateLimit: Sets the delay between requests to avoid overwhelming the server.
outputDir: Specifies where the output file will be saved.
proxyUrl: If needed, set the proxy URL; otherwise, leave as undefined.

Usage

Set the starting URL in src/index.ts:
```
const startUrl = 'https://example.com/docs';
```
Replace 'https://example.com/docs' with the URL of the documentation you wish to scrape.
Run the scraper:
```
npm start
```
The scraper will launch and begin crawling from the startUrl, collecting content from each page and appending it to a file named combined_documentation.md in the output directory.

Output

The output directory will contain:

output/
└── combined_documentation.md

combined_documentation.md: A single Markdown file containing the scraped content from all visited pages, organized with headers indicating the source page titles.

Customization

Content Extraction

Modify src/utils/contentExtractor.ts to change how content is extracted and processed. By default, it:

Removes unwanted HTML elements (e.g., scripts, styles)
Converts HTML content to Markdown using TurndownService

Scraping Logic

Adjust src/scraper/scraper.ts to modify the core scraping process. Key aspects include:

URL Normalization: Ensures each page is visited only once.
Content Accumulation: Appends content to combined_documentation.md.
Link Extraction: Gathers internal links for recursive scraping.

User Agent Configuration

Update src/utils/markdownSaver.ts or the relevant utility file to customize the user agent string used during scraping. This can help mimic different browsers or devices.

Advanced Usage

Using a Proxy

To route requests through a proxy, set the proxyUrl in the configuration:

export const config = {
  // ... other settings ...
  proxyUrl: 'http://proxy.example.com:8080',
};

Handling Authentication

If the target website requires authentication, you'll need to modify src/scraper/scraper.ts to include login steps, such as:

Navigating to the login page
Entering credentials
Waiting for the login process to complete before starting the scraping

Adjusting Depth of Scraping

The maxDepth parameter controls how many levels of links the scraper will follow. Set this to a higher number to scrape deeper into the site's link structure.

Best Practices

Respect Robots.txt: Always check the website's robots.txt file to ensure you are allowed to scrape the content.
Politeness Delays: Use the rateLimit setting to avoid sending too many requests in a short period.
Legal Compliance: Ensure you have permission to scrape and use the content from the target website.

Troubleshooting

No Content Extracted: If the scraper reports that no content was extracted from a page, verify that the selectors in contentExtractor.ts match the structure of the target website.
Blocked Requests: If you encounter HTTP errors or blocks, consider adjusting the user agent or implementing proxy rotation.
Unexpected Errors: Check the console output for error messages and stack traces to identify and fix issues.

Contributing

Contributions are welcome! To contribute:

Fork the repository
Create a feature branch (git checkout -b feature/YourFeature)
Commit your changes (git commit -m 'Add YourFeature')
Push to the branch (git push origin feature/YourFeature)
Open a pull request

License

This project is licensed under the MIT License. See the LICENSE file for details.

Disclaimer

Web scraping can be subject to legal and ethical considerations. Ensure you have the right to scrape and use the content from the target website. Use this tool responsibly and comply with all applicable laws and regulations.

Acknowledgments

Cheerio - Fast, flexible, and lean implementation of core jQuery designed specifically for the server.
Puppeteer - Node.js library which provides a high-level API to control Chromium over the DevTools Protocol.
Turndown - A customizable HTML to Markdown converter written in JavaScript.

Contact

If you have any questions or need assistance, please open an issue on GitHub.

Created with ❤️ by Builderby, 2024

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Usage Instructions for Documentation Scraper

Overview

Features

Prerequisites

Installation

Configuration

Usage

Output

Customization

Content Extraction

Scraping Logic

User Agent Configuration

Advanced Usage

Using a Proxy

Handling Authentication

Adjusting Depth of Scraping

Best Practices

Troubleshooting

Contributing

License

Disclaimer

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Usage Instructions for Documentation Scraper

Overview

Features

Prerequisites

Installation

Configuration

Usage

Output

Customization

Content Extraction

Scraping Logic

User Agent Configuration

Advanced Usage

Using a Proxy

Handling Authentication

Adjusting Depth of Scraping

Best Practices

Troubleshooting

Contributing

License

Disclaimer

Acknowledgments

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages