Skip to content

Commit 1557332

Browse files
committed
Resolve merge conflicts: integrate schema generation and scheduled jobs features
- Merged AI-Powered Schema Generation section with Available Functions section in JS README - Combined schema generation exports with scheduled jobs exports in JS index.js - Integrated schema generation methods with scheduled jobs methods in Python async_client.py and client.py - Moved generate schema example files to correct smartscraper directory structure - All conflicts resolved successfully
2 parents c260512 + 43b8250 commit 1557332

File tree

150 files changed

+13940
-836
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

150 files changed

+13940
-836
lines changed

README.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,11 @@ Get your [API key](https://scrapegraphai.com)!
1515
- 🔍 **SearchScraper**: AI-powered web search with structured results and reference URLs
1616
- 📝 **Markdownify**: Convert any webpage into clean, formatted markdown
1717
- 🕷️ **SmartCrawler**: Intelligently crawl and extract data from multiple pages
18+
- 🤖 **AgenticScraper**: Perform automated browser actions with AI-powered session management
19+
- 📄 **Scrape**: Convert webpages to HTML with JavaScript rendering and custom headers
20+
-**Scheduled Jobs**: Create and manage automated scraping workflows with cron scheduling
21+
- 💳 **Credits Management**: Monitor API usage and credit balance
22+
- 💬 **Feedback System**: Provide ratings and feedback to improve service quality
1823

1924
## 🚀 Quick Links
2025
ScrapeGraphAI offers seamless integration with popular frameworks and tools to enhance your scraping capabilities. Whether you're building with Python or Node.js, using LLM frameworks, or working with no-code platforms, we've got you covered with our comprehensive integration options..
@@ -60,6 +65,24 @@ Perform AI-powered web searches with structured results and reference URLs.
6065
### 📝 Markdownify
6166
Convert any webpage into clean, formatted markdown.
6267

68+
### 🕷️ SmartCrawler
69+
Intelligently crawl and extract data from multiple pages with configurable depth and batch processing.
70+
71+
### 🤖 AgenticScraper
72+
Perform automated browser actions on webpages using AI-powered agentic scraping with session management.
73+
74+
### 📄 Scrape
75+
Convert webpages into HTML format with optional JavaScript rendering and custom headers.
76+
77+
### ⏰ Scheduled Jobs
78+
Create, manage, and monitor scheduled scraping jobs with cron expressions and execution history.
79+
80+
### 💳 Credits
81+
Check your API credit balance and usage.
82+
83+
### 💬 Feedback
84+
Send feedback and ratings for scraping requests to help improve the service.
85+
6386
## 🌟 Key Benefits
6487

6588
- 📝 **Natural Language Queries**: No complex selectors or XPath needed

scrapegraph-js/README.md

Lines changed: 207 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,91 @@ const prompt = 'What does the company do?';
5656

5757
## 🎯 Examples
5858

59+
### Scrape - Get HTML Content
60+
61+
#### Basic Scrape
62+
63+
```javascript
64+
import { scrape } from 'scrapegraph-js';
65+
66+
const apiKey = 'your-api-key';
67+
const url = 'https://example.com';
68+
69+
(async () => {
70+
try {
71+
const response = await scrape(apiKey, url);
72+
console.log('HTML content:', response.html);
73+
console.log('Status:', response.status);
74+
} catch (error) {
75+
console.error('Error:', error);
76+
}
77+
})();
78+
```
79+
80+
#### Scrape with Heavy JavaScript Rendering
81+
82+
```javascript
83+
import { scrape } from 'scrapegraph-js';
84+
85+
const apiKey = 'your-api-key';
86+
const url = 'https://example.com';
87+
88+
(async () => {
89+
try {
90+
const response = await scrape(apiKey, url, {
91+
renderHeavyJs: true
92+
});
93+
console.log('HTML content with JS rendering:', response.html);
94+
} catch (error) {
95+
console.error('Error:', error);
96+
}
97+
})();
98+
```
99+
100+
#### Scrape with Custom Headers
101+
102+
```javascript
103+
import { scrape } from 'scrapegraph-js';
104+
105+
const apiKey = 'your-api-key';
106+
const url = 'https://example.com';
107+
108+
(async () => {
109+
try {
110+
const response = await scrape(apiKey, url, {
111+
headers: {
112+
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
113+
'Cookie': 'session=123'
114+
}
115+
});
116+
console.log('HTML content with custom headers:', response.html);
117+
} catch (error) {
118+
console.error('Error:', error);
119+
}
120+
})();
121+
```
122+
123+
#### Get Scrape Request Status
124+
125+
```javascript
126+
import { getScrapeRequest } from 'scrapegraph-js';
127+
128+
const apiKey = 'your-api-key';
129+
const requestId = 'your-request-id';
130+
131+
(async () => {
132+
try {
133+
const response = await getScrapeRequest(apiKey, requestId);
134+
console.log('Request status:', response.status);
135+
if (response.status === 'completed') {
136+
console.log('HTML content:', response.html);
137+
}
138+
} catch (error) {
139+
console.error('Error:', error);
140+
}
141+
})();
142+
```
143+
59144
### Scraping Websites
60145

61146
#### Basic Scraping
@@ -279,6 +364,7 @@ const schema = {
279364
depth: 2,
280365
maxPages: 2,
281366
sameDomainOnly: true,
367+
sitemap: true, // Use sitemap for better page discovery
282368
batchSize: 1,
283369
});
284370
console.log('Crawl job started. Response:', crawlResponse);
@@ -308,7 +394,13 @@ const schema = {
308394
})();
309395
```
310396

311-
You can use a plain JSON schema or a [Zod](https://www.npmjs.com/package/zod) schema for the `schema` parameter. The crawl API supports options for crawl depth, max pages, domain restriction, and batch size.
397+
You can use a plain JSON schema or a [Zod](https://www.npmjs.com/package/zod) schema for the `schema` parameter. The crawl API supports options for crawl depth, max pages, domain restriction, sitemap discovery, and batch size.
398+
399+
**Sitemap Benefits:**
400+
- Better page discovery using sitemap.xml
401+
- More comprehensive website coverage
402+
- Efficient crawling of structured websites
403+
- Perfect for e-commerce, news sites, and content-heavy websites
312404

313405
### Scraping local HTML
314406

@@ -498,6 +590,120 @@ const requestId = '123e4567-e89b-12d3-a456-426614174000';
498590
})();
499591
```
500592

593+
## 🔧 Available Functions
594+
595+
### Scrape
596+
597+
#### `scrape(apiKey, url, options)`
598+
599+
Converts a webpage into HTML format with optional JavaScript rendering.
600+
601+
**Parameters:**
602+
- `apiKey` (string): Your ScrapeGraph AI API key
603+
- `url` (string): The URL of the webpage to convert
604+
- `options` (object, optional): Configuration options
605+
- `renderHeavyJs` (boolean, optional): Whether to render heavy JavaScript (default: false)
606+
- `headers` (object, optional): Custom headers to send with the request
607+
608+
**Returns:** Promise that resolves to an object containing:
609+
- `html`: The HTML content of the webpage
610+
- `status`: Request status ('completed', 'processing', 'failed')
611+
- `scrape_request_id`: Unique identifier for the request
612+
- `error`: Error message if the request failed
613+
614+
**Example:**
615+
```javascript
616+
const response = await scrape(apiKey, 'https://example.com', {
617+
renderHeavyJs: true,
618+
headers: { 'User-Agent': 'Custom Agent' }
619+
});
620+
```
621+
622+
#### `getScrapeRequest(apiKey, requestId)`
623+
624+
Retrieves the status or result of a previous scrape request.
625+
626+
**Parameters:**
627+
- `apiKey` (string): Your ScrapeGraph AI API key
628+
- `requestId` (string): The unique identifier for the scrape request
629+
630+
**Returns:** Promise that resolves to the request result object.
631+
632+
**Example:**
633+
```javascript
634+
const result = await getScrapeRequest(apiKey, 'request-id-here');
635+
```
636+
637+
### Smart Scraper
638+
639+
#### `smartScraper(apiKey, url, prompt, schema, numberOfScrolls, totalPages, cookies)`
640+
641+
Extracts structured data from websites using AI-powered scraping.
642+
643+
**Parameters:**
644+
- `apiKey` (string): Your ScrapeGraph AI API key
645+
- `url` (string): The URL of the website to scrape
646+
- `prompt` (string): Natural language prompt describing what to extract
647+
- `schema` (object, optional): Zod schema for structured output
648+
- `numberOfScrolls` (number, optional): Number of scrolls for infinite scroll pages
649+
- `totalPages` (number, optional): Number of pages to scrape
650+
- `cookies` (object, optional): Cookies for authentication
651+
652+
### Search Scraper
653+
654+
#### `searchScraper(apiKey, prompt, url, numResults, headers, outputSchema)`
655+
656+
Searches and extracts information from multiple web sources using AI.
657+
658+
### Crawl API
659+
660+
#### `crawl(apiKey, url, prompt, dataSchema, options)`
661+
662+
Starts a crawl job to extract structured data from a website and its linked pages.
663+
664+
**Parameters:**
665+
- `apiKey` (string): Your ScrapeGraph AI API key
666+
- `url` (string): The starting URL for the crawl
667+
- `prompt` (string): AI prompt to guide data extraction (required for AI mode)
668+
- `dataSchema` (object): JSON schema defining extracted data structure (required for AI mode)
669+
- `options` (object): Optional crawl parameters
670+
- `extractionMode` (boolean, default: true): true for AI extraction, false for markdown conversion
671+
- `cacheWebsite` (boolean, default: true): Whether to cache website content
672+
- `depth` (number, default: 2): Maximum crawl depth (1-10)
673+
- `maxPages` (number, default: 2): Maximum pages to crawl (1-100)
674+
- `sameDomainOnly` (boolean, default: true): Only crawl pages from the same domain
675+
- `sitemap` (boolean, default: false): Use sitemap.xml for better page discovery
676+
- `batchSize` (number, default: 1): Batch size for processing pages (1-10)
677+
- `renderHeavyJs` (boolean, default: false): Whether to render heavy JavaScript
678+
679+
**Sitemap Benefits:**
680+
- Better page discovery using sitemap.xml
681+
- More comprehensive website coverage
682+
- Efficient crawling of structured websites
683+
- Perfect for e-commerce, news sites, and content-heavy websites
684+
685+
### Markdownify
686+
687+
#### `markdownify(apiKey, url, headers)`
688+
689+
Converts a webpage into clean, well-structured markdown format.
690+
691+
### Agentic Scraper
692+
693+
#### `agenticScraper(apiKey, url, steps, useSession, userPrompt, outputSchema, aiExtraction)`
694+
695+
Performs automated actions on webpages using step-by-step instructions.
696+
697+
### Utility Functions
698+
699+
#### `getCredits(apiKey)`
700+
701+
Retrieves your current credit balance and usage statistics.
702+
703+
#### `sendFeedback(apiKey, requestId, rating, feedbackText)`
704+
705+
Submits feedback for a specific request.
706+
501707
## 📚 Documentation
502708

503709
For detailed documentation, visit [docs.scrapegraphai.com](https://docs.scrapegraphai.com)

0 commit comments

Comments
 (0)