Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I am trying to run SmartScraperGraph() using ollama with llama3.2 model but i am getting this warning that "Token indices sequence length is longer than the specified maximum sequence length for this model (7678 > 1024)." and the whole website is not being scraped. #853

Closed
GODCREATOR333 opened this issue Dec 27, 2024 · 3 comments

Comments

@GODCREATOR333
Copy link

import json
from scrapegraphai.graphs import SmartScraperGraph
from ollama import Client

ollama_client = Client(host='http://localhost:11434')

Define the configuration for the scraping pipeline

graph_config = {
"llm": {
"model": "ollama/llama3.2",
"temperature": 0.0,
"format": "json",
"model_tokens": 4096,
"base_url": "http://localhost:11434",
},
"embeddings": {
"model": "nomic-embed-text",
},
}

Create the SmartScraperGraph instance

smart_scraper_graph = SmartScraperGraph(
prompt="Extract me all the news from the website along with headlines",
source="https://www.bbc.com/",
config=graph_config
)

Run the pipeline

result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))

Output*********************************************

from langchain_community.callbacks.manager import get_openai_callback
You can use the langchain cli to automatically upgrade many imports. Please see documentation here https://python.langchain.com/docs/versions/v0_2/
from langchain.callbacks import get_openai_callback
Token indices sequence length is longer than the specified maximum sequence length for this model (7678 > 1024). Running this sequence through the model will result in indexing errors
{
"headlines": [
"Life is not easy - Haaland penalty miss sums up Man City crisis",
"How a 1990s Swan Lake changed dance forever"
],
"articles": [
{
"title": "BBC News",
"url": "https://www.bbc.com/news/world-europe-63711133"
},
{
"title": "Matthew Bourne on his male Swan Lake - the show that shook up the dance world",
"url": "https://www.bbc.com/culture/article/20241126-matthew-bourne-on-his-male-swan-lake-the-show-that-shook-up-the-dance-world-forever"
}
]
}

Even after specifying that model tokens = 4096 it is not effecting its maximum sequence length(1024). How can i increase it ? How can i chunk the website into size of its max_sequence_length so that i can scrape the whole website.

PS: Also having the option to further crawl the links and scrape subsequent websites would be great. Thanks

Ubuntu 22.04 LTS
GPU : RTX 4070 12GB VRAM
RAM : 16GB DDR5

Ollama/Llama3.2:3B model

@VinciGit00
Copy link
Collaborator

can you try with openai please?

@Qunlexie
Copy link

Qunlexie commented Jan 2, 2025

I am having this same issue as well. I don't think trying with OpenAI is a good resolution. In my experience OpenAI may provide better results given it is a proprietary model but it will be good to get this working with Open source local LLAMA models. I will greatly appreciate your help on this as well.

can you try with openai please?

github-actions bot pushed a commit that referenced this issue Jan 6, 2025
## [1.35.0-beta.4](v1.35.0-beta.3...v1.35.0-beta.4) (2025-01-06)

### Features

* ⏰added graph timeout and fixed model_tokens param ([#810](#810) [#856](#856) [#853](#853)) ([01a331a](01a331a))
github-actions bot pushed a commit that referenced this issue Jan 6, 2025
## [1.35.0](v1.34.2...v1.35.0) (2025-01-06)

### Features

* ⏰added graph timeout and fixed model_tokens param ([#810](#810) [#856](#856) [#853](#853)) ([01a331a](01a331a))
* ⛏️ enhanced contribution and precommit added ([fcbfe78](fcbfe78))
* add codequality workflow ([4380afb](4380afb))
* add timeout and retry_limit in loader_kwargs ([#865](#865) [#831](#831)) ([21147c4](21147c4))
* serper api search ([1c0141f](1c0141f))

### Bug Fixes

* browserbase integration ([752a885](752a885))
* local html handling ([2a15581](2a15581))

### CI

* **release:** 1.34.2-beta.1 [skip ci] ([f383e72](f383e72)), closes [#861](#861) [#861](#861)
* **release:** 1.34.2-beta.2 [skip ci] ([93fd9d2](93fd9d2))
* **release:** 1.34.3-beta.1 [skip ci] ([013a196](013a196)), closes [#861](#861) [#861](#861)
* **release:** 1.35.0-beta.1 [skip ci] ([c5630ce](c5630ce)), closes [#865](#865) [#831](#831)
* **release:** 1.35.0-beta.2 [skip ci] ([f21c586](f21c586))
* **release:** 1.35.0-beta.3 [skip ci] ([cb54d5b](cb54d5b))
* **release:** 1.35.0-beta.4 [skip ci] ([6e375f5](6e375f5)), closes [#810](#810) [#856](#856) [#853](#853)
@PeriniM
Copy link
Collaborator

PeriniM commented Jan 6, 2025

Closing in favor of #856

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants