Replies: 2 comments 1 reply
-
|
The 100K–200K requests sound plausible for consolidation here. Hindsight’s retain/consolidation path does its own internal recall and source-fact hydration, so it can fan out far beyond the few thousand tokens in the conversation itself. The repo’s current recommendation is to keep that path bounded first, not to rely on a larger context window. I’d start with:
If it’s still too heavy, the next knobs are Also, if you’re running the full image plus local models on the same host, the deployment-footprint guide says that stack is the expensive part. Slim image + external embeddings/reranker will usually buy you more headroom than pushing the context window harder. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you! I already have i'll set HINDSIGHT_API_CONSOLIDATION_RECALL_BUDGET=low and see if it does anything. I'm running the slim image, all inference should happen on my local oMLX runtime. Does HINDSIGHT_API_RERANKER_FLASHRANK_CPU_MEM_ARENA have any impact in that case? But even if I manage to slash it in (say) half, it would still be too heavy for most people who want to run it locally, is there something I could ask for as RFE that would help either on hindsight or hermes side? Some sort of non-LLM based filtering of the conversations before they are consolidated? Or just better batching? I see the consolidation steps creeping up in size as it progresses, can we somehow slash that? I am thinking of knobs in hermes that would
Or can I perhaps tune the bank missions to retain less facts in the first place, caveman style maybe? Every token helps here. ATM memories are very human readable and rich - that's not needed at all I think. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I am running Hermes Agent + Hindsight (local-external via containers) on M5 Max 128GB... and I'm running out of memory :)
TBH I don't really understand the architecture and how Hindsight works, at least not yet.
My LLM runtime is oMLX, my main LLM of choice is Qwen3.6-35B-A3B.
oMLX is running:
bge-m3-mlx-fp16 for embeddings (~1GB loaded
bge-reranker-v2-m3 as reranker (~2 GB loaded)
Qwen3.6-35B-A3B-oQ8-mtp (~38GB loaded) or Qwen3.6-35B-A3B-PARO (just 20GB, but the jury is still out on fidelity - it looks unrealisticaly smart for how packed it is, also it loads very slowly into oMLX)
gemma-4-E4B-it-oQ8 (~8GB in memory) - LLM for hindsight.
Rest of the oMLX memory budget (~80GB) is for caching and prefill and this is the main isue.
When retain and consolidation tasks kick in, everything slows down to a crawl. I expected both to just work on the conversation contents over a number of overlapping turns (usually few thousand tokens) - should be blazingly fast. Practise showed that it usually isn't that fast when Qwen is also thinking (and there is some weird regression on memory pressure), but for retain that is usually fine.
Problem: But then (during consolidation phase?) I see 100K or even 200K+ context length requests going to the Hindsight LLM and I'm not exactly sure if something is wrong with my setup or if that is expected.
This is made much worse if I switch my agent momentarily to a cloud/frontier model with much larger context length - not sure if the compressor is leaving too many tokens in compared to local Qwen, or if this workload just produces more tokens overall (likely). And ever more worse if it kicks automatic skill creation or similiar maintenance task into action that also required large contexts (but this I can tune at least.. somewhat). Luckily I'm not a coder, but any heavier scripting job with lots of toolcalls and iterations makes the hindsight maintenance workload to explode.
In either case, the result is that I'm waiting for Hindsight to finish retain and consolidation before I can query Qwen again, simply because I start running out of memory (oMLX has ~80GB budget). These tasks run for longer than what the actual work took on wall clock. Also, it makes the Macbook pretty much tied to the wall socket when left in high power mode...
I tried switching to gemma4-E2B-it (which si actually mentioned/recommended/default in the docs) and I found it a bit lacking when some of the work I do is in Czech, and it sometimes doesn't produce valid JSON (but I believe I now know why that is, see below). I tried LFM2.5 - extremely fast, but it also fails on producing JSON frequently and it also has a 128K context length limit.
Both of these models (E2B and E4B) are "officially" limited to 128K context window, which doesn't seem to be enough for hindsight, so it ends up finishing early, breaking JSON, triggering the same or similiar query and failing again. Sometimes it then magically finishes (is there some logic in hindsight for that?) or I momentarily route it to Qwen which handles it.
I found out that setting higher max context size in oMLX than what is officially supported in the model cards allows me to use ~280K context with these models, but I have no idea what implications of doing this are. I asked my agent to devise a needle-in-a-haystack test and do a few other validations and they usually score 100% up to 280K context, sometimes they can go up to 360K, but the wall clock time still isn't great.
I am looking for recommendations on what to do - can consolidation be broken into smaller chunks? Every tunable I found is related to retain, but consolidation runs over multiple memories/facts/whatever and it only seems to grow in time.
Please help :-)
I'd be more than happy to distill this into some sort of recommendation for local-first setup in the end, the documentation is rather sparse in this regard.
(btw storage used by hindsight increases by ~170MB/week which is fine by itself but might be a bit too much with my light Hermes use, this also makes me think hindsight is ingesting too much noise?)
Thanks.
Some of my settings (mostly defaults, I got recommendation to switch from reflect, but reflect isn't the costly thing there IMO)
hermes/config.yaml
hindsight.json:
docker compose ENV:
Beta Was this translation helpful? Give feedback.
All reactions