Local inference setup - recommendations #2199

zviratko · 2026-06-15T07:26:35Z

zviratko
Jun 15, 2026

Hello,
I am running Hermes Agent + Hindsight (local-external via containers) on M5 Max 128GB... and I'm running out of memory :)
TBH I don't really understand the architecture and how Hindsight works, at least not yet.

My LLM runtime is oMLX, my main LLM of choice is Qwen3.6-35B-A3B.

oMLX is running:
bge-m3-mlx-fp16 for embeddings (~1GB loaded
bge-reranker-v2-m3 as reranker (~2 GB loaded)
Qwen3.6-35B-A3B-oQ8-mtp (~38GB loaded) or Qwen3.6-35B-A3B-PARO (just 20GB, but the jury is still out on fidelity - it looks unrealisticaly smart for how packed it is, also it loads very slowly into oMLX)
gemma-4-E4B-it-oQ8 (~8GB in memory) - LLM for hindsight.

Rest of the oMLX memory budget (~80GB) is for caching and prefill and this is the main isue.

When retain and consolidation tasks kick in, everything slows down to a crawl. I expected both to just work on the conversation contents over a number of overlapping turns (usually few thousand tokens) - should be blazingly fast. Practise showed that it usually isn't that fast when Qwen is also thinking (and there is some weird regression on memory pressure), but for retain that is usually fine.

Problem: But then (during consolidation phase?) I see 100K or even 200K+ context length requests going to the Hindsight LLM and I'm not exactly sure if something is wrong with my setup or if that is expected.
This is made much worse if I switch my agent momentarily to a cloud/frontier model with much larger context length - not sure if the compressor is leaving too many tokens in compared to local Qwen, or if this workload just produces more tokens overall (likely). And ever more worse if it kicks automatic skill creation or similiar maintenance task into action that also required large contexts (but this I can tune at least.. somewhat). Luckily I'm not a coder, but any heavier scripting job with lots of toolcalls and iterations makes the hindsight maintenance workload to explode.

In either case, the result is that I'm waiting for Hindsight to finish retain and consolidation before I can query Qwen again, simply because I start running out of memory (oMLX has ~80GB budget). These tasks run for longer than what the actual work took on wall clock. Also, it makes the Macbook pretty much tied to the wall socket when left in high power mode...

I tried switching to gemma4-E2B-it (which si actually mentioned/recommended/default in the docs) and I found it a bit lacking when some of the work I do is in Czech, and it sometimes doesn't produce valid JSON (but I believe I now know why that is, see below). I tried LFM2.5 - extremely fast, but it also fails on producing JSON frequently and it also has a 128K context length limit.

Both of these models (E2B and E4B) are "officially" limited to 128K context window, which doesn't seem to be enough for hindsight, so it ends up finishing early, breaking JSON, triggering the same or similiar query and failing again. Sometimes it then magically finishes (is there some logic in hindsight for that?) or I momentarily route it to Qwen which handles it.

I found out that setting higher max context size in oMLX than what is officially supported in the model cards allows me to use ~280K context with these models, but I have no idea what implications of doing this are. I asked my agent to devise a needle-in-a-haystack test and do a few other validations and they usually score 100% up to 280K context, sometimes they can go up to 360K, but the wall clock time still isn't great.

I am looking for recommendations on what to do - can consolidation be broken into smaller chunks? Every tunable I found is related to retain, but consolidation runs over multiple memories/facts/whatever and it only seems to grow in time.

Please help :-)

I'd be more than happy to distill this into some sort of recommendation for local-first setup in the end, the documentation is rather sparse in this regard.
(btw storage used by hindsight increases by ~170MB/week which is fine by itself but might be a bit too much with my light Hermes use, this also makes me think hindsight is ingesting too much noise?)
Thanks.

Some of my settings (mostly defaults, I got recommendation to switch from reflect, but reflect isn't the costly thing there IMO)

hermes/config.yaml

memory:
  flush_min_turns: 2
  hindsight:
    retain_every_n_turns: 6
  memory_char_limit: 2200
  memory_enabled: true
  nudge_interval: 10
  provider: hindsight
  user_char_limit: 1375
  user_profile_enabled: true
  write_approval: false

hindsight.json:

{
  "mode": "local_external",
  "apiKey": "",
  "timeout": 300,
  "idle_timeout": 0,
  "recall_prefetch_method": "reflect",
  "retain_tags": "",
  "retain_source": "",
  "retain_user_prefix": "User",
  "retain_assistant_prefix": "Assistant",
  "banks": {
    "hermes": {
      "bankId": "hermes",
      "budget": "mid",
      "enabled": true
    }
  },
  "api_url": "http://127.0.0.1:8888",
  "bank_id": "hermes",
  "recall_budget": "mid"
}

docker compose ENV:

HINDSIGHT_API_LOG_LEVEL=info
HINDSIGHT_DAEMON_IDLE_TIMEOUT=0
HINDSIGHT_API_LLM_MAX_CONCURRENT=1
HINDSIGHT_API_LLM_TIMEOUT=300
HINDSIGHT_API_CONSOLIDATION_LLM_BATCH_SIZE=1
HINDSIGHT_EMBED_DAEMON_IDLE_TIMEOUT=0
HINDSIGHT_IDLE_TIMEOUT=0
#HINDSIGHT_API_LLM_EXTRA_BODY=future

HINDSIGHT_API_LLM_PROVIDER=openai
HINDSIGHT_API_LLM_BASE_URL=http://host.containers.internal:11435/v1
HINDSIGHT_API_LLM_API_KEY=somethingsomething
HINDSIGHT_API_LLM_MODEL=gemma-4-E4B-it-oQ8
HINDSIGHT_API_EMBEDDINGS_PROVIDER=openai
HINDSIGHT_API_EMBEDDINGS_OPENAI_BASE_URL=http://host.containers.internal:11435/v1
HINDSIGHT_API_EMBEDDINGS_OPENAI_API_KEY=somethingsomething
HINDSIGHT_API_EMBEDDINGS_OPENAI_MODEL=bge-m3-mlx-fp16
HINDSIGHT_API_EMBEDDINGS_OPENAI_BATCH_SIZE=32
HINDSIGHT_API_RERANKER_PROVIDER=cohere
HINDSIGHT_API_RERANKER_COHERE_BASE_URL=http://host.containers.internal:11435/v1/rerank
HINDSIGHT_API_RERANKER_COHERE_API_KEY=somethingsomething
HINDSIGHT_API_RERANKER_COHERE_MODEL=bge-reranker-v2-m3
HINDSIGHT_API_RERANKER_LOCAL_BUCKET_BATCHING=true
#HINDSIGHT_API_RETAIN_BATCH_ENABLED=true
HINDSIGHT_API_RETAIN_CHUNK_BATCH_SIZE=30
HINDSIGHT_API_CONSOLIDATION_MAX_MEMORIES_PER_ROUND=30

starkmarkus · 2026-06-18T22:06:24Z

starkmarkus
Jun 18, 2026

The 100K–200K requests sound plausible for consolidation here. Hindsight’s retain/consolidation path does its own internal recall and source-fact hydration, so it can fan out far beyond the few thousand tokens in the conversation itself.

The repo’s current recommendation is to keep that path bounded first, not to rely on a larger context window. I’d start with:

HINDSIGHT_API_CONSOLIDATION_RECALL_BUDGET=low
HINDSIGHT_API_CONSOLIDATION_SOURCE_FACTS_MAX_TOKENS=4096
HINDSIGHT_API_RERANKER_FLASHRANK_CPU_MEM_ARENA=false

If it’s still too heavy, the next knobs are HINDSIGHT_API_CONSOLIDATION_MAX_MEMORIES_PER_ROUND and HINDSIGHT_API_CONSOLIDATION_LLM_BATCH_SIZE. In practice, that means yes, you can make consolidation run in smaller rounds; I wouldn’t treat the 280K context setting as the main fix.

Also, if you’re running the full image plus local models on the same host, the deployment-footprint guide says that stack is the expensive part. Slim image + external embeddings/reranker will usually buy you more headroom than pushing the context window harder.

0 replies

zviratko · 2026-06-19T04:13:15Z

zviratko
Jun 19, 2026
Author

Thank you!

I already have
HINDSIGHT_API_CONSOLIDATION_MAX_MEMORIES_PER_ROUND=30 (default 100, what's the quality impact of going even lower?)
HINDSIGHT_API_CONSOLIDATION_LLM_BATCH_SIZE=1

i'll set HINDSIGHT_API_CONSOLIDATION_RECALL_BUDGET=low and see if it does anything.

I'm running the slim image, all inference should happen on my local oMLX runtime. Does HINDSIGHT_API_RERANKER_FLASHRANK_CPU_MEM_ARENA have any impact in that case?

But even if I manage to slash it in (say) half, it would still be too heavy for most people who want to run it locally, is there something I could ask for as RFE that would help either on hindsight or hermes side? Some sort of non-LLM based filtering of the conversations before they are consolidated? Or just better batching? I see the consolidation steps creeping up in size as it progresses, can we somehow slash that?

I am thinking of knobs in hermes that would

trigger retain after compression (the model already has everything in context and it has to run anyway) - replacing the partial retains from turns
always ask for compression after conversation is ended - same effect
filter the output to not contain the results from MCP servers, tool calls - I see facts fetched from MCP in my memories, that's completely superfluous

Or can I perhaps tune the bank missions to retain less facts in the first place, caveman style maybe? Every token helps here. ATM memories are very human readable and rich - that's not needed at all I think.

1 reply

starkmarkus Jun 19, 2026

Lowering HINDSIGHT_API_CONSOLIDATION_MAX_MEMORIES_PER_ROUND further is a reasonable way to cut peak memory, but it is a throughput trade-off, not a free win. The repo guidance is to widen that knob only when low recall is clearly missing useful related observations.

HINDSIGHT_API_RERANKER_FLASHRANK_CPU_MEM_ARENA=false can still matter on a slim setup if FlashRank is actually in the loop. If your reranker path is active, keeping it off helps avoid RSS that lingers after consolidation work.

For an RFE, I would ask for less noise entering retain and consolidation in the first place. The best candidates are compressing before retain so partial turn retains get replaced by the compressed summary, running a final retain after the conversation ends, and filtering out tool or MCP output before it reaches memory unless it is genuinely useful later.

That is the bigger lever. If you reduce what gets retained, consolidation has less to fan out over and the whole pipeline gets cheaper, which matches the repo guidance to keep the expensive path narrow first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local inference setup - recommendations #2199

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Local inference setup - recommendations #2199

Uh oh!

Uh oh!

zviratko Jun 15, 2026

Replies: 2 comments · 1 reply

Uh oh!

starkmarkus Jun 18, 2026

Uh oh!

zviratko Jun 19, 2026 Author

Uh oh!

starkmarkus Jun 19, 2026

zviratko
Jun 15, 2026

Replies: 2 comments 1 reply

starkmarkus
Jun 18, 2026

zviratko
Jun 19, 2026
Author