Skip to content

Question: could "long-horizon tension crash tests" be a complementary workload for RULER? #103

@onestardao

Description

@onestardao

Hi, and thanks for releasing RULER and the accompanying paper.
It is one of the clearest pieces of work I have seen on long-context evaluation, especially the way you separate task types instead of only testing recall.

On a different front, I have been working on something I call a long-horizon “tension crash test” for LLMs.
The idea is not just to ask a single long-context question, but to keep a model inside a sequence of high-tension problems and watch when its internal state quietly drifts or collapses.

Concretely, I maintain an open TXT pack called “WFGY 3.0 · Singularity Demo (BlackHole-131)”.
It is a plain-text universe of 131 “S-class” questions across topics like alignment, extreme physics, long-horizon decision making, etc.
Any LLM that supports file input can read it and be stress-tested on high-tension reasoning, entirely from text. There is no hidden code or external calls, and the repo is MIT-licensed with ~1.4k stars.

My question for you is very simple:

From the perspective of RULER, do you see this kind of long-horizon, high-tension TXT pack as:

potentially useful as an external workload to complement RULER’s synthetic tasks,

clearly out of scope for RULER (for example because it mixes multiple domains and is hard to score), or

something that would only make sense if it were formalized as a proper RULER task category?

I am not asking you to adopt anything.
I mainly want to understand, from your design point of view, whether a “semantic tension / long-horizon crash test” dimension belongs in the same family of evaluations as RULER, or if you consider it a different kind of object entirely.

If this is even slightly interesting, I am happy to share a very small subset of the TXT plus example traces, so you can see concretely what happens to different models after 30–50 S-class prompts in a row.

Thanks for any guidance, and also thanks again for making RULER public.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions