ClueEval is a project designed to evaluate the reasoning capabilities of Large Language Models (LLMs) by challenging them to solve generated mystery stories.
⭐ If you find this tool useful, please consider starring the repository on GitHub! ⭐
ClueEval creates mystery stories that theoretically test deductive reasoning abilities in solving.
-
Story Generation: The system randomly generates a basic mystery, including a killer and victim, in
story/random_details.py
. Then,story/writer.py
uses an LLM to create a unique murder mystery, giving each character their own story and perspective. -
Narrative Creation: A detailed narrative is generated, including both the true events and misleading information.
-
Clue Assembly: The system compiles a set of clues, some relevant to solving the mystery and others serving as red herrings.
-
Prose: The set of clues is turned into prose.
-
Evaluation: Whodunnit? The clues contained in the prose should be enough to figure it out. These are fair play mysteries.
-
Ensure you have Python 3.10+ installed on your system.
-
Clone this repository:
git clone https://github.com/yourusername/ClueEval.git cd ClueEval
-
Install the required dependencies:
pip install -r requirements.txt
-
Set up your OpenAI API key as an environment variable:
export OPENAI_API_KEY='your-api-key-here'
-
Run the main script (interactive mode):
python main.py --interactive_mode
This will give you a mystery to solve. Read it and decide who you think is the killer!
-
Run the main script (generation mode):
python main.py 10
This will generate 10 mysteries, and store them in
generated_questions
. Beware that each generation will take a couple of minutes, as there is a lot of back and forth with an LLM. -
Run lm_eval: For information on running the CLUE evaluation task using lm_eval, please refer to the README in the
lm_eval/tasks/clue_eval/
directory.
story/
: Contains the core logic for story generation and processing.utils/
: Utility functions, including GPT API interactions.config/
: Configuration files, including prompts and element lists.lm_eval/
: Contains the CLUE evaluation task and results. See the README inlm_eval/tasks/clue_eval/
for detailed information on running the evaluation.
We welcome contributions to ClueEval! Please feel free to submit issues, feature requests, or pull requests.
- Inspired by golden age mystery authors.
- Narrative generation uses OpenAI's GPT models.
- Anthropic Claude wrote most of the code, although I did some of the work too. I take responsibility for all the bugs.
Happy mystery solving!