ClueEval is a project designed to evaluate the reasoning capabilities of Large Language Models (LLMs) by challenging them to solve generated mystery stories.
ClueEval creates mystery stories that theoretically test deductive reasoning abilities in solving.
-
Story Generation: The system randomly generates a basic mystery, including a killer and victim, in
story/random_details.py
. Then,story/writer.py
uses an LLM to create a unique murder mystery, giving each character their own story and perspective. -
Narrative Creation: A detailed narrative is generated, including both the true events and misleading information.
-
Clue Assembly: The system compiles a set of clues, some relevant to solving the mystery and others serving as red herrings.
-
Prose: The set of clues is turned into prose.
-
Evaluation: Whodunnit? The clues contained in the prose should be enough to figure it out. These are fair play mysteries.
-
Ensure you have Python 3.10+ installed on your system.
-
Clone this repository:
git clone https://github.com/yourusername/ClueEval.git cd ClueEval
-
Install the required dependencies:
pip install -r requirements.txt
-
Set up your OpenAI API key as an environment variable:
export OPENAI_API_KEY='your-api-key-here'
-
Run the main script (interactive mode):
python main.py --interactive_mode
This will give you a mystery to solve. Read it and decide who you think is the killer!
-
Run the main script (generation mode):
python main.py 10
This will generate 10 mysteries, and store them in
generated_questions
. Beware that each generation will take a couple of minutes, as there is a lot of back and forth with an LLM. -
Run lm_eval: For information on running the CLUE evaluation task using lm_eval, please refer to the README in the
lm_eval/tasks/clue_eval/
directory.
story/
: Contains the core logic for story generation and processing.utils/
: Utility functions, including GPT API interactions.config/
: Configuration files, including prompts and element lists.lm_eval/
: Contains the CLUE evaluation task and results. See the README inlm_eval/tasks/clue_eval/
for detailed information on running the evaluation.
We welcome contributions to ClueEval! Please feel free to submit issues, feature requests, or pull requests.
- Inspired by golden age mystery authors.
- Narrative generation uses OpenAI's GPT models.
- Anthropic Claude wrote most of the code, although I did some of the work too. I take responsibility for all the bugs.
Happy mystery solving!