This is a Python application that allows you to evaluate how well LLMs handle the order dependency problem, drawing heavily from the paper Large Language Models Are not Robust Multiple Choice Selectors by Zheng, et al. It is written to easily handle multiple models with limited extra work.
The easiest way to install the project is by using pip or uv to install the wheel file in dist/. You may also clone the repo and use uv or any other
pyproject.toml-compatible tool to install dependencies to a local virtual environment and then run it.
Upon installation, you can run the application either by using uv run run_analysis or run_analysis. Outpus are saved to the outputs directory.
Usage: run_analysis [OPTIONS]
Options:
--model_name TEXT Model name
--data_limit INTEGER Number of questions to use
--random BOOLEAN Whether to use random questions
--help Show this message and exit.
To access the Huggingface dataset, use the Huggingface CLI to authenticate your account.
To access the OpenAI API, set your OpenAI API key in your own .env file.
This project could be extended in many ways. It could be extended to handle more models, more MCQ datasets, and more evaluation metrics, such as the standard deviation of the recall balance. It could also be extended to handle more complex questions, such as those that require multiple steps to solve (and evaluating chain-of-thought prompting to answer these).
Testing, which is non-existent in this project, would also be a good addition.