This project is under heavy development.
Extract information from labels on images of museum specimens.
OCRed text from the label on the lower right of the sheet.
This is clearly from another sheet and label. The colors indicate text matched to fields.
The text is formatted and placed into named fields using the Darwin Core format.
You will need the python environment package manager called uv as well as git.
git clone https://github.com/rafelafrance/LabelLlama.git
cd LabelLlama
git checkout v0.1.1This project is under heavy development, tag v0.1.1 will pin the code to a known state.
uv synclmstudio is a wrapper and GUI around the llamma.cpp library.
The GUI is convenient for downloading and running models locally.
Note you may run LM-Studio headless with lms daemon.
Of, course you don't have to run any models locally.
I use local models to OCR text on images of specimens and cleaning LM output some fields.
You can get the LM-Studio GUI and daemon here
I run a local model to OCR images. I use (for now) Chandra-OCR on the LM-Studio daemon.
lms daemon up
lms server startThe important arguments for running the OCR script get_text.py
are shown in the demo/get_text_demo.bash bash script.
- --image-dir: A directory full of museum specimen images that you want to OCR.
- --docs: Put the OCRed text into this file.
- --model: Use this model for the OCR.
Note that the script will append to the --docs file so that you may rerun the script after an error or combine the output from multiple runs into a single file.
The output CSV has 3 columns.
- The "source" path, which is the image path in this case.
- How long it took the OCR to run, "elapsed".
- The OCR text itself, "text".
run_lm.py gets the raw field extracts from a Large Language Model (LLM).
The fields will contain hallucinations, interpretations, and odd notations,
all of which I try to fix in the next step.
The important arguments for this process are shown in the demo/run_lm_demo.bash script.
- --docs: demo/ocr_demo_docs.csv The output from the OCR script.
- --out-file: demo/lm_extracts.csv Where to put the extraction output.
- --model: "openai/gpt-5-nano" The LLM to use.
- --api-key: The key required to run the LLM.
- --threads: 5 You can parallelize the process, which speeds things up.
- --signature: herbarium The fields to extract from the OCRed text.
You will need to set up an account with payment information to get the API key. In this case it would be with OpenAI.
Note that the "text" column in the output file may be different from the raw input in the OCR file. I run a text preprocessing step before sending it to the LLM. I remove obvious header and footers. I also join lines of text if there is only one line break (return) between them. Labels have limited horizontal space, so sentences are split across multiple lines. However, the models tend to do better if there are no line breaks in a sentence. If there are two or more line breaks in a row then the breaks are likely to have semantic meaning, but if there is only one break then it probably doesn't. I want to record the exact text given to the LLM.
This is where I clean up the oddities in the output from the LLM and put it into a usable format. The important arguments for this script are:
- --in-file: demo/lm_extracts.csv This is the output from the
run_lm.pyscript. - --out-file: demo/lm_extracts_post.csv Output the cleaned results to this file.
- --run-field-models: Explanation below.
The run-field-models: Some output fields are fairly complex and need to be broken
down into subfields. For example a "TRS" field will have the "township", "range",
"section", and" "quad" subfields. I use a small local model like gemma-4 to find the subfields.
This option says to run those models on, currently, the "TRS", "UTM", and "Elevation" extracts.
The field-models are only run if there is missing data from the LLM extract, so if the
LLM extracted the "northing", "easting", and "zone" from a UTM then the local model is
skipped for that particular "UTM".



