CONcordance of Curated & Original Raw Descriptions In Annotations
A toolkit for annotation concordance and entity relationship classification using embeddings and LLMs.
- gateway-check: Argo Gateway API connectivity check on startup with prod/dev endpoint fallback
- local: PubMedBERT embeddings → cosine similarity → heuristic labels
- zero-shot: Single LLM call with optional similarity hints
- vote: Multiple LLM calls with majority vote (with vote tracking)
- rac (Beta): Retrieval-Augmented Classification with example memory
- fallback: Safe local fallback on errors
- Template-driven prompt management with versioned external templates (v1.x, v2, v2.1, v3.0, v3.1)
- Ad-hoc mode for quick two-sentence comparisons (without requiring a CSV file)
- list-templates: List available prompt templates
- batch processing: Control both file chunking and LLM batch sizes
- verbose: Show detailed evidence and explanations
git clone https://github.com/you/concordia.git
cd concordia
poetry install # install dependencies & CLI entry-point
poetry shell # activate the virtual environmentpip install concordiaIf you've installed additional Python packages in your environment, you can compare them with Poetry-managed dependencies:
# export current environment packages
pip freeze > env-requirements.txt
# export Poetry-managed requirements
poetry export -f requirements.txt --without-hashes > poetry-requirements.txt
# view differences
diff env-requirements.txt poetry-requirements.txtManually add any missing packages to pyproject.toml under [tool.poetry.dependencies] and run poetry update.
CLI
# Simplified command structure (single invocation)
concord example_data/annotations_test.csv --mode zero-shot --llm-model gpt4o
concord example_data/annotations_test.csv --mode local --output local.csv
concord example_data/annotations_test.csv --mode vote --output results_vote.csv
concord example_data/annotations_test.csv --mode rac --output results_rac.csv
# Direct text comparison (no CSV required)
concord --text-a "Entity A" --text-b "Entity B" --mode zero-shot
# List available templates
concord --list-templates
# Control batch processing
concord example_data/annotations_test.csv --batch-size 32 --llm-batch-size 12Python
from concord.pipeline import run_pair, run_file
label, sim, evidence = run_pair("Entity A", "Entity B", "config.yaml")
print(label, sim, evidence)After generating predictions (e.g., from a benchmark run), evaluate them against the gold standard using eval/evaluate_suite.py.
For detailed instructions on running benchmark suites and evaluation, see the Benchmarking Workflow.
Example evaluation command:
python eval/evaluate_suite.py \
--gold eval/datasets/Benchmark_subset__200_pairs_v1.csv \
--pred-dir eval/results/your_benchmark_run_timestamp_dir \
--pattern "**/*.csv" \
--out eval/results/your_benchmark_run_timestamp_dir/evaluation_output \
--plotReplace your_benchmark_run_timestamp_dir with the specific output directory of your benchmark run.
engine:
mode: zero-shot # local | zero-shot | vote | rac
sim_hint: false # Optional: prefix similarity hint to prompts
llm:
model: gpt4o # use without hyphens
stream: false
user: ${ARGO_USER}
local:
model_id: NeuML/pubmedbert-base-embeddings
device: cpu # cpu or cuda
# RAC mode settings (Beta)
rac:
example_limit: 3 # Number of examples to include in prompts
similarity_threshold: 0.6 # Minimum similarity to include example
auto_store: true # Auto-save classifications to vector store
data_dir: "./data" # Where to store the vector databaseengine.mode: select mode (local,zero-shot,vote,rac)engine.sim_hint: boolean flag to prefix cosine similarity hint to LLM prompts (default: false)engine.sim_threshold: similarity threshold for local mode (default: 0.98)engine.vote_temps: list of temperatures for vote mode LLM calls (default:[0.8, 0.2, 0.0])llm.model: Gateway model name (e.g.gpt4o,gpt35,gpto3mini)llm.stream:trueto use streaming/streamchat/endpointllm.user: Argo Gateway username (viaARGO_USER)llm.api_key: Argo Gateway API key (viaARGO_API_KEY)prompt_ver: explicit prompt version to use (overrides configprompt_verand bucket routing)local.model_id: embedding model ID (PubMedBERT or SPECTER2)local.device: device for embeddings (cpuorcuda)local.batch_size: batch size for file processingrac.example_limit: number of similar examples to retrieve (for RAC mode)rac.similarity_threshold: minimum similarity score for examples (0-1)rac.auto_store: whether to automatically store successful classificationsdata_dir: directory for storing vector database and other data
The Retrieval-Augmented Classification (RAC) mode is currently in beta development. This mode enhances classification by retrieving similar previously classified examples and including them in the prompt for context.
RAC mode currently has several limitations being actively worked on:
-
All Classifications Get Stored: Currently, all successful LLM classifications are stored in the vector database if
auto_storeis enabled, regardless of quality or accuracy. -
Planned Improvements:
- Human validation before storing examples
- Confidence thresholds from the LLM responses
- Selective storage based on specific characteristics or patterns
- Improved embedding methods for better similarity matching
# First time setup - create data directory
mkdir -p data
# Run with RAC mode (will build up examples over time)
concord example_data/annotations_test.csv --mode rac --output results_rac.csvmkdocs servePublished site: https://.github.io/concordia/
ARGO_USER: ANL login for Argo Gateway (required)ARGO_API_KEY: API key for private Argo Gateway (optional)
See CONTRIBUTING.md for guidelines.
Run all tests via pytest:
pytestWe enforce formatting and linting with pre-commit hooks:
pip install pre-commit
pre-commit install
pre-commit run --all-filesApache-2.0