This Repository does not contain all the codes. It was created from my Private Repository for public viewing purposes.
A powerful Streamlit application that uses Large Language Models (OpenAI GPT or Anthropic Claude) to automatically annotate scientific text with custom tag definitions. The app provides a complete workflow from annotation to validation, editing, and evaluation.
- Multi-LLM Support: Choose between OpenAI (GPT-3.5, GPT-4, GPT-4o) or Anthropic Claude models
- Custom Tag Definitions: Upload your own tag set with definitions and examples via CSV
- Chunked Processing: Handles large texts by breaking them into configurable chunks
- Visual Annotation Display: Interactive highlighting of annotated entities in the text
- Comprehensive Export: Download annotations with complete metadata in JSON format
- Interactive Editing: Edit, add, or remove annotations through a built-in data editor
- Position Validation: Validate that annotation positions match the actual text
- Automatic Position Fixing: Automatically correct misaligned annotation positions
- LLM Evaluation: Use LLM to evaluate annotation quality and get improvement suggestions
- Batch Recommendation Application: Apply multiple LLM suggestions at once
- Python 3.9+
- Streamlit
- OpenAI API key or Anthropic Claude API key
git clone <repository-url>
cd scientific-text-annotator
pip install -r requirements.txt
streamlit run app_v3_agent_optimization.py
- Configure API: Enter your OpenAI or Claude API key in the sidebar
- Select Model: Choose your preferred LLM provider and model
- Adjust Parameters: Configure temperature, chunk size, and max tokens
- Upload Text:
- Upload a
.txt
file, or - Paste text directly into the text area
- Upload a
- Upload Tag Set: Upload a CSV file with columns:
tag_name
: The name of the annotation tagdefinition
: Clear definition of what the tag representsexamples
: Examples of text that should be tagged
- Click "π Run Annotation" to start the process
- Monitor progress as the app processes text chunks
- View annotation statistics and distribution
- Visual Preview: See highlighted annotations in the text
- Edit Annotations: Use the interactive table to:
- Modify annotation text, positions, or labels
- Add new annotations
- Delete unwanted annotations
- Validation: Check if annotation positions match the text
- Auto-Fix: Automatically correct position misalignments
- Run Evaluation: Let the LLM assess annotation quality
- Review Suggestions: See recommendations for label changes or deletions
- Apply Recommendations: Selectively apply LLM suggestions
Download your annotations in JSON format with complete metadata
- Temperature (0.0-1.0): Controls LLM creativity/consistency
- Chunk Size (200-4000 chars): Size of text segments processed individually
- Max Tokens: Maximum tokens per LLM response
- Clean Text: Remove non-printable characters from input
OpenAI Models:
gpt-4o-mini
(recommended for cost-effectiveness)gpt-4o
(best performance)gpt-4
gpt-3.5-turbo
Claude Models:
claude-3-5-sonnet-20250219
claude-3-5-haiku-20241022
Your tag definition CSV must include these columns:
tag_name,definition,examples
GENE,"A DNA sequence that codes for a protein","TP53, BRCA1, insulin gene"
PROTEIN,"A biological molecule made of amino acids","hemoglobin, antibody, enzyme"
DISEASE,"A medical condition or disorder","cancer, diabetes, Alzheimer's disease"
- Checks if annotation positions correctly correspond to the tagged text
- Identifies misaligned annotations caused by text processing
- Reports overlapping annotations and zero-length annotations
- First Strategy: Uses the first occurrence of the text found
- Closest Strategy: Chooses the position closest to the original annotation
- Assesses whether annotations match their tag definitions
- Provides confidence scores and detailed reasoning
- Suggests label changes or entity deletions
- Tracks which recommendations have been applied
scientific-text-annotator/
βββ app_v3_agent_optimization.py # Main Streamlit application
βββ prompts_flat.py # LLM prompt templates
βββ helper.py # Utility functions
βββ llm_clients.py # LLM client implementations
βββ requirements.txt # Python dependencies
βββ README.md # This file
run_annotation_pipeline()
: Main annotation workflowchunk_text()
: Splits text into processable chunksaggregate_entities()
: Combines annotations from all chunks
validate_annotations_streamlit()
: Checks annotation correctnessfix_annotation_positions_streamlit()
: Corrects position errors
evaluate_annotations_with_llm()
: LLM-based quality assessmentapply_evaluation_recommendations()
: Applies LLM suggestions
display_annotated_entities()
: Visual text highlightingdisplay_processing_summary()
: Shows processing statistics
- Clear Tag Definitions: Write precise, unambiguous definitions
- Good Examples: Provide diverse, representative examples
- Appropriate Chunk Size: Balance context vs. processing time
- Multiple Passes: Run evaluation and apply suggestions iteratively
- Optimize Chunk Size: Larger chunks provide more context but slower processing
- Monitor Token Usage: Adjust max_tokens based on chunk size
- Use Efficient Models: Consider
gpt-4o-mini
for cost-effectiveness
- Always Validate: Check position accuracy after annotation
- Review LLM Suggestions: Don't blindly apply all recommendations
- Manual Review: Spot-check annotations for accuracy
- Iterative Improvement: Refine tag definitions based on results
- API Rate Limits: Large texts may hit API rate limits
- Context Windows: Very long texts are processed in chunks, potentially losing global context
- Model Dependence: Quality depends on the chosen LLM's capabilities
- Position Sensitivity: Text preprocessing may affect annotation positions
- "API key missing": Ensure you've entered a valid API key
- Position mismatches: Run validation and use the auto-fix feature
- Empty annotations: Check your tag definitions and examples
- Memory issues: Reduce chunk size or max tokens for large texts
- Use smaller chunk sizes for faster processing
- Choose appropriate models based on your accuracy/cost requirements
- Clean text input to avoid processing issues
Happy Annotating! π