WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives

This repository contains constructed datasets and evaluation frameworks for WeatherArchive-Bench. It comprises two tasks: WeatherArchive-Retrieval, which measures a system’s ability to locate historically relevant passages from over one million archival news segments, and WeatherArchive-Assessment, which evaluates whether Large Language Models (LLMs) can classify societal vulnerability and resilience indicators from extreme weather narratives.

📁 Project Structure

WXImpactRAG/
├── 📁 constant/                      # Configuration and constants
│   ├── climate_framework.py          # IPCC vulnerability framework definitions
│   └── constants.py                  # File paths and model configurations
│
├── 📁 embedding_loaders/             # Data preprocessing and embedding
│   ├── concat.py                     # Text concatenation utilities
│   └── raw_csv/                      # Historical weather data corpus
│       ├── blizzard_English_*.csv    # Blizzard-related documents
│       ├── cold_English_*.csv        # Cold weather documents
│       ├── heat_English_*.csv        # Heat-related documents
│       ├── storm_English_*.csv       # Storm documents
│       └── ...                       # Other weather phenomena
│
├── 📁 data/                   # Ground truth datasets
│   ├── ground_truth_climate.csv      # Climate assessment ground truth
│   ├── QACandidate_Pool.csv          # Question-answer candidate pool
│   └── QACorrect_Passages.csv        # Correct passage annotations
│
├── 📁 WeatherArchive_Retrieval/      # Retrieval evaluation framework
│   ├── output/                       # Retrieval results
│   │   ├── overall.csv               # Comprehensive retrieval metrics
│   │   ├── raw_BM25*.csv             # BM25 variant results
│   │   ├── raw_model_result_*.csv    # Dense retrieval results
│   │   └── ...                       # Other retrieval outputs
│   ├── retriever_eval_*.py           # Retrieval evaluation scripts
│   ├── overall.py                    # Overall evaluation metrics
│   ├── utils.py                      # Utility functions
│   └── README.md                     # Retrieval framework documentation
│
└── 📁 WeatherArchive_Assessment/     # Climate impact assessment
    ├── output/                        # Assessment results
    │   ├── gpt-4o-results.csv        # GPT-4o assessment results
    │   ├── gpt-3.5-turbo-results.csv # GPT-3.5-turbo results
    │   ├── Qwen2.5-*.csv             # Qwen model results
    │   └── ...                       # Other model outputs
    └── src/                          # Assessment source code
        ├── climate_eval.py           # Climate impact evaluation
        ├── MCQ_metrics.py            # Multiple choice metrics
        ├── QA_metrics.py             # Question-answering metrics
        └── rag_eval.py               # RAG evaluation framework

🔬 Experiments and Evaluation

WeatherArchive-Retrieval

Objective: Evaluate the effectiveness of various retrieval methods for historical weather data.

WeatherArchive-Assessment

Objective: Evaluate LLM performance in societal vulnerability and resilience assessment related to extreme weather events based on a well-crafted framework referenced from prior meteorological research.

📊 Key Results Summary

Retrieval Performance Highlights

Model	Recall@100	nDCG@100	MRR@100	BLEU@1
Gemini Embedding 001	95.8%	58.8%	48.7%	51.7%
Arctic Embed 2.0	91.0%	54.2%	44.5%	44.2%
BM25Okapi + CE	83.0%	52.5%	44.0%	56.5%
OpenAI-3-large	89.6%	57.1%	47.1%	50.2%
ANCE	86.6%	40.8%	29.3%	27.6%

🚀 Getting Started

Prerequisites

pip install -r requirements.txt

Running WeatherArchive-Retrieval

# BM25 variants with cross-encoder reranking
python -m WeatherArchive_Retrieval.retriever_eval_1

# Dense retrieval models
python -m WeatherArchive_Retrieval.retriever_eval_2  # SBERT, SPLADE
python -m WeatherArchive_Retrieval.retriever_eval_3  # ANCE, UniCoil
python -m WeatherArchive_Retrieval.retriever_eval_4  # Qwen models
python -m WeatherArchive_Retrieval.retriever_eval_5  # OpenAI models
python -m WeatherArchive_Retrieval.retriever_eval_6  # Arctic, Granite
python -m WeatherArchive_Retrieval.retriever_eval_7  # Gemini models

# Generate overall evaluation metrics
python -m WeatherArchive_Retrieval.overall

Running WeatherArchive-Assessment

# Societal Vulnerability and Resilience Indicator Classification
python -m WeatherArchive_Assessment.src.climate_eval
# Data analyze
python -m WeatherArchive_Assessment.src.classification_metrics

# Free-form Question Answering 
python -m WeatherArchive_Assessment.src.rag_eval
# Data analyze
python -m WeatherArchive_Assessment.src.QA_metrics

📝 Data Requirements

Input Data: Historical weather documents in CSV format with 'Text' column
Queries: Question dataset with 'query' column
Ground Truth: Correct passages for evaluation
API Keys: OpenAI, Google, HuggingFace (for respective models)

🔧 Configuration

Model configurations in constant/constants.py
Climate framework definitions in constant/climate_framework.py
File paths and evaluation parameters are customizable

This repository contains the complete implementation and evaluation framework for WeatherArchive-Bench

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
WeatherArchive_Assessment		WeatherArchive_Assessment
WeatherArchive_Retrieval		WeatherArchive_Retrieval
constant		constant
data		data
pics		pics
.DS_Store		.DS_Store
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives

📁 Project Structure

🔬 Experiments and Evaluation

WeatherArchive-Retrieval

WeatherArchive-Assessment

📊 Key Results Summary

Retrieval Performance Highlights

🚀 Getting Started

Prerequisites

Running WeatherArchive-Retrieval

Running WeatherArchive-Assessment

📝 Data Requirements

🔧 Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives

📁 Project Structure

🔬 Experiments and Evaluation

WeatherArchive-Retrieval

WeatherArchive-Assessment

📊 Key Results Summary

Retrieval Performance Highlights

🚀 Getting Started

Prerequisites

Running WeatherArchive-Retrieval

Running WeatherArchive-Assessment

📝 Data Requirements

🔧 Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages