-
Notifications
You must be signed in to change notification settings - Fork 45
LongContext-ICL-Annotation #176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1,90 @@ | ||
| # LongContext-ICL-Annotation (长上下文场景中LLM自动数据标注挑战赛) | ||
| # 超长长上下文场景中LLM自动数据标注挑战赛 | ||
|
|
||
| --- | ||
|
|
||
| ## 消息 | ||
| <!-- BEGIN NEWS --> | ||
| - **[2026-01-20] `发布`:** 赛事信息已在 **Kaggle** 正式上线。详情见:[FlagOS Open Computing Global Challenge](https://www.kaggle.com/competitions/flag-os-open-computing-global-challenge). | ||
| - **[2026-01-06] `发布`:** 由 **众智 FlagOS 社区**、**北京智源人工智能研究院(BAAI)** 与 **CCF ODTC** 联合主办的综合性大赛 **FlagOS 开放计算全球挑战赛** 正式发布。详情见: | ||
| [FlagOS开放计算全球挑战赛- AI赛事通 | 数据算法赛](https://www.competehub.dev/zh/competitions/modelscope180) | ||
| <!-- END NEWS --> | ||
|
|
||
| --- | ||
|
|
||
| ## 简介 | ||
| 长上下文场景中LLM自动数据标注挑战赛基于Qwen3-4B大语言模型,采用上下文(In-context Learning, ICL)范式开展自动化数据标注任务研究。参赛团队必须使用组委会统一提供的数据集,围绕超长上下文场景设计有效的 ICL 标注方案,并在统一评测集上完成推理与评测。组委会将依据标准化评测结果,对参赛方案进行综合评估并确定最终排名。 | ||
|
|
||
| ### 赛题目标 | ||
| 本赛题以大语言模型(Large Language Models,LLMs)为核心驱动力,面向超长上下文条件下的自动化数据标注问题,探索兼具效率与精度的新型技术范式。赛题重点聚焦以下三个关键科学与工程问题: | ||
| 1. 在超长上下文场景下,如何设计有效的模型指令与提示策略,引导 LLM 稳定、高质量地完成数据标注任务? | ||
| 2. 当可用标注示例数量显著超过模型上下文容量时,如何为待标注数据构造信息密集、结构合理的超长上下文输入? | ||
| 3. 在自动多轮对话或持续交互场景中,如何高效利用超长上下文,实现一致性与可扩展性兼顾的数据标注? | ||
|
|
||
| ### 赛题详情 | ||
| - 参赛团队自主设计一套完整的LLM自动数据标注方案,并在统一的数据集与评测设置下进行实验验证。比赛将以标准化榜单形式公布各参赛方案的评测分数及排名。 | ||
| - 除评测结果外,参赛团队还需按照赛事要求提交技术方案文档与可复现源代码。组委会将对提交方案进行复现验证,并对技术方案本身进行评审。最终成绩将由预测结果成绩与技术方案成绩加权计算得出,具体规则如下。 | ||
| - 参赛队伍需按照赛题和赛制要求,提交技术方案和完整代码至Github OpenSeek官方开源项目下。 | ||
| - 更多具体细节请参考[赛事平台](https://flagos.io/RaceDetail?id=296fmsd8&lang=cn)。 | ||
| - 关于赛事信息,一切以赛事平台公布信息为准。 | ||
|
|
||
| --- | ||
|
|
||
| ## 快速开始 | ||
| ### 1. 环境 | ||
|
|
||
| ```bash | ||
| openai | ||
| torch | ||
| flagScale | ||
| ``` | ||
|
|
||
| 推荐在NVIDIA平台使用 `cd src && bash create_env_nvidia.sh` 创建环境。 | ||
|
|
||
| ### 2. 下载模型权重 | ||
| ```bash | ||
| hf download Qwen/Qwen3-4B --local-dir Qwen3-4B | ||
| # or | ||
| modelscope download --model Qwen/Qwen3-4B | ||
| ``` | ||
| ### 3. 长文本配置 | ||
| 在`Qwen3-4B/config.json`将原有配置替换为: | ||
| ```json | ||
| "rope_scaling": { | ||
| "rope_type": "yarn", | ||
| "factor": 4.0, | ||
| "original_max_position_embeddings": 32768 | ||
| } | ||
| ``` | ||
| ### 4. 模型部署 | ||
|
|
||
| 请根据实际需求,配置 `llm_config.yaml` 文件。启动配置 | ||
|
|
||
| ```bash | ||
| cd FlagScale | ||
| python run.py --config-path .. --config-name llm_config action=run | ||
| ``` | ||
|
|
||
| 在模型服务启动后,可通过以下方式测试本地 API: | ||
|
|
||
| ```bash | ||
| python api_test.py | ||
| ``` | ||
|
|
||
| 如需停止服务,请执行: | ||
|
|
||
| ```bash | ||
| python run.py --config-path .. --config-name llm_config action=stop | ||
| ``` | ||
|
|
||
| ### 5. 运行/改进基线方法(Baseline) | ||
|
|
||
| 启动如下命令开始模型标注 | ||
| ```bash | ||
| python main.py | ||
| ``` | ||
|
|
||
| 实现新的标注方法,请修改`method.py`文件。你可以在该文件中: | ||
| * 定义新的指令模板、 | ||
| * 定义新的上下文示例选择策略 | ||
| * 定义新的模型推理、标注方案 | ||
| * 添加自定义后处理逻辑 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,2 +1,108 @@ | ||
| # LongContext-ICL-Annotation | ||
|
|
||
| Large Language Models Automatic Data Annotation under Long-Context Scenarios. | ||
|
|
||
| ## News | ||
| <!-- BEGIN NEWS --> | ||
| - **[2026-01-20] `Release`:** The competition is now officially live on **Kaggle**. See details: [FlagOS Open Computing Global Challenge](https://www.kaggle.com/competitions/flag-os-open-computing-global-challenge). | ||
| - **[2026-01-06] `Release`:** The comprehensive competition **FlagOS Open Computing Global Challenge** was officially announced, co-hosted by the **FlagOS Community**, the **Beijing Academy of Artificial Intelligence (BAAI)**, and **CCF ODTC**. See details: | ||
| [FlagOS开放计算全球挑战赛- AI赛事通 | 数据算法赛](https://www.competehub.dev/zh/competitions/modelscope180) | ||
| <!-- END NEWS --> | ||
|
|
||
| ## Introduction | ||
|
|
||
| The LongContext-ICL-Annotation Challenge focuses on automatic data annotation under long-context settings using Large Language Models (LLMs). The competition is built upon the Qwen3-4B model and adopts the In-context Learning (ICL) paradigm to investigate scalable and high-quality automated annotation methods. | ||
|
|
||
| Participating teams are required to use the officially provided datasets and design effective ICL-based annotation solutions tailored for ultra-long context scenarios. All submissions will be evaluated on a unified benchmark dataset. The Organizing Committee will conduct standardized evaluations and determine the final rankings based on the official evaluation results. | ||
|
|
||
| ## Objectives | ||
|
|
||
| This challenge takes Large Language Models (LLMs) as the core technical foundation and targets automated data annotation under ultra-long context constraints, aiming to explore novel paradigms that balance annotation efficiency and annotation accuracy. The competition focuses on the following key scientific and engineering challenges: | ||
|
|
||
| - 1. Instruction and Prompt Design: | ||
|
|
||
| How can effective model instructions and prompt strategies be designed in ultra-long context scenarios to guide LLMs toward stable and high-quality data annotation? | ||
| - 2. Ultra-Long Context Construction: | ||
|
|
||
| When the number of available annotation examples significantly exceeds the model’s context capacity, how can information-dense and structurally coherent ultra-long context inputs be constructed for target data annotation? | ||
| - 3. Multi-Turn and Continuous Annotation: | ||
|
|
||
| In automated multi-round dialogue or continuous interaction settings, how can ultra-long contexts be efficiently leveraged to achieve both consistency and scalability in data annotation? | ||
|
|
||
| ## Challenge Details | ||
|
|
||
| - Participating teams are expected to independently design a complete LLM-based automatic data annotation pipeline and validate their approach under a unified dataset and evaluation protocol. Evaluation scores and rankings will be published on a standardized leaderboard. | ||
|
|
||
| - In addition to prediction results, teams must submit a technical report and fully reproducible source code in accordance with the competition requirements. The Organizing Committee will reproduce submitted solutions and review the technical design. The final score will be calculated as a weighted combination of prediction performance and technical solution evaluation, with detailed rules specified by the competition. | ||
|
|
||
| - Teams are required to submit their technical reports and complete source code to the official OpenSeek GitHub repository designated by the competition. | ||
|
|
||
| - For additional details, please refer to [FlagOS platform](https://flagos.io/RaceDetail?id=296fmsd8&lang=en). All competition-related information is subject to the announcements published on the official platform. | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ### 1. Environment Setup | ||
|
|
||
| ```bash | ||
| openai | ||
| torch | ||
| flagScale | ||
| ``` | ||
|
|
||
| On NVIDIA platforms, it is recommended to create the environment using: `cd src && bash create_env_nvidia.sh` | ||
|
|
||
| ### 2. Download Model Weights | ||
|
|
||
| ```bash | ||
| hf download Qwen/Qwen3-4B --local-dir Qwen3-4B | ||
| # or | ||
| modelscope download --model Qwen/Qwen3-4B | ||
| ``` | ||
|
|
||
| ### 3. Long-Context Configuration | ||
|
|
||
| In `Qwen3-4B/config.json`, replace the original configuration with the following settings: | ||
|
|
||
| ```json | ||
| "rope_scaling": { | ||
| "rope_type": "yarn", | ||
| "factor": 4.0, | ||
| "original_max_position_embeddings": 32768 | ||
| } | ||
| ``` | ||
|
|
||
| ### 4. Model Deployment | ||
|
|
||
| Configure the `llm_config.yaml` file according to your actual requirements. Then start the service with: | ||
|
|
||
| ```bash | ||
| cd FlagScale | ||
| python run.py --config-path .. --config-name llm_config action=run | ||
| ``` | ||
|
|
||
| After the model service is launched, you can test the local API using: | ||
|
|
||
| ```bash | ||
| python api_test.py | ||
| ``` | ||
|
|
||
| To stop the service, run: | ||
|
|
||
| ```bash | ||
| python run.py --config-path .. --config-name llm_config action=stop | ||
| ``` | ||
|
|
||
| ### 5. Run or Extend the Baseline Method | ||
|
|
||
| Start the baseline annotation pipeline with: | ||
|
|
||
| ```bash | ||
| python main.py | ||
| ``` | ||
|
|
||
| To implement a new annotation method, modify the `method.py` file. Within this file, you may: | ||
|
|
||
| - Define new instruction or prompt templates | ||
| - Design new context example selection strategies | ||
| - Implement alternative model inference and annotation pipelines | ||
| - Add custom post-processing logic |
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,5 +1,18 @@ | ||||||||||||||
| # Datasets | ||||||||||||||
|
|
||||||||||||||
| | 任务ID | 任务名称 | 上下文长度最短 | 测试样本数 | | ||||||||||||||
| This repository provides the official datasets for the **LLM Automatic Data Annotation**. | ||||||||||||||
| The datasets are specifically designed to evaluate the capability of Large Language Models (LLMs) to perform **automatic data annotation under ultra-long context settings** using the In-context Learning (ICL) paradigm. | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| ## Overview | ||||||||||||||
|
|
||||||||||||||
| - Most tasks require a **minimum ICL context length of 30K tokens**, deliberately exceeding standard context limits to evaluate long-context understanding, prompt engineering, and example selection strategies. | ||||||||||||||
| - Task **openseek-8** is configured with a **shorter minimum context length (15K tokens)** and a **smaller test set**, reflecting the unique challenges of **kernel generation**. | ||||||||||||||
| - All datasets are released with **fixed and standardized test splits** to ensure fair comparison and reproducibility across submissions. | ||||||||||||||
| - The task suite covers a **diverse range of domains and reasoning types**, including symbolic reasoning, linguistic analysis, natural language inference, code-related tasks, and open-ended generation. | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| | Task ID | task name | Minimum ICL context | Test sample number | | ||||||||||||||
| | --- | --- | --- | --- | | ||||||||||||||
| | openseek-1 | closest_integers | 30K | 500 | | ||||||||||||||
| | openseek-2 | count_nouns_verbs | 30K | 500 | | ||||||||||||||
|
|
@@ -8,4 +21,22 @@ | |||||||||||||
| | openseek-5 | semeval_2018_task1_tweet_sadness_detection | 30K | 500 | | ||||||||||||||
| | openseek-6 | mnli_same_genre_classification | 30K | 500 | | ||||||||||||||
| | openseek-7 | jeopardy_answer_generation_all | 30K | 500 | | ||||||||||||||
| | openseek-8 | kernel_genernation | 30K | 500 | | ||||||||||||||
| | openseek-8 | kernel_genernation | 15K | 166 | | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| ## Data Structure | ||||||||||||||
| The datasets are organized in JSON format, with each task having its own json file. Here's a brief overview of the data structure: | ||||||||||||||
|
|
||||||||||||||
| - `task_id`: A unique identifier for the task. | ||||||||||||||
| - "task_name": A short human-readable name of the task. | ||||||||||||||
| - `Definition`: A detailed description of what the model should do. | ||||||||||||||
|
Comment on lines
+30
to
+32
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The formatting for the field descriptions in the
Suggested change
|
||||||||||||||
| - `examples`: Demonstration samples intended for understanding the task format (not necessarily used for scoring). Each example typically includes: `id`, `input` and `output`. | ||||||||||||||
| - `test_samples`: The samples to be predicted by participants. Labels/ground truth is hidden. Each test sample typically includes: `id` and `input`. | ||||||||||||||
| - `License`: The dataset license name and/or a URL to the license text, describing allowed use and redistribution. | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| ## Usage Notes | ||||||||||||||
|
|
||||||||||||||
| - Participants must use the **official datasets as provided**, without altering test splits or labels, for leaderboard evaluation. | ||||||||||||||
| - Any preprocessing steps, context construction strategies, or example selection mechanisms should be clearly described in the accompanying technical report. | ||||||||||||||
| - All experimental results must be **fully reproducible** using the datasets in this repository. | ||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| # 数据集说明 | ||
|
|
||
| 本仓库提供 **LLM Automatic Data Annotation** 的官方数据集。 | ||
|
|
||
| 这些数据集专门用于评估大语言模型(LLMs)在 超长上下文设置 下,使用 In-context Learning(ICL)范式进行 自动数据标注 的能力。 | ||
|
|
||
| --- | ||
|
|
||
| ## 概览 | ||
|
|
||
| - 大多数任务要求 **最小 ICL 上下文长度为 30K tokens**,接近 **Qwen3-4B** 的标准上下文限制,以评估长上下文理解、提示工程以及示例选择策略。 | ||
| - 任务 **openseek-8** 配置了 **更短的最小上下文长度(15K tokens)** 和 **更小的测试集**,以反映**算子生成**的独特挑战。 | ||
| - 所有数据集均以 **固定且标准化的测试划分** 发布,以确保提交之间的公平比较与可复现性。 | ||
| - 任务集合覆盖 **多样的领域与推理类型**,包括符号推理、语言学分析、自然语言推断、代码相关任务以及开放式生成。 | ||
|
|
||
| | Task ID | task name | Minimum ICL context | Test sample number | | ||
| | --- | --- | --- | --- | | ||
| | openseek-1 | closest_integers | 30K | 500 | | ||
| | openseek-2 | count_nouns_verbs | 30K | 500 | | ||
| | openseek-3 | collatz_conjecture | 30K | 500 | | ||
| | openseek-4 | conala_concat_strings | 30K | 500 | | ||
| | openseek-5 | semeval_2018_task1_tweet_sadness_detection | 30K | 500 | | ||
| | openseek-6 | mnli_same_genre_classification | 30K | 500 | | ||
| | openseek-7 | jeopardy_answer_generation_all | 30K | 500 | | ||
| | openseek-8 | kernel_genernation | 15K | 166 | | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| --- | ||
|
|
||
| ## 数据结构 | ||
|
|
||
| 数据集以 `JSON` 格式组织,每个任务对应一个独立的 `.json` 文件。以下是数据结构的简要说明: | ||
|
|
||
| - `task_id`: 任务的唯一标识符。 | ||
| - `task_name`: 任务的简短、便于理解的人类可读名称。 | ||
| - `Definition`: 对模型应执行内容的详细描述。 | ||
| - `examples`: 用于理解任务格式的演示样本(不一定用于计分)。每个示例包含:`id`、`input` 和 `output`。 | ||
| - `test_samples`: 参赛者需要预测的样本。标签/真实值被隐藏。每个测试样本包含:`id` 和 `input`。 | ||
| - `License`: 数据集许可证名称和/或许可证文本的 URL,用于说明允许的使用方式与再分发规则。 | ||
|
|
||
| --- | ||
|
|
||
| ## 使用说明 | ||
|
|
||
| - 参赛者必须使用**按原样提供的官方数据集**,不得更改测试划分或标签,以用于排行榜评测。 | ||
| - 任何预处理步骤、上下文构建策略或示例选择机制,都应在随附的技术报告中清晰描述。 | ||
| - 所有实验结果必须能够使用本仓库中的数据集**完全复现**。 | ||
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| ## Submission Format (JSONL + ZIP) | ||
|
|
||
| Below is the standard format for submitting model predictions. Please save your predictions into **eight `.jsonl` files** (one per task), then **package them into a single `.zip` archive** and upload it to the FlagOS platform for automatic evaluation. | ||
|
|
||
| --- | ||
|
|
||
| ### 1) JSONL File Content | ||
|
|
||
| Each `.jsonl` file consists of multiple JSON objects (**one prediction per line**). | ||
| Each prediction must contain the following two fields: | ||
|
|
||
| - `test_sample_id`: corresponds to the sample `id` in the competition dataset. | ||
| - `prediction`: the model’s predicted result for that sample. | ||
|
|
||
| **Single-line example:** | ||
| ```json | ||
| {"test_sample_id":"openseek-1-ed5ac69191204cd4bfb0ca41bc7f197f","prediction":"..."} | ||
| ``` | ||
|
|
||
| ### 2) ZIP Archive Requirements (Mandatory) | ||
| Each submission must upload **one** `.zip` file, and the archive must contain **8** prediction files: | ||
|
|
||
| - Each filename must start with `openseek-[id]` (e.g., `openseek-1*.jsonl`) | ||
| - All **8** tasks correspond to **8** `.jsonl` files for automated scoring. | ||
|
|
||
| > Recommendation: Make sure the `.zip` archive contains these **8** `.jsonl` files directly (no nested folders), and avoid including any unrelated extra files to prevent evaluation parsing issues. | ||
|
|
||
| --- | ||
|
|
||
| ## 提交格式说明(JSONL + ZIP) | ||
|
|
||
| 以下为标准的模型预测结果提交规范。请将模型预测结果分别保存为 **8 个 `.jsonl` 文件**,并将它们 **打包为一个 `.zip` 压缩包** 后上传至 FlagOS 平台进行自动评测。 | ||
|
|
||
| --- | ||
|
|
||
| ### 1) JSONL 文件内容格式 | ||
|
|
||
| 每个 `.jsonl` 文件由多行 JSON 对象组成(**一行一个预测结果**)。 | ||
| 每条预测必须包含以下两个字段: | ||
|
|
||
| - `test_sample_id`:对应赛题数据中的样本 `id` | ||
| - `prediction`:模型对该样本的预测结果 | ||
|
|
||
| **单行示例:** | ||
| ```json | ||
| {"test_sample_id":"openseek-1-ed5ac69191204cd4bfb0ca41bc7f197f","prediction":"..."} | ||
| ``` | ||
|
|
||
| ### 2) ZIP 压缩包要求(必须满足) | ||
| 每次提交需上传 **一个** `.zip` 文件,且该压缩包必须同时包含 **8** 个预测文件: | ||
|
|
||
| - 文件名需以 `openseek-[id]` 开头(例如 `openseek-1*.jsonl`) | ||
| - 共 **8** 个任务各对应 **8** 个 `.jsonl` 文件,用于自动化评分 | ||
|
|
||
| > 建议:确保压缩包内直接包含这 **8** 个 `.jsonl` 文件(不嵌套文件夹),并避免额外无关文件,以免影响评测解析。 |
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| import requests | ||
|
|
||
| url = "http://0.0.0.0:2026/v1/completions" | ||
| prompts = [ | ||
| "Hello, FlagScale + vLLM!", | ||
| "Translate 'Hello World' to Chinese.", | ||
| "Write a short poem about autumn." | ||
| # '用中文写一首短诗,诗句开头用<label>,结尾用</label>包裹起来' | ||
| ] | ||
|
|
||
| for prompt in prompts: | ||
| data = { | ||
| "model": "../Qwen3-4B", | ||
| "prompt": prompt, | ||
| "max_tokens": 1000 | ||
| } | ||
|
Comment on lines
+3
to
+16
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| resp = requests.post(url, json=data) | ||
| print(f"Prompt: {prompt}") | ||
| print("Response:", resp.json(), "\n") | ||
|
|
||
| print("*"*50) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in the task name
kernel_genernation. It should bekernel_generation. This typo appears in other files as well (data/README_zh.md). For consistency, please correct it in all occurrences. Note thatsrc/main.pyuseskernel_generation, which is inconsistent with the documentation.