LongContext-ICL-Annotation by Haiming94 · Pull Request #176 · FlagAI-Open/OpenSeek

Haiming94 · 2026-01-22T05:16:05Z

Signed-off-by: Haiming94 931284009@qq.com

Signed-off-by: Haiming94 <931284009@qq.com>

gemini-code-assist · 2026-01-22T05:16:36Z

Summary of Changes

Hello @Haiming94, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes the foundational elements for the 'LongContext-ICL-Annotation' challenge, focusing on automated data annotation by Large Language Models within ultra-long context settings. It provides extensive documentation in both English and Chinese, outlining the competition's objectives, rules, and technical requirements. Furthermore, it includes a complete set of baseline code for environment setup, model deployment, and an evaluation pipeline, enabling participants to quickly engage with the challenge and develop their solutions for effective in-context learning strategies.

Highlights

Competition Setup: Comprehensive documentation (English and Chinese) has been added for a new competition focused on LLM automatic data annotation in long-context scenarios, including objectives, rules, and technical requirements.
Baseline Implementation: A complete baseline Python code structure is introduced, covering prompt construction, in-context example selection, and annotation using a local LLM API (Qwen3-4B).
Environment and Deployment: New scripts are provided for setting up the development environment on NVIDIA platforms and deploying the Qwen3-4B model using vLLM for local inference.
Dataset and Submission Guidelines: Detailed descriptions of the competition datasets, including context length requirements and data structure, are now available, along with clear instructions for the submission format (JSONL + ZIP).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive setup for a machine learning competition, including documentation, environment setup scripts, and baseline code. While the overall structure is good, there are several critical issues that need to be addressed to ensure the code is usable, reproducible, and maintainable. I've identified critical bugs in the environment setup script and hardcoded user-specific paths in the Python code that will prevent others from running it. Additionally, there are opportunities to improve code quality by removing dead code, making configurations more flexible, and improving error handling. I've also noted a recurring typo across multiple files.

gemini-code-assist · 2026-01-22T05:19:09Z

openseek/competition/LongContext-ICL-Annotation/src/create_env_nvidia.sh

+flash_attn-${flash_attn_version}+cu${cu}torch${torch}-cp${cp}-cp${cp}-linux_x86_64.whl
+# Recommend to download the wheel handly, for example flash_attn-2.8.3+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64
+pip install flash_attn-2.8.3+cu124torch2.6-cp311-cp311-linux_x86_64.whl


There's a critical bug in the flash_attn installation steps.

Line 26 is just a filename and will cause a "command not found" error.

Line 28 uses a hardcoded wheel filename, which may not match the version downloaded in the preceding wget command, especially on different systems.

This will likely break the environment setup. You should use the variables defined earlier to install the downloaded wheel and remove the erroneous line.

Suggested change

flash_attn-${flash_attn_version}+cu${cu}torch${torch}-cp${cp}-cp${cp}-linux_x86_64.whl

# Recommend to download the wheel handly, for example flash_attn-2.8.3+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64

pip install flash_attn-2.8.3+cu124torch2.6-cp311-cp311-linux_x86_64.whl

# Recommend to download the wheel handly, for example flash_attn-2.8.3+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64

pip install flash_attn-${flash_attn_version}+cu${cu}torch${torch}-cp${cp}-cp${cp}-linux_x86_64.whl

gemini-code-assist · 2026-01-22T05:19:09Z

openseek/competition/LongContext-ICL-Annotation/src/method.py

+def select_examples(all_examples: list[dict], task_description: str, text2annotate: str) -> str:
+    """
+        Select examples from all_examples to fit into the target context length (适配Qwen3-4B的token计算).
+        all_examples:
+            A list of examples, where each example is a dict with keys 'input' and 'output' (no 'length' needed).
+            For example, ``{"input": "The material is good and looks great.", "output": "Good Review"}``,
+        task_description:
+            The description of the annotation task which may be used for example evaluation. 
+        text2annotate:
+            The text that needs to be annotated  which may be used for example retrieval.
+    """
+    # 初始化Qwen3-4B的tokenizer（自动下载/加载千问3-4B的分词器）
+    # 若本地已下载模型，可替换为本地路径，如 "./qwen3-4b"
+    tokenizer = AutoTokenizer.from_pretrained("/share/project/wuhaiming/spaces/data_agent/OpenSeek-main/openseek/competition/LongContext-ICL-Annotation/src/Qwen3-4B", trust_remote_code=True)


The select_examples function hardcodes a user-specific absolute path to the tokenizer and re-initializes it on every call. This is inefficient and will cause the script to fail for other users.

To fix this, you should:

Modify the function signature to accept a tokenizer object.

Remove the hardcoded AutoTokenizer.from_pretrained(...) call.

Update the call site in main.py to pass the already initialized qwen_tokenizer to this function.

Suggested change

def select_examples(all_examples: list[dict], task_description: str, text2annotate: str) -> str:

"""

Select examples from all_examples to fit into the target context length (适配Qwen3-4B的token计算).

all_examples:

A list of examples, where each example is a dict with keys 'input' and 'output' (no 'length' needed).

For example, ``{"input": "The material is good and looks great.", "output": "Good Review"}``,

task_description:

The description of the annotation task which may be used for example evaluation.

text2annotate:

The text that needs to be annotated which may be used for example retrieval.

"""

# 初始化Qwen3-4B的tokenizer（自动下载/加载千问3-4B的分词器）

# 若本地已下载模型，可替换为本地路径，如 "./qwen3-4b"

tokenizer = AutoTokenizer.from_pretrained("/share/project/wuhaiming/spaces/data_agent/OpenSeek-main/openseek/competition/LongContext-ICL-Annotation/src/Qwen3-4B", trust_remote_code=True)

def select_examples(tokenizer: AutoTokenizer, all_examples: list[dict], task_description: str, text2annotate: str) -> str:

"""

Select examples from all_examples to fit into the target context length (适配Qwen3-4B的token计算).

all_examples:

A list of examples, where each example is a dict with keys 'input' and 'output' (no 'length' needed).

For example, ``{"input": "The material is good and looks great.", "output": "Good Review"}``,

task_description:

The description of the annotation task which may be used for example evaluation.

text2annotate:

The text that needs to be annotated which may be used for example retrieval.

"""

gemini-code-assist · 2026-01-22T05:19:09Z

openseek/competition/LongContext-ICL-Annotation/src/main.py

+                        default='../outputs/',
+                        help='Prefix path to save the evaluation logs.')
+    parser.add_argument('--tokenizer_path', type=str,
+                        default='/share/project/wuhaiming/spaces/data_agent/OpenSeek-main/openseek/competition/LongContext-ICL-Annotation/src/Qwen3-4B')


The default path for the tokenizer is a hardcoded, user-specific absolute path. This will cause the script to fail for any other user. Please change this to a relative path or a more generic placeholder that instructs the user to provide their own path.

Suggested change

default='/share/project/wuhaiming/spaces/data_agent/OpenSeek-main/openseek/competition/LongContext-ICL-Annotation/src/Qwen3-4B')

default='../Qwen3-4B')

gemini-code-assist · 2026-01-22T05:19:09Z

openseek/competition/LongContext-ICL-Annotation/src/method.py

+    try:
+        resp = requests.post(URL, json=data)
+        whole_result = resp.json()["choices"][0]["text"]
+    except Exception as e:
+        whole_result = "None"


Catching a broad Exception is generally not recommended as it can hide unexpected errors. It's better to catch a more specific exception, like requests.exceptions.RequestException. Additionally, the error e is completely swallowed, which makes debugging very difficult. You should at least log or print the error.

Suggested change

try:

resp = requests.post(URL, json=data)

whole_result = resp.json()["choices"][0]["text"]

except Exception as e:

whole_result = "None"

try:

resp = requests.post(URL, json=data, timeout=60) # Add a timeout

resp.raise_for_status() # Raise an exception for bad status codes

whole_result = resp.json()["choices"][0]["text"]

except requests.exceptions.RequestException as e:

print(f"API request failed: {e}")

whole_result = "None"

gemini-code-assist · 2026-01-22T05:19:09Z

openseek/competition/LongContext-ICL-Annotation/src/create_env_nvidia.sh

@@ -0,0 +1,46 @@
+
+git clone https://github.com/FlagOpen/FlagScale.git


The script clones the main branch of the FlagScale repository. This can lead to non-reproducible environment setups if the main branch changes. For reproducibility, it's crucial to pin this to a specific commit hash or a release tag.

Suggested change

git clone https://github.com/FlagOpen/FlagScale.git

git clone https://github.com/FlagOpen/FlagScale.git --branch <tag_or_commit_hash>

gemini-code-assist · 2026-01-22T05:19:09Z

openseek/competition/LongContext-ICL-Annotation/src/api_test.py

+url = "http://0.0.0.0:2026/v1/completions"
+prompts = [
+    "Hello, FlagScale + vLLM!",
+    "Translate 'Hello World' to Chinese.",
+    "Write a short poem about autumn."
+    # '用中文写一首短诗，诗句开头用<label>，结尾用</label>包裹起来'
+]
+
+for prompt in prompts:
+    data = {
+        "model": "../Qwen3-4B",
+        "prompt": prompt,
+        "max_tokens": 1000
+    }


The URL and model path are hardcoded in this test script. This makes it less flexible and harder to maintain. Consider using command-line arguments (e.g., with argparse) or environment variables to make these values configurable.

gemini-code-assist · 2026-01-22T05:19:09Z

openseek/competition/LongContext-ICL-Annotation/src/main.py

+        # tokenized_input = qwen_tokenizer(input_prompt, return_tensors="pt")
+        # if tokenized_input['input_ids'].shape[1] > max_input_length:
+        #     test_record['prediction'] = None
+        # else:
+        #     prediction = annotate(input_prompt)
+        #     test_record['prediction'] = prediction


This block of code is commented out and appears to be dead code. It should be removed to improve code clarity and maintainability.

gemini-code-assist · 2026-01-22T05:19:10Z

openseek/competition/LongContext-ICL-Annotation/src/method.py

+def build_prompt____(task_description: str, text2annotate: str) -> str:
+    """
+    Build a high-precision English prompt for long-context data annotation (optimized for Qwen3-4B).
+    Core requirement: Final answer MUST be wrapped in <label> tags (no extra content outside tags).
+    """
+    prompt = (
+        "### Role Definition\n"
+        "You are a professional data annotation expert specializing in long-context text labeling. "
+        "Your work must strictly comply with the following rules, with the highest priority given to output format accuracy.\n\n"
+
+        "### Core Annotation Task\n"
+        f"{task_description}\n\n"
+
+        "### Non-Negotiable Annotation Rules (Highest Priority)\n"
+        "1. **Final Output Mandate**: Your annotation result MUST be wrapped in <label> tags — NO text, symbols, spaces, or explanations are allowed outside the tags.\n"
+        "2. **Internal Reasoning Permission**: You may perform logical reasoning, text analysis, or context comprehension internally (in your thought process), but NONE of these thoughts may appear in the final output.\n"
+        "3. **Label Format Strictness**: <label> is the opening tag and </label> is the closing tag — they must appear in pairs, with NO extra spaces or characters inside the tags (e.g., <label>  Good Review  </label> is invalid).\n"
+        "4. **Prohibited Outputs**: \n"
+        "   - ❌ Prohibited: 'After analysis, this is a positive review: <label>Good Review</label>' (extra text outside tags)\n"
+        "   - ❌ Prohibited: 'Bad Review' (missing <label> tags entirely)\n"
+        "   - ❌ Prohibited: '<label>Bad Review' (unpaired/closing tag missing)\n\n"
+
+        "### Correct vs. Incorrect Examples\n"
+        "✅ Correct Example 1: <label>answer</label>\n"
+        "✅ Correct Example 2: <label>Bad Review</label>\n"
+        "❌ Incorrect Example 1: I think this review is negative → <label>Bad Review</label>\n"
+        "❌ Incorrect Example 2: <label>  Neutral Review  </label> (extra spaces inside tags)\n"
+        "❌ Incorrect Example 3: Neutral Review (no label tags)\n\n"
+
+        "### Reference Annotation Examples\n"
+        "{EXAMPLES}\n\n"
+
+        "### Text to Annotate\n"
+        f"{text2annotate}\n\n"
+
+        "### Final Output Command (Re-emphasized)\n"
+        "You may complete any internal reasoning process, but your FINAL OUTPUT MUST consist solely of the annotation result wrapped in <label> tags (no other content whatsoever).\n"
+        "Annotation Result: "
+    )
+    return prompt
+
+def build_prompt(task_description: str, text2annotate: str) -> str:
+    """
+    Construct a high-precision prompt for long-context data annotation (optimized for Qwen3-4B).
+    task_description: Clear description of the annotation task (e.g., "Classify English product reviews as Good Review/Bad Review").
+    text2annotate: The text to be annotated (single text or batch texts).
+    """
+    prompt = (
+        "### Role Definition\n"
+        "You are a professional data annotation expert specialized in long-context text labeling. "
+        "Your work must strictly follow the task rules, fully learn from the provided examples, and ensure the final annotation result is 100% enclosed in <label> tags.\n\n"
+
+        "### Core Task\n"
+        f"{task_description}\n\n"
+
+        "### Critical Annotation Guidelines\n"
+        "1. **Example Learning Requirement**: Thoroughly analyze and fully learn from the annotation logic, format, and criteria in the Examples section. "
+        "Your annotation must align with the style, judgment standards, and tag usage shown in the examples.\n"
+        "2. **Thinking Process**: You may (and are encouraged to) explain your annotation reasoning step by step (e.g., key information extraction, judgment basis, rule matching).\n"
+        "3. **Mandatory Output Rule**: Regardless of any thinking process you provide, your final annotation result MUST be enclosed in <label> tags (this is non-negotiable).\n"
+        "   - Correct example: \n"
+        "     Reasoning: This review mentions 'excellent quality' and 'very satisfied', which meets the criteria for a Good Review.\n"
+        "     <label>Good Review</label>\n"
+        "   - Wrong example 1 (missing tags): This review is negative.\n"
+        "   - Wrong example 2 (incomplete tags): Bad Review</label>\n"
+        "4. **Length Adaptation**: For long texts, maintain complete thinking process and ensure the final <label> tags contain the accurate annotation result (no truncation).\n\n"
+
+        "### Examples (Must Be Fully Followed)\n"
+        "[[EXAMPLES]]\n\n"
+
+        "### Text to Annotate\n"
+        f"{text2annotate}\n\n"
+
+        "### Final Requirement Summary\n"
+        "1. You can (and should) provide clear thinking process for your annotation.\n"
+        "2. The final annotation result MUST be wrapped in <label> tags (no exceptions).\n"
+        "3. All annotation logic must strictly follow the examples provided above.\n"
+    )
+    return prompt
+
+def build_prompt_backup(task_description:str, text2annotate:str)->str:
+    """
+        Construct the prompt for annotation based on the task description.
+        task_description: 
+            The description of the annotation task. 
+            For example, ``Given an English language product review, 
+            determine if it is a Good Review or a Bad Review.`` 
+        text2annotate:
+            The text that needs to be annotated.
+            For example, ``My son received this book as a gift. I was extremely disappointed.``
+    """
+    prompt = (
+        "You are a data annotation assistant. "
+        "Your task is to label the given texts according to the task description "
+        "and annotation guidelines provided below.\n\n"
+        f"[Task Description]\n {task_description}\n\n"
+        "[Examples]\n {EXAMPLES}\n\n"
+        "Please follow these instructions when labeling:\n"
+        "1. **Output Format**: Annotate the text directly by wrapping each labeled "
+        "span with <label> tags in the following format: <label> annotation result </label>.\n"
+        # "2. Do not add any extra text, explanations, or commentary in the labeled spans.\n\n"
+        f"[Task Description (repeat)] \n {task_description}\n\n"
+        f"[Input Texts]\n {text2annotate}\n\n"
+        "Please output the annotation results: "
+    )
+    return prompt
+
+def select_examples_backup(all_examples:list[dict], task_description:str, text2annotate:str)->str:
+    """
+        Select examples from all_examples to fit into the target context length.
+        all_examples:
+            A list of examples, where each example is a dict with keys 'input', 'output', and 'length'.
+            For example, ``{"input": "The material is good and looks great.", "output": "Good Review", "length": 79``},
+        task_description:
+            The description of the annotation task which may be used for example evaluation. 
+            For example, ``Given an English language product review, 
+            determine if it is a Good Review or a Bad Review.`` 
+        text2annotate:
+            The text that needs to be annotated  which may be used for example retrieval.
+            For example, ``My son received this book as a gift. I was extremely disappointed.``
+
+    """
+    # Notice that the maximum context length is restricted.
+    target_length = 10_000
+
+    input_list = [example['input'] for example in all_examples]
+    output_list = [example['output'][0] for example in all_examples]
+    length_list = [example['length'] for example in all_examples]
+
+    # <label> have 2 tokens; </label> have 3 tokens; \n have 1 token; # have 1 token.
+    examples_str, token_num = "", 0
+    for i, (input_text, output_text, length) in enumerate(zip(input_list, output_list, length_list)):
+        if length + token_num <= target_length:
+            token_num += (length + 2 + 3 + 1 + 1)
+            example_str = f"# {input_text} <label> {output_text} </label>\n"
+            examples_str += example_str
+        else:
+            return examples_str, i
+    return examples_str


The functions build_prompt____, build_prompt_backup, and select_examples_backup appear to be unused backup or old versions of functions. They should be removed to clean up the code and avoid confusion.

gemini-code-assist · 2026-01-22T05:19:10Z

openseek/competition/LongContext-ICL-Annotation/data/README_zh.md

+| openseek-5 | semeval_2018_task1_tweet_sadness_detection |  30K  | 500 | 
+| openseek-6 | mnli_same_genre_classification |  30K  | 500 | 
+| openseek-7 | jeopardy_answer_generation_all |  30K  | 500 | 
+| openseek-8 | kernel_genernation |  15K  | 166 | 


There is a typo in the task name kernel_genernation. It should be kernel_generation. This typo should be corrected for consistency across the project.

gemini-code-assist · 2026-01-22T05:19:10Z

openseek/competition/LongContext-ICL-Annotation/data/README.md

+- `task_id`: A unique identifier for the task.
+- "task_name": A short human-readable name of the task.
+- `Definition`: A detailed description of what the model should do.


The formatting for the field descriptions in the Data Structure section is inconsistent. "task_name" is enclosed in double quotes, while task_id and Definition are not. For consistency and clarity, I recommend removing the quotes from task_name or applying them to all field names.

Suggested change

- `task_id`: A unique identifier for the task.

- "task_name": A short human-readable name of the task.

- `Definition`: A detailed description of what the model should do.

- `task_id`: A unique identifier for the task.

- `task_name`: A short human-readable name of the task.

- `Definition`: A detailed description of what the model should do.

LongContext-ICL-Annotation

accba95

Signed-off-by: Haiming94 <931284009@qq.com>

Haiming94 merged commit 29d21e7 into main Jan 22, 2026
1 of 4 checks passed

gemini-code-assist bot reviewed Jan 22, 2026

View reviewed changes

Haiming94 deleted the LongContext-ICLAnnotation branch January 22, 2026 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LongContext-ICL-Annotation#176

LongContext-ICL-Annotation#176
Haiming94 merged 1 commit intomainfrom
LongContext-ICLAnnotation

Haiming94 commented Jan 22, 2026

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Uh oh!

gemini-code-assist bot Jan 22, 2026

Uh oh!

gemini-code-assist bot Jan 22, 2026

Uh oh!

gemini-code-assist bot Jan 22, 2026

Uh oh!

gemini-code-assist bot Jan 22, 2026

Uh oh!

gemini-code-assist bot Jan 22, 2026

Uh oh!

gemini-code-assist bot Jan 22, 2026

Uh oh!

gemini-code-assist bot Jan 22, 2026

Uh oh!

gemini-code-assist bot Jan 22, 2026

Uh oh!

gemini-code-assist bot Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	default='/share/project/wuhaiming/spaces/data_agent/OpenSeek-main/openseek/competition/LongContext-ICL-Annotation/src/Qwen3-4B')
	default='../Qwen3-4B')

		@@ -0,0 +1,46 @@

		git clone https://github.com/FlagOpen/FlagScale.git

Conversation

Haiming94 commented Jan 22, 2026

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant