Skip to content

LongContext-ICL-Annotation#176

Merged
Haiming94 merged 1 commit intomainfrom
LongContext-ICLAnnotation
Jan 22, 2026
Merged

LongContext-ICL-Annotation#176
Haiming94 merged 1 commit intomainfrom
LongContext-ICLAnnotation

Conversation

@Haiming94
Copy link
Copy Markdown
Collaborator

Signed-off-by: Haiming94 931284009@qq.com

Signed-off-by: Haiming94 <931284009@qq.com>
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @Haiming94, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes the foundational elements for the 'LongContext-ICL-Annotation' challenge, focusing on automated data annotation by Large Language Models within ultra-long context settings. It provides extensive documentation in both English and Chinese, outlining the competition's objectives, rules, and technical requirements. Furthermore, it includes a complete set of baseline code for environment setup, model deployment, and an evaluation pipeline, enabling participants to quickly engage with the challenge and develop their solutions for effective in-context learning strategies.

Highlights

  • Competition Setup: Comprehensive documentation (English and Chinese) has been added for a new competition focused on LLM automatic data annotation in long-context scenarios, including objectives, rules, and technical requirements.
  • Baseline Implementation: A complete baseline Python code structure is introduced, covering prompt construction, in-context example selection, and annotation using a local LLM API (Qwen3-4B).
  • Environment and Deployment: New scripts are provided for setting up the development environment on NVIDIA platforms and deploying the Qwen3-4B model using vLLM for local inference.
  • Dataset and Submission Guidelines: Detailed descriptions of the competition datasets, including context length requirements and data structure, are now available, along with clear instructions for the submission format (JSONL + ZIP).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Haiming94 Haiming94 merged commit 29d21e7 into main Jan 22, 2026
1 of 4 checks passed
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive setup for a machine learning competition, including documentation, environment setup scripts, and baseline code. While the overall structure is good, there are several critical issues that need to be addressed to ensure the code is usable, reproducible, and maintainable. I've identified critical bugs in the environment setup script and hardcoded user-specific paths in the Python code that will prevent others from running it. Additionally, there are opportunities to improve code quality by removing dead code, making configurations more flexible, and improving error handling. I've also noted a recurring typo across multiple files.

Comment on lines +26 to +28
flash_attn-${flash_attn_version}+cu${cu}torch${torch}-cp${cp}-cp${cp}-linux_x86_64.whl
# Recommend to download the wheel handly, for example flash_attn-2.8.3+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64
pip install flash_attn-2.8.3+cu124torch2.6-cp311-cp311-linux_x86_64.whl
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a critical bug in the flash_attn installation steps.

  1. Line 26 is just a filename and will cause a "command not found" error.
  2. Line 28 uses a hardcoded wheel filename, which may not match the version downloaded in the preceding wget command, especially on different systems.

This will likely break the environment setup. You should use the variables defined earlier to install the downloaded wheel and remove the erroneous line.

Suggested change
flash_attn-${flash_attn_version}+cu${cu}torch${torch}-cp${cp}-cp${cp}-linux_x86_64.whl
# Recommend to download the wheel handly, for example flash_attn-2.8.3+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64
pip install flash_attn-2.8.3+cu124torch2.6-cp311-cp311-linux_x86_64.whl
# Recommend to download the wheel handly, for example flash_attn-2.8.3+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64
pip install flash_attn-${flash_attn_version}+cu${cu}torch${torch}-cp${cp}-cp${cp}-linux_x86_64.whl

Comment on lines +148 to +161
def select_examples(all_examples: list[dict], task_description: str, text2annotate: str) -> str:
"""
Select examples from all_examples to fit into the target context length (适配Qwen3-4B的token计算).
all_examples:
A list of examples, where each example is a dict with keys 'input' and 'output' (no 'length' needed).
For example, ``{"input": "The material is good and looks great.", "output": "Good Review"}``,
task_description:
The description of the annotation task which may be used for example evaluation.
text2annotate:
The text that needs to be annotated which may be used for example retrieval.
"""
# 初始化Qwen3-4B的tokenizer(自动下载/加载千问3-4B的分词器)
# 若本地已下载模型,可替换为本地路径,如 "./qwen3-4b"
tokenizer = AutoTokenizer.from_pretrained("/share/project/wuhaiming/spaces/data_agent/OpenSeek-main/openseek/competition/LongContext-ICL-Annotation/src/Qwen3-4B", trust_remote_code=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The select_examples function hardcodes a user-specific absolute path to the tokenizer and re-initializes it on every call. This is inefficient and will cause the script to fail for other users.

To fix this, you should:

  1. Modify the function signature to accept a tokenizer object.
  2. Remove the hardcoded AutoTokenizer.from_pretrained(...) call.
  3. Update the call site in main.py to pass the already initialized qwen_tokenizer to this function.
Suggested change
def select_examples(all_examples: list[dict], task_description: str, text2annotate: str) -> str:
"""
Select examples from all_examples to fit into the target context length (适配Qwen3-4B的token计算).
all_examples:
A list of examples, where each example is a dict with keys 'input' and 'output' (no 'length' needed).
For example, ``{"input": "The material is good and looks great.", "output": "Good Review"}``,
task_description:
The description of the annotation task which may be used for example evaluation.
text2annotate:
The text that needs to be annotated which may be used for example retrieval.
"""
# 初始化Qwen3-4B的tokenizer(自动下载/加载千问3-4B的分词器)
# 若本地已下载模型,可替换为本地路径,如 "./qwen3-4b"
tokenizer = AutoTokenizer.from_pretrained("/share/project/wuhaiming/spaces/data_agent/OpenSeek-main/openseek/competition/LongContext-ICL-Annotation/src/Qwen3-4B", trust_remote_code=True)
def select_examples(tokenizer: AutoTokenizer, all_examples: list[dict], task_description: str, text2annotate: str) -> str:
"""
Select examples from all_examples to fit into the target context length (适配Qwen3-4B的token计算).
all_examples:
A list of examples, where each example is a dict with keys 'input' and 'output' (no 'length' needed).
For example, ``{"input": "The material is good and looks great.", "output": "Good Review"}``,
task_description:
The description of the annotation task which may be used for example evaluation.
text2annotate:
The text that needs to be annotated which may be used for example retrieval.
"""

default='../outputs/',
help='Prefix path to save the evaluation logs.')
parser.add_argument('--tokenizer_path', type=str,
default='/share/project/wuhaiming/spaces/data_agent/OpenSeek-main/openseek/competition/LongContext-ICL-Annotation/src/Qwen3-4B')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The default path for the tokenizer is a hardcoded, user-specific absolute path. This will cause the script to fail for any other user. Please change this to a relative path or a more generic placeholder that instructs the user to provide their own path.

Suggested change
default='/share/project/wuhaiming/spaces/data_agent/OpenSeek-main/openseek/competition/LongContext-ICL-Annotation/src/Qwen3-4B')
default='../Qwen3-4B')

Comment on lines +241 to +245
try:
resp = requests.post(URL, json=data)
whole_result = resp.json()["choices"][0]["text"]
except Exception as e:
whole_result = "None"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Catching a broad Exception is generally not recommended as it can hide unexpected errors. It's better to catch a more specific exception, like requests.exceptions.RequestException. Additionally, the error e is completely swallowed, which makes debugging very difficult. You should at least log or print the error.

Suggested change
try:
resp = requests.post(URL, json=data)
whole_result = resp.json()["choices"][0]["text"]
except Exception as e:
whole_result = "None"
try:
resp = requests.post(URL, json=data, timeout=60) # Add a timeout
resp.raise_for_status() # Raise an exception for bad status codes
whole_result = resp.json()["choices"][0]["text"]
except requests.exceptions.RequestException as e:
print(f"API request failed: {e}")
whole_result = "None"

@@ -0,0 +1,46 @@

git clone https://github.com/FlagOpen/FlagScale.git
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The script clones the main branch of the FlagScale repository. This can lead to non-reproducible environment setups if the main branch changes. For reproducibility, it's crucial to pin this to a specific commit hash or a release tag.

Suggested change
git clone https://github.com/FlagOpen/FlagScale.git
git clone https://github.com/FlagOpen/FlagScale.git --branch <tag_or_commit_hash>

Comment on lines +3 to +16
url = "http://0.0.0.0:2026/v1/completions"
prompts = [
"Hello, FlagScale + vLLM!",
"Translate 'Hello World' to Chinese.",
"Write a short poem about autumn."
# '用中文写一首短诗,诗句开头用<label>,结尾用</label>包裹起来'
]

for prompt in prompts:
data = {
"model": "../Qwen3-4B",
"prompt": prompt,
"max_tokens": 1000
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The URL and model path are hardcoded in this test script. This makes it less flexible and harder to maintain. Consider using command-line arguments (e.g., with argparse) or environment variables to make these values configurable.

Comment on lines +78 to +83
# tokenized_input = qwen_tokenizer(input_prompt, return_tensors="pt")
# if tokenized_input['input_ids'].shape[1] > max_input_length:
# test_record['prediction'] = None
# else:
# prediction = annotate(input_prompt)
# test_record['prediction'] = prediction
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code is commented out and appears to be dead code. It should be removed to improve code clarity and maintainability.

Comment on lines +8 to +146
def build_prompt____(task_description: str, text2annotate: str) -> str:
"""
Build a high-precision English prompt for long-context data annotation (optimized for Qwen3-4B).
Core requirement: Final answer MUST be wrapped in <label> tags (no extra content outside tags).
"""
prompt = (
"### Role Definition\n"
"You are a professional data annotation expert specializing in long-context text labeling. "
"Your work must strictly comply with the following rules, with the highest priority given to output format accuracy.\n\n"

"### Core Annotation Task\n"
f"{task_description}\n\n"

"### Non-Negotiable Annotation Rules (Highest Priority)\n"
"1. **Final Output Mandate**: Your annotation result MUST be wrapped in <label> tags — NO text, symbols, spaces, or explanations are allowed outside the tags.\n"
"2. **Internal Reasoning Permission**: You may perform logical reasoning, text analysis, or context comprehension internally (in your thought process), but NONE of these thoughts may appear in the final output.\n"
"3. **Label Format Strictness**: <label> is the opening tag and </label> is the closing tag — they must appear in pairs, with NO extra spaces or characters inside the tags (e.g., <label> Good Review </label> is invalid).\n"
"4. **Prohibited Outputs**: \n"
" - ❌ Prohibited: 'After analysis, this is a positive review: <label>Good Review</label>' (extra text outside tags)\n"
" - ❌ Prohibited: 'Bad Review' (missing <label> tags entirely)\n"
" - ❌ Prohibited: '<label>Bad Review' (unpaired/closing tag missing)\n\n"

"### Correct vs. Incorrect Examples\n"
"✅ Correct Example 1: <label>answer</label>\n"
"✅ Correct Example 2: <label>Bad Review</label>\n"
"❌ Incorrect Example 1: I think this review is negative → <label>Bad Review</label>\n"
"❌ Incorrect Example 2: <label> Neutral Review </label> (extra spaces inside tags)\n"
"❌ Incorrect Example 3: Neutral Review (no label tags)\n\n"

"### Reference Annotation Examples\n"
"{EXAMPLES}\n\n"

"### Text to Annotate\n"
f"{text2annotate}\n\n"

"### Final Output Command (Re-emphasized)\n"
"You may complete any internal reasoning process, but your FINAL OUTPUT MUST consist solely of the annotation result wrapped in <label> tags (no other content whatsoever).\n"
"Annotation Result: "
)
return prompt

def build_prompt(task_description: str, text2annotate: str) -> str:
"""
Construct a high-precision prompt for long-context data annotation (optimized for Qwen3-4B).
task_description: Clear description of the annotation task (e.g., "Classify English product reviews as Good Review/Bad Review").
text2annotate: The text to be annotated (single text or batch texts).
"""
prompt = (
"### Role Definition\n"
"You are a professional data annotation expert specialized in long-context text labeling. "
"Your work must strictly follow the task rules, fully learn from the provided examples, and ensure the final annotation result is 100% enclosed in <label> tags.\n\n"

"### Core Task\n"
f"{task_description}\n\n"

"### Critical Annotation Guidelines\n"
"1. **Example Learning Requirement**: Thoroughly analyze and fully learn from the annotation logic, format, and criteria in the Examples section. "
"Your annotation must align with the style, judgment standards, and tag usage shown in the examples.\n"
"2. **Thinking Process**: You may (and are encouraged to) explain your annotation reasoning step by step (e.g., key information extraction, judgment basis, rule matching).\n"
"3. **Mandatory Output Rule**: Regardless of any thinking process you provide, your final annotation result MUST be enclosed in <label> tags (this is non-negotiable).\n"
" - Correct example: \n"
" Reasoning: This review mentions 'excellent quality' and 'very satisfied', which meets the criteria for a Good Review.\n"
" <label>Good Review</label>\n"
" - Wrong example 1 (missing tags): This review is negative.\n"
" - Wrong example 2 (incomplete tags): Bad Review</label>\n"
"4. **Length Adaptation**: For long texts, maintain complete thinking process and ensure the final <label> tags contain the accurate annotation result (no truncation).\n\n"

"### Examples (Must Be Fully Followed)\n"
"[[EXAMPLES]]\n\n"

"### Text to Annotate\n"
f"{text2annotate}\n\n"

"### Final Requirement Summary\n"
"1. You can (and should) provide clear thinking process for your annotation.\n"
"2. The final annotation result MUST be wrapped in <label> tags (no exceptions).\n"
"3. All annotation logic must strictly follow the examples provided above.\n"
)
return prompt

def build_prompt_backup(task_description:str, text2annotate:str)->str:
"""
Construct the prompt for annotation based on the task description.
task_description:
The description of the annotation task.
For example, ``Given an English language product review,
determine if it is a Good Review or a Bad Review.``
text2annotate:
The text that needs to be annotated.
For example, ``My son received this book as a gift. I was extremely disappointed.``
"""
prompt = (
"You are a data annotation assistant. "
"Your task is to label the given texts according to the task description "
"and annotation guidelines provided below.\n\n"
f"[Task Description]\n {task_description}\n\n"
"[Examples]\n {EXAMPLES}\n\n"
"Please follow these instructions when labeling:\n"
"1. **Output Format**: Annotate the text directly by wrapping each labeled "
"span with <label> tags in the following format: <label> annotation result </label>.\n"
# "2. Do not add any extra text, explanations, or commentary in the labeled spans.\n\n"
f"[Task Description (repeat)] \n {task_description}\n\n"
f"[Input Texts]\n {text2annotate}\n\n"
"Please output the annotation results: "
)
return prompt

def select_examples_backup(all_examples:list[dict], task_description:str, text2annotate:str)->str:
"""
Select examples from all_examples to fit into the target context length.
all_examples:
A list of examples, where each example is a dict with keys 'input', 'output', and 'length'.
For example, ``{"input": "The material is good and looks great.", "output": "Good Review", "length": 79``},
task_description:
The description of the annotation task which may be used for example evaluation.
For example, ``Given an English language product review,
determine if it is a Good Review or a Bad Review.``
text2annotate:
The text that needs to be annotated which may be used for example retrieval.
For example, ``My son received this book as a gift. I was extremely disappointed.``

"""
# Notice that the maximum context length is restricted.
target_length = 10_000

input_list = [example['input'] for example in all_examples]
output_list = [example['output'][0] for example in all_examples]
length_list = [example['length'] for example in all_examples]

# <label> have 2 tokens; </label> have 3 tokens; \n have 1 token; # have 1 token.
examples_str, token_num = "", 0
for i, (input_text, output_text, length) in enumerate(zip(input_list, output_list, length_list)):
if length + token_num <= target_length:
token_num += (length + 2 + 3 + 1 + 1)
example_str = f"# {input_text} <label> {output_text} </label>\n"
examples_str += example_str
else:
return examples_str, i
return examples_str
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The functions build_prompt____, build_prompt_backup, and select_examples_backup appear to be unused backup or old versions of functions. They should be removed to clean up the code and avoid confusion.

| openseek-5 | semeval_2018_task1_tweet_sadness_detection | 30K | 500 |
| openseek-6 | mnli_same_genre_classification | 30K | 500 |
| openseek-7 | jeopardy_answer_generation_all | 30K | 500 |
| openseek-8 | kernel_genernation | 15K | 166 |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the task name kernel_genernation. It should be kernel_generation. This typo should be corrected for consistency across the project.

Comment on lines +30 to +32
- `task_id`: A unique identifier for the task.
- "task_name": A short human-readable name of the task.
- `Definition`: A detailed description of what the model should do.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The formatting for the field descriptions in the Data Structure section is inconsistent. "task_name" is enclosed in double quotes, while task_id and Definition are not. For consistency and clarity, I recommend removing the quotes from task_name or applying them to all field names.

Suggested change
- `task_id`: A unique identifier for the task.
- "task_name": A short human-readable name of the task.
- `Definition`: A detailed description of what the model should do.
- `task_id`: A unique identifier for the task.
- `task_name`: A short human-readable name of the task.
- `Definition`: A detailed description of what the model should do.

@Haiming94 Haiming94 deleted the LongContext-ICLAnnotation branch January 22, 2026 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant