Conversation
Signed-off-by: Haiming94 <931284009@qq.com>
Summary of ChangesHello @Haiming94, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request establishes the foundational elements for the 'LongContext-ICL-Annotation' challenge, focusing on automated data annotation by Large Language Models within ultra-long context settings. It provides extensive documentation in both English and Chinese, outlining the competition's objectives, rules, and technical requirements. Furthermore, it includes a complete set of baseline code for environment setup, model deployment, and an evaluation pipeline, enabling participants to quickly engage with the challenge and develop their solutions for effective in-context learning strategies. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive setup for a machine learning competition, including documentation, environment setup scripts, and baseline code. While the overall structure is good, there are several critical issues that need to be addressed to ensure the code is usable, reproducible, and maintainable. I've identified critical bugs in the environment setup script and hardcoded user-specific paths in the Python code that will prevent others from running it. Additionally, there are opportunities to improve code quality by removing dead code, making configurations more flexible, and improving error handling. I've also noted a recurring typo across multiple files.
| flash_attn-${flash_attn_version}+cu${cu}torch${torch}-cp${cp}-cp${cp}-linux_x86_64.whl | ||
| # Recommend to download the wheel handly, for example flash_attn-2.8.3+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64 | ||
| pip install flash_attn-2.8.3+cu124torch2.6-cp311-cp311-linux_x86_64.whl |
There was a problem hiding this comment.
There's a critical bug in the flash_attn installation steps.
- Line 26 is just a filename and will cause a "command not found" error.
- Line 28 uses a hardcoded wheel filename, which may not match the version downloaded in the preceding
wgetcommand, especially on different systems.
This will likely break the environment setup. You should use the variables defined earlier to install the downloaded wheel and remove the erroneous line.
| flash_attn-${flash_attn_version}+cu${cu}torch${torch}-cp${cp}-cp${cp}-linux_x86_64.whl | |
| # Recommend to download the wheel handly, for example flash_attn-2.8.3+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64 | |
| pip install flash_attn-2.8.3+cu124torch2.6-cp311-cp311-linux_x86_64.whl | |
| # Recommend to download the wheel handly, for example flash_attn-2.8.3+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64 | |
| pip install flash_attn-${flash_attn_version}+cu${cu}torch${torch}-cp${cp}-cp${cp}-linux_x86_64.whl |
| def select_examples(all_examples: list[dict], task_description: str, text2annotate: str) -> str: | ||
| """ | ||
| Select examples from all_examples to fit into the target context length (适配Qwen3-4B的token计算). | ||
| all_examples: | ||
| A list of examples, where each example is a dict with keys 'input' and 'output' (no 'length' needed). | ||
| For example, ``{"input": "The material is good and looks great.", "output": "Good Review"}``, | ||
| task_description: | ||
| The description of the annotation task which may be used for example evaluation. | ||
| text2annotate: | ||
| The text that needs to be annotated which may be used for example retrieval. | ||
| """ | ||
| # 初始化Qwen3-4B的tokenizer(自动下载/加载千问3-4B的分词器) | ||
| # 若本地已下载模型,可替换为本地路径,如 "./qwen3-4b" | ||
| tokenizer = AutoTokenizer.from_pretrained("/share/project/wuhaiming/spaces/data_agent/OpenSeek-main/openseek/competition/LongContext-ICL-Annotation/src/Qwen3-4B", trust_remote_code=True) |
There was a problem hiding this comment.
The select_examples function hardcodes a user-specific absolute path to the tokenizer and re-initializes it on every call. This is inefficient and will cause the script to fail for other users.
To fix this, you should:
- Modify the function signature to accept a
tokenizerobject. - Remove the hardcoded
AutoTokenizer.from_pretrained(...)call. - Update the call site in
main.pyto pass the already initializedqwen_tokenizerto this function.
| def select_examples(all_examples: list[dict], task_description: str, text2annotate: str) -> str: | |
| """ | |
| Select examples from all_examples to fit into the target context length (适配Qwen3-4B的token计算). | |
| all_examples: | |
| A list of examples, where each example is a dict with keys 'input' and 'output' (no 'length' needed). | |
| For example, ``{"input": "The material is good and looks great.", "output": "Good Review"}``, | |
| task_description: | |
| The description of the annotation task which may be used for example evaluation. | |
| text2annotate: | |
| The text that needs to be annotated which may be used for example retrieval. | |
| """ | |
| # 初始化Qwen3-4B的tokenizer(自动下载/加载千问3-4B的分词器) | |
| # 若本地已下载模型,可替换为本地路径,如 "./qwen3-4b" | |
| tokenizer = AutoTokenizer.from_pretrained("/share/project/wuhaiming/spaces/data_agent/OpenSeek-main/openseek/competition/LongContext-ICL-Annotation/src/Qwen3-4B", trust_remote_code=True) | |
| def select_examples(tokenizer: AutoTokenizer, all_examples: list[dict], task_description: str, text2annotate: str) -> str: | |
| """ | |
| Select examples from all_examples to fit into the target context length (适配Qwen3-4B的token计算). | |
| all_examples: | |
| A list of examples, where each example is a dict with keys 'input' and 'output' (no 'length' needed). | |
| For example, ``{"input": "The material is good and looks great.", "output": "Good Review"}``, | |
| task_description: | |
| The description of the annotation task which may be used for example evaluation. | |
| text2annotate: | |
| The text that needs to be annotated which may be used for example retrieval. | |
| """ |
| default='../outputs/', | ||
| help='Prefix path to save the evaluation logs.') | ||
| parser.add_argument('--tokenizer_path', type=str, | ||
| default='/share/project/wuhaiming/spaces/data_agent/OpenSeek-main/openseek/competition/LongContext-ICL-Annotation/src/Qwen3-4B') |
There was a problem hiding this comment.
The default path for the tokenizer is a hardcoded, user-specific absolute path. This will cause the script to fail for any other user. Please change this to a relative path or a more generic placeholder that instructs the user to provide their own path.
| default='/share/project/wuhaiming/spaces/data_agent/OpenSeek-main/openseek/competition/LongContext-ICL-Annotation/src/Qwen3-4B') | |
| default='../Qwen3-4B') |
| try: | ||
| resp = requests.post(URL, json=data) | ||
| whole_result = resp.json()["choices"][0]["text"] | ||
| except Exception as e: | ||
| whole_result = "None" |
There was a problem hiding this comment.
Catching a broad Exception is generally not recommended as it can hide unexpected errors. It's better to catch a more specific exception, like requests.exceptions.RequestException. Additionally, the error e is completely swallowed, which makes debugging very difficult. You should at least log or print the error.
| try: | |
| resp = requests.post(URL, json=data) | |
| whole_result = resp.json()["choices"][0]["text"] | |
| except Exception as e: | |
| whole_result = "None" | |
| try: | |
| resp = requests.post(URL, json=data, timeout=60) # Add a timeout | |
| resp.raise_for_status() # Raise an exception for bad status codes | |
| whole_result = resp.json()["choices"][0]["text"] | |
| except requests.exceptions.RequestException as e: | |
| print(f"API request failed: {e}") | |
| whole_result = "None" |
| @@ -0,0 +1,46 @@ | |||
|
|
|||
| git clone https://github.com/FlagOpen/FlagScale.git | |||
There was a problem hiding this comment.
The script clones the main branch of the FlagScale repository. This can lead to non-reproducible environment setups if the main branch changes. For reproducibility, it's crucial to pin this to a specific commit hash or a release tag.
| git clone https://github.com/FlagOpen/FlagScale.git | |
| git clone https://github.com/FlagOpen/FlagScale.git --branch <tag_or_commit_hash> |
| url = "http://0.0.0.0:2026/v1/completions" | ||
| prompts = [ | ||
| "Hello, FlagScale + vLLM!", | ||
| "Translate 'Hello World' to Chinese.", | ||
| "Write a short poem about autumn." | ||
| # '用中文写一首短诗,诗句开头用<label>,结尾用</label>包裹起来' | ||
| ] | ||
|
|
||
| for prompt in prompts: | ||
| data = { | ||
| "model": "../Qwen3-4B", | ||
| "prompt": prompt, | ||
| "max_tokens": 1000 | ||
| } |
| # tokenized_input = qwen_tokenizer(input_prompt, return_tensors="pt") | ||
| # if tokenized_input['input_ids'].shape[1] > max_input_length: | ||
| # test_record['prediction'] = None | ||
| # else: | ||
| # prediction = annotate(input_prompt) | ||
| # test_record['prediction'] = prediction |
| def build_prompt____(task_description: str, text2annotate: str) -> str: | ||
| """ | ||
| Build a high-precision English prompt for long-context data annotation (optimized for Qwen3-4B). | ||
| Core requirement: Final answer MUST be wrapped in <label> tags (no extra content outside tags). | ||
| """ | ||
| prompt = ( | ||
| "### Role Definition\n" | ||
| "You are a professional data annotation expert specializing in long-context text labeling. " | ||
| "Your work must strictly comply with the following rules, with the highest priority given to output format accuracy.\n\n" | ||
|
|
||
| "### Core Annotation Task\n" | ||
| f"{task_description}\n\n" | ||
|
|
||
| "### Non-Negotiable Annotation Rules (Highest Priority)\n" | ||
| "1. **Final Output Mandate**: Your annotation result MUST be wrapped in <label> tags — NO text, symbols, spaces, or explanations are allowed outside the tags.\n" | ||
| "2. **Internal Reasoning Permission**: You may perform logical reasoning, text analysis, or context comprehension internally (in your thought process), but NONE of these thoughts may appear in the final output.\n" | ||
| "3. **Label Format Strictness**: <label> is the opening tag and </label> is the closing tag — they must appear in pairs, with NO extra spaces or characters inside the tags (e.g., <label> Good Review </label> is invalid).\n" | ||
| "4. **Prohibited Outputs**: \n" | ||
| " - ❌ Prohibited: 'After analysis, this is a positive review: <label>Good Review</label>' (extra text outside tags)\n" | ||
| " - ❌ Prohibited: 'Bad Review' (missing <label> tags entirely)\n" | ||
| " - ❌ Prohibited: '<label>Bad Review' (unpaired/closing tag missing)\n\n" | ||
|
|
||
| "### Correct vs. Incorrect Examples\n" | ||
| "✅ Correct Example 1: <label>answer</label>\n" | ||
| "✅ Correct Example 2: <label>Bad Review</label>\n" | ||
| "❌ Incorrect Example 1: I think this review is negative → <label>Bad Review</label>\n" | ||
| "❌ Incorrect Example 2: <label> Neutral Review </label> (extra spaces inside tags)\n" | ||
| "❌ Incorrect Example 3: Neutral Review (no label tags)\n\n" | ||
|
|
||
| "### Reference Annotation Examples\n" | ||
| "{EXAMPLES}\n\n" | ||
|
|
||
| "### Text to Annotate\n" | ||
| f"{text2annotate}\n\n" | ||
|
|
||
| "### Final Output Command (Re-emphasized)\n" | ||
| "You may complete any internal reasoning process, but your FINAL OUTPUT MUST consist solely of the annotation result wrapped in <label> tags (no other content whatsoever).\n" | ||
| "Annotation Result: " | ||
| ) | ||
| return prompt | ||
|
|
||
| def build_prompt(task_description: str, text2annotate: str) -> str: | ||
| """ | ||
| Construct a high-precision prompt for long-context data annotation (optimized for Qwen3-4B). | ||
| task_description: Clear description of the annotation task (e.g., "Classify English product reviews as Good Review/Bad Review"). | ||
| text2annotate: The text to be annotated (single text or batch texts). | ||
| """ | ||
| prompt = ( | ||
| "### Role Definition\n" | ||
| "You are a professional data annotation expert specialized in long-context text labeling. " | ||
| "Your work must strictly follow the task rules, fully learn from the provided examples, and ensure the final annotation result is 100% enclosed in <label> tags.\n\n" | ||
|
|
||
| "### Core Task\n" | ||
| f"{task_description}\n\n" | ||
|
|
||
| "### Critical Annotation Guidelines\n" | ||
| "1. **Example Learning Requirement**: Thoroughly analyze and fully learn from the annotation logic, format, and criteria in the Examples section. " | ||
| "Your annotation must align with the style, judgment standards, and tag usage shown in the examples.\n" | ||
| "2. **Thinking Process**: You may (and are encouraged to) explain your annotation reasoning step by step (e.g., key information extraction, judgment basis, rule matching).\n" | ||
| "3. **Mandatory Output Rule**: Regardless of any thinking process you provide, your final annotation result MUST be enclosed in <label> tags (this is non-negotiable).\n" | ||
| " - Correct example: \n" | ||
| " Reasoning: This review mentions 'excellent quality' and 'very satisfied', which meets the criteria for a Good Review.\n" | ||
| " <label>Good Review</label>\n" | ||
| " - Wrong example 1 (missing tags): This review is negative.\n" | ||
| " - Wrong example 2 (incomplete tags): Bad Review</label>\n" | ||
| "4. **Length Adaptation**: For long texts, maintain complete thinking process and ensure the final <label> tags contain the accurate annotation result (no truncation).\n\n" | ||
|
|
||
| "### Examples (Must Be Fully Followed)\n" | ||
| "[[EXAMPLES]]\n\n" | ||
|
|
||
| "### Text to Annotate\n" | ||
| f"{text2annotate}\n\n" | ||
|
|
||
| "### Final Requirement Summary\n" | ||
| "1. You can (and should) provide clear thinking process for your annotation.\n" | ||
| "2. The final annotation result MUST be wrapped in <label> tags (no exceptions).\n" | ||
| "3. All annotation logic must strictly follow the examples provided above.\n" | ||
| ) | ||
| return prompt | ||
|
|
||
| def build_prompt_backup(task_description:str, text2annotate:str)->str: | ||
| """ | ||
| Construct the prompt for annotation based on the task description. | ||
| task_description: | ||
| The description of the annotation task. | ||
| For example, ``Given an English language product review, | ||
| determine if it is a Good Review or a Bad Review.`` | ||
| text2annotate: | ||
| The text that needs to be annotated. | ||
| For example, ``My son received this book as a gift. I was extremely disappointed.`` | ||
| """ | ||
| prompt = ( | ||
| "You are a data annotation assistant. " | ||
| "Your task is to label the given texts according to the task description " | ||
| "and annotation guidelines provided below.\n\n" | ||
| f"[Task Description]\n {task_description}\n\n" | ||
| "[Examples]\n {EXAMPLES}\n\n" | ||
| "Please follow these instructions when labeling:\n" | ||
| "1. **Output Format**: Annotate the text directly by wrapping each labeled " | ||
| "span with <label> tags in the following format: <label> annotation result </label>.\n" | ||
| # "2. Do not add any extra text, explanations, or commentary in the labeled spans.\n\n" | ||
| f"[Task Description (repeat)] \n {task_description}\n\n" | ||
| f"[Input Texts]\n {text2annotate}\n\n" | ||
| "Please output the annotation results: " | ||
| ) | ||
| return prompt | ||
|
|
||
| def select_examples_backup(all_examples:list[dict], task_description:str, text2annotate:str)->str: | ||
| """ | ||
| Select examples from all_examples to fit into the target context length. | ||
| all_examples: | ||
| A list of examples, where each example is a dict with keys 'input', 'output', and 'length'. | ||
| For example, ``{"input": "The material is good and looks great.", "output": "Good Review", "length": 79``}, | ||
| task_description: | ||
| The description of the annotation task which may be used for example evaluation. | ||
| For example, ``Given an English language product review, | ||
| determine if it is a Good Review or a Bad Review.`` | ||
| text2annotate: | ||
| The text that needs to be annotated which may be used for example retrieval. | ||
| For example, ``My son received this book as a gift. I was extremely disappointed.`` | ||
|
|
||
| """ | ||
| # Notice that the maximum context length is restricted. | ||
| target_length = 10_000 | ||
|
|
||
| input_list = [example['input'] for example in all_examples] | ||
| output_list = [example['output'][0] for example in all_examples] | ||
| length_list = [example['length'] for example in all_examples] | ||
|
|
||
| # <label> have 2 tokens; </label> have 3 tokens; \n have 1 token; # have 1 token. | ||
| examples_str, token_num = "", 0 | ||
| for i, (input_text, output_text, length) in enumerate(zip(input_list, output_list, length_list)): | ||
| if length + token_num <= target_length: | ||
| token_num += (length + 2 + 3 + 1 + 1) | ||
| example_str = f"# {input_text} <label> {output_text} </label>\n" | ||
| examples_str += example_str | ||
| else: | ||
| return examples_str, i | ||
| return examples_str |
| | openseek-5 | semeval_2018_task1_tweet_sadness_detection | 30K | 500 | | ||
| | openseek-6 | mnli_same_genre_classification | 30K | 500 | | ||
| | openseek-7 | jeopardy_answer_generation_all | 30K | 500 | | ||
| | openseek-8 | kernel_genernation | 15K | 166 | |
| - `task_id`: A unique identifier for the task. | ||
| - "task_name": A short human-readable name of the task. | ||
| - `Definition`: A detailed description of what the model should do. |
There was a problem hiding this comment.
The formatting for the field descriptions in the Data Structure section is inconsistent. "task_name" is enclosed in double quotes, while task_id and Definition are not. For consistency and clarity, I recommend removing the quotes from task_name or applying them to all field names.
| - `task_id`: A unique identifier for the task. | |
| - "task_name": A short human-readable name of the task. | |
| - `Definition`: A detailed description of what the model should do. | |
| - `task_id`: A unique identifier for the task. | |
| - `task_name`: A short human-readable name of the task. | |
| - `Definition`: A detailed description of what the model should do. |
Signed-off-by: Haiming94 931284009@qq.com