mindspore-ai · DLUT-LYZ · Aug 29, 2024
diff --git a/coda_lm_eval/README.md b/coda_lm_eval/README.md
@@ -0,0 +1,163 @@
+# CODA-LM
+[![arXiv](https://img.shields.io/badge/arXiv-2404.10595-b31b1b.svg?style=plastic)](https://arxiv.org/abs/2404.10595) [![arXiv](https://img.shields.io/badge/Web-CODA_LM-blue.svg?style=plastic)](https://coda-dataset.github.io/coda-lm/)
+
+This repository contains the implementation of the paper:
+
+> Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases <br>
+> [Kai Chen](https://kaichen1998.github.io)\*, [Yanze Li]()\*, [Wenhua Zhang]()\*, [Yanxin Liu](), [Pengxiang Li](https://scholar.google.com/citations?user=rUp_4RgAAAAJ&hl=en), [Ruiyuan Gao](https://gaoruiyuan.com/), [Lanqing Hong](https://scholar.google.com.sg/citations?user=2p7x6OUAAAAJ&hl=en)†, [Meng Tian](), [Xinhai Zhao](), [Zhenguo Li](https://scholar.google.com/citations?user=XboZC1AAAAAJ&hl=en&oi=ao), [Dit-Yan Yeung](https://sites.google.com/view/dyyeung), [Huchuan Lu](https://scholar.google.com/citations?user=D3nE0agAAAAJ&hl=en), [Xu Jia](https://stephenjia.github.io/)† <br>
+> *Equal Contribution   †Corresponding Author
+
+<p align="center">
+  <img src="./images/overview.png" style="width: 65%; margin: 0 auto; text-align: center"/>
+</p>
+
+## Data Preparation
+
+The instructions for downloading CODA-LM are listed as follows:
+
+1. Download the image files following the CODA official instructions [here](https://coda-dataset.github.io/download.html#instructions)
+2. Download the CODA-LM annotation files and then decompress them in the same root directory.
+
+| Split | Size | Image Source  |  Download  |
+| :---: | :--: | :-----------: | :--------: |
+| Train | 4884 | CODA2022 val  | [HF Hub](https://huggingface.co/datasets/KaiChen1998/coda-lm) |
+|  Val  | 4384 | CODA2022 test | [HF Hub](https://huggingface.co/datasets/KaiChen1998/coda-lm) |
+| Test  | 500  | CODA2022 test | [HF Hub](https://huggingface.co/datasets/KaiChen1998/coda-lm) |
+| Mini  |  50  | CODA2022 test | [HF Hub](https://huggingface.co/datasets/KaiChen1998/coda-lm) |
+
+Note that:
+
+1. Images of **CODA-LM train** set come from **CODA2022 val** set, while images of **CODA-LM val and test** sets come from **CODA2022 test** set.
+2. **CODA-LM mini** set is a 50-image subset of **CODA-LM val** set for demonstration.
+
+After decompression, the data organization is listed as follows:
+
+```
+├── val                    -- CODA2022 val (we only use images)
+│   │── images
+│   │   │── *.jpg
+├── test                   -- CODA2022 test (we only use images)
+│   │── images
+│   │   │── *.jpg
+├── CODA-LM
+│   │── Train              -- CODA-LM train (we use 4884 images from CODA2022 val)
+│   │   │── val_*.json
+│   │── Val                -- CODA-LM val (we use 4384 images from CODA2022 test)
+│   │   │── test_*.json
+│   │── Test               -- CODA-LM test (we use 500 images from CODA2022 test)
+│   │   │── test_*.json
+│   │── Mini               -- CODA-LM mini (a 50-image subset of CODA-LM val)
+│   │   │── test_*.json
+```
+
+
+
+## Data Format
+
+The annotation files contains question-answering pairs for all three tasks as following,
+
+```
+{
+    "general_perception":{
+        "vehicles": [                             -- list containing information on all vehicles
+            {
+                "description": <str>,             -- description about a single vehicle
+                "explanation": <str>"             -- explanation why it affects the ego car
+            },
+        "vulnerable_road_users": [...],           -- list containing information on all VRUs
+        "traffic signs": [...],                   -- list containing information on all traffic signs
+        "traffic lights": [...],                  -- list containing information on all traffic lights
+        "traffic cones": [...],                   -- list containing information on all traffic cones
+        "barriers": [...],                        -- list containing information on all barriers
+        "other objects": [...],                   -- list containing information on all other objects
+    	"description and explanation": <str>      -- summarization of information on all categories
+    },
+    "region_perception":{
+        "1": {                                    -- region index
+            "description and explanation": <str>, -- description of road users in the specific region with explanation on why it affects the ego car
+            "box": <list of float>,               -- xywh coordinates
+            "category_name": <str>                -- object category
+        },
+        "2": {...},
+        "3": {...}
+    },
+    "driving_suggestion": <str>,
+}
+```
+
+## Data Usage
+To help users better understand the structure of CODA-LM, we provide a python script to convert our annotations to basic VQA formats, as follows:
+
+1. Download the data and make sure the directory organization follows [Data Prepration](https://github.com/DLUT-LYZ/CODA-LM?tab=readme-ov-file#data-preparation).
+
+2. Run `convert2vqa.py` as follows:
+
+   ```bash
+   # English
+   python convert2vqa.py --coda_root $CODA_ROOT --codalm_ann_name CODA-LM
+
+   # Chinese
+   python convert2vqa.py --coda_root $CODA_ROOT --codalm_ann_name CODA-LM-chinese
+   ```
+
+3. After that, the resulting data organization will be like this:
+
+   ```
+   ├── val
+   │   │── images
+   │   │── images_w_bboxes                  -- Images with bboxes drawn for region perception
+   │   │   │── *.jpg
+   ├── test
+   │   │── images
+   │   │── images_w_bboxes                  -- Images with bboxes drawn for region perception
+   │   │   │── *.jpg
+   ├── CODA-LM
+   │   │── Train
+   │   │   │── vqa_anno
+   │   │   │   │── general_perception.jsonl  -- VQA annotations for general perception
+   │   │   │   │── region_perception.jsonl   -- VQA annotations for region perception
+   │   │   │   │── driving_suggestion.jsonl  -- VQA annotations for driving suggestion
+   │   │── Val
+   │   │   │── vqa_anno
+   │   │── Test
+   │   │   │── vqa_anno
+   │   │── Mini
+   │   │   │── vqa_anno
+   ```
+
+4. The basic VQA format saves data sample simply with a dictionary containing `question_id`, `image`, `question`, and `answer`, as follows:
+
+   ```
+   {"question_id": 0, "image": val/images/0001.jpg, "question": <str>, "answer": <str>}
+   {"question_id": 1, "image": val/images/0002.jpg, "question": <str>, "answer": <str>}
+   {"question_id": 2, "image": val/images/0003.jpg, "question": <str>, "answer": <str>}
+   ...
+   ```
+
+5. Note that for regional perception, there are various possible manners to utilize the bbox annotations. Here we provide a simple implementation by drawing the bboxes with red rectangles on images, which are saved in the `images_w_bboxes` directory.
+
+   ![img](images/visual.png)
+
+
+## Data Annotation
+
+Check [CODA-LM Annotation Tool](./labeling) for more details.
+
+
+# Evaluation
+
+Check [CODA-LM Evaluation](./evaluation) for more details.
+
+
+
+## Citation
+
+```bibtex
+@article{li2024automated,
+  title={Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases},
+  author={Li, Yanze and Zhang, Wenhua and Chen, Kai and Liu, Yanxin and Li, Pengxiang and Gao, Ruiyuan and Hong, Lanqing and Tian, Meng and Zhao, Xinhai and Li, Zhenguo and others},
+  journal={arXiv preprint arXiv:2404.10595},
+  year={2024}
+}
+```
+
diff --git a/coda_lm_eval/convert2vqa.py b/coda_lm_eval/convert2vqa.py
@@ -0,0 +1,115 @@
+from PIL import Image, ImageDraw
+import os
+import numpy as np
+import json
+import argparse
+from tqdm import tqdm
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--coda_root", type=str, default="./")
+    parser.add_argument("--codalm_ann_name", type=str, default="CODA-LM", choices=["CODA-LM", "CODA-LM-chinese"])
+    args = parser.parse_args()
+
+    # user notice for directory structure
+    print("=======================\nBefore started, please make sure your directory has been organized as in https://github.com/DLUT-LYZ/CODA-LM?tab=readme-ov-file#data-preparation\n=======================\n")
+
+    ########################
+    # Start pre-processing
+    ########################
+    ann_root = os.path.join(args.coda_root, args.codalm_ann_name)
+    for split in os.listdir(ann_root):
+        assert split in ['Train', 'Val', 'Test', 'Mini']
+        split_root = os.path.join(ann_root, split)
+        json_list = sorted([each for each in os.listdir(split_root) if each.endswith('.json')])
+
+        stage1_index = -1
+        stage2_index = -1
+        stage3_index = -1
+        stage1_all_data = []
+        stage2_all_data = []
+        stage3_all_data = []
+        for json_name in tqdm(json_list, desc=split):
+            with open(os.path.join(split_root, json_name), 'r', encoding='utf-8') as f:
+                json_data = json.load(f)
+            img_dir = json_name.split("_")[0]
+            img_name = json_name.split("_")[1][:-5] + '.jpg'
+
+            ########################
+            # Step 1: Prepare general perception data
+            ########################
+            stage1_index += 1
+            stage1_data = dict(
+                question_id=stage1_index,
+                image=os.path.join(img_dir, 'images', img_name),
+                question="There is an image of traffic captured from the perspective of the ego car. Focus on objects influencing the ego car's driving behavior: vehicles (cars, trucks, buses, etc.), vulnerable road users (pedestrians, cyclists, motorcyclists), traffic signs (no parking, warning, directional, etc.), traffic lights (red, green, yellow), traffic cones, barriers, miscellaneous(debris, dustbin, animals, etc.). You must not discuss any objects beyond the seven categories above. Please describe each object's appearance, position, direction, and explain why it affects the ego car's behavior."
+            )
+            if split != 'Test':
+                stage1_data['answer'] = json_data['general_perception']['description and explanation']
+            stage1_all_data.append(stage1_data)
+
+            ########################
+            # Step 2: Prepare driving suggestion data
+            ########################
+            stage2_index += 1
+            stage2_data = dict(
+                question_id=stage2_index,
+                    image=os.path.join(img_dir, 'images', img_name),
+                    question="There is an image of traffic captured from the perspective of the ego car. Focus on objects influencing the ego car's driving behavior: vehicles (cars, trucks, buses, etc.), vulnerable road users (pedestrians, cyclists, motorcyclists), traffic signs (no parking, warning, directional, etc.), traffic lights (red, green, yellow), traffic cones, barriers, miscellaneous(debris, dustbin, animals, etc.). You must not discuss any objects beyond the seven categories above. Please provide driving suggestions for the ego car based on the current scene."
+            )
+            if split != 'Test':
+                stage2_data['answer'] = json_data['driving_suggestion']
+            stage2_all_data.append(stage2_data)
+
+            ########################
+            # Step 3: Prepare region perception data
+            ########################
+            regional_data = json_data["region_perception"]
+            for key, value in regional_data.items():
+                # preprare image paths
+                output_root = os.path.join(args.coda_root, img_dir, 'images_w_boxes')
+                output_path = os.path.join(output_root, "{}_object_{}.jpg".format(json_name.split("_")[1][:-5], key))
+                os.makedirs(output_root, exist_ok=True)
+
+                # prepare images
+                if not os.path.exists(output_path):
+                    img = Image.open(os.path.join(args.coda_root, img_dir, 'images', img_name))
+                    draw = ImageDraw.Draw(img)
+                    rect = [value['box'][0], value['box'][1], value['box'][0] + value['box'][2], value['box'][1] + value['box'][3]]
+                    draw.rectangle(rect, outline="red", width=2)
+                    img.save(output_path)
+
+                # prepare annotation
+                stage3_index += 1
+                stage3_data = dict(
+                    question_id=stage3_index,
+                    image=os.path.join(img_dir, 'images_w_boxes', "{}_object_{}.jpg".format(json_name.split("_")[1][:-5], key)),
+                    question="Please describe the object inside the red rectangle in the image and explain why it affect ego car driving."
+                )
+                if split != 'Test':
+                    stage3_data['answer'] = value['description and explanation']
+                stage3_all_data.append(stage3_data)
+
+        ########################
+        # Step 4: Save annotation
+        ########################
+        save_root = os.path.join(split_root, 'vqa_anno')
+        os.makedirs(save_root, exist_ok=True)
+
+        # save stage1
+        with open(os.path.join(save_root, 'general_perception.jsonl'), 'w') as file:
+            for entry in stage1_all_data:
+                json_str = json.dumps(entry)
+                file.write(json_str + '\n')
+
+        # save stage2
+        with open(os.path.join(save_root, 'driving_suggestion.jsonl'), 'w') as file:
+            for entry in stage2_all_data:
+                json_str = json.dumps(entry)
+                file.write(json_str + '\n')
+
+        # save stage3
+        with open(os.path.join(save_root, 'region_perception.jsonl'), 'w') as file:
+            for entry in stage3_all_data:
+                json_str = json.dumps(entry)
+                file.write(json_str + '\n')
diff --git a/coda_lm_eval/evaluation/README.md b/coda_lm_eval/evaluation/README.md
@@ -0,0 +1,50 @@
+# CODA-LM Evaluation
+
+## Data Preparation
+
+1. First of all, follow the instructions on [CODA-LM submission](https://coda-dataset.github.io/w-coda2024/track1/index.html#submission) to run inference with your LVLM and save your results as requested. Your results should be saved in three `jsonl` files for three tasks separately.
+2. If you are using a subset of CODA-LM validation set (e.g., CODA-LM Mini set) for evaluation, you need to copy the corresponding JSON files in a separate `$ROOT_TO_GT` directory like `./CODA_LM/Mini`. Directly setting `$ROOT_TO_GT=./CODA-LM/Val`, but not running inference on the whole validation set will result in a running error. 
+3. Now the data organization will be like:
+
+```
+├── CODA_LM
+│   ├── Train
+│   ├── Val
+│   ├── Test
+│   ├── ROOT_TO_GT
+│   │   │── test_*.json
+├── ROOT_TO_RESULTS
+│   │── general_perception_answer.jsonl
+│   │── region_perception_answer.jsonl
+│   │── driving_suggestion_answer.jsonl 
+```
+
+
+
+## Instructions
+
+0. Install additional dependencies for evaluation via pip.
+
+   ```bash
+   pip install OpenAI
+   ```
+
+1. Insert the ground truth `label name` of each corner case object collected in regional perception.
+
+   ```bash
+   # Results will be saved in $ROOT_TO_RESULTS/region_perception_answer_w_label.jsonl
+   python convert2eval.py --reference_path ./CODA-LM/$ROOT_TO_GT --prediction_path $ROOT_TO_RESULTS/region_perception_answer.jsonl
+   ```
+
+2. Run evaluation for each task separately. By default, we prompt `gpt-4o-2024-05-13` for evaluation.
+
+   ```bash
+   # General perception
+   python stage1_eval_batch.py --reference_path ./CODA-LM/$ROOT_TO_GT --prediction_path $ROOT_TO_RESULTS/general_perception_answer.jsonl --save_path eval/general_perception_answer --model_name gpt-4o-2024-05-13 --api_key $OPENAI_KEY
+
+   # Driving suggestion
+   python stage2_eval_batch.py --reference_path ./CODA-LM/$ROOT_TO_GT --prediction_path $ROOT_TO_RESULTS/driving_suggestion_answer.jsonl --save_path eval/driving_suggestion_answer --model_name gpt-4o-2024-05-13 --api_key $OPENAI_KEY
+
+   # Regional perception
+   python stage3_eval_batch.py --reference_path ./CODA-LM/$ROOT_TO_GT --prediction_path $ROOT_TO_RESULTS/region_perception_answer_w_label.jsonl --save_path eval/region_perception_answer_w_label --model_name gpt-4o-2024-05-13 --api_key $OPENAI_KEY
+   ```
diff --git a/coda_lm_eval/evaluation/convert2eval.py b/coda_lm_eval/evaluation/convert2eval.py
@@ -0,0 +1,39 @@
+import os
+import json
+import argparse
+from tqdm import tqdm
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--reference_path", type=str, default="./CODA-LM/Test")
+    parser.add_argument("--prediction_path", type=str, default="./region_perception_answer.jsonl")
+    args = parser.parse_args()
+
+    ######################
+    # load ground truth
+    ######################
+    reference_data = {}
+    for each in tqdm(os.listdir(args.reference_path), desc='Loading GT'):
+        if not each.endswith('.json'):
+            continue
+        with open(os.path.join(args.reference_path, each), "r", encoding='utf-8') as f:
+            each_data = json.load(f)
+        reference_data[each[:-5]] = each_data
+
+    ######################
+    # load prediction
+    ######################
+    prediction_data = [json.loads(q) for q in open(os.path.expanduser(args.prediction_path), "r", encoding='utf-8')]
+    for each in tqdm(prediction_data, desc='Processing prediction'):
+        image_name = each['image'].split('/')[-1]
+        json_name = 'test_' + image_name.split('_')[0]
+        object_id = image_name.split('_')[-1][:-4]
+        assert object_id.isdigit()
+        each['label_name'] = reference_data[json_name]['region_perception'][object_id]['category_name']
+
+    ######################
+    # save converted prediction
+    ######################
+    with open(args.prediction_path[:-6] + '_w_label.jsonl', "w", encoding='utf-8') as file:
+      for each in prediction_data:
+          file.write(json.dumps(each) + "\n")
diff --git a/coda_lm_eval/evaluation/scene_few_shot/high.json b/coda_lm_eval/evaluation/scene_few_shot/high.json
@@ -0,0 +1,5 @@
+{
+    "reference": "There is a black SUV in the opposite direction to the ego car in the left lane and  a white van in front of the ego car in the ego car lane. A motorcyclist wearing yelow hat is visible on the right side of the ego car, moving through the intersection. Motorcyclists are vulnerable, potentially leading to safety-critical situations. A directional sign indicating lanes for straight travel, assisting the ego car in navigating its route.  A green traffic light is visible straight ahead, allowing traffic to proceed through the intersection. On the side of the road, the orange barrier marks a risky area, alerting the ego car to potential hazards and the need for caution.",
+    "prediction": "The green traffic light ahead informs the ego car that it has the right of way to proceed. The presence of a pedestrian crossing indicates that drivers should be alert for pedestrians. A black sedan and a sliver car in front of th ego car directly influences the ego car's driving behavior, requiring the driver to adjust speed and maintain a safe following distance. To the right of the image, there is a person riding a scooter.  This is a vulnerable road user who may behave unpredictably. There are barriers set up along the edge of the road on the right side, indicating some form of road work or a closed-off area. The ego car should stay clear of these barriers to avoid entering a potentially unsafe area.",
+    "response": "The prediction text correctly identifies the green traffic light, barriers, two vehicles and the person riding a scooter. Although the colors and types of vehicles are slightly different, silver and white colors are similar, both the reference text and the prediction text describe small vehicles (SUV, sedan, van, and car are all equivalent), so SUV is predicted to be sedan, and van is predicted to be car. While there is no explicit mention of crosswalk markings in the reference text, they usually appear at intersections, so this is not an illusion. However, it did not mention the directional sign indicating straight line travel and the relative position of the vehicle was incorrect. So, Rating: [[8]]."
+}