The results in the inference mode is relatively inaccurate. #28

Fusica · 2025-01-13T10:43:13Z

I loaded the pre-trained weights you provided and used inference to test an image from the training data, but I found that the returned result has a significant gap compared to the ground truth in the dataset. I would like to know if this gap is normal or if there might be an issue somewhere, such as in my evaluation process. Do the authors have performance test results? I hope the authors can provide clarification. Thank you very much.

Fusica · 2025-01-13T12:31:15Z

import numpy as np
import json
import torch
import torch.nn.functional as F


def rotation_error(gt_R, pred_R):
    """
    Calculate rotation error (in degrees)
    gt_R: ground truth rotation matrix (3x3)
    pred_R: predicted rotation matrix (3x3)
    """
    # Ensure inputs have correct shape
    gt_R = np.array(gt_R).reshape(3, 3)
    pred_R = np.array(pred_R).reshape(3, 3)

    # Calculate relative rotation
    R_diff = np.dot(pred_R.T, gt_R)

    # Calculate rotation angle (in radians)
    trace = np.trace(R_diff)
    trace = min(3.0, max(-1.0, (trace - 1) / 2.0))  # Ensure value is in [-1,1]
    angle = np.arccos(trace)

    # Convert to degrees
    return np.rad2deg(angle)


def translation_error(t1, t2):
    """
    Calculate Euclidean distance between two translation vectors (in meters)
    """
    return np.linalg.norm(t1 - t2)


def load_gt_data(scene_id, img_id="61"):
    """Load ground truth data for a specific scene"""
    base_path = "/data/lmo/test_all/test/000002"

    with open(f"{base_path}/scene_gt.json", "r") as f:
        scene_gt = json.load(f)

    with open(f"{base_path}/scene_gt_info.json", "r") as f:
        scene_gt_info = json.load(f)

    # Use image ID as key
    return scene_gt[img_id], scene_gt_info[img_id]


def visualize_boxes(img_path, boxes, save_path="debug_vis.jpg"):
    """Visualize bounding boxes on image"""
    import cv2

    img = cv2.imread(img_path)

    for box in boxes:
        x, y, w, h = [int(v) for v in box]
        cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)

    cv2.imwrite(save_path, img)


def load_pred_results():
    with open("infer_out/results.json", "r") as f:
        return json.load(f)


def rotation_6d_to_matrix(rot_6d):
    """Convert 6D rotation to rotation matrix"""
    # Ensure input has correct shape and type
    rot_6d = torch.tensor(rot_6d, dtype=torch.float32).view(-1, 6)

    # Split first 3 and last 3 dimensions
    m1 = rot_6d[:, 0:3]
    m2 = rot_6d[:, 3:6]

    # Gram-Schmidt orthogonalization
    x = F.normalize(m1, p=2, dim=1)
    z = torch.cross(x, m2, dim=1)
    z = F.normalize(z, p=2, dim=1)
    y = torch.cross(z, x, dim=1)

    # Assemble rotation matrix
    rot_matrix = torch.cat((x.view(-1, 3, 1), y.view(-1, 3, 1), z.view(-1, 3, 1)), 2)
    return rot_matrix.squeeze().numpy()  # Ensure return of 3x3 matrix


# Load GT data for image 61
scene_id = "2"
img_id = "61"
gt_poses, gt_boxes = load_gt_data(scene_id, img_id)

# Visualize GT boxes
img_path = f"/data/lmo/test_all/test/000002/rgb/000061.png"
boxes = [info["bbox_obj"] for info in gt_boxes]
visualize_boxes(img_path, boxes)

# YOLO class to LMO ID mapping
yolo_to_lmo = {1: 1, 2: 5, 3: 6, 4: 8, 5: 9, 6: 10, 7: 11, 8: 12}

# Compare predictions with GT
pred_results = load_pred_results()
img_results = pred_results["000061"]

print("GT objects:", [(i, gt["obj_id"]) for i, gt in enumerate(gt_poses)])
print("Pred objects (before mapping):", [(k, pred["class"]) for k, pred in img_results.items()])

for i, gt in enumerate(gt_poses):
    gt_R = np.array(gt["cam_R_m2c"]).reshape(3, 3)
    gt_t = np.array(gt["cam_t_m2c"]) / 1000.0
    gt_obj_id = gt["obj_id"]

    # Collect all matching predictions
    matching_preds = []
    for pred_idx, pred in img_results.items():
        pred_lmo_id = yolo_to_lmo[pred["class"]]
        if pred_lmo_id == gt_obj_id:
            pred_t = np.array(pred["t"])
            pred_R = rotation_6d_to_matrix(pred["rot"][:6])

            t_err = translation_error(gt_t, pred_t)
            r_err = rotation_error(gt_R, pred_R)
            matching_preds.append((t_err, r_err, pred))

    # Select best prediction
    if matching_preds:
        best_pred = min(matching_preds, key=lambda x: x[0])
        t_err, r_err, pred = best_pred

        print(f"\nObject {gt_obj_id} (from YOLO class {pred['class']}):")
        print(f"Found {len(matching_preds)} matching predictions, using best one:")
        print(f"Translation error: {t_err:.3f}m")
        print(f"Rotation error: {r_err:.1f}°")
        print("---")

The above is the evaluation code I used, and I am not sure if it is correct. Below are my prediction results, and the differences are very significant.

Object 1 (from YOLO class 1):
Found 1 matching predictions, using best one:
Translation error: 0.656m
Rotation error: 164.3°
---

Object 5 (from YOLO class 2):
Found 1 matching predictions, using best one:
Translation error: 0.819m
Rotation error: 60.3°
---

Object 6 (from YOLO class 3):
Found 2 matching predictions, using best one:
Translation error: 0.426m
Rotation error: 125.4°
---

Object 8 (from YOLO class 4):
Found 1 matching predictions, using best one:
Translation error: 0.243m
Rotation error: 30.7°
---

Object 10 (from YOLO class 6):
Found 1 matching predictions, using best one:
Translation error: 0.435m
Rotation error: 134.0°
---

Object 11 (from YOLO class 7):
Found 2 matching predictions, using best one:
Translation error: 0.331m
Rotation error: 98.3°

Fusica · 2025-01-14T06:48:22Z

I wonder if it is because I used the GT as the source for reference points and query embedding during the training and validation phases, but could only use the backbone output during inference.

tgjantos · 2025-01-14T09:27:15Z

Hi Fusica,

for which dataset are you testing this? And which pre-trained weights are you using?

Best,
Thomas

Fusica · 2025-01-14T09:29:46Z

i use lmo to run inference, the pretrained model is poet_lmo_maskrcnn.pth.

tgjantos · 2025-01-14T09:52:07Z

You are not doing anything wrong. It was already reported that PoET on the LM-O dataset is not performing as expected. We have a fix for this, but we need to wait for the results of a review process to put out the code and the new model. I am sorry for that.

Best,
Thomas

Fusica · 2025-01-14T13:23:48Z

However, why can't I achieve at least consistent results with the validation set during inference? Because I found that during validation and inference, the prediction results for the same image are different. So I suspect it might be because the bbox_mode is different during validation and inference.

tgjantos · 2025-01-23T10:30:56Z

That is 100% the reason. I kindly refer to the supplementary material of our paper: https://www.aau.at/wp-content/uploads/2022/09/jantos_poet.pdf

We investigated how the object detector quality influences the pose estimators performance. Hence, the difference can be quite drastic when using actual network predictions instead of GT information.

Best,
Thomas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The results in the inference mode is relatively inaccurate. #28

The results in the inference mode is relatively inaccurate. #28

Fusica commented Jan 13, 2025

Fusica commented Jan 13, 2025 •

edited

Loading

Fusica commented Jan 14, 2025 •

edited

Loading

tgjantos commented Jan 14, 2025

Fusica commented Jan 14, 2025

tgjantos commented Jan 14, 2025

Fusica commented Jan 14, 2025

tgjantos commented Jan 23, 2025

The results in the inference mode is relatively inaccurate. #28

The results in the inference mode is relatively inaccurate. #28

Comments

Fusica commented Jan 13, 2025

Fusica commented Jan 13, 2025 • edited Loading

Fusica commented Jan 14, 2025 • edited Loading

tgjantos commented Jan 14, 2025

Fusica commented Jan 14, 2025

tgjantos commented Jan 14, 2025

Fusica commented Jan 14, 2025

tgjantos commented Jan 23, 2025

Fusica commented Jan 13, 2025 •

edited

Loading

Fusica commented Jan 14, 2025 •

edited

Loading