Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The results in the inference mode is relatively inaccurate. #28

Open
Fusica opened this issue Jan 13, 2025 · 7 comments
Open

The results in the inference mode is relatively inaccurate. #28

Fusica opened this issue Jan 13, 2025 · 7 comments

Comments

@Fusica
Copy link

Fusica commented Jan 13, 2025

I loaded the pre-trained weights you provided and used inference to test an image from the training data, but I found that the returned result has a significant gap compared to the ground truth in the dataset. I would like to know if this gap is normal or if there might be an issue somewhere, such as in my evaluation process. Do the authors have performance test results? I hope the authors can provide clarification. Thank you very much.

@Fusica
Copy link
Author

Fusica commented Jan 13, 2025

import numpy as np
import json
import torch
import torch.nn.functional as F


def rotation_error(gt_R, pred_R):
    """
    Calculate rotation error (in degrees)
    gt_R: ground truth rotation matrix (3x3)
    pred_R: predicted rotation matrix (3x3)
    """
    # Ensure inputs have correct shape
    gt_R = np.array(gt_R).reshape(3, 3)
    pred_R = np.array(pred_R).reshape(3, 3)

    # Calculate relative rotation
    R_diff = np.dot(pred_R.T, gt_R)

    # Calculate rotation angle (in radians)
    trace = np.trace(R_diff)
    trace = min(3.0, max(-1.0, (trace - 1) / 2.0))  # Ensure value is in [-1,1]
    angle = np.arccos(trace)

    # Convert to degrees
    return np.rad2deg(angle)


def translation_error(t1, t2):
    """
    Calculate Euclidean distance between two translation vectors (in meters)
    """
    return np.linalg.norm(t1 - t2)


def load_gt_data(scene_id, img_id="61"):
    """Load ground truth data for a specific scene"""
    base_path = "/data/lmo/test_all/test/000002"

    with open(f"{base_path}/scene_gt.json", "r") as f:
        scene_gt = json.load(f)

    with open(f"{base_path}/scene_gt_info.json", "r") as f:
        scene_gt_info = json.load(f)

    # Use image ID as key
    return scene_gt[img_id], scene_gt_info[img_id]


def visualize_boxes(img_path, boxes, save_path="debug_vis.jpg"):
    """Visualize bounding boxes on image"""
    import cv2

    img = cv2.imread(img_path)

    for box in boxes:
        x, y, w, h = [int(v) for v in box]
        cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)

    cv2.imwrite(save_path, img)


def load_pred_results():
    with open("infer_out/results.json", "r") as f:
        return json.load(f)


def rotation_6d_to_matrix(rot_6d):
    """Convert 6D rotation to rotation matrix"""
    # Ensure input has correct shape and type
    rot_6d = torch.tensor(rot_6d, dtype=torch.float32).view(-1, 6)

    # Split first 3 and last 3 dimensions
    m1 = rot_6d[:, 0:3]
    m2 = rot_6d[:, 3:6]

    # Gram-Schmidt orthogonalization
    x = F.normalize(m1, p=2, dim=1)
    z = torch.cross(x, m2, dim=1)
    z = F.normalize(z, p=2, dim=1)
    y = torch.cross(z, x, dim=1)

    # Assemble rotation matrix
    rot_matrix = torch.cat((x.view(-1, 3, 1), y.view(-1, 3, 1), z.view(-1, 3, 1)), 2)
    return rot_matrix.squeeze().numpy()  # Ensure return of 3x3 matrix


# Load GT data for image 61
scene_id = "2"
img_id = "61"
gt_poses, gt_boxes = load_gt_data(scene_id, img_id)

# Visualize GT boxes
img_path = f"/data/lmo/test_all/test/000002/rgb/000061.png"
boxes = [info["bbox_obj"] for info in gt_boxes]
visualize_boxes(img_path, boxes)

# YOLO class to LMO ID mapping
yolo_to_lmo = {1: 1, 2: 5, 3: 6, 4: 8, 5: 9, 6: 10, 7: 11, 8: 12}

# Compare predictions with GT
pred_results = load_pred_results()
img_results = pred_results["000061"]

print("GT objects:", [(i, gt["obj_id"]) for i, gt in enumerate(gt_poses)])
print("Pred objects (before mapping):", [(k, pred["class"]) for k, pred in img_results.items()])

for i, gt in enumerate(gt_poses):
    gt_R = np.array(gt["cam_R_m2c"]).reshape(3, 3)
    gt_t = np.array(gt["cam_t_m2c"]) / 1000.0
    gt_obj_id = gt["obj_id"]

    # Collect all matching predictions
    matching_preds = []
    for pred_idx, pred in img_results.items():
        pred_lmo_id = yolo_to_lmo[pred["class"]]
        if pred_lmo_id == gt_obj_id:
            pred_t = np.array(pred["t"])
            pred_R = rotation_6d_to_matrix(pred["rot"][:6])

            t_err = translation_error(gt_t, pred_t)
            r_err = rotation_error(gt_R, pred_R)
            matching_preds.append((t_err, r_err, pred))

    # Select best prediction
    if matching_preds:
        best_pred = min(matching_preds, key=lambda x: x[0])
        t_err, r_err, pred = best_pred

        print(f"\nObject {gt_obj_id} (from YOLO class {pred['class']}):")
        print(f"Found {len(matching_preds)} matching predictions, using best one:")
        print(f"Translation error: {t_err:.3f}m")
        print(f"Rotation error: {r_err:.1f}°")
        print("---")

The above is the evaluation code I used, and I am not sure if it is correct. Below are my prediction results, and the differences are very significant.

Object 1 (from YOLO class 1):
Found 1 matching predictions, using best one:
Translation error: 0.656m
Rotation error: 164.3°
---

Object 5 (from YOLO class 2):
Found 1 matching predictions, using best one:
Translation error: 0.819m
Rotation error: 60.3°
---

Object 6 (from YOLO class 3):
Found 2 matching predictions, using best one:
Translation error: 0.426m
Rotation error: 125.4°
---

Object 8 (from YOLO class 4):
Found 1 matching predictions, using best one:
Translation error: 0.243m
Rotation error: 30.7°
---

Object 10 (from YOLO class 6):
Found 1 matching predictions, using best one:
Translation error: 0.435m
Rotation error: 134.0°
---

Object 11 (from YOLO class 7):
Found 2 matching predictions, using best one:
Translation error: 0.331m
Rotation error: 98.3°

@Fusica
Copy link
Author

Fusica commented Jan 14, 2025

I wonder if it is because I used the GT as the source for reference points and query embedding during the training and validation phases, but could only use the backbone output during inference.

@tgjantos
Copy link
Member

Hi Fusica,

for which dataset are you testing this? And which pre-trained weights are you using?

Best,
Thomas

@Fusica
Copy link
Author

Fusica commented Jan 14, 2025

i use lmo to run inference, the pretrained model is poet_lmo_maskrcnn.pth.

@tgjantos
Copy link
Member

You are not doing anything wrong. It was already reported that PoET on the LM-O dataset is not performing as expected. We have a fix for this, but we need to wait for the results of a review process to put out the code and the new model. I am sorry for that.

Best,
Thomas

@Fusica
Copy link
Author

Fusica commented Jan 14, 2025

However, why can't I achieve at least consistent results with the validation set during inference? Because I found that during validation and inference, the prediction results for the same image are different. So I suspect it might be because the bbox_mode is different during validation and inference.

@tgjantos
Copy link
Member

That is 100% the reason. I kindly refer to the supplementary material of our paper: https://www.aau.at/wp-content/uploads/2022/09/jantos_poet.pdf

We investigated how the object detector quality influences the pose estimators performance. Hence, the difference can be quite drastic when using actual network predictions instead of GT information.

Best,
Thomas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants