-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect Keypoint and Bounding Box Outputs with RetinaFace Custom Parser in DeepStream 6.3 #36
Comments
with retinaface resnet50 i am getting correct bounding box. can you tell me how u generated engine file from https://github.com/biubug6/Pytorch_Retinaface? |
@Athuliva https://github.com/biubug6/Pytorch_Retinaface/blob/master/convert_to_onnx.py this script for onxx converstion and for engine conversion: |
@Athuliva https://github.com/wang-xinyu/tensorrtx/tree/master/retinaface while using this way generated engine file, i was facing error when using with the deep stream test5 application! |
@sowmiya-masterworks have u loaded the libdecodeplugin.so while using https://github.com/wang-xinyu/tensorrtx/tree/master/retinaface
|
@sowmiya-masterworks try this, BTW, how do you decode those bbox and lmks? Since Retinaface is an anchor based model, the raw outputs should be post-processed otherwise they are unreadable. |
@zhouyuchong, thanks for the suggestion! Could you provide some guidance or recommend a repository for decoding the bounding boxes and landmarks from RetinaFace inside the DeepStream environment? Since it's an anchor-based model, I understand that the raw outputs need post-processing to be interpretable, and any pointers on how to approach this within DeepStream would be greatly appreciated. |
@sowmiya-masterworks cpp version if you use |
I'm using the RetinaFace custom parser from the face-recognition-deepstream repo and encountering several issues concerning the output on keypoint detection and bounding box accuracy, using DeepStream 6.3.
Environment
Hardware: NVIDIA GeForce RTX 3060
Driver Version: 555.42.06
CUDA Version: 12.5
DeepStream Version: 6.3
Operating System: [Please specify your OS, e.g., Ubuntu 20.04]
Test Applications: DeepStream Test5 (both C++ and Python versions)
Models and Weights
Models Tried: Both ResNet50 and MobileNet architectures were tested.
Weights: The model weights used are sourced from the
[Pytorch_Retinaface repository]https://github.com/biubug6/Pytorch_Retinaface
Expected Behavior
Accurate detection and output of face landmarks and bounding boxes in the video stream.
Actual Behavior
Landmarks: Many keypoint coordinates are either zero or negative, which does not correspond to valid pixel values.
Bounding Boxes: Outputs are often unrealistic (e.g., exceedingly large dimensions).
Video Output: No detections appear in the output video.
Steps to Reproduce
Tested the setup using the DeepStream Test5 application for both primary inference engine (pgie) and secondary inference engine (sgie).
Also tested using the Python3 main.py script provided in the repository.
In all tests, inappropriate bounding boxes and keypoints were observed across different setups and models.
Console
OutputRaw output array:
output[0] = 0.937012
output[1] = 1.41797
output[2] = -1.57422
output[3] = -0.820801
output[4] = 1.58301
output[5] = 0.605469
output[6] = -0.943848
output[7] = -0.15332
output[8] = -1.27832
output[9] = 1.35547
output[10] = -0.252197
output[11] = -0.817383
output[12] = 0.0300903
output[13] = 1.14355
output[14] = -0.243774
output[15] = -0.335449
Raw output array:
output[0] = 0.937012
output[1] = 1.41797
output[2] = -1.57422
output[3] = -0.820801
output[4] = 1.58301
output[5] = 0.605469
output[6] = -0.943848
output[7] = -0.15332
output[8] = -1.27832
output[9] = 1.35547
output[10] = -0.252197
output[11] = -0.817383
output[12] = 0.0300903
output[13] = 1.14355
output[14] = -0.243774
output[15] = -0.335449
Clipped BBox: 1.41797, 0, 0, 1.58301
Detection:
Top: 0
Left: 1
Width: 4.29497e+09
Height: 1
Confidence: 0.605469
Landmarks: 0 0 -1 1 0 0 0 1 0 0
Raw output array:
output[0] = 0.935547
output[1] = 1.41797
output[2] = -1.57324
output[3] = -0.819336
output[4] = 1.58203
output[5] = 0.60498
output[6] = -0.943359
output[7] = -0.15332
output[8] = -1.27734
output[9] = 1.35352
output[10] = -0.25293
output[11] = -0.816895
output[12] = 0.0317383
output[13] = 1.14355
output[14] = -0.243042
output[15] = -0.334473
Raw output array:
output[0] = 0.935547
output[1] = 1.41797
output[2] = -1.57324
output[3] = -0.819336
output[4] = 1.58203
output[5] = 0.60498
output[6] = -0.943359
output[7] = -0.15332
output[8] = -1.27734
output[9] = 1.35352
output[10] = -0.25293
output[11] = -0.816895
output[12] = 0.0317383
output[13] = 1.14355
output[14] = -0.243042
output[15] = -0.334473
Clipped BBox: 1.41797, 0, 0, 1.58203
Detection:
Top: 0
Left: 1
Width: 4.29497e+09
Height: 1
Confidence: 0.60498
Landmarks: 0 0 -1 1 0 0 0 1 0 0
The text was updated successfully, but these errors were encountered: