The confidence score #3903

ranjit-tiger · 2025-02-05T09:46:47Z

Describe the bug
Post parsing PDF , how to validate the parsing results

To Reproduce
detection_class_prob, This key is not consistent that is, it is not available for all extracted elements.

Expected behavior
Let's say i am parsing a pdf which have images, texts, tables as image etc. I have used partition_pdf() and used hi_res as strategy. Now the behaviour should ,for each element in metadata ,detection_class_prob key should be available which will tell confidence score.However i am not seeing the detection_class_prob for few elements. Like for a Table element detection_class_prob is available and for Image element detection_class_prob is not, Simillarly for other elements the key is unavailable. Expected is to have this key for all the elements.

Screenshots

Environment Info
please use 👍
unstructured version : 0.16.23

raw_pdf_elements=partition_pdf(
    filename="/content/data/Cocktails_Spirits.pdf",
    strategy="hi_res",
    infer_table_structure=True,  # Infers table structures from content
    extract_images_in_pdf=True,  # Extract images from the PDF
    extract_image_block_types=["Image", "Table"],  # Image and Table extraction
    extract_image_block_to_payload=True,  # Return images in the response
    output_format="application/json",  # JSON output format
    extract_image_block_output_dir="extracted_data_test"
  )

Additional context
probabilities value we should get.

ds-filipknefel · 2025-02-26T11:51:29Z

The detection_class_prob field is not always present as not all detection methods rely on probabilistic approaches. It could be that in those cases the field should still be available but e.g. with value of 1.

@christinestraub for helping to decide whether we should reclassify it from a bug and whether there's a room for change here at the time.

ranjit-tiger added the bug Something isn't working label Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The confidence score #3903

The confidence score #3903

ranjit-tiger commented Feb 5, 2025 •

edited

Loading

ds-filipknefel commented Feb 26, 2025

The confidence score #3903

The confidence score #3903

Comments

ranjit-tiger commented Feb 5, 2025 • edited Loading

ds-filipknefel commented Feb 26, 2025

ranjit-tiger commented Feb 5, 2025 •

edited

Loading