Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The confidence score #3903

Open
ranjit-tiger opened this issue Feb 5, 2025 · 1 comment
Open

The confidence score #3903

ranjit-tiger opened this issue Feb 5, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@ranjit-tiger
Copy link

ranjit-tiger commented Feb 5, 2025

Describe the bug
Post parsing PDF , how to validate the parsing results

To Reproduce
detection_class_prob, This key is not consistent that is, it is not available for all extracted elements.

Expected behavior
Let's say i am parsing a pdf which have images, texts, tables as image etc. I have used partition_pdf() and used hi_res as strategy. Now the behaviour should ,for each element in metadata ,detection_class_prob key should be available which will tell confidence score.However i am not seeing the detection_class_prob for few elements. Like for a Table element detection_class_prob is available and for Image element detection_class_prob is not, Simillarly for other elements the key is unavailable. Expected is to have this key for all the elements.

Screenshots

Image

Image

Environment Info
please use 👍
unstructured version : 0.16.23

raw_pdf_elements=partition_pdf(
    filename="/content/data/Cocktails_Spirits.pdf",
    strategy="hi_res",
    infer_table_structure=True,  # Infers table structures from content
    extract_images_in_pdf=True,  # Extract images from the PDF
    extract_image_block_types=["Image", "Table"],  # Image and Table extraction
    extract_image_block_to_payload=True,  # Return images in the response
    output_format="application/json",  # JSON output format
    extract_image_block_output_dir="extracted_data_test"
  )

Additional context
probabilities value we should get.

@ranjit-tiger ranjit-tiger added the bug Something isn't working label Feb 5, 2025
@ds-filipknefel
Copy link
Contributor

The detection_class_prob field is not always present as not all detection methods rely on probabilistic approaches. It could be that in those cases the field should still be available but e.g. with value of 1.

@christinestraub for helping to decide whether we should reclassify it from a bug and whether there's a room for change here at the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants