Skip to content

Conversation

@robertandremitchell
Copy link
Collaborator

@robertandremitchell robertandremitchell commented Dec 8, 2025

Description

Creates a function that expects a small payload of data on the returned/expected LOINC to evaluate degree of accuracy in our algorithm. Definitions of how we are currently considering accuracy are outlined here: https://docs.google.com/document/d/1yA5NJ06mf1EfLZRmNrrNKopWL6ExMj-dPYKy8wlVDGs/edit?tab=t.0#heading=h.b1r0q3mit8hy

This updates additional parts of the code to add the expected LOINC to the example data. HOWEVER, open questions:

  • do we expect the returned codes to always be of one particular type (long name, common name, etc)
  • if so, is the intention to make a call to the LOINC API or our internal CSV to try to match it to a code?

I have a dummy notebook with small edits to the performance.ipynb (here: https://ml.azure.com/fileexplorerAzNB?wsid=/subscriptions/6848426c-8ca8-4832-b493-fed851be1f95/resourcegroups/dibbs-ttc-training/providers/Microsoft.MachineLearningServices/workspaces/dibbsttc&tid=28cf58df-efe8-4135-b2d1-f697ee74c00c&activeFilePath=Users/robert.a.mitchell/performance-copy.ipynb&notebookPivot=0) that does the following:

  • dynamic updates to the code based on model name
  • switches from un-pickling to loading in JSONL
  • outputs a JSONL that looks like the below:
{"example_idx": 0, "query_input": "Hester Davis fall risk scale", "expected_label": "Hester Davis fall risk scale", "k": 10, "encoding_time_s": 0.32662534713745117, "search_time_s": 0.0002624988555908203, "expected_match": {"rank": null, "score": null, "is_correct_in_topk": false, "is_correct_top1": false}, "results": [{"rank": 1, "corpus_id": 934, "label": "Coronavirus anxiety scale", "loinc_type": "Order", "score": 0.8334473371505737}, {"rank": 2, "corpus_id": 70, "label": "Abbreviated Injury Scale panel AAAM", "loinc_type": "Order", "score": 0.8256902694702148}, {"rank": 3, "corpus_id": 886, "label": "Goal attainment scale - Reported", "loinc_type": "Order", "score": 0.8089544177055359}, {"rank": 4, "corpus_id": 22, "label": "17-Hydroxyprogesterone [Measurement] in DBS", "loinc_type": "Order", "score": 0.8040225505828857}, {"rank": 5, "corpus_id": 712, "label": "Bacterial susceptibility panel by Disk diffusion (KB)", "loinc_type": "Order", "score": 0.8019613027572632}, {"rank": 6, "corpus_id": 117, "label": "Active range of motion panel Quantitative", "loinc_type": "Order", "score": 0.7989861369132996}, {"rank": 7, "corpus_id": 166, "label": "ADL functional rehabilitation potential Set", "loinc_type": "Order", "score": 0.7988804578781128}, {"rank": 8, "corpus_id": 676, "label": "Cholinesterase activity panel - Serum or Plasma", "loinc_type": "Order", "score": 0.7974450588226318}, {"rank": 9, "corpus_id": 906, "label": "Centers for Environmental Health trace metals screen panel [Mass/volume] - Urine", "loinc_type": "Order", "score": 0.7957779169082642}, {"rank": 10, "corpus_id": 499, "label": "Anemia evaluation panel - Serum or Blood", "loinc_type": "Order", "score": 0.7948352098464966}]}
{"example_idx": 1, "query_input": "F9 gene familial mut Doc analysis molecular genetics (Bld/Tiss)", "expected_label": "F9 gene familial mut analysis Molgen Doc (Bld/Tiss)", "k": 1, "encoding_time_s": 0.5327327251434326, "search_time_s": 0.0006182193756103516, "expected_match": {"rank": null, "score": null, "is_correct_in_topk": false, "is_correct_top1": false}, "results": [{"rank": 1, "corpus_id": 885, "label": "Glycosylation congenital disorders multigene analysis in Blood or Tissue by Molecular genetics method", "loinc_type": "Order", "score": 0.8319992423057556}]}

I've only run it on a small fraction of 1/283 files. I think ideally we would have on the examples file we load in the LOINC ID and LOINC type so that we would be able to do a more comprehensive check of whether

  • The LOINC ID and/or name matches
  • If it does not match, is there a match amongst the high-probability matches that is closer on type (i.e., if the expected is a Order but there's a 85% that's a Observation but a 83% that's an Order, we would in theory want to use the 83%?)

For the sake of an end-to-end product, I've taken a small snippet of the data and ran the text fields against the LOINC API to get a LOINC ID. I've also rewritten the scripts for generating key-pairs for examples to include the LOINC codes so that in future runs we can skip a call to the LOINC API.

Related Issues

Closes #173

Additional Notes

The logic of the third-degree match is still a tad shaky. The sample data shows two different kinds: one where the LOINCs and OIDs differ but connect to the same condition and another where the LOINCs and OIDs differ but connect to several conditions, all the same.

Related to this code itself, I think the only other function we may want to add down the line is either a function to transition the data we need from the matching protocol into the right shape, but that should be relatively straightforward since this script really only needs two columns worth of data.

Checklist

Please review and complete the following checklist before submitting your pull request:

  • I have ensured that the pull request is of a manageable size, allowing it to be reviewed within a single session.
  • I have reviewed my changes to ensure they are clear, concise, and well-documented.
  • I have updated the documentation, if applicable.
  • I have added or updated test cases to cover my changes, if applicable.
  • I have minimized the number of reviewers to include only those essential for the review.

Checklist for Reviewers

Please review and complete the following checklist during the review process:

  • The code follows best practices and conventions.
  • The changes implement the desired functionality or fix the reported issue.
  • The tests cover the new changes and pass successfully.
  • Any potential edge cases or error scenarios have been considered.

@robertandremitchell robertandremitchell linked an issue Dec 8, 2025 that may be closed by this pull request
2 tasks
@codecov-commenter
Copy link

codecov-commenter commented Dec 8, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.58%. Comparing base (ec2be1b) to head (a5b4c54).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #175   +/-   ##
=======================================
  Coverage   93.58%   93.58%           
=======================================
  Files          17       17           
  Lines         561      561           
=======================================
  Hits          525      525           
  Misses         36       36           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@m-goggins m-goggins added the Algorithm Development Tasks related to training, testing, evaluating and improving language models label Dec 10, 2025
@robertandremitchell robertandremitchell marked this pull request as draft January 8, 2026 19:19
@robertandremitchell robertandremitchell marked this pull request as ready for review January 8, 2026 21:00
[
{
"example_idx": 1,
"k-run": 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sample files are super helpful, thank you for adding them!

Is k-run always expected to be 1? Are you selecting 1 because we want to return the "top" result (which is 1 in the list of returned results until we have the re-ranker)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k is 1 here mostly because I was lazy in generating dummy data 😅
data/accuracy_evaluation/sample_data/evaluation_results_eval_results_snippet_with_loinc_codes.json for example has multiple k-runs (1, 3, 5, 10).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! haha makes sense 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Algorithm Development Tasks related to training, testing, evaluating and improving language models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create function(s) to assess valueset accuracy

5 participants