Skip to content

Real input sample testing#9

Open
Yueqiao12Zhang wants to merge 17 commits into
mainfrom
real-input-sample-testing
Open

Real input sample testing#9
Yueqiao12Zhang wants to merge 17 commits into
mainfrom
real-input-sample-testing

Conversation

@Yueqiao12Zhang
Copy link
Copy Markdown
Contributor

Description

Added an integration test for the pipeline using legacy XML manually labeled training data:

  • /core/tests/fixtures/Interactive_Classifier_GameraXML_TrainingData.xml

Using this training data, I evaluated the model against a new real-world input page provided by Kyrie:

  • /core/tests/NZ-Wt MSR-03 109v

Results & Observations:
The evaluation on new page yielded poor accuracy, indicating that the legacy GameraXML training data does not generalize well to this new manuscript page. However, on the LOO and n-fold testing on the training data alone, we got ~95% accuracy using k=1. To improve performance and ensure a reliable baseline, we will need to manually label the new page.

Associated Scripts & Changes

  • test_real_input_knn.py — Added the new pipeline/KNN test case using the real input.
  • evaluate.py — Used to execute the evaluation pipeline.
  • visulize.py — Used to analyze and visualize the poor classification results.
  • conftest.py — Updated test fixtures and configurations to support the new fixture paths.

Next Steps

  • Manually label the /core/tests/NZ-Wt MSR-03 109v page to provide accurate training data for the new manuscript type.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a real-world, end-to-end pipeline test and supporting utilities to evaluate KNN classification using a legacy GameraXML training set against a new manuscript page sample. This helps quantify generalization gaps and provides tooling to inspect prediction quality visually.

Changes:

  • Adds an integration-style pytest module covering smoke, determinism, and (optional) 5-fold / LOO accuracy checks.
  • Adds test-support helpers to ingest the sample page, run classification, and print summary reports.
  • Adds a visualization script plus a CSV vocabulary fixture used for label sanity checks.

Reviewed changes

Copilot reviewed 5 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
core/tests/test_real_input_knn.py New end-to-end + CV/LOO tests for real sample input and legacy XML training DB
core/tests/sample_input/evaluate.py Shared helpers to ingest/filter annotations, classify the sample page, and print a summary
core/tests/sample_input/visualize.py Script to render annotation/prediction overlays for manual inspection
core/tests/sample_input/csv-square_notation_neume_level_newest.csv Canonical vocabulary CSV fixture used for label sanity checks
core/tests/conftest.py Registers the slow marker used by the new integration tests
Comments suppressed due to low confidence (2)

core/tests/test_real_input_knn.py:94

  • test_real_page_smoke requests the training_db fixture but never uses it (and classify_page() reloads the XML training set internally). Keeping the unused fixture parameter forces an extra XML parse and feature extraction work; drop the fixture from the signature or refactor classify_page to accept a preloaded training set.
def test_real_page_smoke(page_glyphs, training_db, vocab):
    classified, classifier = classify_page()

core/tests/test_real_input_knn.py:234

  • The comment says “Index the full DB by id for O(1) … lookup”, but the implementation rebuilds train by scanning training_db for every held-out glyph ([g for g in training_db if g.id != held_out.id]), which is O(N) per iteration and will be noticeably slow for larger IC_LOO_LIMIT. Consider precomputing an id -> index map and using list slicing (training_db[:i] + training_db[i+1:]) or similar to avoid repeated full scans, and update the comment accordingly.
    # Index the full DB by id for O(1) "everyone except this glyph" lookup.
    correct = 0
    for held_out in subset:
        train = [g for g in training_db if g.id != held_out.id]
        clf = InteractiveClassifier(k=1).fit(train)
        pred = clf.predict(held_out)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread core/tests/test_real_input_knn.py
Comment thread core/tests/test_real_input_knn.py
Comment thread core/tests/test_real_input_knn.py Outdated
Comment thread core/tests/test_real_input_knn.py
Comment thread core/tests/sample_input/helpers/evaluate.py
@kyrieb-ekat
Copy link
Copy Markdown

As a quick note, the training data was produced from a square notation manuscript; the manuscript page you used/that is featured in the PR is in hufnagel script, which will have contributed to the poor performance. The detection is very impressive, however!

@Yueqiao12Zhang
Copy link
Copy Markdown
Contributor Author

As a quick note, the training data was produced from a square notation manuscript; the manuscript page you used/that is featured in the PR is in hufnagel script, which will have contributed to the poor performance. The detection is very impressive, however!

Yes, I noticed it and I am currently investigating the similarities and the differences in the neume shapes. I will try to look into the hufnagel and manually label a bit.

@kyrieb-ekat
Copy link
Copy Markdown

Also put over on the Neon issue, but here's a quick Hufnagel sample annotation w/ neume labels
rapid_hufnagel_annotation.csv

@Yueqiao12Zhang
Copy link
Copy Markdown
Contributor Author

Description

Added an integration test for the pipeline using legacy XML manually labeled training data:

  • /core/tests/fixtures/Interactive_Classifier_GameraXML_TrainingData.xml

Using this training data, I evaluated the model against a new real-world input page provided by Kyrie:

  • /core/tests/NZ-Wt MSR-03 109v

Results & Observations: The evaluation on new page yielded poor accuracy, indicating that the legacy GameraXML training data does not generalize well to this new manuscript page. However, on the LOO and n-fold testing on the training data alone, we got ~95% accuracy using k=1. To improve performance and ensure a reliable baseline, we will need to manually label the new page.

Associated Scripts & Changes

  • test_real_input_knn.py — Added the new pipeline/KNN test case using the real input.
  • evaluate.py — Used to execute the evaluation pipeline.
  • visulize.py — Used to analyze and visualize the poor classification results.
  • conftest.py — Updated test fixtures and configurations to support the new fixture paths.

Next Steps

  • Manually label the /core/tests/NZ-Wt MSR-03 109v page to provide accurate training data for the new manuscript type.

Update: Using the Hufnagel annotation sample from Kyrie, the new prediction for NZ-Wt shows satisfiable result on a 98-glyph training set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a minimal non-frontend test workflow for the Interactive Classifier

3 participants