-
Notifications
You must be signed in to change notification settings - Fork 49
Automated JSON Schema Validation Against Sample Data #8047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Co-authored-by: pethers <[email protected]>
Co-authored-by: pethers <[email protected]>
Co-authored-by: pethers <[email protected]>
Co-authored-by: pethers <[email protected]>
Co-authored-by: pethers <[email protected]>
📊 JSON Schema Validation ResultsValidation Summary:
Schema Status
📄 Full Report: See uploaded artifacts for detailed validation report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request implements automated validation of 5 JSON export schemas against 142 CSV sample data files to detect schema-data mismatches and ensure correctness. The solution includes a Python validation script (506 lines), GitHub Actions workflow for CI/CD integration, and comprehensive documentation.
Key Changes:
- Automated Python validation engine with flexible field matching and type inference
- CI/CD integration via GitHub Actions (daily runs, PR checks, manual dispatch)
- Generated validation reports identifying 127 field mismatches across schemas
- Documentation of field mappings between JSON schemas and database columns
Reviewed changes
Copilot reviewed 8 out of 10 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
json-export-specs/validate_schemas.py |
Core validation script parsing schema markdown, analyzing CSV data, flexible field matching |
.github/workflows/validate-json-schemas.yml |
GitHub Actions workflow with automatic PR comments and issue creation on failures |
json-export-specs/schemas/validation-results.json |
Machine-readable validation output (1,209 lines) |
json-export-specs/schemas/SCHEMA_VALIDATION_REPORT.md |
Human-readable validation report with detailed findings |
json-export-specs/VALIDATION_README.md |
Tool usage documentation and methodology |
json-export-specs/FIELD_MAPPING.md |
Explicit JSON↔Database field mappings with computed field logic |
json-export-specs/IMPLEMENTATION_SUMMARY.md |
Executive summary of implementation and findings |
json-export-specs/README.md |
Added validation status section with badge |
.gitignore |
Added json-export-specs/__pycache__/ exclusion |
json-export-specs/__pycache__/validate_schemas.cpython-312.pyc |
Compiled Python bytecode (should not be committed) |
| | **Politician** | `view_riksdagen_politician`, `view_riksdagen_politician_summary`, `view_riksdagen_politician_document`, `view_riksdagen_politician_ballot` | | ||
| | **Party** | `view_riksdagen_party`, `view_riksdagen_party_summary`, `view_riksdagen_party_document_summary`, `view_riksdagen_party_ballot` | | ||
| | **Committee** | `view_riksdagen_committee`, `view_riksdagen_committee_decisions`, `view_riksdagen_committee_roles`, `view_riksdagen_committee_ballot_decision` | | ||
| | **Ministry** | `view_riksdagen_goverment`, `view_riksdagen_goverment_roles`, `view_ministry_decision_impact`, `view_ministry_effectiveness_trends` | |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "goverment" should be "government". This typo appears in the view name references.
| | **Ministry** | `view_riksdagen_goverment`, `view_riksdagen_goverment_roles`, `view_ministry_decision_impact`, `view_ministry_effectiveness_trends` | | |
| | **Ministry** | `view_riksdagen_government`, `view_riksdagen_government_roles`, `view_ministry_decision_impact`, `view_ministry_effectiveness_trends` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e625cae. Changed "goverment" to "government" in view name references.
json-export-specs/FIELD_MAPPING.md
Outdated
| | `id` | `ministry_code` | string | `view_riksdagen_goverment` | | ||
| | `name` | `details` | string | `view_riksdagen_goverment` | | ||
| | `minister` | - | string | `view_riksdagen_goverment_role_member` | |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "goverment" should be "government". This typo appears multiple times in view name references (lines 149-151).
| | `id` | `ministry_code` | string | `view_riksdagen_goverment` | | |
| | `name` | `details` | string | `view_riksdagen_goverment` | | |
| | `minister` | - | string | `view_riksdagen_goverment_role_member` | | |
| | `id` | `ministry_code` | string | `view_riksdagen_government` | | |
| | `name` | `details` | string | `view_riksdagen_government` | | |
| | `minister` | - | string | `view_riksdagen_government_role_member` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e625cae. All 3 occurrences of "goverment" corrected to "government".
| "column_types": self._infer_column_types(rows, columns) | ||
| } | ||
|
|
||
| except Exception as e: |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The broad except Exception as e handler at line 149 catches all exceptions when reading CSV files, including critical errors like KeyboardInterrupt or SystemExit. While the error is logged, this could mask serious issues during execution.
Consider catching more specific exceptions:
except (IOError, csv.Error, UnicodeDecodeError) as e:
print(f" ✗ Error reading {csv_file.name}: {e}")This allows critical exceptions like KeyboardInterrupt to propagate properly while still catching file-related errors.
| except Exception as e: | |
| except (IOError, csv.Error, UnicodeDecodeError) as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e625cae. Changed to catch specific exceptions: IOError, csv.Error, UnicodeDecodeError. Critical exceptions like KeyboardInterrupt now propagate properly.
| schema_info["fields"][field_name] = { | ||
| "type": field_type, | ||
| "required": True # Default assumption | ||
| } |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mermaid field extraction pattern r'\+(\w+)\s+(\w+)' is too broad and may incorrectly match non-field content. For example, in the validation results (line 34), a field named "8" was extracted, which is likely a false match from mermaid diagram syntax (e.g., list items or other markers).
Consider making the pattern more specific by:
- Ensuring it only matches within mermaid class diagrams
- Validating that extracted field names are valid identifiers (start with letter, not just digits)
- Adding a filter to exclude numeric-only field names
Example fix:
if field_name and not field_name.isdigit(): # Add validation
schema_info["fields"][field_name] = {
"type": field_type,
"required": True
}| schema_info["fields"][field_name] = { | |
| "type": field_type, | |
| "required": True # Default assumption | |
| } | |
| # Only add field if it starts with a letter and is not numeric-only | |
| if re.match(r'^[A-Za-z]\w*$', field_name): | |
| schema_info["fields"][field_name] = { | |
| "type": field_type, | |
| "required": True # Default assumption | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e625cae. Added validation to ensure field names start with a letter: ^[A-Za-z]\w*$. The invalid "8" field is now excluded from extraction. Total field mismatches reduced from 127 to 125.
| "field": "8", | ||
| "issue": "Field defined in schema but not found in data", | ||
| "suggestions": [ | ||
| "8" | ||
| ] | ||
| }, | ||
| { |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the validation results JSON, the field "8" appears in both politician and party schemas (lines 34 and 353), which confirms the issue with the mermaid pattern extraction. This invalid field name pollutes the validation results and creates meaningless field mismatches.
This is a consequence of the overly broad regex pattern in the schema parsing logic.
| "field": "8", | |
| "issue": "Field defined in schema but not found in data", | |
| "suggestions": [ | |
| "8" | |
| ] | |
| }, | |
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e625cae. The invalid "8" field has been eliminated by adding field name validation in the mermaid pattern extraction logic.
| cache: 'pip' | ||
|
|
||
| - name: Install dependencies | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| # No external pip packages required - script uses only Python standard library | ||
|
|
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The "Install dependencies" step (lines 40-43) upgrades pip but installs no packages since the script uses only the Python standard library. The cache: 'pip' option on line 38 is also unnecessary in this case.
Consider either:
- Removing the cache configuration and simplifying the install step
- Adding a comment explaining why pip is upgraded despite no dependencies
This would reduce workflow execution time slightly and improve clarity.
| cache: 'pip' | |
| - name: Install dependencies | |
| run: | | |
| python -m pip install --upgrade pip | |
| # No external pip packages required - script uses only Python standard library | |
| # No dependencies to install - script uses only Python standard library |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e625cae. Removed the cache: 'pip' option and "Install dependencies" step. Added comment: "No dependencies to install - script uses only Python standard library".
| "This report validates the 5 JSON export schemas against 142 real CSV sample data files " | ||
| "from the CIA database to ensure schema correctness and identify gaps between " |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implicit string concatenation. Maybe missing a comma?
| "This report validates the 5 JSON export schemas against 142 real CSV sample data files " | |
| "from the CIA database to ensure schema correctness and identify gaps between " | |
| "This report validates the 5 JSON export schemas against 142 real CSV sample data files ", | |
| "from the CIA database to ensure schema correctness and identify gaps between ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e625cae. Added explicit commas between the strings to prevent implicit string concatenation.
|
|
||
| import csv | ||
| import json | ||
| import os |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'os' is not used.
| import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e625cae. Removed unused os import.
| from collections import defaultdict | ||
| from datetime import datetime, timezone | ||
| from pathlib import Path | ||
| from typing import Dict, List, Set, Tuple, Any |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'Set' is not used.
| from typing import Dict, List, Set, Tuple, Any | |
| from typing import Dict, List, Tuple, Any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e625cae. Removed unused Set from typing imports.
| # This is a data structure example | ||
| schema_info["example_json_paths"].append(self._extract_paths(parsed.get("data", []))) | ||
| except json.JSONDecodeError: | ||
| pass |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| pass | |
| print(f"⚠️ Warning: Failed to parse JSON example in {schema_path}. Example snippet: {example[:60]}...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e625cae. Added explanatory comment: "Skip invalid JSON examples in markdown".
|
@copilot apply changes based on the comments in this thread |
…lidate field names Co-authored-by: pethers <[email protected]>
|
📊 JSON Schema Validation ResultsValidation Summary:
Schema Status
📄 Full Report: See uploaded artifacts for detailed validation report |
📊 JSON Schema Validation ResultsValidation Summary:
Schema Status
📄 Full Report: See uploaded artifacts for detailed validation report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 10 changed files in this pull request and generated no new comments.
📊 JSON Schema Validation ResultsValidation Summary:
Schema Status
📄 Full Report: See uploaded artifacts for detailed validation report |
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Snapshot WarningsEnsure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice. OpenSSF Scorecard
Scanned Files
|
Signed-off-by: James Pether Sörling <[email protected]>
📊 JSON Schema Validation ResultsValidation Summary:
Schema Status
📄 Full Report: See uploaded artifacts for detailed validation report |
|



Schema Validation Against Sample Data - PR Review Fixes Applied ✅
All feedback from PR review has been addressed.
✅ Changes Made (Based on Review Comments)
1. Fixed Spelling Errors
2. Improved Exception Handling
except Exceptionto specific exceptionsIOError,csv.Error,UnicodeDecodeErrorKeyboardInterruptto propagate3. Fixed Mermaid Field Extraction
^[A-Za-z]\w*$4. Removed Unused Imports
osimportSetfrom typing imports5. Added Explanatory Comment
6. Fixed String Concatenation
7. Workflow Optimization
cache: 'pip'option (no dependencies to cache)continue-on-error: trueis needed for artifact upload✅ Validation Results (After Fixes)
Field Count Changes:
🧪 Testing
Validated all changes:
📝 Files Modified
.github/workflows/validate-json-schemas.yml(8 lines changed)json-export-specs/FIELD_MAPPING.md(6 lines changed)json-export-specs/VALIDATION_README.md(2 lines changed)json-export-specs/validate_schemas.py(20 lines changed)json-export-specs/schemas/SCHEMA_VALIDATION_REPORT.md(regenerated)json-export-specs/schemas/validation-results.json(regenerated)Status: ✅ ALL REVIEW COMMENTS ADDRESSED
Quality: ✅ CODE REVIEW FEEDBACK IMPLEMENTED
Testing: ✅ VALIDATED AND WORKING
Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.