Skip to content

feat: Add infer_schema with confidence-scored type inference (#855)#2580

Open
dynamo-pentester wants to merge 1 commit into
im-anishraj:mainfrom
dynamo-pentester:feat/infer-schema
Open

feat: Add infer_schema with confidence-scored type inference (#855)#2580
dynamo-pentester wants to merge 1 commit into
im-anishraj:mainfrom
dynamo-pentester:feat/infer-schema

Conversation

@dynamo-pentester

Copy link
Copy Markdown
Contributor

Summary

Implements automatic schema/data type inference with confidence scoring, as scoped in #855.

Adds infer_schema(frame: ArFrame) -> InferredSchema to the quality layer. For each column, it returns a ColumnInference with a best-guess inferred_type, a deterministic confidence score in [0.0, 1.0], an is_ambiguous flag, and the full candidates score breakdown.

Closes #855

What changed

  • arnio/quality.py

    • Added ColumnInference and InferredSchema frozen dataclasses
    • Added infer_schema(frame) — a thin layer over the existing profile() output (reuses suggested_dtype, semantic_type, null_ratio, unique_ratio; no second parsing engine)
    • Defined _INFER_CANDIDATE_TYPES (the six supported types: int64, float64, bool, datetime, categorical, string) and _AMBIGUITY_THRESHOLD = 0.15 in one place
    • ColumnInference.to_dict() / InferredSchema.to_dict() — fully JSON-safe, deterministic column/candidate ordering
    • InferredSchema.to_schema() — maps only to the dtypes Field/Schema already accept (int64, float64, bool, datetime, string); categoricalstring. All fields default to nullable=True (conservative default)
    • Added from .schema import Field, Schema import for the to_schema() return type
  • arnio/__init__.py

    • Exported ColumnInference, InferredSchema, infer_schema
  • tests/test_infer_schema.py (new)

    • Primitive type detection (int64, float64, bool, datetime, string)
    • Boolean variants: yes/no, true/false, 1/0
    • Multiple datetime formats
    • Mixed numeric/string columns
    • High-null and all-null columns (predictable fallback, no misleading high confidence)
    • Ambiguity boundary (second candidate within 0.15 of top)
    • Categorical detection (documented rule: ≥2 distinct values AND unique_ratio ≤ 0.20 — not uniqueness alone)
    • Score validation: all confidence/candidate scores finite and in [0.0, 1.0]
    • Deterministic output across repeated calls
    • to_dict() JSON serialization and ordering
    • to_schema() → valid Schema, correct dtype mappings, usable with ar.validate()
    • Input validation (TypeError on non-ArFrame)
  • website/api.html

    • Added infer_schema(frame) entry to the Quality section, documenting ColumnInference fields and InferredSchema methods

Out of scope (untouched)

  • profile(), compare_profiles(), detect_drift() / DriftReport
  • Schema, Field, validators (only consumed via to_schema())
  • C++ core, CSV parser, pipeline

How to test

pip install -e ".[dev]"
pytest tests/test_infer_schema.py -v
pytest tests/ -q   # full suite, ensure no regressions

Example usage

import arnio as ar

frame = ar.read_csv("data.csv")
schema = ar.infer_schema(frame)

for name, col in schema.columns.items():
    print(name, col.inferred_type, f"{col.confidence:.2f}", "ambiguous" if col.is_ambiguous else "")

# Pipe directly into validate()
result = ar.validate(frame, schema.to_schema())

@vercel

vercel Bot commented Jun 13, 2026

Copy link
Copy Markdown

@dynamo-pentester is attempting to deploy a commit to the xtylishanish-gmailcom's projects Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Add automatic schema/data type inference with confidence scoring

1 participant