Releases: pipe1os/modelinfo-cli
v1.4.2
ModelInfo CLI (v1.4.2)
Fixed
- Fixed unused imports (
os,json) in architecture parsing logic. - Fixed a bug where the
--max-vramargument was ignored when evaluating single models without a target GPU. - Fixed a bug where the target GPU's memory limit was ignored in favor of the default max VRAM when rendering the multi-model comparison table.
- Prevented a potential
ValueErrorcrash in the remote fetcher by enforcing a minimum of 1 worker for theThreadPoolExecutor.
v1.4.1
v1.4.0
ModelInfo CLI (v1.4.0)
This release adds multi-GPU hardware topology modeling and a subtractive vLLM memory engine for inference planning. We also overhauled how remote Hub interactions work to speed up metadata fetching.
Added
- Added the
--vllmflag to switch from additive VRAM checks to subtractive "Serving Capacity" estimates. It calculates PagedAttention block limits based on a configurable--gpu-utilratio. - Topology-Aware Overhead Scaling: Added
--topology(nvlink,pcie4,pcie3) and--strategy(tp,pp) flags. The calculator now applies NCCL communication penalties directly to weights and activations instead of using a generic fixed multiplier. - Mapped explicit
ggml_typeenums (0-33) for GGUF files to fix VRAM under-reporting for specific quantization types. - The CLI now does algorithmic estimation via
index.jsonby default. If you need the exact size breakdown of every tensor, pass--tensorsto force it to fetch all remote shards. - Added comprehensive
pytesttest coverage for the new vLLM subtractive math engine, topology penalties, and explicit GGUF byte mappings.
Changed
- Removed KV cache from the distributed overhead multiplier because Tensor Parallelism partitions context blocks rather than duplicating them.
- Changed the network logic to infer metadata directly from
index.jsonandconfig.json. It skips iterative chunk requests for sharded arrays unless--tensorsis passed.
v1.3.0
ModelInfo CLI (v1.3.0)
This release adds comprehensive hardware fit diagnostics, dynamic GPU scaling, and side-by-side model comparison to instantly evaluate operational deployment trade-offs.
Added
- Hardware Discovery Engine: Added the
--gpuflag with multi-vendor normalization (NVIDIA, AMD, Intel) to calculate if a model fits within specific hardware constraints. Supports named GPUs (rtx4090), explicit sizes (24), and native local hardware discovery (auto). - Fragmentation Defense: Implemented a 3-tier UI heuristic (✓ Safe, ⚠ Warning, ✗ Fail) to defend against memory fragmentation and generation-time transient spikes.
- Side-by-Side Comparison: Passing multiple models via the CLI (
modelinfo modelA modelB) now implicitly triggers a dedicated side-by-side comparison table, surfacing parameter counts, context lengths, and VRAM footprints to evaluate architectural trade-offs.
Changed
- Multi-GPU Overhead Scaling: The CUDA context initialization overhead (600 MB) now dynamically scales based on the detected
gpu_countto prevent silent prefill OOMs on multi-GPU deployments. - Mathematical Transparency: Enforced the
Dtypecolumn mathematically into the comparative UI to visualize exactly why quantization scales VRAM footprints downward.
v1.2.0
ModelInfo CLI (v1.2.0)
This release adds remote Hugging Face Hub inspection, dynamic VRAM overhead modeling, and sensible context defaults for operational inference planning.
Added
- Remote Hugging Face Hub Support: Inspect any public or gated model directly via its repo ID (e.g.,
modelinfo meta-llama/Llama-2-7b-hf) without downloading the full checkpoint. Uses concurrentRangerequests (max 8 workers) to extract the first 500KB of safetensors shards to prevent synchronous I/O bottlenecks and bypass CDN rate-limits. - Framework Overhead Modeling: VRAM estimates now include a static 600 MB CUDA context overhead alongside the model weights and KV cache for operational accuracy.
- Hierarchical VRAM UI: Redesigned the output terminal UI to group memory footprints into Weights, KV Cache, and Overhead.
Changed
- Sane Context Defaults: Hard-capped the default
--contextvalue at 8192 tokens. Models with extreme architectural boundaries (e.g., 128k) will still read the native limit and print it in the UI, protecting users from unrealistic default memory calculations. - Authentication Fallback: Remote HTTP fetcher now supports token extraction from the
HF_TOKENenvironment variable,~/.cache/huggingface/token, and the legacy~/.huggingface/tokenpath.
v1.1.0
ModelInfo CLI (v1.1.0)
This release shifts the tool from heuristic guesswork to deterministic binary metadata extraction, introducing native context limit bounds checking and pure, decoupled math engines.
Core Engine
- GGUF Metadata Extraction: The parser now extracts global key-value pairs (e.g.,
general.architecture,attention.head_count_kv) from the binary file before parsing the tensor list, guaranteeing accurate architecture mapping. - Context Limit Warnings: Extracts
max_position_embeddings(Hugging Face) and{gen_arch}.context_length(GGUF) to actively warn users if the requested--contextexceeds the model's native boundary. - SafeTensors Config Fallback: Seamlessly reads adjacent
config.jsonfiles for robust fallback parsing of architectures lacking explicit tensor structures. - Fused Tensor Estimation: Defused mathematical edge cases for GQA, ALiBi, and older MHA models when extracting KV cache dimensionality from fused
qkv_proj.weighttensors.
UI & Architecture
- Decoupled Math Engine: Moved all filesystem operations and config parsing into
cli.pyto maintain Separation of Concerns, keepingcalculator.pyandarchitecture.pyas pure, testable math engines. - Terminal Aesthetics: Solved terminal text-wrapping bugs by constraining
richtable columns and removed non-standard emojis to enforce a clean, professional Unix aesthetic.
Full Changelog: v1.0.0...v1.1.0
V1.0.0
ModelInfo CLI (v1.0.0)
Initial public release of the modelinfo-cli package.
Core Engine
- Zero-Dependency Parsers: Native standard-library (
os,struct,json) binary deserialization for.safetensors,.gguf, and legacy.ptfiles. - Sharded Architecture Support: Auto-detects and gracefully parses
model.safetensors.index.jsonmanifests without crashing on partial downloads. - Hardware Calculator: Calculates exact VRAM footprints (including dynamic KV cache overhead) using precise GGUF block quantization ratios (Q8 through Q2).
- Restricted Unpickler: Defangs arbitrary code execution when inspecting legacy PyTorch checkpoints.
CI/CD & Constraints
- Cross-platform testing matrix (
ubuntu,macos,windowson Python3.10-3.12) verifying binary struct unpacking behavior. - Hardened pipeline constraints enforcing the sub-100ms startup latency by blocking heavy ML ecosystem imports (e.g.,
torch,transformers).