Skip to content

Commit a8fdb01

Browse files
committed
chore: update project description to align with exact hardware capabilities
1 parent 0c6fcdf commit a8fdb01

5 files changed

Lines changed: 12 additions & 12 deletions

File tree

CHANGELOG.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,14 @@ and this project adheres to [Semantic Versioning](http://semver.org/).
88

99
## [1.4.0] - 2026-06-07
1010

11-
This release adds multi-GPU hardware topology modeling and a subtractive vLLM memory engine for inference planning. We also overhauled how remote Hub interactions work to speed up metadata fetching.
11+
This release adds multi-GPU hardware topology modeling and a vLLM serving capacity engine for inference planning. We also overhauled how remote Hub interactions work to speed up metadata fetching.
1212

1313
### Added
14-
- Added the `--vllm` flag to switch from additive VRAM checks to subtractive "Serving Capacity" estimates. It calculates PagedAttention block limits based on a configurable `--gpu-util` ratio.
14+
- Added the `--vllm` flag to switch from additive VRAM checks to a "Serving Capacity" simulation. It calculates PagedAttention block limits based on a configurable `--gpu-util` ratio.
1515
- **Topology-Aware Overhead Scaling:** Added `--topology` (`nvlink`, `pcie4`, `pcie3`) and `--strategy` (`tp`, `pp`) flags. The calculator now applies NCCL communication penalties directly to weights and activations instead of using a generic fixed multiplier.
1616
- Mapped explicit `ggml_type` enums (0-33) for GGUF files to fix VRAM under-reporting for specific quantization types.
1717
- The CLI now does algorithmic estimation via `index.json` by default. If you need the exact size breakdown of every tensor, pass `--tensors` to force it to fetch all remote shards.
18-
- Added comprehensive `pytest` test coverage for the new vLLM subtractive math engine, topology penalties, and explicit GGUF byte mappings.
18+
- Added comprehensive `pytest` test coverage for the new vLLM serving capacity engine, topology penalties, and explicit GGUF byte mappings.
1919

2020
### Changed
2121
- Removed KV cache from the distributed overhead multiplier because Tensor Parallelism partitions context blocks rather than duplicating them.

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,14 @@
88

99
ModelInfo is a CLI tool that inspects machine learning model checkpoints (`.safetensors`, `.gguf`, `.pt`) and calculates hardware requirements completely offline.
1010

11-
It reads binary headers directly using the Python standard library. By bypassing full tensor payload loading and strictly excluding heavy ecosystems like PyTorch or HuggingFace, the tool executes in under 100 milliseconds.
11+
It reads binary headers directly using the Python standard library. It skips the full tensor payload entirely (no PyTorch, no HuggingFace) and parses in under 100ms.
1212

1313
## Features
1414

1515
- **Zero-Dependency Parsing**: Reads `.safetensors` 8-byte JSON prefixes and `.gguf` binary key-value metadata directly via `struct` and `json` (falling back to `config.json` if needed).
1616
- **Remote Hugging Face Hub Inspection**: Pass a repo ID (e.g., `meta-llama/Llama-2-7b-hf`) and it uses concurrent byte-range requests to read the headers off the CDN in under 2 seconds. No need to download the checkpoint.
1717
- Parses `model.safetensors.index.json` to support sharded models without crashing on partial downloads.
18-
- **Dynamic VRAM & Subtractive vLLM Math**: Calculates exact VRAM limits based on the model's architecture and your target context length. If you use the `--vllm` flag, it switches to a subtractive "Serving Capacity" engine that calculates exactly how many tokens fit in the PagedAttention pool based on your `--gpu-util` ratio.
18+
- **Dynamic VRAM & vLLM Capacity Planning**: Calculates exact VRAM limits based on the model's architecture and your target context length. If you use the `--vllm` flag, it switches to a "Serving Capacity" simulation that calculates exactly how many tokens fit in the PagedAttention pool based on your `--gpu-util` ratio.
1919
- **Hardware Fit Diagnostics**: Check if a model fits your cluster with `--gpu` (e.g. `--gpu RTX4090` or `--gpu auto`). It enforces Apple Silicon's 75% unified memory wire limit, and you can explicitly model multi-GPU NCCL communication penalties with `--topology` and `--strategy`.
2020
- **Side-by-Side Comparison**: Pass multiple models to trigger a comparison table (parameters, data types, context lengths, VRAM footprints).
2121
- Uses exact `ggml_type` mappings for GGUF formats to calculate byte-scaling coefficients, preventing VRAM under-reporting.
@@ -48,7 +48,7 @@ pip install -e ".[dev]"
4848

4949
## Testing
5050

51-
The testing suite enforces cross-platform structural integrity and guards the zero-dependency latency constraint. Tests are isolated against custom binary mocks in `tests/fixtures/`.
51+
Tests cover the binary parsers and verify the sub-100ms local parse constraint using binary mocks in `tests/fixtures/`.
5252

5353
Run the test suite using pytest:
5454

@@ -147,7 +147,7 @@ Qwen2.5-0.5B 494.0M BF16 8K 1.6 GB ✓
147147
| `--gpu` | `--gpu rtx4090` | Check if the model fits. Accepts GPU names (`rtx4090`, `b200`, `rx7900xtx`), explicit VRAM limits in GB (`--gpu 24`), or local hardware auto-discovery (`--gpu auto`). |
148148
| `--context` | `--context 32768` | Adjust the target KV cache length. Essential for calculating the dynamic memory footprint of long-context models. Defaults to `8192`. |
149149
| `--max-vram` | `--max-vram 80` | Adjusts the color-coded heat mapping thresholds (Green/Yellow/Red) in the terminal output to match a specific hardware ceiling. |
150-
| `--vllm` | `--vllm --gpu auto` | Switches from additive memory checking to a subtractive serving capacity estimation. Shows exactly how many tokens fit in the PagedAttention pool. |
150+
| `--vllm` | `--vllm --gpu auto` | Switches from additive memory checking to a serving capacity simulation. Shows exactly how many tokens fit in the PagedAttention pool. |
151151
| `--gpu-util` | `--gpu-util 0.9` | Sets the vLLM `gpu_memory_utilization` ratio. Defaults to `0.9` (reserves 10% for PyTorch context). |
152152
| `--topology` | `--topology nvlink` | Set interconnect topology to calculate exact communication overhead penalties (`nvlink`, `pcie4`, `pcie3`). Defaults to `pcie4`. |
153153
| `--strategy` | `--strategy tp` | Selects the parallelization strategy for multi-GPU setups (`tp` for Tensor Parallelism, `pp` for Pipeline Parallelism). Defaults to `tp`. |

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
55
[project]
66
name = "modelinfo-cli"
77
version = "1.4.0"
8-
description = "A sub-100ms, zero-dependency CLI to inspect ML models (.safetensors, .gguf) locally or via Hugging Face, calculate exact VRAM footprints, and determine hardware fit."
8+
description = "A CLI tool to inspect ML checkpoints (.safetensors, .gguf, .pt) and calculate inference VRAM, multi-GPU memory splits, and vLLM serving capacity."
99
readme = "README.md"
1010
requires-python = ">=3.10"
1111
license = { text = "MIT" }

src/modelinfo/cli.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ def parse_args(argv: Sequence[str] | None = None) -> argparse.Namespace:
6464
parser.add_argument(
6565
"--vllm",
6666
action="store_true",
67-
help="Enable Subtractive Math Engine: Calculate max context tokens using vLLM PagedAttention allocation.",
67+
help="Enable vLLM Capacity Simulation: Calculate max context tokens using PagedAttention allocation.",
6868
)
6969
parser.add_argument(
7070
"--gpu-util",
@@ -185,7 +185,7 @@ def main(argv: Sequence[str] | None = None) -> int:
185185

186186
if len(args.file) > 1:
187187
if args.vllm:
188-
console.print("[red]Error: Side-by-side comparison does not currently support the subtractive --vllm engine. Compare models sequentially or remove --vllm.[/red]")
188+
console.print("[red]Error: Side-by-side comparison does not currently support the --vllm capacity simulation. Compare models sequentially or remove --vllm.[/red]")
189189
return 1
190190

191191
models = []

tests/test_calculator.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -143,8 +143,8 @@ def test_strategy_pp():
143143
assert fp_pp["penalty_percentage"] == 0.0
144144
assert fp_pp["overhead_bytes"] == (4 * 600 * 1024 * 1024)
145145

146-
def test_vllm_subtractive_math():
147-
"""Verify the subtractive vLLM serving capacity engine calculates exact tokens."""
146+
def test_vllm_capacity_simulation():
147+
"""Verify the vLLM serving capacity engine calculates exact tokens."""
148148
tensors = {
149149
"model.layers.0.attn.weight": {"shape": [1024, 1024], "dtype": "F16"} # Base: 2MB
150150
}

0 commit comments

Comments
 (0)