BTSbot update: v1.0.1 -> v2.0.0; acai vram optimization#476
Open
Theodlz wants to merge 4 commits into
Open
Conversation
…er performance and uses a lot less vram at runtime
Contributor
There was a problem hiding this comment.
Pull request overview
Updates BTSBot from v1.0.1 to v2.0.0 for ZTF enrichment and GPU validation.
Changes:
- Replaces BTSBot model paths in shared model loading and GPU smoke validation.
- Adds the new BTSBot v2.0.0 ONNX LFS pointer and removes v1.0.1.
- Removes the old BTSBot model copy step from the GPU Dockerfile.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/enrichment/models/mod.rs |
Loads BTSBot v2.0.0 for CPU and GPU model pools. |
src/bin/scheduler.rs |
Uses BTSBot v2.0.0 in GPU inference validation. |
Dockerfile.gpu |
Removes old BTSBot v1.0.1 model copy. |
data/models/btsbot-v2.0.0.onnx |
Adds the new ONNX model via Git LFS pointer. |
data/models/btsbot-v1.0.1.onnx |
Removes the old ONNX model pointer. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…simply takes the post-transpose image format as input (which happens to match btsbot v2, so we don't hold 2 copies of it in memory and lower the vram usage a lot, even though inference seems slower somehow)
|
Throughput results (
|
|
Throughput results (
|
|
Throughput results (
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
While the new model is much larger in parameter size (8.8M vs 230k for the old model), performance is improved and VRAM usage on the GPU is a lot smaller (only my system, old model uses around 6GB at batch size = 1024, new one uses around 2GB!!! A huge improvement. Also runtime is better, I see 30 ms instead of 45 ms.
Now here's the issue. BTSbot wants the images in (N,3,63,63) format (NCHW). But acai (and old btsbot) expect (N,63,63,3) format (NWHC). So it's a bit of a pain, because we now need to have 2 sets of tensor images! But then, while I was looking at the ONNX graph of ACAI I noticed something I stumbled upon a few years back: ACAI wants (N,63,63,3) images as input but the first ONNX operator is a transpose (0,3,1,2) that converts it to (N,3,63,63), exactly the input format of btsbot v2! Not just that, but that seemingly unnecessary transpose means cuda has to use twice the memory it needs to the images! That means we do useless compute, and use more VRAM.
So I got claude's help to edit the ONNX files of ACAI to drop the transpose and simply take (N,3,63,63) images as input directly. VRAM usage on a batch size of 1024 drops from 3GB to 2GB. However, latency is worse somehow!!! Model takes 20ms instead of 15ms. Since we have 5 ACAI models that accumulates, but the 15ms shaved of from BTSbot compensate a bit. To me, that tradeoff is definitely acceptable given the VRAM savings.
So all in all, with new btsbost and the "nchw" acai to match its input format and lower VRAM, running a 1024 batch size with all the models sequentially uses 7.9GB of VRAM on my system, when previously it needed so much that it went OOM (I believe Sushant said it needed something like 16GB, which if true means a x2 improvement). Throughput might be slightly lower due to the - curious - increased runtime of the modified acai models, but I'd still call that a win, VRAM is our ZTF limiting factor, not throughput.
Notes:
TODOs: