-
-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"tensorboardx-2.5.1.drv' failed with exit code 134" on nixos unstable, building invokeai-amd #38
Comments
Here are the contents of
|
And here's my current nixos version:
|
It may be that your system is running out of memory whilst running the test whilst building tensor board from source, since it seems like Python is crashing during the build. Check |
My system has overall 32GB of RAM, plus 6.3GB of swap. |
Then it may be a race condition in the Python test that means your system is either too fast or too slow when building the derivation. This would be an upstream problem too, which means if you look at the tensorboard issue tracker you may find complaints about race conditions in the tests. |
Python does not crash by aborting core on any of my systems like this, so it can possibly be:
Just to make sure, you're not emulating x86 on an ARM machine via |
I'm not emulating no, building for x86linux on x86linux. Here's my CPU info, in case this was never run on the same CPU. Also, maybe I should run a memory check, to see if my RAM is still fine.
|
Another question: do you know why this derivation's result isn't cached? That would address the flaky python tests entirely, or maybe they are the cause of this? |
It isn't cached probably because we've ran out of cache space on cachix, since this stuff is definitely extending beyond 10G of disk space. I asked @domenkozar for "moar cache!" and he obliged, but I think we may have ran out of that too. Alternative would be to host our own cache via S3, etc. But this is all very difficult considering the project doesn't make enough money via sponsors to facilitate this infrastructure. |
I'm happy to sponsor more, what do you think how much you need? Also you might want to use pins. |
@domenkozar I'd only ever want to pin the latest, so I'd need a way of unpinning anything that is old. At which point I'm making tons of api calls and managing state, which I don't want to do. The garbage collection approach works fine for this project, it's just that the size window can be beyond 10G. If you could do 50G I think that'll satisfy this project forever, as I can't see the closure size of all of the flake outputs ever reaching above that, even with the Nvidia/AMD build matrix we have. TL;DR if you give me 50G, that means that the latest master of the flake outputs in this repo should always be in the cache, regardless of garbage collection (since the flake will be within the 50G boundary) |
Done! |
@MatthewCroughan let me know how I can check when this is cached. I can retry and close the issue then. |
I'm not sure if there's an easy way to cause hercules-ci to upload the missing cache to cachix without a mass rebuild (updating nixpkgs), so this should fix itself when we update nixified.ai to 23.05 for example. Maybe @roberth has an answer in the meantime though. I will endeavour to upstream our overlays as explained in #33 and in doing so update nixpkgs, sometime this week or next. |
@MatthewCroughan you could restart the job to recreate the |
I've pushed the latest versions of all our packages to the cache. You should get a cache hit if you simply repeat the |
Nice, I can progress further! I have another issue now, shall I open a different ticket? It complains about missing "configs/models.yaml", is it something expected? |
Here is what I'm getting:
I am one of these "have zero clue about the packaging, let me just run it" kind of people :) Hope this report helps with reaching the goal of making AI tooling more reproducible. I'll be here if you need more info and can test stuff from time to time to check fixes.
I've re-run this 3 times, with the same result every time.
The text was updated successfully, but these errors were encountered: