Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assess "Too many open files" #9832

Open
berland opened this issue Jan 22, 2025 · 3 comments
Open

Assess "Too many open files" #9832

berland opened this issue Jan 22, 2025 · 3 comments

Comments

@berland
Copy link
Contributor

berland commented Jan 22, 2025

In light of #9830 Erts behaviour of open files should be assessed. Some questions:

  • How does it scale with number of realizations?
  • Differences running locally versus on compute cluster
@berland
Copy link
Contributor Author

berland commented Jan 22, 2025

Tips for testing:

  • In one terminal watch -n 0.1 "lsof -c ert | wc -l"
  • In another terminal, look manually at "lsof -c ert"
  • Run poly_example on LOCAL or LSF, with 1000 realizations
  • Let poly_eval.py sleep for 10 seconds to simulate something done
  • Set SUBMIT_SLEEP to 0 to get all realizations submitted immediately
  • Increase _max_batch size and decrease _batching_interval in evaluator.py to let Ert be snappier.

@berland
Copy link
Contributor Author

berland commented Jan 22, 2025

Some findings:

ulimit -n states that the per-process count of open files is 1024. One would expect that opening new files should get an OSError when this is exceeed.

Starting Ert required ca 440 file handles, irrespective of ensemble count.

Running with 1000 sleeping poly-realizations still keeps Ert at roughly 400-500 open file handles when run on LSF, the submission of jobs to the cluster is quick enough to not overlap.

Every TCP connection to the compute nodes count as one open file handle (they appear in lsof output).

Running 1000 realizations yields an observable peak of 1200 open file handles for the Ert process, but no OSError "Too many open files" can be observed ..❓ This is contrary to what ulimit -n says.

Hitting ctrl-c starts killing all realizations, where killing starts immediately with one subprocess for each kill (?). This yields another observable peak in open file handles, also surpassing the 1024 limit, but with no crash reproduced.

@berland
Copy link
Contributor Author

berland commented Jan 22, 2025

There is also a hard limit ulimit -Hn which yields 262144 on-prem (TGX)

@sondreso sondreso moved this to Todo in SCOUT Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

1 participant