Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuna crashes after the first DFT calculation #46

Open
bjkreitz opened this issue Aug 31, 2022 · 7 comments
Open

Finetuna crashes after the first DFT calculation #46

bjkreitz opened this issue Aug 31, 2022 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@bjkreitz
Copy link

Issue

I tried to run FINETUNA with VASP 6.3.0 to relax H*CO on Pt(111) using the provided ASE example template (no 1). 10 steps are performed with the MLP and then a DFT calculation is triggered. However, after the DFT calculation converges, the software crashes and reports the following error message:

Trying to close the VASP stream but encountered error:
process PID not found (pid=181196)
Will now force closing the VASP process. The OUTCAR and vasprun.xml outputs may be incomplete
Force below threshold: check with parent
OnlineLearner: Parent calculation required
Traceback (most recent call last):
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/psutil/_common.py", line 442, in wrapper
ret = self._cache[fun]
AttributeError: _cache

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/psutil/_pslinux.py", line 1642, in wrapper
return fun(self, *args, **kwargs)
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/psutil/_common.py", line 445, in wrapper
return fun(self)
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/psutil/_pslinux.py", line 1684, in _parse_stat_file
data = bcat("%s/%s/stat" % (self._procfs_path, self.pid))
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/psutil/_common.py", line 775, in bcat
return cat(fname, fallback=fallback, _open=open_binary)
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/psutil/_common.py", line 763, in cat
with _open(fname) as f:
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/psutil/_common.py", line 727, in open_binary
return open(fname, "rb", buffering=FILE_READ_BUFFER_SIZE)
FileNotFoundError: [Errno 2] No such file or directory: '/proc/181196/stat'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/psutil/init.py", line 361, in _init
self.create_time()
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/psutil/init.py", line 714, in create_time
self._create_time = self._proc.create_time()
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/psutil/_pslinux.py", line 1642, in wrapper
return fun(self, *args, **kwargs)
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/psutil/_pslinux.py", line 1852, in create_time
ctime = float(self._parse_stat_file()['create_time'])
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/psutil/_pslinux.py", line 1649, in wrapper
raise NoSuchProcess(self.pid, self._name)
psutil.NoSuchProcess: process no longer exists (pid=181196)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/gpfs/data/cfgoldsm/bkreitz1/VASP/methane-oxidation/neb/h--co-diss/IS/finetuna/example.py", line 106, in
relaxer.run(
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/finetuna/atomistic_methods.py", line 198, in run
dyn.run(fmax=self.fmax, steps=self.steps)
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/ase/optimize/optimize.py", line 294, in run
return Dynamics.run(self)
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/ase/optimize/optimize.py", line 181, in run
for converged in Dynamics.irun(self):
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/ase/optimize/optimize.py", line 168, in irun
self.log()
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/ase/optimize/optimize.py", line 308, in log
forces = self.atoms.get_forces()
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/ase/atoms.py", line 790, in get_forces
forces = self._calc.get_forces(self)
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/ase/calculators/abc.py", line 23, in get_forces
return self.get_property('forces', atoms)
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/ase/calculators/calculator.py", line 736, in get_property
self.calculate(atoms, [name], system_changes)
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/finetuna/online_learner/online_learner.py", line 189, in calculate
energy, forces, fmax = self.get_energy_and_forces(atoms)
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/finetuna/online_learner/online_learner.py", line 259, in get_energy_and_forces
energy, forces, constrained_forces = self.add_data_and_retrain(
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/finetuna/online_learner/online_learner.py", line 491, in add_data_and_retrain
self.parent_calc._pause_calc()
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/vasp_interactive/vasp_interactive.py", line 471, in _pause_calc
mpi_process = _find_mpi_process(pid)
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/vasp_interactive/vasp_interactive.py", line 65, in _find_mpi_process
process_list = [psutil.Process(pid)]
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/psutil/init.py", line 332, in init
self._init(pid)
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/psutil/init.py", line 373, in _init
raise NoSuchProcess(pid, msg='process PID not found')
psutil.NoSuchProcess: process PID not found (pid=181196)
Trying to close the VASP stream but encountered error:
'psutil'

Software

OpenMPI 4.0.5
Intel 2020.2
Python 3.9.0

Executed on 2 nodes, with 2 tasks per node and 8 cpus per task (not sure if that's relevant)

I adjusted the vasp calculator as follows:

vasp_calc = VaspInteractive(
    ibrion=-1,
    nsw=0,
ispin=1,
    ediff=1e-6,
    ediffg=-0.03,
    encut=450.0,
    laechg=False,
    lcharg=False,
    lwave=False,
    #ncore=4,
    xc="beef-vdw",
    kpts=(3,3,1),
)
@alchem0x2A
Copy link
Contributor

@bjkreitz I think this issue is related with vasp-interactive. It could be vasp-interactive isn't compatible with your local vasp build so that the parsing stopped. Another possibility is maybe related to multiple node MPI. In vasp-interactive we're using psutils to check the mpi process and send pausing signals when performing ML jobs. But this hasn't been tested on multi-node machines.

Could you simply test if able to use VaspInteractive as an ase calculator on your structure using the MPI settings? I suspect it will fail in this case as well.

Also are the OUTCAR, vasprun.xml and vasp.out files somehow truncated in your setup?

@alchem0x2A
Copy link
Contributor

If so let's raise the issue in https://github.com/ulissigroup/vasp-interactive instead. You're likely to overcome the issue by switching VaspInteractive to normal Vasp with some penalty of computation time

@bjkreitz
Copy link
Author

When I run the relaxation just with vasp-interactive and identical settings it seems to work. A few optimization steps were performed without a crash.

@alchem0x2A
Copy link
Contributor

@bjkreitz Thanks for the testing. So it seems the vasp build you have is compatible and the issue might be related with MPI pausing on multiple nodes. Could you try some simple test like this on your setup (just atoms + VaspInteractive, no finetuna involved)

import time
atoms.calc = VaspInteractive(**params)
atoms.get_potential_energy()
# Potentially not working on multiple nodes?
with atoms.calc.pause():
    time.sleep(5)
# Just simulate a second step
atoms.rattle(0.01)
atoms.get_potential_energy()
atoms.calc.finalize()

@bjkreitz
Copy link
Author

bjkreitz commented Sep 1, 2022

Yea when I try your example it fails after computing the first potential energy. This fails on multiple nodes but also on a single node with 16 cpus due to an MPI issue.
I get this error message:

Traceback (most recent call last):
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/vasp_interactive/vasp_interactive.py", line 500, in pause
self._pause_calc()
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/vasp_interactive/vasp_interactive.py", line 471, in _pause_calc
mpi_process = _find_mpi_process(pid)
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/vasp_interactive/vasp_interactive.py", line 80, in _find_mpi_process
mpi_proc = mpi_candidates[-1]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/gpfs/data/cfgoldsm/bkreitz1/VASP/methane-oxidation/neb/h--co-diss/IS/finetuna/interactive/minimal.py", line 25, in
with syst.calc.pause():
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/contextlib.py", line 119, in enter
return next(self.gen)
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/vasp_interactive/vasp_interactive.py", line 503, in pause
self._resume_calc()
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/vasp_interactive/vasp_interactive.py", line 483, in _resume_calc
mpi_process = _find_mpi_process(pid)
File "/users/bkreitz1/anaconda/finetuna/lib/python3.9/site-packages/vasp_interactive/vasp_interactive.py", line 80, in _find_mpi_process
mpi_proc = mpi_candidates[-1]
IndexError: list index out of range
Trying to close the VASP stream but encountered error:
list index out of range
Will now force closing the VASP process. The OUTCAR and vasprun.xml outputs may be incomplete

@alchem0x2A
Copy link
Contributor

@bjkreitz Thanks for the test! Yes it seems the way VaspInteractive handles the mpi process lookup via psutil does not work for your specific system. This can happen since we perhaps only considered a few MPI combinations. the _find_mpi_process method needs to be updated. If possible could you inform us the $VASP_COMMAND or $ASE_VASP_COMMAND you're using?

Meanwhile if you're ok with testing finetuna, simply switching VaspInteractive to Vasp should guarantee a smooth transition. @jmusiel Is the current implementation of finetuna able to disable pausing when using VaspInteractive?

@jmusiel
Copy link
Collaborator

jmusiel commented Sep 6, 2022

Yes switching VaspInteractive to Vasp should disable pausing and work as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants