Skip to content

Conversation

@mducle
Copy link
Member

@mducle mducle commented Mar 20, 2022

Changes the checks in @parallel_config and @MPI* classes to allow compiled workers.

When the worker field of parallel_config is set to something which is not a Matlab function, then it is checked to see if it may be a compiled worker. This check considers something a compiled worker if:

  • It is a full path to a existing file; OR
  • It is a command on the system PATH (as determined by the which (Linux) or where (Windows) command).

The full path to the file is set as the worker and the is_compiled flag is set to true. No check is made as to whether this file is actually executable or can run Horace etc... if it is not then any parallel computation will fail further down the chain; such checks would involve (perhaps significant) modification of worker_v2 and/or setting up an actual computation which would take too long... we assume users would not set a worker they know will not work... (and if they do they will get an error when they try to run a computation...)

The actual compiled worker code is in a Horace PR #794. Tests will be added in the Horace PR (later!)

Copy link
Collaborator

@oerc0122 oerc0122 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this works that's good, otherwise, just some minor formatting complaints.

%
wrkr = which(obj.worker_);
mff = MPI_clusters_factory.instance();
assert(~isempty(which(obj.worker)) || exist(obj.worker, 'file'), ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code seems duplicated 3x is it worth making it a function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... maybe... it's just it's in three different classes... :/

@abuts
Copy link
Member

abuts commented Mar 22, 2022

@mducle Do you really want it to rel_3_6? Ok, tests for master will probably fail then.

@mducle
Copy link
Member Author

mducle commented Mar 22, 2022

@mducle Do you really want it to rel_3_6? Ok, tests for master will probably fail then.

Well... the changes are needed for pace-python to work. At the moment I'm patching the 3.6.2 sources but I would like to be able to use it in directly without patching in e.g. 3.6.3.

I guess I could apply this PR to master and then back-port it to rel_3_6 but I would like it to be in the release...

Copy link
Member

@abuts abuts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given sticky points, I have mentioned, work, looks ok. Have not noticed changes to ClusterSlurm
an ClusterHerbert are they all use the same launcher as ClusterMpi? I though each one has its own, but may be I am wrong and the launcher is placed on the parent.

Then its fine.

% only one version of mpiexec is used now. May change in the
% future.
mpi_exec = fullfile(external_dll_dir, 'mpiexec.exe');
[rs, rv] = system('where mpiexec');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concern about this logic. I have seed three MS mpi versions (8,9 and 10) and they are not working together. If you've compiled your cpp_communicator with MS MPI 8, it will not work with MS MPI 9 and the issue is very difficult to debug. You need to run mpi in command line, and when it meets the other version, it will tell some gibberish like protocol error or whatever. You will never know its just conflicting version. I have not tried all combinations, so may be if you compiled mpiexec 8 & msmpi8 and then run it under mpiexec 9 and (dll? 9) it will run. But I tried cpp_communicator compiled with 8 and mpiexec 9 and it fails.

I would try to run worker under internal mpiexec if it is possible and there are no compelling reason to try external mpi first.

Copy link
Member Author

@mducle mducle Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it with MSMPI v8, MSMPI v9 and MSMPI v10 and it works on all 3 with the Jenkins compiled mex files - I just download the released Horace bundle and use it as is; I don't recompile the mex files.

I think that the cpp_communicator only uses MPI2.2 (or earlier) features and that standard came out in 2009 and is supported by all three versions of MS MPI we're talking about here. It seems there are no ABI changes between the versions either so they all seem to work ok.

The issue with the Horace-bundled mpiexec.exe is due to the firewall (at least for me; I think). Even though I set it to be allowed by the firewall it keeps failing with a Failed to post close command error 1726... it might be due to some signing issues but I couldn't figure out a way around it except by installing the official package...

% Assume if it is on the system path, then it is a compiled worker
out = splitlines(strip(rv));
out = out{1}; % Only take first path if there is more than one
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather distinguish between compiled and Matlab worker on extension. The Matlab worker would be .m and compiled -- everything else. This would be simpler then what is done here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this bit is to allow a user shortcut because what we want is the fully qualified path to pass to the Java ProcessBuilder, so if the user just sets pc.worker = 'horace_worker.exe' without the full path it would not work; so the system('[which|where] <worker>') here is to get the full path. .m files will not be returned by this (unless they have erroneously been marked as executable by the filesystem on Linux).

@abuts
Copy link
Member

abuts commented Mar 22, 2022

Sorry, I was writing my comments alongside with Jacob's but the editor separated them. They may look a bit unclear separately.

@abuts
Copy link
Member

abuts commented Mar 22, 2022

I guess I could apply this PR to master and then back-port it to rel_3_6 but I would like it to be in the release...

You started it from rel_3_6, so you merging it to 3.6 is ok. Then branch should be relatively easy merged to master. This actually makes me wander, why the tests are failing. ...

@mducle mducle merged commit e9a58ed into rel_3_6 Mar 31, 2022
@mducle mducle deleted the compiled_worker_herbert branch March 31, 2022 00:42
mducle added a commit that referenced this pull request Mar 31, 2022
* Add compiled worker changes

* Address PR comments and fix test fail

* Compiled workers: fixed windows mpiexec test fail
mducle added a commit that referenced this pull request Mar 31, 2022
* Add compiled worker changes

* Address PR comments and fix test fail

* Compiled workers: fixed windows mpiexec test fail
mducle added a commit to pace-neutrons/Horace that referenced this pull request Jun 7, 2022
* Test: pace-neutrons/Herbert#434

* Add herbert/horace amalgamation proposal SMG

* Fix herbert amalgamation SMG typos
mducle added a commit to pace-neutrons/Horace that referenced this pull request Jun 8, 2022
* Test: pace-neutrons/Herbert#434

* Test: Herbert#434

* Add herbert/horace amalgamation proposal SMG

* Fix herbert amalgamation SMG typos
mducle added a commit to pace-neutrons/Horace that referenced this pull request Jun 16, 2022
* Add compiled worker changes

* Address PR comments and fix test fail

* Compiled workers: fixed windows mpiexec test fail
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants