-
Notifications
You must be signed in to change notification settings - Fork 0
Compiled workers - Herbert changes #434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
oerc0122
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this works that's good, otherwise, just some minor formatting complaints.
| % | ||
| wrkr = which(obj.worker_); | ||
| mff = MPI_clusters_factory.instance(); | ||
| assert(~isempty(which(obj.worker)) || exist(obj.worker, 'file'), ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code seems duplicated 3x is it worth making it a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah... maybe... it's just it's in three different classes... :/
herbert_core/classes/MPIFramework/@parallel_config/private/check_and_set_worker_.m
Outdated
Show resolved
Hide resolved
herbert_core/classes/MPIFramework/@parallel_config/private/check_and_set_worker_.m
Outdated
Show resolved
Hide resolved
herbert_core/classes/MPIFramework/@parallel_config/private/check_and_set_worker_.m
Outdated
Show resolved
Hide resolved
herbert_core/classes/MPIFramework/@parallel_config/parallel_config.m
Outdated
Show resolved
Hide resolved
|
@mducle Do you really want it to rel_3_6? Ok, tests for master will probably fail then. |
Well... the changes are needed for I guess I could apply this PR to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given sticky points, I have mentioned, work, looks ok. Have not noticed changes to ClusterSlurm
an ClusterHerbert are they all use the same launcher as ClusterMpi? I though each one has its own, but may be I am wrong and the launcher is placed on the parent.
Then its fine.
| % only one version of mpiexec is used now. May change in the | ||
| % future. | ||
| mpi_exec = fullfile(external_dll_dir, 'mpiexec.exe'); | ||
| [rs, rv] = system('where mpiexec'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concern about this logic. I have seed three MS mpi versions (8,9 and 10) and they are not working together. If you've compiled your cpp_communicator with MS MPI 8, it will not work with MS MPI 9 and the issue is very difficult to debug. You need to run mpi in command line, and when it meets the other version, it will tell some gibberish like protocol error or whatever. You will never know its just conflicting version. I have not tried all combinations, so may be if you compiled mpiexec 8 & msmpi8 and then run it under mpiexec 9 and (dll? 9) it will run. But I tried cpp_communicator compiled with 8 and mpiexec 9 and it fails.
I would try to run worker under internal mpiexec if it is possible and there are no compelling reason to try external mpi first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it with MSMPI v8, MSMPI v9 and MSMPI v10 and it works on all 3 with the Jenkins compiled mex files - I just download the released Horace bundle and use it as is; I don't recompile the mex files.
I think that the cpp_communicator only uses MPI2.2 (or earlier) features and that standard came out in 2009 and is supported by all three versions of MS MPI we're talking about here. It seems there are no ABI changes between the versions either so they all seem to work ok.
The issue with the Horace-bundled mpiexec.exe is due to the firewall (at least for me; I think). Even though I set it to be allowed by the firewall it keeps failing with a Failed to post close command error 1726... it might be due to some signing issues but I couldn't figure out a way around it except by installing the official package...
herbert_core/classes/MPIFramework/@ClusterParpoolWrapper/ClusterParpoolWrapper.m
Outdated
Show resolved
Hide resolved
herbert_core/classes/MPIFramework/@parallel_config/parallel_config.m
Outdated
Show resolved
Hide resolved
herbert_core/classes/MPIFramework/@parallel_config/parallel_config.m
Outdated
Show resolved
Hide resolved
herbert_core/classes/MPIFramework/@parallel_config/private/check_and_set_worker_.m
Outdated
Show resolved
Hide resolved
herbert_core/classes/MPIFramework/@parallel_config/private/check_and_set_worker_.m
Show resolved
Hide resolved
herbert_core/classes/MPIFramework/@parallel_config/private/check_and_set_worker_.m
Outdated
Show resolved
Hide resolved
| % Assume if it is on the system path, then it is a compiled worker | ||
| out = splitlines(strip(rv)); | ||
| out = out{1}; % Only take first path if there is more than one | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather distinguish between compiled and Matlab worker on extension. The Matlab worker would be .m and compiled -- everything else. This would be simpler then what is done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this bit is to allow a user shortcut because what we want is the fully qualified path to pass to the Java ProcessBuilder, so if the user just sets pc.worker = 'horace_worker.exe' without the full path it would not work; so the system('[which|where] <worker>') here is to get the full path. .m files will not be returned by this (unless they have erroneously been marked as executable by the filesystem on Linux).
|
Sorry, I was writing my comments alongside with Jacob's but the editor separated them. They may look a bit unclear separately. |
You started it from rel_3_6, so you merging it to 3.6 is ok. Then branch should be relatively easy merged to master. This actually makes me wander, why the tests are failing. ... |
* Add compiled worker changes * Address PR comments and fix test fail * Compiled workers: fixed windows mpiexec test fail
* Add compiled worker changes * Address PR comments and fix test fail * Compiled workers: fixed windows mpiexec test fail
* Test: pace-neutrons/Herbert#434 * Add herbert/horace amalgamation proposal SMG * Fix herbert amalgamation SMG typos
* Test: pace-neutrons/Herbert#434 * Test: Herbert#434 * Add herbert/horace amalgamation proposal SMG * Fix herbert amalgamation SMG typos
* Add compiled worker changes * Address PR comments and fix test fail * Compiled workers: fixed windows mpiexec test fail
Changes the checks in
@parallel_configand@MPI*classes to allow compiled workers.When the
workerfield ofparallel_configis set to something which is not a Matlab function, then it is checked to see if it may be a compiled worker. This check considers something a compiled worker if:PATH(as determined by thewhich(Linux) orwhere(Windows) command).The full path to the file is set as the worker and the
is_compiledflag is set to true. No check is made as to whether this file is actually executable or can run Horace etc... if it is not then any parallel computation will fail further down the chain; such checks would involve (perhaps significant) modification ofworker_v2and/or setting up an actual computation which would take too long... we assume users would not set a worker they know will not work... (and if they do they will get an error when they try to run a computation...)The actual compiled worker code is in a Horace PR #794. Tests will be added in the Horace PR (later!)