Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions herbert_core/classes/MPIFramework/@ClusterMPI/ClusterMPI.m
Original file line number Diff line number Diff line change
Expand Up @@ -172,9 +172,18 @@ function check_availability(obj)
rootpath = fileparts(which('herbert_init'));
external_dll_dir = fullfile(rootpath, 'DLL','external');
if ispc()
% only one version of mpiexec is used now. May change in the
% future.
mpi_exec = fullfile(external_dll_dir, 'mpiexec.exe');
[rs, rv] = system('where mpiexec');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concern about this logic. I have seed three MS mpi versions (8,9 and 10) and they are not working together. If you've compiled your cpp_communicator with MS MPI 8, it will not work with MS MPI 9 and the issue is very difficult to debug. You need to run mpi in command line, and when it meets the other version, it will tell some gibberish like protocol error or whatever. You will never know its just conflicting version. I have not tried all combinations, so may be if you compiled mpiexec 8 & msmpi8 and then run it under mpiexec 9 and (dll? 9) it will run. But I tried cpp_communicator compiled with 8 and mpiexec 9 and it fails.

I would try to run worker under internal mpiexec if it is possible and there are no compelling reason to try external mpi first.

Copy link
Member Author

@mducle mducle Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it with MSMPI v8, MSMPI v9 and MSMPI v10 and it works on all 3 with the Jenkins compiled mex files - I just download the released Horace bundle and use it as is; I don't recompile the mex files.

I think that the cpp_communicator only uses MPI2.2 (or earlier) features and that standard came out in 2009 and is supported by all three versions of MS MPI we're talking about here. It seems there are no ABI changes between the versions either so they all seem to work ok.

The issue with the Horace-bundled mpiexec.exe is due to the firewall (at least for me; I think). Even though I set it to be allowed by the firewall it keeps failing with a Failed to post close command error 1726... it might be due to some signing issues but I couldn't figure out a way around it except by installing the official package...

mpis = splitlines(strip(rv));
% Ignore Matlab-bundled mpiexec (firewall issues)
mpis(cellfun(@(x) contains(x, matlabroot), mpis)) = [];
if rs == 0 && ~isempty(mpis)
% If multiple mpiexec on path, prefer user installed MS MPI
mpi_id = [1 find(cellfun(@(x) contains(x,'Microsoft'), mpis), 1)];
mpi_exec = mpis{max(mpi_id)};
else
% No mpiexec on path, use pre-packaged version
mpi_exec = fullfile(external_dll_dir, 'mpiexec.exe');
end
else
mpi_exec = fullfile(external_dll_dir, 'mpiexec');

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,9 @@
end

obj = init@ClusterWrapper(obj,n_workers,mess_exchange_framework,log_level);
assert(~obj.is_compiled_script_, ...
'HERBERT:ClusterParpoolWrapper:invalid_argument', ...
'Parpool cluster does not work with compiled workers')

% delete interactive parallel cluster if any exist
cl = gcp('nocreate');
Expand Down
16 changes: 6 additions & 10 deletions herbert_core/classes/MPIFramework/@ClusterWrapper/ClusterWrapper.m
Original file line number Diff line number Diff line change
Expand Up @@ -213,11 +213,9 @@
obj.matlab_starter_= fullfile(obj.matlab_starter_,'matlab');
end
if obj.is_compiled_script_
% TODO -- need checking and may be expansion when compiled
% horace ticket is executed.
obj.common_env_var_('HERBERT_PARALLEL_EXECUTOR')= obj.worker_name_;
obj.common_env_var_('HERBERT_PARALLEL_EXECUTOR') = obj.worker_name_;
else
obj.common_env_var_('HERBERT_PARALLEL_EXECUTOR') = obj.matlab_starter_;
obj.common_env_var_('HERBERT_PARALLEL_EXECUTOR') = obj.matlab_starter_;
end
% additional Matlab m-files search path to be available to
% workers
Expand Down Expand Up @@ -461,12 +459,10 @@ function check_availability(~)
% Should throw PARALLEL_CONFIG:not_avalable exception
% if the particular framework is not available.
worker = config_store.instance.get_value('parallel_config','worker');
pkp = which(worker);
if isempty(pkp)
error('HERBERT:ClusterWrapper:not_available',...
'Parallel worker %s is not on Matlab path. Parallel extensions are not available',...
worker);
end
assert(~isempty(which(worker)) || exist(worker, 'file'), ...
'HERBERT:ClusterWrapper:not_available',...
'Parallel worker %s is not on Matlab path. Parallel extensions are not available',...
worker);
end
% The property returns the list of the configurations, available for
% usage by the
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -215,42 +215,39 @@
%-----------------------------------------------------------------
% overloaded getters
function wrkr = get.worker(obj)
wrkr= get_or_restore_field(obj,'worker');
wrkr = obj.get_or_restore_field('worker');
end
function wrkr = get.is_compiled(obj)
% incomplete! Should be derived from worker
wrkr= obj.is_compiled_;
wrkr = obj.get_or_restore_field('is_compiled');
end

function frmw =get.parallel_cluster(obj)
function frmw = get.parallel_cluster(obj)
%
wrkr = config_store.instance.get_value(obj,'worker');
pkp = which(wrkr);
if isempty(pkp)
frmw = 'none';
return
frmw = 'none';
if ~isempty(which(wrkr)) || exist(wrkr, 'file')
frmw = obj.get_or_restore_field('parallel_cluster');
end
frmw = get_or_restore_field(obj,'parallel_cluster');
end
function conf = get.cluster_config(obj)
conf = get_or_restore_field(obj,'cluster_config');
conf = obj.get_or_restore_field('cluster_config');
end
%
function folder =get.shared_folder_on_local(obj)
folder = get_or_restore_field(obj,'shared_folder_on_local');
function folder = get.shared_folder_on_local(obj)
folder = obj.get_or_restore_field('shared_folder_on_local');
if isempty(folder)
is_depl = MPI_State.instance().is_deployed;
if is_depl
folder = get_or_restore_field(obj,'working_directory');
folder = obj.get_or_restore_field('working_directory');
if isempty(folder)
folder = tmp_dir;
end
end
end
end
%
function folder =get.shared_folder_on_remote(obj)
folder = get_or_restore_field(obj,'shared_folder_on_remote');
function folder = get.shared_folder_on_remote(obj)
folder = obj.get_or_restore_field('shared_folder_on_remote');
if isempty(folder)
folder = obj.shared_folder_on_local;
end
Expand All @@ -261,7 +258,7 @@
if is_depl
work_dir = obj.shared_folder_on_remote;
else
work_dir = get_or_restore_field(obj,'working_directory');
work_dir = obj.get_or_restore_field('working_directory');
end
if isempty(work_dir)
work_dir = tmp_dir;
Expand All @@ -275,7 +272,7 @@
if is_depl
work_dir = obj.shared_folder_on_remote;
else
work_dir = get_or_restore_field(obj,'working_directory');
work_dir = obj.get_or_restore_field('working_directory');
end
if isempty(work_dir)
is = true;
Expand Down Expand Up @@ -411,7 +408,7 @@
end
%
function mpirunner = get.external_mpiexec(obj)
mpirunner = get_or_restore_field(obj,'external_mpiexec');
mpirunner = obj.get_or_restore_field('external_mpiexec');
end
%
function obj=set.external_mpiexec(obj,val)
Expand Down Expand Up @@ -452,4 +449,4 @@
the_opt = select_option_(opt,arg);
end
end
end
end
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,11 @@
% The cluster name (can be defined by single symbol)
% or by a cluster number in the list of clusters
%
wrkr = which(obj.worker_);
mff = MPI_clusters_factory.instance();
assert(~isempty(which(obj.worker)) || exist(obj.worker, 'file'), ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code seems duplicated 3x is it worth making it a function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... maybe... it's just it's in three different classes... :/

'HERBERT:parallel_config:not_available', ...
'Parallel worker is not on the Matlab path so parallel features are not available');

if isempty(wrkr)
error('HERBERT:parallel_config:not_available',...
'Parallel worker is not on the Matlab path so parallel features are not available')
else
mff = MPI_clusters_factory.instance();
known_clusters = mff.known_cluster_names;
full_cl_name = obj.select_option(known_clusters,cluster_name);
mff.parallel_cluster = full_cl_name;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,16 @@
'The worker property needs the executable script name')
end
scr_path = which(new_wrkr);
config_instance = config_store.instance();
if isempty(scr_path)
% Check if it is a compiled worker
compiled_wrkr = check_compiled_(new_wrkr);
if ~isempty(compiled_wrkr)
config_instance.store_config(obj, 'worker', new_wrkr);
config_instance.store_config(obj, 'is_compiled', true);
return
end

def_wrkr = obj.worker_;
if strcmp(new_wrkr,def_wrkr)
cur_fmw = get_or_restore_field(obj,'parallel_cluster');
Expand All @@ -18,12 +27,12 @@
'to all running Matlab sessions but parallel config can not find it.',...
' Parallel extensions are disabled'],...
new_wrkr)
config_store.instance().store_config(obj,...
config_instance.store_config(obj,...
'parallel_cluster','none','cluster_config','none');

end
else
config_store.instance().store_config(obj,'worker',def_wrkr);
config_instance.store_config(obj,'worker',def_wrkr);
error('PARALLEL_CONFIG:invalid_argument',...
['The script to run in parallel (%s) should be available ',...
'to all running Matlab sessions but parallel config can not find it.',...
Expand All @@ -32,6 +41,27 @@

end
else % worker function is available.
config_store.instance().store_config(obj,'worker',new_wrkr);
config_instance.store_config(obj, 'worker', new_wrkr);
config_instance.store_config(obj, 'is_compiled', false);
end
end % function

function out = check_compiled_(worker)
out = '';
if is_file(worker) && ~endsWith(worker, '.m')
% Assume if input is full path to file, then it is a compiled worker
out = worker;
else
if ispc()
cmd = 'where';
else
cmd = 'which';
end
[rs, rv] = system([cmd ' ' worker]);
if rs == 0
% Assume if it is on the system path, then it is a compiled worker
out = splitlines(strip(rv));
out = out{1}; % Only take first path if there is more than one
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather distinguish between compiled and Matlab worker on extension. The Matlab worker would be .m and compiled -- everything else. This would be simpler then what is done here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this bit is to allow a user shortcut because what we want is the fully qualified path to pass to the Java ProcessBuilder, so if the user just sets pc.worker = 'horace_worker.exe' without the full path it would not work; so the system('[which|where] <worker>') here is to get the full path. .m files will not be returned by this (unless they have erroneously been marked as executable by the filesystem on Linux).

end
end