You are here

Using the consortium license to run MATLAB Distributed Computing Server workers

Introduction

Starting in 2009, WestGrid has made an annual purchase of a 64-worker "consortium" license for the MATLAB Distributed Computing Server.  In March 2013, the number of worker licenses was increased to 160. The consortium license allows researchers from Canadian academic institutions who have licensed the Parallel Computing Toolbox (or have access to it through a local server) to submit jobs to a WestGrid cluster, even if it is not located at their home institution.  Orcinus is the only WestGrid cluster on which the Distributed Computing Server workers run under the consortium license. A separate SFU site license allows for Distributed Computing Server jobs to be run on Bugaboo by researchers from that university.

The MathWorks web site has detailed instructions for running MATLAB in parallel using a combination of the Parallel Computing Toolbox and the Distributed Computing Server. If you own a Parallel Computing Toolbox license and would like to get started using the MATLAB Distributed Computing Server environment on Orcinus, please contact WestGrid technical support.  Researchers from several WestGrid institutions (including U of A, U of C, UBC and SFU) can avoid purchasing Parallel Computing Toolbox licenses, as these are available on local servers.  Other institutions may also provide this support.

To get a feeling for what is involved in setting up and using MATLAB through the Distributed Computing Server, you could refer to the MathWorks web site at http://www.mathworks.com/products/parallel-computing/.

Using public key authentication to connect to Orcinus

The details in this section on public key authentication are only needed for versions of MATLAB before R2011a so are not needed for most users.

The MATLAB Parallel Computing Toolbox uses the SSH (secure shell) network protocol to log in to Orcinus to execute commands (such as qsub for submitting batch jobs).   Similarly, the SCP (secure copy) protocol is used for transferring files back and forth between Orcinus and the system on which the Parallel Computing Toolbox is running. It would be very cumbersome to be prompted for a password every time MATLAB needed to execute a remote command or transfer a file.  This can be avoided by using public key authentication.  With this method, you have to enter your WestGrid password only once during a session in which you are submitting MATLAB jobs.

Before attempting to use the Parallel Computing Toolbox for remote MATLAB job submission on Orcinus, you should set up public key authentication and verify that you can use it connect to Orcinus with an SSH client and a (secure copy) file transfer client.  The details of how you do that depend on what type of system (Linux, Microsoft Windows, MacOS X, ...) you are using.

In brief, for Linux and Macintosh systems, you generate keys with ssh-keygen (making sure that you use a pass phrase) and transfer the public key to the .ssh/authorized_keys file on Orcinus.  Then, whenever you want to submit MATLAB jobs, you run commands like ssh-agent /bin/bash and ssh-add key_file (which should prompt you for the pass phrase associated with your ssh key) before starting MATLAB.  On Microsoft Windows systems the idea is similar, in that you need to generate ssh keys and install the public key in .ssh/authorized_keys on Orcinus and then request that ssh/scp connections use the installed keys. However, the key generation and management software does not come pre-installed.  Typically, PuTTY is used, as described here.

Distributed Computing Server - basic concepts

As mentioned above, the MATLAB Parallel Computing Toolbox is used to control job submission when using the Distributed Computing Server installation of MATLAB on a WestGrid cluster, such as Orcinus. Details of these MathWorks products are available on their web site, including a user's guide for the Parallel Computing Toolbox.  There is also an administrator's guide for the Distributed Computing Server, but, most end users will not need to look at that.

In the figure below, relationships are illustrated among the various software and hardware components involved in using the Parallel Computing Toolbox on your computer to submit a batch job on an Orcinus login node, which, in turn, will run workers under the Distributed Computing Server license on the Oricinus compute nodes.

MATLAB on Orcinus

Some of the details of the interaction among the various components shown above are either largely hidden from view or require only a one-time setup.  For example, the SSH interactions in the diagram are taken care of by the public-key authentication setup described in the previous section.  However, a basic understanding of what is going on "under the hood" is helpful, so, will be discussed below. For additional details, see the MathWorks web site, particularly the section on Using the Generic Scheduler Interface in the Parallel Computing Toolbox User's Guide.

The Generic Scheduler Interface is a way of describing to MATLAB where you want to run your job (on a remote cluster in most cases) and sending the necessary commands to the batch job scheduler on that remote system.  This is done by defining something called a scheduler object in your MATLAB session.  The scheduler object is a structure with several components, most of which will be the same from job to job.  Specific examples of how to set up the scheduler object for submitting jobs to a WestGrid cluster will be given later in these notes.

One of the most important components of the scheduler object is a reference to the MATLAB code, called a submit function, that is used to construct the batch job script (or a series of scripts if you are submitting a number of tasks at the same time), send that script to the remote cluster, construct the TORQUE qsub command line that is used to submit jobs to the batch job system on the cluster for execution and then actually run the qsub command.  The submit function can also be used to copy any data files that are needed for the job from your local machine to the remote cluster, although if you have a large data set that is referenced by several different jobs, you could manually copy that to the cluster ahead of time.

When the batch job script that is created by the submit function is actually executed on the compute nodes of the cluster, a Distributed Computing Toolbox worker will be started up. Some environment variables that are defined in the batch job are used by the MATLAB worker to locate such things as the directory where data needed for the job is located.

You may use one of two different submit functions, depending on whether your MATLAB calculations are essentially a number of independent (serial) tasks or whether you have a parallel calculation in which different workers need to communicate and exchange data as the calculation proceeds.  You may not need to know the details of the submit function code if you are using an institutuional server to submit your jobs.  However, if you are submitting the jobs directly from your own computer, you will have to edit a few lines of sample submit functions that MathWorks provides and install the submit functions where they can be found by your MATLAB session.  Until these notes are more self-contained, please contact support@westgrid.ca for more specific advice on editing and installing the submit functions.

Examples for MATLAB versions before R2011a

Some examples are given in the tutorial notes available here.

These examples rely on scripts that should be installed in the toolbox/local directory (matlab-toolboxwin.zipmatlab-toolboxunix.tar.gz).

Make sure that the directory referred to by remoteDataLocation exists.

Once your public key authentication is set up, make a script based on the following example:

getschedule.m

function [ sched ] = getschedule()
%Change the following five lines
WestgridID='myWestgridID';
Email='myEmail@email_address';
Nprocs='1';
Wtime='00:05:00';
Memory='512mb';

SubmitArguments=strcat(' -l procs=',Nprocs,',mem=',Memory,',walltime=',Wtime,',
software=MDCE:',Nprocs,' -m bea -M ',Email);
VER=version('-release');

switch VER
  case '2009a'
   remoteMatlabRoot='/global/software/matlab-2009a';
  case '2009b'
   remoteMatlabRoot='/global/software/matlab-2009b';
  case '2010a'
   remoteMatlabRoot='/global/software/matlab-2010a';
  case '2010b'
   remoteMatlabRoot='/global/software/matlab-2010b';
  otherwise
   fprintf(' Matlab version %s is not supported\n',VER);
   return;
end
clusterHost=strcat(WestgridID,'@orcinus.westgrid.ca');
remoteDataLocation=strcat('/global/scratch/',WestgridID);
sched = findResource('scheduler','type','generic');
set(sched,'ClusterSize',1);
set(sched, 'ClusterOsType', 'unix');
set(sched,'HasSharedFilesystem',0);
set(sched,'ClusterMatlabRoot',remoteMatlabRoot);
set(sched,'GetJobStateFcn',@pbsGetJobState);
set(sched,'DestroyJobFcn',@pbsDestroyJob);
set(sched,'SubmitFcn',{@pbsNonSharedSimpleSubmitFcn,clusterHost,remoteDataLocation,SubmitArguments});
set(sched,'ParallelSubmitFcn',{@pbsNonSharedParallelSubmitFcn,clusterHost,remoteDataLocation,SubmitArguments})

testserial.m

function testserial()
sched=getschedule
j=createJob(sched)
createTask(j,@rand,1,{3,3});
submit(j)

To run,

ssh-agent bash
ssh-add
(give passphrase)
matlab
testserial
quit

You will see two numbers corresponding to the job. First is the job ID assigned by MATLAB and the second is the batch number assigned by batch system on Orcinus which has the form xxxxx.orca1.ibb.

Once the job has finished, you should get an email. Then to retrieve the results you can use these steps:

ssh-agent bash
ssh-add
(give passphrase)
matlab
sched=getschedule
%using as an example matlab job ID=1
j=findJob(sched,'ID',1)
results=getAllOutputArguments(j)
results{:}

Here is an example of a job using matlabpool and parfor.

In getschedule.m, change

Nprocs='1'

to

Nprocs='4';

testparfor2.m

function a = testparfor2(N)
a = zeros(N,1);
parfor(i=1:N)
t=getCurrentTask();
a(i) = t.ID;
end

testparfor.m

function testparfor()
sched=getschedule
j = createMatlabPoolJob(sched);
j.FileDependencies={'testparfor2.m'};
set(j,'MaximumNumberofWorkers',4);
set(j,'MinimumNumberofWorkers',4);
t = createTask(j,@testparfor2,1,{3})
alltasks = get(j, 'Tasks');
set(alltasks, 'CaptureCommandWindowOutput', true)
submit(j)

The proc value in SubmitArguments should correspond to MaximumNumberofWorkers and MinimumNumberofWorkers. The parameter for testParfor2 is set to 3 because the actual number of workers is actual one less than the number of processors asked for. One processor is used by the master process.

To run,

ssh-agent bash
ssh-add
(give passphrase)
matlab
testparfor
quit

A parallel job will be slightly different as shown:

testparallel.m

function testparallel()
sched=getschedule
% create the matlab job
pjob=createParallelJob(sched);

set(pjob, 'MaximumNumberOfWorkers', 4)
set(pjob, 'MinimumNumberOfWorkers', 4)
%create parallel task using colsum.m
set(pjob, 'FileDependencies', {'colsum.m'})
t=createTask(pjob, @colsum, 1, {})
%submit PBS job
submit(pjob)

If your job requires input data files and/or produces output files, it is easiest to transfer them to and from Orcinus using scp. It should be noted that the MATLAB jobs on Orcinus will start in your home directory. So if you transferred your input data to /global/scratch/myWestgridID/DATA directory, then you should include

cd /global/scratch/myWestgridID/DATA

in the main function that you are running in your job.

Examples for MATLAB version R2011a

In this version, MathWorks changed how jobs are created and the scripts for sending jobs to the remote server.

The new scripts that needs to be installed in toolbox/local can be obtained from matlab2011-toolbox.zip.

The corresponding example scripts for running jobs are

getschedule.m

function [ sched ] = getschedule()
%Change the following five lines
WestgridID='myWestgridID';
Email='myEmail@email_address';
Nprocs='1';
Wtime='00:01:00';
Memory='512mb';

submitArguments=strcat(' -l procs=',Nprocs,',mem=',Memory,',walltime=',Wtime,',software=MDCE:',Nprocs,' -m bea -M ',Email);
VER=version('-release');
switch VER
    case '2011a'
        remoteMatlabRoot='/global/software/matlab-2011a';
    otherwise
        fprintf(' Matlab version %s is not supported\n',VER);
        return;
end
clusterHost='orcinus.westgrid.ca';
remoteDataLocation=strcat('/global/scratch/',WestgridID);
sched = findResource('scheduler','type','generic');
set(sched,'ClusterSize',str2num(Nprocs));
set(sched, 'ClusterOsType', 'unix');
set(sched,'HasSharedFilesystem',0);
set(sched,'ClusterMatlabRoot',remoteMatlabRoot);
set(sched,'GetJobStateFcn',@getJobStateFcn);
set(sched,'DestroyJobFcn',@destroyJobFcn);
set(sched,'SubmitFcn',{@distributedSubmitFcn,clusterHost,remoteDataLocation,submitArguments});
set(sched,'ParallelSubmitFcn',{@parallelSubmitFcn,clusterHost,remoteDataLocation,submitArguments});

testserial.m

function testserial()
sched=getschedule
j=createJob(sched)
createTask(j,@rand,1,{3,3});
submit(j)
wait(j)
results = getAllOutputArguments(j);
results{:}

Public key authentication is no longer used. To run testserial,

matlab
testserial
Enter the username for orcinus.westgrid.ca:
(enter Westgrid ID)
Use an identity file to login to orcinus.westgrid.ca? (y or n)
n
Please enter the password for user fujinaga on orcinus.westgrid.ca:
(enter password)

Examples for MATLAB versions R2012a to R2015b

In this version, more script changes.
The new scripts that needs to be installed in toolbox/local can be obtained from matlab2012a-toolboxunix.tar.gz.The corresponding example scripts for running jobs are

getcluster.m

function [ cluster ] = getcluster()
%Change the following five lines
WestgridID='myWestgridID';
Email='myEmail@email_address';
Nprocs='1';
Wtime='00:01:00';
Memory='512mb';

submitArguments=strcat(' -l procs=',Nprocs,',mem=',Memory,',walltime=',Wtime,',software=MDCE:',Nprocs,' -m bea -M ',Email);
VER=version('-release');
switch VER
    case '2012a'
        remoteMatlabRoot='/global/software/matlab-2012a';
    case '2013a'
        remoteMatlabRoot='/global/software/matlab-2013a';
    case '2013b'
        remoteMatlabRoot='/global/software/matlab-2013b';
    case '2014a'
        remoteMatlabRoot='/global/software/matlab-2014a';
    case '2014b'
        remoteMatlabRoot='/global/software/matlab-2014b';
    case '2015b'
        remoteMatlabRoot='/global/software/matlab-2015b';
    otherwise
        fprintf(' Matlab version %s is not supported\n',VER);
        return;
end
clusterHost='orcinus.westgrid.ca';
remoteJobStorageLocation = strcat('/global/scratch/',WestgridID);
cluster = parallel.cluster.Generic();
set(cluster, 'HasSharedFilesystem', false);
set(cluster, 'ClusterMatlabRoot', remoteMatlabRoot);
set(cluster, 'OperatingSystem', 'unix');
% The IndependentSubmitFcn must be a MATLAB cell array that includes the three additional inputs
set(cluster, 'IndependentSubmitFcn', {@independentSubmitFcn, clusterHost, remoteJobStorageLocation, submitArguments});
% If you want to run communicating jobs (including matlabpool), you must specify a CommunicatingSubmitFcn
set(cluster, 'CommunicatingSubmitFcn', {@communicatingSubmitFcn, clusterHost, remoteJobStorageLocation, submitArguments});
set(cluster, 'GetJobStateFcn', @getJobStateFcn);
set(cluster, 'DeleteJobFcn', @deleteJobFcn);

testserial.m

function testserial()
cluster=getcluster
j=createJob(cluster)
createTask(j,@rand,1,{3,3});
submit(j)
wait(j)
results = fetchOutputs(j);
results{:}

Public key authentication is no longer used. To run testserial,

matlab
testserial
Enter the username for orcinus.westgrid.ca:
(enter Westgrid ID)
Use an identity file to login to orcinus.westgrid.ca? (y or n)
n
Please enter the password for user fujinaga on orcinus.westgrid.ca:
(enter password)


    For a matlabpool parallel job, change Nprocs in getcluster.m to 8.

testparfor.m

function testparfor()
cluster=getcluster
j = createCommunicatingJob(cluster,'Type','pool')
j.AttachedFiles={'testparfor2.m'};
t = createTask(j,@testparfor2,1,{7});
j.NumWorkersRange = [8,8]
submit(j)
wait(j)
results = fetchOutputs(j);
results{:}

testparallel.m

function testparallel()
cluster=getcluster
% create the matlab job
pjob = createCommunicatingJob(cluster,'Type','spmd')
pjob.AttachedFiles={'colsum.m'};

%create parallel taks using colsum.m
t=createTask(pjob, @colsum, 1, {}) ;
pjob.NumWorkersRange = [8,8]
%submit PBS job
submit(pjob)

colsum.m

function total_sum = colsum
if labindex == 1
% Send magic square to other labs
A = labBroadcast(1,magic(numlabs)) ;
else
% Receive broadcast on other labs
A = labBroadcast(1) ;
end

% Calculate sum of column identified by labindex for this lab
column_sum = sum(A(:,labindex)) ;
% Calculate total sum by combining column sum from all labs
total_sum = gplus(column_sum);

License status

Distributed Computing Server jobs on Orcinus are not started until Distributed Computing Server licenses are available. The license status can be checked by running the lmstat command on Oricinus:

/global/software/matlab-flexlm/etc/lmstat -a

 

Updated: 2014-06-17 - Removed links to old documentation on other sites and corrected command to check license status.