You are here

​HPC System Administrator

University of Saskatchewan; Information and Communication Technology, Academic and Research Technologies
Saskatoon
Saskatchewan

In support of Research Computing in ICT, the Advanced Computing System Administrator performs system administration on University of Saskatchewan and WestGrid / Compute Canada systems, including large storage systems, backup servers, tape libraries, and High Performance Computing (HPC) clusters. The position works to ensure that storage and computing resources are available, effectively administered, patched, secured, repaired and maintained, and that they achieve required levels of performance for University and Compute Canada users. The Advanced Computing System Administrator, in consultation with the Research Computing group, implements technical solutions to research problems for researchers within the university and Compute Canada communities.

Nature of Work:

The Advanced Computing System Administrator handles a wide variety of tasks: responding to requests from users; performing urgent maintenance on failed infrastructure; implementing advanced computing systems and clusters; and performing long-term analysis on future directions for software and hardware for supported systems.

This position requires expertise in the use of research technologies, a client-focused and collaborative approach to service, and an understanding of the research process and the IT environment of the university. The role includes hardware installation and maintenance, HPC cluster support, and software installation and configuration. The position requires keen attention to detail and careful, measured action; as a primary administrator on a multiple-petabyte data storage system there is considerable impact associated with errors that compromise data. The position requires the ability to work effectively both independently and within a team, and to appropriately prioritize response to a wide variety of time-sensitive tasks.

Participation in local, WestGrid, and Compute Canada meetings and working groups is expected and occasional travel, both within Canada and internationally, may be required. Occasional work outside of core business hours is required, balanced by flex-time consideration. 

Accountabilities:

The key accountabilities for this position are:

  • Maintaining Compute Canada / WestGrid TSM Backup and storage systems at remote Compute Canada sites (e.g. the Cedar national system at Simon Fraser University), and the integrity of the data on those systems;
  • Optimizing the storage access and network performance seen by users on Compute Canada / WestGrid on remote Compute Canada sites;
  • Supporting and installing hardware and software infrastructure associated with High Performance Computing for groups of researchers and for ICT, where assigned;
  • Participating effectively in Compute Canada / WestGrid technical and working groups, where assigned;
  • Maintaining appropriate levels of system security on assigned systems, including patching;
  • Actively participating in the Research Computing team, including discussions, decisions and tasks related to the overall configuration, operation, and support of IT for research and scholarship.

Qualifications

  • Education: An undergraduate degree in Computer Science or related field, or significant experience with HPC and storage system administration. A combination of education and relevant experience may be considered.

  • Experience: Two years of work experience with systems administration. An equivalent combination of education and experience may be considered. Demonstrable experience with IBM Spectrum Protect (Tape backup storage hardware and software) is strongly desired.

  • Skills: Extensive and current knowledge from an IT-based research environment, including data storage, HPC clustering tools, Linux, modern server hardware.

Experience with any of the following will be considered an asset:

  • Tape backup for storage systems (IBM TSM/Spectrum Protect)
  • Clustered filesystems (e.g. GPFS, LUSTRE)
  • Linux system administration (RedHat 6/7, other distributions)
  • Large (Petabyte scale) storage hardware management: SAN, NAS
  • Scripting languages (e.g. Perl, Python, Unix shells)
  • HPC Cluster hardware (servers/blades, switches, racks)
  • HPC Clustering software (e.g. Bright, ROCKS, xcat)
  • HPC job batching systems (e.g. SLURM, maui/moab, SGE)
  • High performance networking interconnects (e.g. Infiniband)
  • IP networking and troubleshooting, performance optimization
  • POSIX file systems

Knowledge of the University of Saskatchewan's culture and goals would be an asset.