You are here

Common job submission errors and how to avoid them - Webinar Recap

Earlier this month, Kamil Marcinkowski, WestGrid Site Lead at the University of Alberta, gave a webinar on ‘Common job submission errors and how to avoid them’ for researchers using Compute Canada systems.

This session was targeted at users who are familiar with submitting jobs on Compute Canada systems, but who could use some tips on using the scheduler more effectively. He shared some of the common mistakes users make with the scheduler, and how to avoid these pitfalls in order to run more jobs and get your research done quicker. From tips on troubleshooting to explanations of job memory and priority, his presentation provides a helpful breakdown of what to do if a job is not running right, and how to get help when you need it.

We’ve summarized a few of the key points below, but you can view the full set of slides and watch the webinar recording on the WestGrid Training Materials site.

Troubleshooting

When you're troubleshooting issues with the scheduler, there are some key things that can help you (and support analysts) determine what might be causing issues. Important pieces of information to note include:

  • any error messages you receive
  • what environment settings were used for your jobs when the specific problem occurred
  • what was the output of commands you ran on the system that makes you think there is a problem

Memory

Each job you schedule takes a specific amount of memory. The smaller the jobs the less memory it takes, and the quicker the job can be done. If you run out of memory you can request more, but be careful of how much you ask for, some things to keep in mind:

  • The more memory the harder to schedule
  • There is only 4Gb RAM per core on most Cluster nodes
  • You will be assessed as using more resources, will affect priority and allocation of resources
  • All RAM requests should be in 1000s of MiB

Running Jobs

If you're having trouble getting your job running, some things to check include:

  • the JobState and Reason
  • the resources available by using Compute Canada’s partitions-stats command
  • the job priority

Priority

The priority of your job depends on your team's usage. The fewer resources you have used, the higher priority you job will have. In order to help your job have better priority, you can add checkpoints in your work to create many smaller jobs. The benefits of creating checkpoints are that they:

  • Make long-running jobs into smaller runs
  • Minimize hardware failures
  • Minimize downtime on your work
  • Smaller jobs can fit easier into the scheduler

Getting Help

If you’re having issues with the scheduler and aren't able to troubleshoot them yourself, you can reach out to the Compute Canada support team and they will be able to help you work out a solution. When asking for help, please remember to:

  • Ask for help in a respectable manner
  • Include your account name and the job name
  • Include the system name and describe what isn’t working / what you’re having difficulties with
  • Be as clear about the issue as possible, giving details that explain what is happening and what you need help with

For a more in-depth tutorial on scheduling and job management on Compute Canada clusters, check out Kamil’s three-part webinar series “Scheduling & Job Management: How to Get the Most from a Cluster”. The recording is available on WestGrid's Training Materials site here.

If you have questions at any time about using Compute Canada resources, you can email support@computecanada.ca.