Getting comfortable with Machine Learning on Servers

6 minute read

Published:

A lot of new members often ask around everyone about how to work with servers. I thought it would be nice if I can put everything in one place. This post will be about a list of things you need to know, and then you can google for any expanded tutorial for it. I will write based on what I have experienced and what we face daily at MIDAS Lab. If you were to take away only one useful tip from this article, it should be Be Polite to the people that are sharing the server.

Getting the Server access

  • For starters, SSH is the way you connect to the servers.
  • If you are accessing the server from outside the university network, you might need to set up a VPN connection. Forticlient on Windows/Linux and Openfortivpn on Linux lets you do that.
  • For Linux, SSH is not a problem. However, on Windows, you can either set up WSL (Windows Subsystem Linux) or use PuTTY.
  • Most of the labs follow a key-based SSH login. For that, you need to generate SSH keys, which can be done by ssh-keygen on Linux and using PuTTY on Windows.

Getting a feel of your new workspace

  • Servers usually run Linux. Please read about standard Linux commands like ls to list directories, cd to change directories, etc. The Missing Semester of your CS Education provides a good overview of these.
  • Use htop to view the current CPU and RAM usage on the server. This will be very handy when you feel that the server is running ‘slow’ today and when you need to inspect your executing processes.
  • If the server has a GPU, use nvtop to view the GPUs on the server, their VRAM status, and their current utilization. nvidia-smi is another related command that prints all the GPU information but is not real-time.
  • To view the current storage status of different storage mounts (home, data_dump), use df -h. In case you are not sure about the mount you are working on, use pwd.
  • If you work with videos or a large number of small files, you may run into inodes issue (file pointers exhausted). To check this, use df -i.
  • ncdu is also a useful command to get disk sizes.
  • Use tmux windows. This will not only allow you to run multiple scripts on the same ssh session but also it will let you keep your scripts running when you close the current ssh session. There are a lot of tmux cheatsheets available online. Keep one of them very handy initially.
  • Learn to use rsync for copy and file transfers across servers. Another related command is scp.
  • You can also run Jupyter kernels on the server while viewing the notebooks on your browser. You do this via SSH-Tunneling and Port Forwarding.
  • You can also connect to your workspace using the VS Code’s SSH plugin.

Running your experiments and best practices

  • Please learn how to use Git. Please push your code to GitHub or any other preferred Version control system whenever you are done for a particular work session. It just takes two minutes. Again, The Missing Semester of your CS Education is very useful for this.
  • For different projects, create different conda environments. This is very important since sometimes different projects have conflicting dependencies. While creating a conda environment, do it inside a tmux window since dependency installations take time.
  • Run your model’s training using python scripts, instead of running it using Jupyter notebooks. You will save some time. You can convert notebooks to scripts using nbconvert. Fastai library has decorators for this purpose, and VSCode lets you run some parts of your scripts in notebook style behavior.
  • If you are loading your models on notebooks for some reason, please release the GPU memory or shutdown your notebook after you are done. This is one of the most common bad practices. What happens here is since your kernel is always running, your model occupies GPU memory overnight with zero utilization. This causes a lot of problems for others who might want to run experiments after you.
  • On a related note, please use CUDA_VISIBLE_DEVICES to assign your training processes to specific GPUs. Without this, there might be a chance that your process takes memory from multiple GPUs, where it can be run on a single GPU.
  • Keep your CPU usage in check if you can, especially during data preprocessing. Use tasknet.

Tools to make your life easier and run experiments faster

  • In Pytorch, a better way of using multiple GPUs (if you cannot use model parallelism) is DistributedDataParallel, not DataParallel. This holds even when you want to use two GPUs on the same machine.
  • Please use tensorboard to log your training and testing stats.
  • Use PyMagnitude to work with embeddings.
  • Use tqdm for progress bars. When doing so, use tqdm.auto which works well both in scripts and notebooks.
  • If you can parallelize your jobs using either Threading or multiprocessing. Both of them are python libraries and should be used in specific use cases. If these two don’t satisfy your needs, use GNU Parallel.
  • Have a look at awesome emergining tools like RAPIDS for GPU accelerated ML methods, Fastai-v2 and their famous DL course, Allen-NLP and Pytorch Lightning (My personal favourite).
  • Check out Knock Knock to get notifications during your training process. Huggingface also provides a lot of pretrained models for NLP related tasks.
  • Check out Weights and Biases for experiment management. They are very kind to provide free academic licenses.
  • I also like Notion to maintain a research journal, draft posts for my blog, and other stuff. They are also very kind to provide free student licenses.

Thanks to Karmanya and Pakhi for useful suggestions and reviews. Feel free to suggest and comment on this article using this Twitter thread or this LinkedIn post.