Frequently Asked Questions and Troubleshooting

Dataloader sometimes hang when num_workers > 0

Add torch.multiprocessing.set_start_method("forkserver"). The default "fork" strategy is error prone by design. For more information, see PyTorch documentation, and StackOverflow.

Error when installing Rust

If you see some error like the message below, just clean the original installation record first by rm -rf /root/.rustup and reinstall.

error: could not rename component file from '/root/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/share/doc/cargo' to '/root/.rustup/tmp/m74fkrv0gv6708f6_dir/bk'error: caused by: other os error.

Hang when running a distributed program

You can try to check whether the machine has multiple network interfaces, and use command NCCL_SOCKET_IFNAME=network card name (such as eth01) to specify the one you want to use (usually a physical one). Card information can be obtained by ls /sys/class/net/.

Model accuracy drops

Using a different algorithm or using more GPUs has similar effect as using a different optimizer, so you need to retune your hyperparameters. Some tricks you can try:

  1. Train more epochs and increase the number of training iterations to 0.2-0.3 times more than the original.
  2. Scale the learning rate. If the total batch size of distributed training is increased by $N$ times, the learning rate should also be increased by $N$ times to be $N \times lr$.
  3. Performing a gradual learning rate warmup for several epochs often helps (see also Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour).