Profiling a Resnet-18 model on Multiple GPUs
In some cases, your model may be too large, or the dataset you are working with might be so extensive that leveraging multiple GPUs becomes necessary to handle the workload efficiently. Running computations in parallel across multiple GPUs can significantly speed up training and improve performance. This section explains how to extend the previous example of profiling a ResNet-18 model on a single GPU to a multi-GPU setup. We will use PyTorch’s Distributed Data Parallel (DDP) module to distribute the workload across multiple GPUs. Additionally, we will utilize the PyTorch Profiler to collect and analyze performance metrics, just as we did for the single GPU case, but now in a multi-GPU environment.
Code Example: Setting Up DDP for the ResNet-18 Model
1#import all the necessary libraries
2import os
3import torch
4import torch.nn as nn
5import torch.optim
6import torch.profiler
7import torch.utils.data
8import torchvision.datasets
9import torchvision.models
10import torchvision.transforms as T
11from torchvision.models import ResNet18_Weights
12from torch.distributed import init_process_group, destroy_process_group
13from torch.nn.parallel import DistributedDataParallel as DDP
14from torch.utils.data.distributed import DistributedSampler
15
16def ddp_setup():
17 init_process_group(backend="nccl")
18 torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
19
20def main():
21 ddp_setup()
22 rank = int(os.environ["LOCAL_RANK"])
23
24 # Prepare data
25 transform = T.Compose([
26 T.Resize(256),
27 T.CenterCrop(224),
28 T.ToTensor(),
29 T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
30 ])
31
32 trainset = torchvision.datasets.CIFAR10(
33 root='./data',
34 train=True,
35 download=True,
36 transform=transform
37 )
38
39 sampler = DistributedSampler(trainset, shuffle=True)
40 trainloader = torch.utils.data.DataLoader(
41 trainset,
42 batch_size=4,
43 sampler=sampler,
44 num_workers=2,
45 pin_memory=True,
46 persistent_workers=True
47 )
48
49 # Model setup
50 device = torch.device(f"cuda:{rank}")
51 model = torchvision.models.resnet18(weights=ResNet18_Weights.DEFAULT).to(device)
52 model = DDP(model, device_ids=[rank])
53 criterion = nn.CrossEntropyLoss().to(device)
54 optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
55 model.train()
56
57 prof = torch.profiler.profile(
58 activities=[
59 torch.profiler.ProfilerActivity.CPU,
60 torch.profiler.ProfilerActivity.CUDA
61 ],
62 schedule=torch.profiler.schedule(wait=1, warmup=1, active=2),
63 on_trace_ready=torch.profiler.tensorboard_trace_handler(f'./outgpus',
64 worker_name=f'worker{rank}'
65 ),
66 record_shapes=True,
67 profile_memory=True,
68 with_stack=True
69 )
70 prof.start()
71
72 for step, data in enumerate(trainloader):
73 inputs, labels = data[0].to(device, non_blocking=True), data[1].to(device, non_blocking=True)
74
75 outputs = model(inputs)
76 loss = criterion(outputs, labels)
77
78 optimizer.zero_grad(set_to_none=True)
79 loss.backward()
80 optimizer.step()
81
82 prof.step()
83 print(f"Rank {rank} - step: {step}, loss: {loss.item():.4f}")
84
85
86 if step >= 10:
87 break
88
89 prof.stop()
90
91 destroy_process_group()
92
93if __name__ == "__main__":
94 main()
Understanding the DDP setup
To enable DDP in our code, we made several modifications to the previous implementation. DDP is a PyTorch module that allows us to parallelize our model across multiple GPUs or even multiple machines. You can learn more about DDP in the official PyTorch DDP tutorial. Additionally, if you are looking for an example of setting up DDP for our cluster, refer to this project: DDP Example on Saga.
Key Changes in the Code
Wrapping the Main Logic: The main logic of the code has been encapsulated in a function. This is a common practice when working with DDP to ensure proper initialization and cleanup of distributed processes.
Data Loading Optimizations:
We added
persistent_workers=True
to theDataLoader
to improve performance by keeping worker processes alive between iterations.We also set
non_blocking=True
for data transfers to overlap data transfer with computation, further optimizing the data loading process.
Memory Efficiency:
To reduce memory usage, we configured the optimizer withoptimizer.zero_grad(set_to_none=True)
. This ensures gradients are set toNone
instead of zeroing them out, which can save memory during training.Profiling Across Multiple GPUs:
Since we are using multiple GPUs, the profiling code now collects metrics from all GPUs. This allows us to analyze the performance of the entire distributed setup.
However, if you want to profile a specific GPU, you can modify the code to log metrics only for that GPU. For example, by adding a condition like
if rank == 0:
, you can restrict profiling to a single GPU (typically GPU 0).While it is possible to profile all GPUs during testing, in practice, profiling is often limited to a single GPU to reduce profiling overhead once the setup is verified.
Job Script for Utilizing Multiple GPUs
To run our code on multiple GPUs, we need to make a few modifications to the previous job script. These changes ensure that the script is configured to utilize multiple GPUs effectively and include PyTorch-specific parameters required for distributed training.
For example, we use the torchrun
utility with the standalone
argument to indicate that the code will run on multiple GPUs within a single node. This is a key step in enabling PyTorch’s DDP functionality.
Below is the updated job script with the necessary changes for multi-GPU execution.
#!/bin/bash -l
#SBATCH --job-name=PyTprofilergpus
#SBATCH --account=<project_number>
#SBATCH --time=00:10:00 #wall-time
#SBATCH --partition=accel #partition
#SBATCH --nodes=1 #nbr of nodes
#SBATCH --ntasks=1 #nbr of tasks
#SBATCH --ntasks-per-node=1 #nbr of tasks per nodes (nbr of cpu-cores, MPI-processes)
#SBATCH --cpus-per-task=1 #nbr of threads
#SBATCH --gpus=2 #total nbr of gpus
#SBATCH --mem=4G #main memory
#SBATCH -o PyTprofilergpus.out #slurm output
# Set up job environment
set -o errexit # exit on any error
set -o nounset # treat unset variables as error
#define paths
Mydir=/cluster/work/users/<user_name>
MyContainer=${Mydir}/Container/pytorch_22.12-py3.sif
MyExp=${Mydir}/MyEx
#specify bind paths by setting the environment variable
#export SINGULARITY_BIND="${MyExp},$PWD"
#TF32 is enabled by default in the NVIDIA NGC TensorFlow and PyTorch containers
#To disable TF32 set the environment variable to 0
#export NVIDIA_TF32_OVERRIDE=0
#to run singularity container
srun singularity exec --nv -B ${MyExp},$PWD ${MyContainer} torchrun --standalone --nnodes=1 --nproc_per_node=${SLURM_GPUS_PER_NODE:-2} ${MyExp}/resnet18_api_ddp.py
echo
echo "--Job ID:" $SLURM_JOB_ID
echo "--total nbr of gpus" $SLURM_GPUS
echo "--nbr of gpus_per_node" $SLURM_GPUS_PER_NODE
Performance Metrics
In this section, we present screenshots of various performance metrics captured using the PyTorch Profiler. These metrics provide insights into GPU-specific logs and help us analyze the performance of our multi-GPU setup.
1. GPU Usage
When profiling performance metrics for multiple GPUs, the TensorBoard dashboard allows us to select and view the metrics for each GPU individually. For example, as shown in the figure below, we can analyze the GPU utilization for each GPU in the system:

The dashboard provides detailed information, including GPU utilization for each GPU. In our case, both GPUs are utilized at approximately 33%, and the profiler also provides performance recommendations for both GPUs. These insights are valuable for identifying bottlenecks and optimizing GPU usage in a distributed training setup.
2. Trace View
Similar to the single GPU case, we can now view the trace for each individual GPU in a multi-GPU setup. The trace view provides a detailed timeline of operations, helping us identify potential bottlenecks for each GPU. As shown in the figure below, the trace view allows us to analyze the execution patterns and performance of each GPU:

This visualization is particularly useful for pinpointing inefficiencies and understanding how workloads are distributed across GPUs in a distributed training setup.
3. Memory View
The Memory View allows us to compare the memory usage of each GPU over time. This view provides valuable insights into how memory is allocated and utilized by each GPU during training. By analyzing the memory usage across different timeframes for individual GPUs, we can identify potential inefficiencies and determine whether any optimizations are needed to improve memory utilization. Below are the memory views for the two GPUs used in our setup:
Memory View for GPU 0:

Memory View for GPU 1:

These visualizations help us monitor and compare memory usage across GPUs, making it easier to identify imbalances or areas for improvement.
4. Distributed View
The Distributed View provides detailed information about the devices used in the multi-GPU setup. This includes details such as the device name, memory usage, and other relevant metrics for each GPU. As shown in the figure below, this view helps us understand the hardware configuration and resource utilization for each GPU:

Additionally, the Distributed View offers an overview of computation and synchronization across GPUs in a graphical format. This visualization is particularly useful for analyzing how workloads are distributed and synchronized between GPUs, helping us identify potential inefficiencies in the distributed training.
Conclusion
In this guide, we demonstrated how to profile GPU-accelerated deep learning models using the PyTorch Profiler in a multi-GPU setup. By leveraging multiple GPUs, we showcased how to analyze performance metrics, identify bottlenecks, and optimize resource utilization to improve the efficiency of distributed training workflows.