Measuring NVLink Traffic when using TensorFlow’s NCCL Library

2 min readMar 4, 2019

NCCL (pronounced “Nickel”) is a library of standard collective communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, and reduce-scatter. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink and other connectivity links.

If you fired-up a GPU-powered EC2 instance such as p3.8xlarge and looked at the topology with nvidia-smi, you’d see the following GPU-to-GPU NVLink configuration.

Now we’ll develop a data-pump app that continuously pumps data (albeit, the same value) into two GPUs and uses the TensorFlow API, tensorflow.contrib.nccl.all_sum() to perform the all_reduce operation on the two ‘place-holders’ located on each GPU. The reason we chose to do this is because this is roughly how gradient aggregation is done during data-parallel training.

A quick primer on all_sum() is shown below. Note how it reduces (sums, in this case) the values on the two GPUs and distributes the results to both the GPUs.

The simplest data-pump example can be written-up as follows:

import tensorflow as tf
from itertools import repeat
from tensorflow.contrib.nccl import all_sumwith tf.device(‘/gpu:0’):
    g0 = tf.placeholder(tf.float32, (2, 2), f”g0")with tf.device(‘/gpu:1’):
    g1 = tf.placeholder(tf.float32, (2, 2), f”g1")all_reduce_sum = all_sum([g0, g1])sess = tf.Session(config=tf.ConfigProto(log_device_placement=True,
                                        allow_soft_placement=False))init = tf.global_variables_initializer()
sess.run(init)r = [[1, 1], [1, 1]], [[2, 2], [2, 2]]
for x, y in repeat(r):
    sess.run(all_reduce_sum, feed_dict={g0: x, g1: y})

Now you should setup NVLink to measure the data moving across it using nvidia-smi.

Run the data-pump script.

The last step is to measure the bytes flowing (received and transmitted) across the link.

Reference

Measuring NVLink Traffic when using TensorFlow’s NCCL Library

Written by Auro Tripathy