The Importance of Attention

Language modeling tasks like answering questions and classifying documents are now designed with transformer networks. The attention building block is a central component of the transformer. Recently, transformer-based computer vision models have attained state-of-the-art results (further underscoring the importance of attention). This write-up provides an intuitive understanding of the attention block at the heart of transformers.

I assume the reader has a basic understanding of Neural Net-based spatial and sequence learning (i.e., CNNs and LSTMs).

Prior to transformers (with attention), recurrent nets (LSTMs) were a natural way to construct sequence learning tasks like language translation. However, the main disadvantage of…

I highlight ways to lower the barrier to entry into biomedical text processing and speed up progress in this vitally important area impacting mankind.

Spotlighting the Challenge

Medical documents have grown to the extent that, PubMed, the search engine and repository for biomedical research articles, adds 4,000 new papers every day and over a million every year. Supervised Deep Learning methods for mining this avalanche of data have not been able to keep up, primarily because of the paucity of labeled training data in the medical field.

Another data point to underscore the need for rapid medical text understanding is the 30,000 COVID-19…


By now you may have come across the position paper, PyTorch: An Imperative Style, High-Performance Deep Learning Library presented at the 2019 Neural Information Processing Systems. The paper promotes PyTorch as a Deep Learning framework that balances usability with pragmatic performance (sacrificing neither). Read the paper and judge for yourself.

Here, I highlight just one aspect; the ease of creating your custom own Deep Learning layer as part of a neural network (NN) model. Typically, you’ll reuse the existing layers, but many-a-times, you’ll need to create your own custom layer. I show how straight forward that is. Once you have…

Wondering how long cuda operations take in your PyTorch-based training code? For instance, how much time does the feed-forward path take? Likewise, how much time does the the back propagation path take? You would think this would be as simple as wrapping the call (loss.backward(), in the case of back propagation) with a timer variable that records the start of time and then follow that up with recording the time right after the call. Not so.

The Wrong Way

# Below timing method will NOT work for asynchronous cuda calls
import time as timer
start = timer.time()
print("Time taken", timer.time() …

NCCL (pronounced “Nickel”) is a library of standard collective communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, and reduce-scatter. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink and other connectivity links.

If you fired-up a GPU-powered EC2 instance such as p3.8xlarge and looked at the topology with nvidia-smi, you’d see the following GPU-to-GPU NVLink configuration.

Now we’ll develop a data-pump app that continuously pumps data (albeit, the same value) into two GPUs and uses the TensorFlow API, tensorflow.contrib.nccl.all_sum() to perform the all_reduce operation on the two ‘place-holders’ located on each GPU. …

puDB is a lightweight, keyboard-friendly visual debugger for Python without the encumbrances of a full-blown Integrated Development Environment (IDE). pdb is its more well-known no-frills cousin. I easily get lost in pdb’s line-oriented view of the code. My go-to Python debugger is puDB.

Recently, I wanted to debug a multiprocessing Reinforcement Learning algorithm called Asynchronous Advantage Actor-Critic (A3C). Because it asynchronously trains agents, it can launch tens of processes, and each process (agent) contributes to the globally shared model.

How would you debug such a massively parallel algorithm?

Jest aside, understanding TensorFlow’s relationship with GPU memory can be rewarding when designing for real-world data and with models spanning multiple GPUs. Here’s considerations.

Greedy By Design

To explore, start by running a three-layer MNIST classifier on a multi-GPU system (in my examples, a three GPU system). You’ll notice in nvidia-smi that TensorFlow has allocated to itself the entire memory of all three available GPUs (34.5 GB!). The model size is just 502,410 trainable parameters. Throw in memory allocated for the gradients, the intermediate activations (feature maps), and scratch space; still doesn’t justify grabbing the other GPUs’ memory. To be fair, the TensorFlow…

Machine Learning (ML) will become pervasive across all Data Centers services. Here, you’ll learn about ML Data Center workloads at Facebook and Google. My findings come from their own publications [1,2] and I’m grateful to them for sharing their insights. Workloads come in two flavors; training (model building) and inference (model use).

ML Inference Workloads in the Data Center

The Google paper covers model inference from a Neural Network (NN) angle, i.e., the kinds of NN compute that are off-loaded to hardware accelerators to handle the explosion in demand for NN-based applications. …

Auro Tripathy

Machine Learning Modeler

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store