Thankfully, we can derive a variety of models from the BERT architecture to fit our memory and latency needs. Turns out, model capacity (the number of parameters) is factored on three variables, the number of layers, the hidden embedding size, and the number of attention heads. This post puts the…