A BERT Flavor to Sample and Savor

2 min readJan 10, 2022

Thankfully, we can derive a variety of models from the BERT architecture to fit our memory and latency needs. Turns out, model capacity (the number of parameters) is factored on three variables, the number of layers, the hidden embedding size, and the number of attention heads.

This post puts the spotlight on these three parameters.

Number of Encoder Layers

The BERT model is made of several similar transformer encoder layers, stacked on top of each others. The output of layer N-1 is the input of the layer N. BERT Base has 12 such layers and BERT Large has 24.

Hidden Embedding Size

Hidden embedding size is the size of the output of each layer (768 in the case of BERT Base and 1024 for BERT Large).

Number of Attention Heads

Just as in convolution neural nets, where we use several filters on the output side to produce several different feature maps, multi-head attention serves a similar purpose. The keys, queries and values (K, Q, V)are projected into a different space multiple times (A times) with different, learnt linear projections. BERT Base uses A=12 attention heads and BERT Large uses A=16. (not to be conflated with the L=12 encoder layers)