Chapter 03.04: Transformer Parameter Count

This chapter deals with the number of parameters of the transformer. The parameter count of a transformer model refers to the total number of learnable parameters present in its architecture, which are distributed across various components of the model. These components typically include:

  1. Embedding Layers: Parameters associated with the input and output embeddings for tokens, which encode their semantic meanings.
  2. Encoder Layers: Parameters within each encoder layer, including those associated with self-attention mechanisms, position-wise feedforward networks, and layer normalization.
  3. Decoder Layers: Parameters within each decoder layer, including self-attention mechanisms, cross-attention mechanisms, position-wise feedforward networks, and layer normalization.
  4. Positional Encodings: Parameters used to encode positional information in the input sequences.

The total parameter count of a transformer model is the sum of parameters from all these components, with variations depending on the specific architecture and hyperparameters chosen for the model.

Lecture Slides

Additional Resources