Transformer layers

class dynn.layers.transformer_layers.CondTransformer(pc, input_dim, hidden_dim, cond_dim, n_heads, activation=<function relu>, dropout=0.0)

Bases: dynn.layers.base_layers.ParametrizedLayer

Conditional transformer layer.

As described in Vaswani et al. (2017) This is the “decoder” side of the transformer, ie self attention + attention to context.

Parameters:
  • pc (dynet.ParameterCollection) – Parameter collection to hold the parameters
  • input_dim (int) – Hidden dimension (used everywhere)
  • cond_dim (int) – Conditional dimension (dimension of the “encoder” side, used for attention)
  • n_heads (int) – Number of heads for attention.
  • activation (function, optional) – MLP activation (defaults to relu).
  • dropout (float, optional) – Dropout rate (defaults to 0)
__call__(x, c, lengths=None, left_aligned=True, mask=None, triu=False, lengths_c=None, left_aligned_c=True, mask_c=None, return_att=False)

Run the transformer layer.

The input is expected to have dimensions d x L where L is the length dimension.

Parameters:
  • x (dynet.Expression) – Input (dimensions input_dim x L)
  • c (dynet.Expression) – Context (dimensions cond_dim x l)
  • lengths (list, optional) – Defaults to None. List of lengths for masking (used for self attention)
  • left_aligned (bool, optional) – Defaults to True. USed for masking in self attention.
  • mask (dynet.Expression, optional) – Defaults to None. As an alternative to length, you can pass a mask expression directly (useful to reuse masks accross layers).
  • triu (bool, optional) – Upper triangular self attention. Mask such that each position can only attend to the previous positions.
  • lengths_c (list, optional) – Defaults to None. List of lengths for masking (used for conditional attention)
  • left_aligned_c (bool, optional) – Defaults to True. Used for masking in conditional attention.
  • mask_c (dynet.Expression, optional) – Defaults to None. As an alternative to length_c, you can pass a mask expression directly (useful to reuse masks accross layers).
  • return_att (bool, optional) – Defaults to False. Return the self and conditional attention weights
Returns:

The output expression (+ the

attention weights if return_att is True)

Return type:

tuple, dynet.Expression

__init__(pc, input_dim, hidden_dim, cond_dim, n_heads, activation=<function relu>, dropout=0.0)

Creates a subcollection for this layer with a custom name

step(state, x, c, lengths=None, left_aligned=True, mask=None, triu=False, lengths_c=None, left_aligned_c=True, mask_c=None, return_att=False)

Runs the transformer for one step. Useful for decoding.

The “state” of the transformer is the list of L-1 inputs and its output is the L th output. This returns a tuple of both the new state (L-1 previous inputs + L th input concatenated) and the L th output

Parameters:
  • x (dynet.Expression) – Input (dimension input_dim)
  • state (dynet.Expression, optional) – Previous “state” (dimensions input_dim x (L-1))
  • c (dynet.Expression) – Context (dimensions cond_dim x l)
  • lengths_c (list, optional) – Defaults to None. List of lengths for masking (used for conditional attention)
  • left_aligned_c (bool, optional) – Defaults to True. Used for masking in conditional attention.
  • mask_c (dynet.Expression, optional) – Defaults to None. As an alternative to length_c, you can pass a mask expression directly (useful to reuse masks accross layers).
  • return_att (bool, optional) – Defaults to False. Return the self and conditional attention weights
  • return_att – Defaults to False. [description]
Returns:

[description]

Return type:

[type]

class dynn.layers.transformer_layers.StackedCondTransformers(pc, n_layers, input_dim, hidden_dim, cond_dim, n_heads, activation=<function relu>, dropout=0.0)

Bases: dynn.layers.combination_layers.Sequential

Multilayer transformer.

Parameters:
  • pc (dynet.ParameterCollection) – Parameter collection to hold the parameters
  • n_layers (int) – Number of layers
  • input_dim (int) – Hidden dimension (used everywhere)
  • cond_dim (int) – Conditional dimension (dimension of the “encoder” side, used for attention)
  • n_heads (int) – Number of heads for self attention.
  • activation (function, optional) – MLP activation (defaults to relu).
  • dropout (float, optional) – Dropout rate (defaults to 0)
__call__(x, c, lengths=None, left_aligned=True, mask=None, triu=False, lengths_c=None, left_aligned_c=True, mask_c=None, return_att=False, return_last_only=True)

Run the multilayer transformer.

The input is expected to have dimensions d x L where L is the length dimension.

Parameters:
  • x (dynet.Expression) – Input (dimensions input_dim x L)
  • c (list) – list of contexts (one per layer, each of dim cond_dim x L). If this is not a list (but an expression), the same context will be used for each layer.
  • lengths (list, optional) – Defaults to None. List of lengths for masking (used for self attention)
  • left_aligned (bool, optional) – Defaults to True. USed for masking in self attention.
  • mask (dynet.Expression, optional) – Defaults to None. As an alternative to length, you can pass a mask expression directly (useful to reuse masks accross layers).
  • triu (bool, optional) – Upper triangular self attention. Mask such that each position can only attend to the previous positions.
  • lengths_c (list, optional) – Defaults to None. List of lengths for masking (used for conditional attention)
  • left_aligned_c (bool, optional) – Defaults to True. Used for masking in conditional attention.
  • mask_c (dynet.Expression, optional) – Defaults to None. As an alternative to length_c, you can pass a mask expression directly (useful to reuse masks accross layers).
  • return_last_only (bool, optional) – Return only the output of the last layer (as opposed to the output of all layers).
Returns:

The output expression (+ the

attention weights if return_att is True)

Return type:

tuple, dynet.Expression

__init__(pc, n_layers, input_dim, hidden_dim, cond_dim, n_heads, activation=<function relu>, dropout=0.0)

Initialize self. See help(type(self)) for accurate signature.

step(state, x, c, lengths=None, left_aligned=True, mask=None, triu=False, lengths_c=None, left_aligned_c=True, mask_c=None, return_att=False, return_last_only=True)

Runs the transformer for one step. Useful for decoding.

The “state” of the multilayered transformer is the list of n_layers L-1 sized inputs and its output is the output of the last layer. This returns a tuple of both the new state (list of n_layers L sized inputs) and the L th output.

Parameters:
  • x (dynet.Expression) – Input (dimension input_dim)
  • state (dynet.Expression) – Previous “state” (list of n_layers expressions of dimensions input_dim x (L-1))
  • c (dynet.Expression) – Context (dimensions cond_dim x l)
  • lengths_c (list, optional) – Defaults to None. List of lengths for masking (used for conditional attention)
  • left_aligned_c (bool, optional) – Defaults to True. Used for masking in conditional attention.
  • mask_c (dynet.Expression, optional) – Defaults to None. As an alternative to length_c, you can pass a mask expression directly (useful to reuse masks accross layers).
  • return_att (bool, optional) – Defaults to False. Return the self and conditional attention weights
  • return_att – Defaults to False. [description]
Returns:

The output expression (+ the

attention weights if return_att is True)

Return type:

tuple

class dynn.layers.transformer_layers.StackedTransformers(pc, n_layers, input_dim, hidden_dim, n_heads, activation=<function relu>, dropout=0.0)

Bases: dynn.layers.combination_layers.Sequential

Multilayer transformer.

Parameters:
  • pc (dynet.ParameterCollection) – Parameter collection to hold the parameters
  • n_layers (int) – Number of layers
  • input_dim (int) – Hidden dimension (used everywhere)
  • n_heads (int) – Number of heads for self attention.
  • activation (function, optional) – MLP activation (defaults to relu).
  • dropout (float, optional) – Dropout rate (defaults to 0)
__call__(x, lengths=None, left_aligned=True, triu=False, mask=None, return_att=False, return_last_only=True)

Run the multilayer transformer.

The input is expected to have dimensions d x L where L is the length dimension.

Parameters:
  • x (dynet.Expression) – Input (dimensions input_dim x L)
  • lengths (list, optional) – Defaults to None. List of lengths for masking (used for attention)
  • left_aligned (bool, optional) – Defaults to True. USed for masking
  • triu (bool, optional) – Upper triangular self attention. Mask such that each position can only attend to the previous positions.
  • mask (dynet.Expression, optional) – Defaults to None. As an alternative to length, you can pass a mask expression directly (useful to reuse masks accross layers)
  • return_att (bool, optional) – Defaults to False. Return the self attention weights
  • return_last_only (bool, optional) – Return only the output of the last layer (as opposed to the output of all layers).
Returns:

The output expression (+ the

attention weights if return_att is True)

Return type:

tuple, dynet.Expression

__init__(pc, n_layers, input_dim, hidden_dim, n_heads, activation=<function relu>, dropout=0.0)

Initialize self. See help(type(self)) for accurate signature.

class dynn.layers.transformer_layers.Transformer(pc, input_dim, hidden_dim, n_heads, activation=<function relu>, dropout=0.0)

Bases: dynn.layers.base_layers.ParametrizedLayer

Transformer layer.

As described in Vaswani et al. (2017) This is the “encoder” side of the transformer, ie self attention only.

Parameters:
  • pc (dynet.ParameterCollection) – Parameter collection to hold the parameters
  • input_dim (int) – Hidden dimension (used everywhere)
  • n_heads (int) – Number of heads for self attention.
  • activation (function, optional) – MLP activation (defaults to relu).
  • dropout (float, optional) – Dropout rate (defaults to 0)
__call__(x, lengths=None, left_aligned=True, triu=False, mask=None, return_att=False)

Run the transformer layer.

The input is expected to have dimensions d x L where L is the length dimension.

Parameters:
  • x (dynet.Expression) – Input (dimensions input_dim x L)
  • lengths (list, optional) – Defaults to None. List of lengths for masking (used for attention)
  • left_aligned (bool, optional) – Defaults to True. Used for masking
  • triu (bool, optional) – Upper triangular self attention. Mask such that each position can only attend to the previous positions.
  • mask (dynet.Expression, optional) – Defaults to None. As an alternative to length, you can pass a mask expression directly (useful to reuse masks accross layers)
  • return_att (bool, optional) – Defaults to False. Return the self attention weights
Returns:

The output expression (+ the

attention weights if return_att is True)

Return type:

tuple, dynet.Expression

__init__(pc, input_dim, hidden_dim, n_heads, activation=<function relu>, dropout=0.0)

Creates a subcollection for this layer with a custom name