dynn.data.batching package

Batching procedures

Iterators implementing common batching strategies.

class dynn.data.batching.NumpyBatches(data, targets, batch_size=32, shuffle=True)

Bases: object

Wraps a list of numpy arrays and a list of targets as a batch iterator.

You can then iterate over this object and get tuples of batch_data, batch_targets ready for use in your computation graph.

Example for classification:

# 1000 10-dimensional inputs
data = np.random.uniform(size=(1000, 10))
# Class labels
labels = np.random.randint(10, size=1000)
# Iterator
batched_dataset = NumpyBatches(data, labels, batch_size=20)
# Training loop
for x, y in batched_dataset:
    # x has shape (10, 20) while y has shape (20,)
    # Do something with x and y

Example for multidimensional regression:

# 1000 10-dimensional inputs
data = np.random.uniform(size=(1000, 10))
# 5-dimensional outputs
labels = np.random.uniform(size=(1000, 5))
# Iterator
batched_dataset = NumpyBatches(data, labels, batch_size=20)
# Training loop
for x, y in batched_dataset:
    # x has shape (10, 20) while y has shape (5, 20)
    # Do something with x and y
Parameters:
  • data (list) – List of numpy arrays containing the data
  • targets (list) – List of targets
  • batch_size (int, optional) – Batch size (default: 32)
  • shuffle (bool, optional) – Shuffle the dataset whenever starting a new iteration (default: True)
__getitem__(index)

Returns the index th sample

This returns something different every time the data is shuffled.

If index is a list or a slice this will return a batch.

The result is a tuple batch_data, batch_target where each of those is a numpy array in Fortran layout (for more efficient input in dynet). The batch size is always the last dimension.

Parameters:index (int, slice) – Index or slice
Returns:batch_data, batch_target
Return type:tuple
__init__(data, targets, batch_size=32, shuffle=True)

Initialize self. See help(type(self)) for accurate signature.

__len__()

This returns the number of batches in the dataset (not the total number of samples)

Returns:
Number of batches in the dataset
ceil(len(data)/batch_size)
Return type:int
__weakref__

list of weak references to the object (if defined)

just_passed_multiple(batch_number)

Checks whether the current number of batches processed has just passed a multiple of batch_number.

For example you can use this to report at regular interval (eg. every 10 batches)

Parameters:batch_number (int) – [description]
Returns:True if \(\fraccurrent_batch\)
Return type:bool
percentage_done()

What percent of the data has been covered in the current epoch

reset()

Reset the iterator and shuffle the dataset if applicable

class dynn.data.batching.SequenceBatch(sequences, original_idxs=None, pad_idx=None, left_aligned=True)

Bases: object

Batched sequence object with padding

This wraps a list of integer sequences into a nice array padded to the longest sequence. The batch dimension (number of sequences) is the last dimension.

By default the sequences are padded to the right which means that they are aligned to the left (they all start at index 0)

Parameters:
  • sequences (list) – List of list of integers
  • original_idxs (list) – This list should point to the original position of each sequence in the data (before shuffling/reordering). This is useful when you want to access information that has been discarded during preprocessing (eg original sentence before numberizing and <unk> ing in MT).
  • pad_idx (int) – Default index for padding
  • left_aligned (bool, optional) – Align to the left (all sequences start at the same position).
__init__(sequences, original_idxs=None, pad_idx=None, left_aligned=True)

Initialize self. See help(type(self)) for accurate signature.

__weakref__

list of weak references to the object (if defined)

collate(sequences)

Pad and concatenate sequences to an array

Args: sequences (list): List of list of integers pad_idx (int): Default index for padding

Returns:max_len x batch_size array
Return type:np.ndarray
get_mask(base_val=1, mask_val=0)

Return a mask expression with specific values for padding tokens.

This will return an expression of the same shape as self.sequences where the i th element of batch b is base_val iff i<=lengths[b] (and mask_val otherwise).

For example, if size is 4 and lengths is [1,2,4] then the returned mask will be:

(here each row is a batch element)

Parameters:
  • base_val (int, optional) – Value of the mask for non-masked indices (typically 1 for multiplicative masks and 0 for additive masks). Defaults to 1.
  • mask_val (int, optional) – Value of the mask for masked indices (typically 0 for multiplicative masks and -inf for additive masks). Defaults to 0.
class dynn.data.batching.PaddedSequenceBatches(data, targets=None, max_samples=32, pad_idx=0, max_tokens=inf, strict_token_limit=False, shuffle=True, group_by_length=True, left_aligned=True)

Bases: object

Wraps a list of sequences and a list of targets as a batch iterator.

You can then iterate over this object and get tuples of batch_data, batch_targets ready for use in your computation graph.

Example:

# Dictionary
dic = dynn.data.dictionary.Dictionary(symbols="abcde".split())
# 1000 sequences of various lengths up to 10
data = [np.random.randint(len(dic), size=np.random.randint(10))
        for _ in range(1000)]
# Class labels
labels = np.random.randint(10, size=1000)
# Iterator with at most 20 samples or 50 tokens per batch
batched_dataset = PaddedSequenceBatches(
    data,
    targets=labels,
    max_samples=20,
    pad_idx=dic.pad_idx,
)
# Training loop
for x, y in batched_dataset:
    # x is a SequenceBatch object
    # and y has shape (batch_size,)
    # Do something with x and y

# Without labels
batched_dataset = PaddedSequenceBatches(
    data,
    max_samples=20,
    pad_idx=dic.pad_idx,
)
for x in batched_dataset:
    # x is a SequenceBatch object
    # Do something with x
Parameters:
  • data (list) – List of numpy arrays containing the data
  • targets (list) – List of targets
  • pad_value (number) – Value at padded position
  • max_samples (int, optional) – Maximum number of samples per batch
  • max_tokens (int, optional) – Maximum number of tokens per batch. This count doesn’t include padding tokens
  • strict_token_limit (bool, optional) – Padding tokens will count towards the max_tokens limit
  • shuffle (bool, optional) – Shuffle the dataset whenever starting a new iteration (default: True)
  • group_by_length (bool, optional) – Group sequences by length. This minimizes the number of padding tokens. The batches are not strictly IID though.
  • left_aligned (bool, optional) – Align the sequences to the left
__getitem__(index)

Returns the index th sample

The result is a tuple batch_data, batch_target where the first is a batch of sequences and the other is is a numpy array in Fortran layout (for more efficient input in dynet).

batch_data is a SequenceBatch object

Parameters:index (int, slice) – Index or slice
Returns:batch_data, batch_target
Return type:tuple
__init__(data, targets=None, max_samples=32, pad_idx=0, max_tokens=inf, strict_token_limit=False, shuffle=True, group_by_length=True, left_aligned=True)

Initialize self. See help(type(self)) for accurate signature.

__len__()

This returns the number of batches in the dataset (not the total number of samples)

Returns:
Number of batches in the dataset
ceil(len(data)/batch_size)
Return type:int
__weakref__

list of weak references to the object (if defined)

just_passed_multiple(batch_number)

Checks whether the current number of batches processed has just passed a multiple of batch_number.

For example you can use this to report at regular interval (eg. every 10 batches)

Parameters:batch_number (int) – [description]
Returns:True if \(\fraccurrent_batch\)
Return type:bool
percentage_done()

What percent of the data has been covered in the current epoch

reset()

Reset the iterator and shuffle the dataset if applicable

class dynn.data.batching.BPTTBatches(data, batch_size=32, seq_length=30)

Bases: object

Wraps a list of sequences as a contiguous batch iterator.

This will iterate over batches of contiguous subsequences of size seq_length. TODO: elaborate

Example:

# Dictionary
# Sequence of length 1000
data = np.random.randint(10, size=1000)
# Iterator with over subsequences of length 20 with batch size 5
batched_dataset = BPTTBatches(data, batch_size=5, seq_length=20)
# Training loop
for x, y in batched_dataset:
    # x has and y have shape (seq_length, batch_size)
    # y[i+1] == x[i]
    # Do something with x
Parameters:
  • data (list) – List of numpy arrays containing the data
  • targets (list) – List of targets
  • batch_size (int, optional) – Batch size
  • seq_length (int, optional) – BPTT length
__getitem__(index)

Returns the index th sample

The result is a tuple x, next_x of numpy arrays of shape seq_len x batch_size seq_length is determined by the range specified by index, and next_x[t]=x[t+1] for all t

Parameters:index (int, slice) – Index or slice
Returns:x, next_x
Return type:tuple
__init__(data, batch_size=32, seq_length=30)

Initialize self. See help(type(self)) for accurate signature.

__len__()

This returns the number of batches in the dataset (not the total number of samples)

Returns:
Number of batches in the dataset
ceil(len(data)/batch_size)
Return type:int
__weakref__

list of weak references to the object (if defined)

just_passed_multiple(batch_number)

Checks whether the current number of batches processed has just passed a multiple of batch_number.

For example you can use this to report at regular interval (eg. every 10 batches)

Parameters:batch_number (int) – [description]
Returns:True if \(\fraccurrent_batch\)
Return type:bool
percentage_done()

What percent of the data has been covered in the current epoch

reset()

Reset the iterator and shuffle the dataset if applicable

class dynn.data.batching.SequencePairsBatches(src_data, tgt_data, src_dictionary, tgt_dictionary=None, labels=None, max_samples=32, max_tokens=99999999, strict_token_limit=False, shuffle=True, group_by_length='source', src_left_aligned=True, tgt_left_aligned=True)

Bases: object

Wraps two lists of sequences as a batch iterator.

This is useful for sequence-to-sequence problems or sentence pairs classification (entailment, paraphrase detection…). Following seq2seq conventions the first sequence is referred to as the “source” and the second as the “target”.

You can then iterate over this object and get tuples of src_batch, tgt_batch ready for use in your computation graph.

Example:

# Dictionary
dic = dynn.data.dictionary.Dictionary(symbols="abcde".split())
# 1000 source sequences of various lengths up to 10
src_data = [np.random.randint(len(dic), size=np.random.randint(10))
            for _ in range(1000)]
# 1000 target sequences of various lengths up to 10
tgt_data = [np.random.randint(len(dic), size=np.random.randint(10))
            for _ in range(1000)]
# Iterator with at most 20 samples or 50 tokens per batch
batched_dataset = SequencePairsBatches(
    src_data, tgt_data, max_samples=20
)
# Training loop
for x, y in batched_dataset:
    # x and y are SequenceBatch objects
Parameters:
  • src_data (list) – List of source sequences (list of int iterables)
  • tgt_data (list) – List of target sequences (list of int iterables)
  • src_dictionary (Dictionary) – Source dictionary
  • tgt_dictionary (Dictionary) – Target dictionary
  • max_samples (int, optional) – Maximum number of samples per batch (one sample is a pair of sentences)
  • max_tokens (int, optional) – Maximum number of total tokens per batch (source + target tokens)
  • strict_token_limit (bool, optional) – Padding tokens will count towards the max_tokens limit
  • shuffle (bool, optional) – Shuffle the dataset whenever starting a new iteration (default: True)
  • group_by_length (str, optional) – Group sequences by length. One of "source" or "target". This minimizes the number of padding tokens. The batches are not strictly IID though.
  • src_left_aligned (bool, optional) – Align the source sequences to the left
  • tgt_left_aligned (bool, optional) – Align the target sequences to the left
__getitem__(index)

Returns the index th sample

The result is a tuple src_batch, tgt_batch where each is a batch_data is a SequenceBatch object

Parameters:index (int, slice) – Index or slice
Returns:src_batch, tgt_batch
Return type:tuple
__init__(src_data, tgt_data, src_dictionary, tgt_dictionary=None, labels=None, max_samples=32, max_tokens=99999999, strict_token_limit=False, shuffle=True, group_by_length='source', src_left_aligned=True, tgt_left_aligned=True)

Initialize self. See help(type(self)) for accurate signature.

__len__()

This returns the number of batches in the dataset (not the total number of samples)

Returns:
Number of batches in the dataset
ceil(len(data)/batch_size)
Return type:int
__weakref__

list of weak references to the object (if defined)

just_passed_multiple(batch_number)

Checks whether the current number of batches processed has just passed a multiple of batch_number.

For example you can use this to report at regular interval (eg. every 10 batches)

Parameters:batch_number (int) – [description]
Returns:True if \(\fraccurrent_batch\)
Return type:bool
percentage_done()

What percent of the data has been covered in the current epoch

reset()

Reset the iterator and shuffle the dataset if applicable

Submodules