class dynn.data.batching.padded_sequence_batching.PaddedSequenceBatches(data, targets=None, max_samples=32, pad_idx=0, max_tokens=inf, strict_token_limit=False, shuffle=True, group_by_length=True, left_aligned=True)

Bases: object

Wraps a list of sequences and a list of targets as a batch iterator.

You can then iterate over this object and get tuples of batch_data, batch_targets ready for use in your computation graph.

Example:

# Dictionary
dic = dynn.data.dictionary.Dictionary(symbols="abcde".split())
# 1000 sequences of various lengths up to 10
data = [np.random.randint(len(dic), size=np.random.randint(10))
        for _ in range(1000)]
# Class labels
labels = np.random.randint(10, size=1000)
# Iterator with at most 20 samples or 50 tokens per batch
batched_dataset = PaddedSequenceBatches(
    data,
    targets=labels,
    max_samples=20,
    pad_idx=dic.pad_idx,
)
# Training loop
for x, y in batched_dataset:
    # x is a SequenceBatch object
    # and y has shape (batch_size,)
    # Do something with x and y

# Without labels
batched_dataset = PaddedSequenceBatches(
    data,
    max_samples=20,
    pad_idx=dic.pad_idx,
)
for x in batched_dataset:
    # x is a SequenceBatch object
    # Do something with x
Parameters:
  • data (list) – List of numpy arrays containing the data
  • targets (list) – List of targets
  • pad_value (number) – Value at padded position
  • max_samples (int, optional) – Maximum number of samples per batch
  • max_tokens (int, optional) – Maximum number of tokens per batch. This count doesn’t include padding tokens
  • strict_token_limit (bool, optional) – Padding tokens will count towards the max_tokens limit
  • shuffle (bool, optional) – Shuffle the dataset whenever starting a new iteration (default: True)
  • group_by_length (bool, optional) – Group sequences by length. This minimizes the number of padding tokens. The batches are not strictly IID though.
  • left_aligned (bool, optional) – Align the sequences to the left
__getitem__(index)

Returns the index th sample

The result is a tuple batch_data, batch_target where the first is a batch of sequences and the other is is a numpy array in Fortran layout (for more efficient input in dynet).

batch_data is a SequenceBatch object

Parameters:index (int, slice) – Index or slice
Returns:batch_data, batch_target
Return type:tuple
__init__(data, targets=None, max_samples=32, pad_idx=0, max_tokens=inf, strict_token_limit=False, shuffle=True, group_by_length=True, left_aligned=True)

Initialize self. See help(type(self)) for accurate signature.

__len__()

This returns the number of batches in the dataset (not the total number of samples)

Returns:
Number of batches in the dataset
ceil(len(data)/batch_size)
Return type:int
__weakref__

list of weak references to the object (if defined)

just_passed_multiple(batch_number)

Checks whether the current number of batches processed has just passed a multiple of batch_number.

For example you can use this to report at regular interval (eg. every 10 batches)

Parameters:batch_number (int) – [description]
Returns:True if \(\fraccurrent_batch\)
Return type:bool
percentage_done()

What percent of the data has been covered in the current epoch

reset()

Reset the iterator and shuffle the dataset if applicable