<no title>

class dynn.data.batching.padded_sequence_batching.PaddedSequenceBatches(data, targets=None, max_samples=32, pad_idx=0, max_tokens=inf, strict_token_limit=False, shuffle=True, group_by_length=True, left_aligned=True)¶

Bases: object

Wraps a list of sequences and a list of targets as a batch iterator.

You can then iterate over this object and get tuples of batch_data, batch_targets ready for use in your computation graph.

Example:

# Dictionary
dic = dynn.data.dictionary.Dictionary(symbols="abcde".split())
# 1000 sequences of various lengths up to 10
data = [np.random.randint(len(dic), size=np.random.randint(10))
        for _ in range(1000)]
# Class labels
labels = np.random.randint(10, size=1000)
# Iterator with at most 20 samples or 50 tokens per batch
batched_dataset = PaddedSequenceBatches(
    data,
    targets=labels,
    max_samples=20,
    pad_idx=dic.pad_idx,
)
# Training loop
for x, y in batched_dataset:
    # x is a SequenceBatch object
    # and y has shape (batch_size,)
    # Do something with x and y

# Without labels
batched_dataset = PaddedSequenceBatches(
    data,
    max_samples=20,
    pad_idx=dic.pad_idx,
)
for x in batched_dataset:
    # x is a SequenceBatch object
    # Do something with x

Parameters:

data (list) – List of numpy arrays containing the data
targets (list) – List of targets
pad_value (number) – Value at padded position
max_samples (int, optional) – Maximum number of samples per batch
max_tokens (int, optional) – Maximum number of tokens per batch. This count doesn’t include padding tokens
strict_token_limit (bool, optional) – Padding tokens will count towards the max_tokens limit
shuffle (bool, optional) – Shuffle the dataset whenever starting a new iteration (default: True)
group_by_length (bool, optional) – Group sequences by length. This minimizes the number of padding tokens. The batches are not strictly IID though.
left_aligned (bool, optional) – Align the sequences to the left

__getitem__(index)¶

Returns the index th sample

The result is a tuple batch_data, batch_target where the first is a batch of sequences and the other is is a numpy array in Fortran layout (for more efficient input in dynet).

batch_data is a SequenceBatch object

Parameters:	index (int, slice) – Index or slice
Returns:	`batch_data, batch_target`
Return type:	tuple

__init__(data, targets=None, max_samples=32, pad_idx=0, max_tokens=inf, strict_token_limit=False, shuffle=True, group_by_length=True, left_aligned=True)¶: Initialize self. See help(type(self)) for accurate signature.

__len__()¶

This returns the number of batches in the dataset (not the total number of samples)

Returns:	Number of batches in the dataset `ceil(len(data)/batch_size)`
Return type:	int

__weakref__¶: list of weak references to the object (if defined)

just_passed_multiple(batch_number)¶

Checks whether the current number of batches processed has just passed a multiple of batch_number.

For example you can use this to report at regular interval (eg. every 10 batches)

Parameters:	batch_number (int) – [description]
Returns:	`True` if \(\fraccurrent_batch\)
Return type:	bool

percentage_done()¶: What percent of the data has been covered in the current epoch

reset()¶: Reset the iterator and shuffle the dataset if applicable