-
class
dynn.data.batching.padded_sequence_batching.
PaddedSequenceBatches
(data, targets=None, max_samples=32, pad_idx=0, max_tokens=inf, strict_token_limit=False, shuffle=True, group_by_length=True, left_aligned=True)¶ Bases:
object
Wraps a list of sequences and a list of targets as a batch iterator.
You can then iterate over this object and get tuples of
batch_data, batch_targets
ready for use in your computation graph.Example:
# Dictionary dic = dynn.data.dictionary.Dictionary(symbols="abcde".split()) # 1000 sequences of various lengths up to 10 data = [np.random.randint(len(dic), size=np.random.randint(10)) for _ in range(1000)] # Class labels labels = np.random.randint(10, size=1000) # Iterator with at most 20 samples or 50 tokens per batch batched_dataset = PaddedSequenceBatches( data, targets=labels, max_samples=20, pad_idx=dic.pad_idx, ) # Training loop for x, y in batched_dataset: # x is a SequenceBatch object # and y has shape (batch_size,) # Do something with x and y # Without labels batched_dataset = PaddedSequenceBatches( data, max_samples=20, pad_idx=dic.pad_idx, ) for x in batched_dataset: # x is a SequenceBatch object # Do something with x
Parameters: - data (list) – List of numpy arrays containing the data
- targets (list) – List of targets
- pad_value (number) – Value at padded position
- max_samples (int, optional) – Maximum number of samples per batch
- max_tokens (int, optional) – Maximum number of tokens per batch. This count doesn’t include padding tokens
- strict_token_limit (bool, optional) – Padding tokens will count towards
the
max_tokens
limit - shuffle (bool, optional) – Shuffle the dataset whenever starting a new
iteration (default:
True
) - group_by_length (bool, optional) – Group sequences by length. This minimizes the number of padding tokens. The batches are not strictly IID though.
- left_aligned (bool, optional) – Align the sequences to the left
-
__getitem__
(index)¶ Returns the
index
th sampleThe result is a tuple
batch_data, batch_target
where the first is a batch of sequences and the other is is a numpy array in Fortran layout (for more efficient input in dynet).batch_data
is aSequenceBatch
objectParameters: index (int, slice) – Index or slice Returns: batch_data, batch_target
Return type: tuple
-
__init__
(data, targets=None, max_samples=32, pad_idx=0, max_tokens=inf, strict_token_limit=False, shuffle=True, group_by_length=True, left_aligned=True)¶ Initialize self. See help(type(self)) for accurate signature.
-
__len__
()¶ This returns the number of batches in the dataset (not the total number of samples)
Returns: - Number of batches in the dataset
ceil(len(data)/batch_size)
Return type: int
-
__weakref__
¶ list of weak references to the object (if defined)
-
just_passed_multiple
(batch_number)¶ Checks whether the current number of batches processed has just passed a multiple of
batch_number
.For example you can use this to report at regular interval (eg. every 10 batches)
Parameters: batch_number (int) – [description] Returns: True
if \(\fraccurrent_batch\)Return type: bool
-
percentage_done
()¶ What percent of the data has been covered in the current epoch
-
reset
()¶ Reset the iterator and shuffle the dataset if applicable