dynn.data.batching package¶
Batching procedures¶
Iterators implementing common batching strategies.
-
class
dynn.data.batching.
NumpyBatches
(data, targets, batch_size=32, shuffle=True)¶ Bases:
object
Wraps a list of numpy arrays and a list of targets as a batch iterator.
You can then iterate over this object and get tuples of
batch_data, batch_targets
ready for use in your computation graph.Example for classification:
# 1000 10-dimensional inputs data = np.random.uniform(size=(1000, 10)) # Class labels labels = np.random.randint(10, size=1000) # Iterator batched_dataset = NumpyBatches(data, labels, batch_size=20) # Training loop for x, y in batched_dataset: # x has shape (10, 20) while y has shape (20,) # Do something with x and y
Example for multidimensional regression:
# 1000 10-dimensional inputs data = np.random.uniform(size=(1000, 10)) # 5-dimensional outputs labels = np.random.uniform(size=(1000, 5)) # Iterator batched_dataset = NumpyBatches(data, labels, batch_size=20) # Training loop for x, y in batched_dataset: # x has shape (10, 20) while y has shape (5, 20) # Do something with x and y
Parameters: -
__getitem__
(index)¶ Returns the
index
th sampleThis returns something different every time the data is shuffled.
If index is a list or a slice this will return a batch.
The result is a tuple
batch_data, batch_target
where each of those is a numpy array in Fortran layout (for more efficient input in dynet). The batch size is always the last dimension.Parameters: index (int, slice) – Index or slice Returns: batch_data, batch_target
Return type: tuple
-
__init__
(data, targets, batch_size=32, shuffle=True)¶ Initialize self. See help(type(self)) for accurate signature.
-
__len__
()¶ This returns the number of batches in the dataset (not the total number of samples)
Returns: - Number of batches in the dataset
ceil(len(data)/batch_size)
Return type: int
-
__weakref__
¶ list of weak references to the object (if defined)
-
just_passed_multiple
(batch_number)¶ Checks whether the current number of batches processed has just passed a multiple of
batch_number
.For example you can use this to report at regular interval (eg. every 10 batches)
Parameters: batch_number (int) – [description] Returns: True
if \(\fraccurrent_batch\)Return type: bool
-
percentage_done
()¶ What percent of the data has been covered in the current epoch
-
reset
()¶ Reset the iterator and shuffle the dataset if applicable
-
-
class
dynn.data.batching.
SequenceBatch
(sequences, original_idxs=None, pad_idx=None, left_aligned=True)¶ Bases:
object
Batched sequence object with padding
This wraps a list of integer sequences into a nice array padded to the longest sequence. The batch dimension (number of sequences) is the last dimension.
By default the sequences are padded to the right which means that they are aligned to the left (they all start at index 0)
Parameters: - sequences (list) – List of list of integers
- original_idxs (list) – This list should point to the original position
of each sequence in the data (before shuffling/reordering). This is
useful when you want to access information that has been discarded
during preprocessing (eg original sentence before numberizing and
<unk>
ing in MT). - pad_idx (int) – Default index for padding
- left_aligned (bool, optional) – Align to the left (all sequences start at the same position).
-
__init__
(sequences, original_idxs=None, pad_idx=None, left_aligned=True)¶ Initialize self. See help(type(self)) for accurate signature.
-
__weakref__
¶ list of weak references to the object (if defined)
-
collate
(sequences)¶ Pad and concatenate sequences to an array
Args: sequences (list): List of list of integers pad_idx (int): Default index for padding
Returns: max_len x batch_size
arrayReturn type: np.ndarray
-
get_mask
(base_val=1, mask_val=0)¶ Return a mask expression with specific values for padding tokens.
This will return an expression of the same shape as
self.sequences
where thei
th element of batchb
isbase_val
iffi<=lengths[b]
(andmask_val
otherwise).For example, if
size
is4
andlengths
is[1,2,4]
then the returned mask will be:(here each row is a batch element)
Parameters:
-
class
dynn.data.batching.
PaddedSequenceBatches
(data, targets=None, max_samples=32, pad_idx=0, max_tokens=inf, strict_token_limit=False, shuffle=True, group_by_length=True, left_aligned=True)¶ Bases:
object
Wraps a list of sequences and a list of targets as a batch iterator.
You can then iterate over this object and get tuples of
batch_data, batch_targets
ready for use in your computation graph.Example:
# Dictionary dic = dynn.data.dictionary.Dictionary(symbols="abcde".split()) # 1000 sequences of various lengths up to 10 data = [np.random.randint(len(dic), size=np.random.randint(10)) for _ in range(1000)] # Class labels labels = np.random.randint(10, size=1000) # Iterator with at most 20 samples or 50 tokens per batch batched_dataset = PaddedSequenceBatches( data, targets=labels, max_samples=20, pad_idx=dic.pad_idx, ) # Training loop for x, y in batched_dataset: # x is a SequenceBatch object # and y has shape (batch_size,) # Do something with x and y # Without labels batched_dataset = PaddedSequenceBatches( data, max_samples=20, pad_idx=dic.pad_idx, ) for x in batched_dataset: # x is a SequenceBatch object # Do something with x
Parameters: - data (list) – List of numpy arrays containing the data
- targets (list) – List of targets
- pad_value (number) – Value at padded position
- max_samples (int, optional) – Maximum number of samples per batch
- max_tokens (int, optional) – Maximum number of tokens per batch. This count doesn’t include padding tokens
- strict_token_limit (bool, optional) – Padding tokens will count towards
the
max_tokens
limit - shuffle (bool, optional) – Shuffle the dataset whenever starting a new
iteration (default:
True
) - group_by_length (bool, optional) – Group sequences by length. This minimizes the number of padding tokens. The batches are not strictly IID though.
- left_aligned (bool, optional) – Align the sequences to the left
-
__getitem__
(index)¶ Returns the
index
th sampleThe result is a tuple
batch_data, batch_target
where the first is a batch of sequences and the other is is a numpy array in Fortran layout (for more efficient input in dynet).batch_data
is aSequenceBatch
objectParameters: index (int, slice) – Index or slice Returns: batch_data, batch_target
Return type: tuple
-
__init__
(data, targets=None, max_samples=32, pad_idx=0, max_tokens=inf, strict_token_limit=False, shuffle=True, group_by_length=True, left_aligned=True)¶ Initialize self. See help(type(self)) for accurate signature.
-
__len__
()¶ This returns the number of batches in the dataset (not the total number of samples)
Returns: - Number of batches in the dataset
ceil(len(data)/batch_size)
Return type: int
-
__weakref__
¶ list of weak references to the object (if defined)
-
just_passed_multiple
(batch_number)¶ Checks whether the current number of batches processed has just passed a multiple of
batch_number
.For example you can use this to report at regular interval (eg. every 10 batches)
Parameters: batch_number (int) – [description] Returns: True
if \(\fraccurrent_batch\)Return type: bool
-
percentage_done
()¶ What percent of the data has been covered in the current epoch
-
reset
()¶ Reset the iterator and shuffle the dataset if applicable
-
class
dynn.data.batching.
BPTTBatches
(data, batch_size=32, seq_length=30)¶ Bases:
object
Wraps a list of sequences as a contiguous batch iterator.
This will iterate over batches of contiguous subsequences of size
seq_length
. TODO: elaborateExample:
# Dictionary # Sequence of length 1000 data = np.random.randint(10, size=1000) # Iterator with over subsequences of length 20 with batch size 5 batched_dataset = BPTTBatches(data, batch_size=5, seq_length=20) # Training loop for x, y in batched_dataset: # x has and y have shape (seq_length, batch_size) # y[i+1] == x[i] # Do something with x
Parameters: -
__getitem__
(index)¶ Returns the
index
th sampleThe result is a tuple
x, next_x
of numpy arrays of shapeseq_len x batch_size
seq_length
is determined by the range specified byindex
, andnext_x[t]=x[t+1]
for allt
Parameters: index (int, slice) – Index or slice Returns: x, next_x
Return type: tuple
-
__init__
(data, batch_size=32, seq_length=30)¶ Initialize self. See help(type(self)) for accurate signature.
-
__len__
()¶ This returns the number of batches in the dataset (not the total number of samples)
Returns: - Number of batches in the dataset
ceil(len(data)/batch_size)
Return type: int
-
__weakref__
¶ list of weak references to the object (if defined)
-
just_passed_multiple
(batch_number)¶ Checks whether the current number of batches processed has just passed a multiple of
batch_number
.For example you can use this to report at regular interval (eg. every 10 batches)
Parameters: batch_number (int) – [description] Returns: True
if \(\fraccurrent_batch\)Return type: bool
-
percentage_done
()¶ What percent of the data has been covered in the current epoch
-
reset
()¶ Reset the iterator and shuffle the dataset if applicable
-
-
class
dynn.data.batching.
SequencePairsBatches
(src_data, tgt_data, src_dictionary, tgt_dictionary=None, labels=None, max_samples=32, max_tokens=99999999, strict_token_limit=False, shuffle=True, group_by_length='source', src_left_aligned=True, tgt_left_aligned=True)¶ Bases:
object
Wraps two lists of sequences as a batch iterator.
This is useful for sequence-to-sequence problems or sentence pairs classification (entailment, paraphrase detection…). Following seq2seq conventions the first sequence is referred to as the “source” and the second as the “target”.
You can then iterate over this object and get tuples of
src_batch, tgt_batch
ready for use in your computation graph.Example:
# Dictionary dic = dynn.data.dictionary.Dictionary(symbols="abcde".split()) # 1000 source sequences of various lengths up to 10 src_data = [np.random.randint(len(dic), size=np.random.randint(10)) for _ in range(1000)] # 1000 target sequences of various lengths up to 10 tgt_data = [np.random.randint(len(dic), size=np.random.randint(10)) for _ in range(1000)] # Iterator with at most 20 samples or 50 tokens per batch batched_dataset = SequencePairsBatches( src_data, tgt_data, max_samples=20 ) # Training loop for x, y in batched_dataset: # x and y are SequenceBatch objects
Parameters: - src_data (list) – List of source sequences (list of int iterables)
- tgt_data (list) – List of target sequences (list of int iterables)
- src_dictionary (Dictionary) – Source dictionary
- tgt_dictionary (Dictionary) – Target dictionary
- max_samples (int, optional) – Maximum number of samples per batch (one sample is a pair of sentences)
- max_tokens (int, optional) – Maximum number of total tokens per batch (source + target tokens)
- strict_token_limit (bool, optional) – Padding tokens will count towards
the
max_tokens
limit - shuffle (bool, optional) – Shuffle the dataset whenever starting a new
iteration (default:
True
) - group_by_length (str, optional) – Group sequences by length. One of
"source"
or"target"
. This minimizes the number of padding tokens. The batches are not strictly IID though. - src_left_aligned (bool, optional) – Align the source sequences to the left
- tgt_left_aligned (bool, optional) – Align the target sequences to the left
-
__getitem__
(index)¶ Returns the
index
th sampleThe result is a tuple
src_batch, tgt_batch
where each is abatch_data
is aSequenceBatch
objectParameters: index (int, slice) – Index or slice Returns: src_batch, tgt_batch
Return type: tuple
-
__init__
(src_data, tgt_data, src_dictionary, tgt_dictionary=None, labels=None, max_samples=32, max_tokens=99999999, strict_token_limit=False, shuffle=True, group_by_length='source', src_left_aligned=True, tgt_left_aligned=True)¶ Initialize self. See help(type(self)) for accurate signature.
-
__len__
()¶ This returns the number of batches in the dataset (not the total number of samples)
Returns: - Number of batches in the dataset
ceil(len(data)/batch_size)
Return type: int
-
__weakref__
¶ list of weak references to the object (if defined)
-
just_passed_multiple
(batch_number)¶ Checks whether the current number of batches processed has just passed a multiple of
batch_number
.For example you can use this to report at regular interval (eg. every 10 batches)
Parameters: batch_number (int) – [description] Returns: True
if \(\fraccurrent_batch\)Return type: bool
-
percentage_done
()¶ What percent of the data has been covered in the current epoch
-
reset
()¶ Reset the iterator and shuffle the dataset if applicable