Preprocessing functions¶
Usful functions for preprocessing data
-
dynn.data.preprocess.
lowercase
(data)¶ Lowercase text
Parameters: data (list,str) – Data to lowercase (either a string or a list [of lists..] of strings) Returns: Lowercased data Return type: list, str
-
dynn.data.preprocess.
normalize
(data)¶ Normalize the data to mean 0 std 1
Parameters: data (list,np.ndarray) – data to normalize Returns: Normalized data Return type: list,np.array
-
dynn.data.preprocess.
tokenize
(data, tok='space', lang='en')¶ Tokenize text data.
There are 5 tokenizers supported:
- “space”: split along whitespaces
- “char”: split in characters
- “13a”: Official WMT tokenization
- “zh”: Chinese tokenization (See
sacrebleu
doc) - “moses”: Moses tokenizer (you can specify lthe language).
- Uses the sacremoses
Parameters: Returns: Tokenized data
Return type: