Preprocessing functions

Usful functions for preprocessing data

dynn.data.preprocess.lowercase(data)

Lowercase text

Parameters:data (list,str) – Data to lowercase (either a string or a list [of lists..] of strings)
Returns:Lowercased data
Return type:list, str
dynn.data.preprocess.normalize(data)

Normalize the data to mean 0 std 1

Parameters:data (list,np.ndarray) – data to normalize
Returns:Normalized data
Return type:list,np.array
dynn.data.preprocess.tokenize(data, tok='space', lang='en')

Tokenize text data.

There are 5 tokenizers supported:

  • “space”: split along whitespaces
  • “char”: split in characters
  • “13a”: Official WMT tokenization
  • “zh”: Chinese tokenization (See sacrebleu doc)
  • “moses”: Moses tokenizer (you can specify lthe language).
    Uses the sacremoses
Parameters:
  • data (list, str) – String or list (of lists…) of strings.
  • tok (str, optional) – Tokenization. Defaults to “space”.
  • lang (str, optional) – Language (only useful for the moses tokenizer). Defaults to “en”.
Returns:

Tokenized data

Return type:

list, str