For any NLP tasks in Deep Learning the first step would be preprocessing the text data into numbers!

In the recent years almost all the DL packages have started to provide their own APIs to do the text preprocessing, however each one has its own subtle differences, which if not understood correctly will lead to improper data preparation and thus skewing model trianing.

When I resumed my hobby in DL with Transformers + Tensorflow 2.0, I came across different APIs doing the same text tokneization as part of the Tensorflow ecosystem tutorials.

From the days of writing our own tokenizer and encoders/decoders, we now have APIs which can simplify our work a lot. However care should be taken while using such APIs, like

How you wanted the text to be splitted?
How the tokenizers wanted to handle the punctuations/special characters?
How to handle out of vocab word (OOV)?
Do you wanted to use WordPiece tokenization?
Does the tokenizer/enoder support charcter level encoding ?
How is vocab length is calculated? does it include PAD and OOV words in it?

Choosing the right API to do our task with multiple options out there is not an easy job, as each API is build with specific purpose to fit with its counter parts. Some works natively with Tensors, some with Tensrflow datasets, some with character level support etc.,

This is a quick skim through reference blog for word and character level encoding in Tensorflow and explaining the pitfall with Tensorflow Dataset tokenizer.

Check out the gist @ https://gist.github.com/Mageswaran1989/70fd26af52ca4afb86e611f84ac83e97#file-text_preprocessing-ipynb