
You are working on an NLP pipeline and need to convert raw text into units a model can process. Before training or using a language model, you need to decide how text should be split and represented.
What is tokenization?
Tokenization is the bridge between raw strings and model inputs. It affects vocabulary coverage, sequence length, handling of rare words, and downstream model quality.