Programming for beginners: Understanding Tokenization in Large Language Models (LLMs)

Large Language Models (LLMs) like ChatGPT and GPT-4 generate human-like text by processing and predicting language patterns. However, these models don’t work directly with words or sentences. Instead, they rely on “tokens”—the numerical building blocks that represent text.

What is a Token?

In the context of LLMs, a token is a small chunk of text. Tokens can represent a whole word, part of a word, or even punctuation marks, depending on the model and the language.

For example,

1. The word "Hello" might be one token.

2. The word "Fantastic!" could be split into multiple tokens (like "Fant", "astic", "!").

Let’s experiment with OpenAI Tokenizer (https://platform.openai.com/tokenizer).

When I type the word, "Hello" in OpenAI Tokenizer, it is represented as one token in GPT-4o, GPT-3.5 and GTP-3(Legacy).

For the word "Fantastic!"

a. GPT-4.o and GPT-3.5 divided it into 2 tokens Fantastic, !

b. GPT-3(Legacy) divided it into 4 tokens F, ant, astic, !

Each token corresponds to a unique number that model can use for computation. The unique number, or ID, assigned to each token is consistent across contexts within a specific model's vocabulary. This means that each token has a fixed ID in that model, regardless of the sentence or context in which it appears.

For instance, if the token "SQL" is assigned the ID 10430 in a model, every time "SQL" appears, the model will use 10430 to represent it.

However, there are a few points to keep in mind.

1. Different Tokenizers, Different IDs: If you switch between different tokenization algorithms or models (e.g., between GPT-3 and BERT), the ID for "LLM" may be different, as each model has its own vocabulary and tokenization process.

2. Subword Tokenization: Some words might break down differently based on the model’s tokenization scheme. For example, in certain contexts, a tokenizer might split a word into subwords (like "fantastic" becoming "fant" and "astic"), each with its own unique ID.

3. Dynamic Contextual Embeddings: Although token IDs remain fixed, the actual meaning that a model derives from a token can vary depending on surrounding context, as the embeddings (internal representations) of these tokens adapt to the context of the sentence.

Why Do LLMs Need Tokenization?

LLMs are built using neural networks that operate on numbers rather than plain text. Tokenization acts as the bridge that converts human-readable language into numbers that the model can process. By breaking text down into tokens, models can better recognize patterns and relationships between words or phrases, enabling them to predict the next word (or token) in a sentence.

How Tokenization Works in LLMs

The tokenization process can vary by language model, but it typically involves a tokenizer that,

1. Breaks down text into tokens.

2. Maps each token to a unique number (its ID).

3. Passes these numerical representations to the model for processing.

In the LLM’s training and operation phases, the model is presented with tokens as sequences of numbers. The model uses these sequences to learn the patterns of language and make predictions for the next likely token in a sequence.

Reference

https://platform.openai.com/tokenizer

Previous Next Home

Programming for beginners

Sunday, 27 April 2025

Understanding Tokenization in Large Language Models (LLMs)

No comments:

Post a Comment