tokenize

October 12, 2024

word embedding

Because deep neural network models cannot process raw text directly, we need to represent
our word as continuous valued vectors. This process is called embedding. embedding is
to convert non-numeric data into a format that neural networks can process.

Embedding ways:

use specific models like Word2Vec
use LLM itself

Tiktoken

Tiktoken is a rust implemented BPE algorithm.

Table of Contents

tokenize

word embedding

Tiktoken