I have a text file. I used hugging face tokenizer library to ran WordPiece and BPE tokenizer algorithm on my text file. I trained them on the file and got the vocabulary size. In spite of my expectation, vocabulary size of WordPiece became bigger. The result was:
WordPiece vocabulary size: 17555
Byte-Pair Encoding vocabulary size: 16553
Anyone knows what is the reason of this result?Here is my code for WordPiece algorithm:
from tokenizers import Tokenizerfrom tokenizers.models import WordPiecetokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))from tokenizers.trainers import WordPieceTrainertrainer = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])from tokenizers.pre_tokenizers import Whitespacetokenizer.pre_tokenizer = Whitespace()tokenizer.train([path], trainer)
And here is the code of Byte-Pair Encoding:
from tokenizers import Tokenizerfrom tokenizers.models import BPEbpe_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))from tokenizers.trainers import BpeTrainerbpe_trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])from tokenizers.pre_tokenizers import Whitespacebpe_tokenizer.pre_tokenizer = Whitespace()bpe_tokenizer.train([path], bpe_trainer)