Compare vocabulary size of WordPiece and BPE tokenizer algorithm

I have a text file. I used hugging face tokenizer library to ran WordPiece and BPE tokenizer algorithm on my text file. I trained them on the file and got the vocabulary size. In spite of my expectation, vocabulary size of WordPiece became bigger. The result was:

WordPiece vocabulary size: 17555

Byte-Pair Encoding vocabulary size: 16553

Anyone knows what is the reason of this result?Here is my code for WordPiece algorithm:

from tokenizers import Tokenizerfrom tokenizers.models import WordPiecetokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))from tokenizers.trainers import WordPieceTrainertrainer = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])from tokenizers.pre_tokenizers import Whitespacetokenizer.pre_tokenizer = Whitespace()tokenizer.train([path], trainer)

And here is the code of Byte-Pair Encoding:

from tokenizers import Tokenizerfrom tokenizers.models import BPEbpe_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))from tokenizers.trainers import BpeTrainerbpe_trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])from tokenizers.pre_tokenizers import Whitespacebpe_tokenizer.pre_tokenizer = Whitespace()bpe_tokenizer.train([path], bpe_trainer)

Compare vocabulary size of WordPiece and BPE tokenizer algorithm

Trending Articles

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Download: Kaliman ft Mickail – Ndimwaiche

Jailed for market trader robbery

Love Status in Punjabi, ਪੰਜਾਬੀ ਲਵ ਸਟੇਟਸ

Bureau of Internal Revenue: Regional Offices (Directory)

Named and shamed: a round up of cases heard by Essex magistrates

२०१६ मराठी कालनिर्णय दिनदर्शिका डाउनलोड

Gangster Health Plan Never Kicked In As Promised For Cadillac Frank, Led To...

DIVO ADARNI

Who's been in the courts?

Aoi Teshima – Mori no Chiisana Restaurant – Single [iTunes Plus M4A]

Adolescence Paragraph

Renolink 1.99 China without error "Padding is invalid..."

Victims of Father Gannon defeated a cover-up, with help from Broken Rites

Keating on Behalf of the Wallara People, Clan of the Koko-Muluridji v State...

Parole Hearing Alert: Angela Dawn Fowler

Aquiles Torrealba Alverez Arrested by Miami-Dade County Corrections on May...

TEAM R2R Network Block Runtime v1.0.0 READ NFO-R2R

Add var_names to table with regression F-stats