Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 13981

Compare vocabulary size of WordPiece and BPE tokenizer algorithm

$
0
0

I have a text file. I used hugging face tokenizer library to ran WordPiece and BPE tokenizer algorithm on my text file. I trained them on the file and got the vocabulary size. In spite of my expectation, vocabulary size of WordPiece became bigger. The result was:

WordPiece vocabulary size: 17555

Byte-Pair Encoding vocabulary size: 16553

Anyone knows what is the reason of this result?Here is my code for WordPiece algorithm:

from tokenizers import Tokenizerfrom tokenizers.models import WordPiecetokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))from tokenizers.trainers import WordPieceTrainertrainer = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])from tokenizers.pre_tokenizers import Whitespacetokenizer.pre_tokenizer = Whitespace()tokenizer.train([path], trainer)

And here is the code of Byte-Pair Encoding:

from tokenizers import Tokenizerfrom tokenizers.models import BPEbpe_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))from tokenizers.trainers import BpeTrainerbpe_trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])from tokenizers.pre_tokenizers import Whitespacebpe_tokenizer.pre_tokenizer = Whitespace()bpe_tokenizer.train([path], bpe_trainer)

Viewing all articles
Browse latest Browse all 13981

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>