For SentencePieceBPETokenizer, Exception: Unk token `` not found in the vocabulary

I try to create a simple SentencePieceBPETokenizer without training.

from tokenizers import SentencePieceBPETokenizerspecial_tokens = ["<unk>", "<pad>", "<cls>", "<sep>", "<mask>"]test_tokenizer = SentencePieceBPETokenizer(unk_token="<unk>", replacement = "▁")test_tokenizer.add_special_tokens(special_tokens)print(test_tokenizer.token_to_id("<unk>")) # print 0

It makes sense so far, then I wrap the above tokenizer with PreTrainedTokenizerFast:

from transformers import PreTrainedTokenizerFastwrapped_tokenizer = PreTrainedTokenizerFast(    tokenizer_object=test_tokenizer,    unk_token="<unk>",    pad_token="<pad>",    cls_token="<cls>",    sep_token="<sep>",    mask_token="<mask>",    padding_side="left",)wrapped_tokenizer.push_to_hub('tokenizer-test', private=True)

and use it like this:

tokenizer_test = AutoTokenizer.from_pretrained("myHFusername/tokenizer-test")print(tokenizer_test.unk_token_id) # print 0

so far it is correct too. But when I try to encode a paragraph, since the Tokenizer does not contain tokens other than those special tokens, all words should be tokenized to the unknown token.

print(tokenizer_test.encode_plus("I am a boy.", add_special_tokens=True))

However, it raises an exception:

Exception: Unk token <unk> not found in the vocabulary

Then, I double check whether the Unknown Token exists by:

print(tokenizer_test.vocab)

which prints: {'<unk>': 0, '<cls>': 2, '<pad>': 1, '<mask>': 4, '<sep>': 3}

Unknown tokens exist!

My question is: how do I resolve this issue?

For SentencePieceBPETokenizer, Exception: Unk token `` not found in the vocabulary

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...