I try to create a simple SentencePieceBPETokenizer without training.
from tokenizers import SentencePieceBPETokenizerspecial_tokens = ["<unk>", "<pad>", "<cls>", "<sep>", "<mask>"]test_tokenizer = SentencePieceBPETokenizer(unk_token="<unk>", replacement = "▁")test_tokenizer.add_special_tokens(special_tokens)print(test_tokenizer.token_to_id("<unk>")) # print 0It makes sense so far, then I wrap the above tokenizer with PreTrainedTokenizerFast:
from transformers import PreTrainedTokenizerFastwrapped_tokenizer = PreTrainedTokenizerFast( tokenizer_object=test_tokenizer, unk_token="<unk>", pad_token="<pad>", cls_token="<cls>", sep_token="<sep>", mask_token="<mask>", padding_side="left",)wrapped_tokenizer.push_to_hub('tokenizer-test', private=True)and use it like this:
tokenizer_test = AutoTokenizer.from_pretrained("myHFusername/tokenizer-test")print(tokenizer_test.unk_token_id) # print 0so far it is correct too. But when I try to encode a paragraph, since the Tokenizer does not contain tokens other than those special tokens, all words should be tokenized to the unknown token.
print(tokenizer_test.encode_plus("I am a boy.", add_special_tokens=True))However, it raises an exception:
Exception: Unk token
<unk>not found in the vocabulary
Then, I double check whether the Unknown Token exists by:
print(tokenizer_test.vocab)which prints: {'<unk>': 0, '<cls>': 2, '<pad>': 1, '<mask>': 4, '<sep>': 3}
Unknown tokens exist!
My question is: how do I resolve this issue?