Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23247

For SentencePieceBPETokenizer, Exception: Unk token `` not found in the vocabulary

$
0
0

I try to create a simple SentencePieceBPETokenizer without training.

from tokenizers import SentencePieceBPETokenizerspecial_tokens = ["<unk>", "<pad>", "<cls>", "<sep>", "<mask>"]test_tokenizer = SentencePieceBPETokenizer(unk_token="<unk>", replacement = "▁")test_tokenizer.add_special_tokens(special_tokens)print(test_tokenizer.token_to_id("<unk>")) # print 0

It makes sense so far, then I wrap the above tokenizer with PreTrainedTokenizerFast:

from transformers import PreTrainedTokenizerFastwrapped_tokenizer = PreTrainedTokenizerFast(    tokenizer_object=test_tokenizer,    unk_token="<unk>",    pad_token="<pad>",    cls_token="<cls>",    sep_token="<sep>",    mask_token="<mask>",    padding_side="left",)wrapped_tokenizer.push_to_hub('tokenizer-test', private=True)

and use it like this:

tokenizer_test = AutoTokenizer.from_pretrained("myHFusername/tokenizer-test")print(tokenizer_test.unk_token_id) # print 0

so far it is correct too. But when I try to encode a paragraph, since the Tokenizer does not contain tokens other than those special tokens, all words should be tokenized to the unknown token.

print(tokenizer_test.encode_plus("I am a boy.", add_special_tokens=True))

However, it raises an exception:

Exception: Unk token <unk> not found in the vocabulary

Then, I double check whether the Unknown Token exists by:

print(tokenizer_test.vocab)

which prints: {'<unk>': 0, '<cls>': 2, '<pad>': 1, '<mask>': 4, '<sep>': 3}

Unknown tokens exist!

My question is: how do I resolve this issue?


Viewing all articles
Browse latest Browse all 23247

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>