I followed this tutorial to create a custom Tokenizer based on SentencePieceBPE
class, with a custom pre-tokenizer class. The newly trained Tokenizer was successfully trained with a dataset and saved on the HuggingFace platform.
I can load my custom Tokenizer class without a problem, using code like this:
tokenizer = MyCustomTokenizerFast.from_pretrained('myusername/mytokenizer')
Specifically, let's take a pre-trained model as an example. Originally, my sequence-to-sequence pipeline uses a pre-trained model named fnlp/bart-base-chinese
as tokenizer and model.
from transformers import AutoTokenizer, BertTokenizercheckpoint = 'fnlp/bart-base-chinese'tokenizer = BertTokenizer.from_pretrained(checkpoint)model = BartForConditionalGeneration.from_pretrained(checkpoint, output_attentions = True, output_hidden_states = True)
If I change the tokenizer to my custom Tokenizer class, the model has to modify/re-train as well.
checkpoint = 'fnlp/bart-base-chinese'tokenizer = MyCustomTokenizerFast.from_pretrained('myusername/mytokenizer')model = BartForConditionalGeneration.from_pretrained(checkpoint, output_attentions = True, output_hidden_states = True) # this one has to change?!
In theory, a tokenizer will tokenize a corpus with a set of Token IDs. The model that uses the tokenizer understands this set of Token IDs. If I change the tokenizer to my custom one, the model has to change / re-train as well.
Is my concept correct? If yes, how should I re-build the model to make it "compatible" with my custom Tokenizer?
Thanks in advance.