Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 14126

After creating a Custom Tokenizer using HF Tokenizers library, how to create a model that fits the Tokenizer?

$
0
0

I followed this tutorial to create a custom Tokenizer based on SentencePieceBPE class, with a custom pre-tokenizer class. The newly trained Tokenizer was successfully trained with a dataset and saved on the HuggingFace platform.

I can load my custom Tokenizer class without a problem, using code like this:

tokenizer = MyCustomTokenizerFast.from_pretrained('myusername/mytokenizer')

Specifically, let's take a pre-trained model as an example. Originally, my sequence-to-sequence pipeline uses a pre-trained model named fnlp/bart-base-chinese as tokenizer and model.

from transformers import AutoTokenizer, BertTokenizercheckpoint = 'fnlp/bart-base-chinese'tokenizer = BertTokenizer.from_pretrained(checkpoint)model = BartForConditionalGeneration.from_pretrained(checkpoint, output_attentions = True, output_hidden_states = True)

If I change the tokenizer to my custom Tokenizer class, the model has to modify/re-train as well.

checkpoint = 'fnlp/bart-base-chinese'tokenizer = MyCustomTokenizerFast.from_pretrained('myusername/mytokenizer')model = BartForConditionalGeneration.from_pretrained(checkpoint, output_attentions = True, output_hidden_states = True) # this one has to change?!

In theory, a tokenizer will tokenize a corpus with a set of Token IDs. The model that uses the tokenizer understands this set of Token IDs. If I change the tokenizer to my custom one, the model has to change / re-train as well.

Is my concept correct? If yes, how should I re-build the model to make it "compatible" with my custom Tokenizer?

Thanks in advance.


Viewing all articles
Browse latest Browse all 14126

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>