[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: openai-whisper_0~20230314-1_amd64.changes REJECTED



Generally speaking, a tokenizer is used to translate the user sentence (natural language)
into a specific form of token sequence that the machine learning model could understand.
Different models have different vocabularies and tokenization methods. There is nothing standard.

More details are here:
https://github.com/openai/tiktoken

The lengthy .tiktoken files look like sort of token vocabulary to me, but I don't have time to verify.
GPT-2 is a large language model. gpt2.tiktoken cannot mean anything else than GPT-2's tokenizer.


On Tue, 2023-06-06 at 20:29 +0200, Petter Reinholdtsen wrote:
> Only replying to the list for now in the hope someone can help answering
> this question from the ftpmasters.
> 
> [Thorsten Alteholz]
> > can you please explain how I can recreate the files *.tiktoken?
> > There seem to be some sources missing ...
> 
> I do not know much about the inner workings for whisper, so I do not
> really know the answer to this question.  As far as I can tell, the
> files in question are whisper/assets/gpt2.tiktoken and
> whisper/assets/multilingual.tiktoken in the source.  I believe these are
> loaded by get_tokenizer() in whisper/tokenizer.py, and that the files in
> question, which are ascii files starting like this, are tiktoken
> tokenizer rule files:
> 
> IQ== 0
> Ig== 1
> Iw== 2
> JA== 3
> JQ== 4
> Jg== 5
> Jw== 6
> KA== 7
> KQ== 8
> Kg== 9
> 
> I have no idea what tiktoken tokenizer rule files really are, now how
> they are created.  The size make me suspect they are generated, as they
> consist of over 50k lines following the structure shown.
> 
> Anyone understand more about tiktok stuff to spread some light on the
> topic?
> 


Reply to: