Re: openai-whisper_0~20230314-1_amd64.changes REJECTED

To: Petter Reinholdtsen <pere@hungry.com>, Debian Deep Learning Team <debian-ai@lists.debian.org>
Subject: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
From: "M. Zhou" <lumin@debian.org>
Date: Tue, 06 Jun 2023 13:48:08 -0700
Message-id: <[🔎] b25c9b47ab31ae0b04e02bb7b739eb03b2f5c56a.camel@riseup.net>
In-reply-to: <[🔎] sa6a5xcla0i.fsf@hjemme.reinholdtsen.name>
References: <[🔎] E1q6ayY-00AIcy-4h@fasolo.debian.org> <[🔎] sa6a5xcla0i.fsf@hjemme.reinholdtsen.name>

Generally speaking, a tokenizer is used to translate the user sentence (natural language)
into a specific form of token sequence that the machine learning model could understand.
Different models have different vocabularies and tokenization methods. There is nothing standard.

More details are here:
https://github.com/openai/tiktoken

The lengthy .tiktoken files look like sort of token vocabulary to me, but I don't have time to verify.
GPT-2 is a large language model. gpt2.tiktoken cannot mean anything else than GPT-2's tokenizer.


On Tue, 2023-06-06 at 20:29 +0200, Petter Reinholdtsen wrote:
> Only replying to the list for now in the hope someone can help answering
> this question from the ftpmasters.
> 
> [Thorsten Alteholz]
> > can you please explain how I can recreate the files *.tiktoken?
> > There seem to be some sources missing ...
> 
> I do not know much about the inner workings for whisper, so I do not
> really know the answer to this question.  As far as I can tell, the
> files in question are whisper/assets/gpt2.tiktoken and
> whisper/assets/multilingual.tiktoken in the source.  I believe these are
> loaded by get_tokenizer() in whisper/tokenizer.py, and that the files in
> question, which are ascii files starting like this, are tiktoken
> tokenizer rule files:
> 
> IQ== 0
> Ig== 1
> Iw== 2
> JA== 3
> JQ== 4
> Jg== 5
> Jw== 6
> KA== 7
> KQ== 8
> Kg== 9
> 
> I have no idea what tiktoken tokenizer rule files really are, now how
> they are created.  The size make me suspect they are generated, as they
> consist of over 50k lines following the structure shown.
> 
> Anyone understand more about tiktok stuff to spread some light on the
> topic?
>

Reply to:

Follow-Ups:
- Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
  - From: Petter Reinholdtsen <pere@hungry.com>

References:
- openai-whisper_0~20230314-1_amd64.changes REJECTED
  - From: Thorsten Alteholz <ftpmaster@ftp-master.debian.org>
- Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
  - From: Petter Reinholdtsen <pere@hungry.com>

Prev by Date: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
Next by Date: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
Previous by thread: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
Next by thread: Re: openai-whisper_0~20230314-1_amd64.changes REJECTED
Index(es):
- Date
- Thread