Bug#1034091: RFP: whisper -- Robust Speech Recognition via Large-Scale Weak Supervision
- To: 1034091@bugs.debian.org
- Subject: Bug#1034091: RFP: whisper -- Robust Speech Recognition via Large-Scale Weak Supervision
- From: Petter Reinholdtsen <pere@hungry.com>
- Date: Wed, 21 Jun 2023 17:33:44 +0200
- Message-id: <[🔎] sa6fs6kkexz.fsf@hjemme.reinholdtsen.name>
- Reply-to: Petter Reinholdtsen <pere@hungry.com>, 1034091@bugs.debian.org
- In-reply-to: <sa6v8husa2w.fsf@hjemme.reinholdtsen.name>
- References: <sa67cumhr96.fsf@hjemme.reinholdtsen.name> <sa6edop1ix0.fsf@hjemme.reinholdtsen.name> <sa6pm83vbbg.fsf@hjemme.reinholdtsen.name> <sa6v8husa2w.fsf@hjemme.reinholdtsen.name> <sa6jzymicr5.fsf@hjemme.reinholdtsen.name>
The upload to contrib / experimental was rejected by the ftpmasters with
the following comment:
> can you please explain how I can recreate the files *.tiktoken? There
> seem to be some sources missing ...
The two files in question are 50k lines of ASCII text that seem to be
some kind of index / vocabulary, and I have no idea how they were
created. I suspect they might be an artifact of the model training, but
do not know. Anyone got a clue to spare on how these were created and
how to rebuild them? If we lack the source to rebuild them, I currently
believe the whisper package will have to go to non-free, not contrib.
Any help to figure this out would be most appreciated.
--
Happy hacking
Petter Reinholdtsen
Reply to: