Tokenisation turns a continuous signal into a normalized discrete vocabulary: stock "went up a lot", "went up a little", "stayed flat". This smooths out noise and simplifies matching up similar but not identical signals.
> We tokenize text because text isn't numbers.
Text is actually numbers. People tried inputting UTF8 directly into transformers, but it doesn't work that well. Karpathy explains why:
Text can be represented by numbers but they aren't the same datatype. They don't support the same operations (addition, subtraction, multiplication, etc).
Interesting. Can you explain how this is superior and/or different from traditional DSP filters or other non-tokenization tricks in the signal processing field?
Traditional DSP filters still output a continuous signal. And it's a well-explored domain, hard to imagine any low-hanging fruit there.
My intuition is the following: transformers work really well for text, so we could try turning a time series into a "story" (limited vocabulary) and see what happens.
> We tokenize text because text isn't numbers.
Text is actually numbers. People tried inputting UTF8 directly into transformers, but it doesn't work that well. Karpathy explains why:
https://www.youtube.com/watch?v=zduSFxRajkE