- May 30, 2023
- Posted by: [email protected]
- Category:
Researchers at Meta AI have potentially found a solution to the “tokenization” problem in GPT models. Meta AI recently released a pre-print research paper introducing the “Megabyte” framework, a ground-breaking approach to building generative pre-trained transformer (GPT) systems. Andrej Karpathy, a renowned AI scientist who previously worked for Tesla, has called this architecture “promising.” Megabyte’s major purpose is to process massive amounts of data, such as photos, novels, and video files, without the use of tokenization.
Tokenization is a lossy process akin to file compression. Tokenization in GPT models entails turning bytes of data into tokens, which are then processed by the transformer to produce output tokens. This process enables AI systems to handle larger strings of data as numerical values. For instance, the sentence “my favorite color is red” would be tokenized into the string “3666, 4004, 3124, 318, 2266, 13” for processing in OpenAI’s ChatGPT.
Even with tokenization, today’s cutting-edge systems have limited capacity. GPT-3.5, for example, can manage slightly more than 4,000 tokens or around 3,000 words, but GPT-4 can handle approximately 32,000 tokens or approximately 24,000 words. Megabyte, on the other hand, eliminates tokenization and employs a novel multi-layer prediction architecture capable of end-to-end modeling over 1 million bytes of data.
Most English-language encoding systems follow the standard 8-bit encoding, where each character occupies one byte of data. By avoiding tokenization, Megabyte can process 1 million bytes of data, equivalent to text documents containing 750,000 words. This represents a significant 3,025% increase compared to GPT-4. To put it into perspective, while GPT-4 can handle around 10 feature-length news articles in a single prompt, Megabyte would be able to analyze Leo Tolstoy’s War and Peace in its entirety along with two other average-length novels.
In addition to text processing, Meta’s Megabyte model has also demonstrated impressive performance on ImageNet tests and audio file benchmarks. It matches or surpasses existing byte-based transformer models like DeepMind’s Perciever AR while utilizing only half the compute power.
This study has far-reaching implications. Due to data limits and the significant time and energy required for training, tokenization has been a limiting factor. By removing tokenization, AI models can be trained to accommodate non-English languages that are difficult to represent with ordinary 8-bit characters. This has the potential to democratize these technologies further, allowing for the development of cryptocurrency trading bots and decentralized autonomous organization technologies using native language codes from around the world. Moreover, it would enhance the capacity of models like ChatGPT to handle multimedia files, such as images, videos, and audio, with similar time and energy consumption as text-based tasks.