Release of Claude 3.0

Image for this post

March 4 saw the release of Claude 3, the newest large language model (LLM) from Anthropic. Claude 3 competes with other LLMs such as GPT-4 in ChatGPT from OpenAI.

Claude 3.0 comes in three versions. Haiku (the smallest and fastest model, which is yet to be released as of this writing and is intended to be used as a corporate chatbot and other similar tasks), Sonnet (the free version and similar to GPT-3 from OpenAI), and Opus (paid). Opus is the largest model which invariably scores best on most benchmarks. It is not clear if the benchmarks in the release notes compares Claude 3 Opus against GPT-4 or the newer GPT-4 Turbo. The benchmarks (and release notes) can be found HERE. Take a closer look. It makes for interesting reading.

An exciting advancement is the larger context window in Claude 3. Anthropic’s models already have some of the largest context windows. The context window, measured in tokens, allows for larger inputs (which includes prompts and other data) and the model’s responses. As and example, a larger context window means that we can upload larger documents in our prompts. The model can interact with these documents when returning a response. It must be noted that the very largest context windows are only available to select customers.

A noted problem with large input windows is the needle in a haystack problem. This problem refers to LLMs inability to remember information in the middle of large inputs. Companies have devised tests for this problem. Such a needle-in-a-haystack test verifies a model’s recall ability by inserting a target sentence (the needle) into the corpus of a document(s) (the haystack) and asks a question that can only be answered by using information in the needle. Company officials note surprise at how good the new Claude 3 model is at this test. Claude can actually recognize that it is being tested in this way and can include this in its response.

Language Processing Units™

groc company logo

Proprietary large language models (LLMs) such as GPT-4 and open-source models such as Llama 2 are trained on parallel processors such as graphics processing units provided by Nvidia. Similar processing architectures are used during inference. Here inference refers to the situation where the trained model is called upon to generate text by writing a prompt.

While parallel processors used in training are well-suited to the task of optimizing billions or even trillions of parameters, a different architecture is required for speedy inference. We have all experienced the slow typed response of OpenAI’s ChatGPT, Microsoft’s Copilot, and now Google’s Gemini interfaces.

Along comes the company, groq, with its language processing units (LPU). Groq claims to make inference 10-100 times faster. Try this for yourself at the groq home page. At the time of writing Croq allows testers on their site to choose between Mixtral 8x7B-32k and Llama 2 70B-4k.

A quicker response time by LLMs greatly enhances their usability. It feels more natural and interactive. This CNN video gives us a quick glimpse.

We will see faster inference in the future, that is for sure. Perhaps we will even see such chips imbedded in our own computers. LLMs are very large, though, so we will also need to see bigger storage and more memory.