Meta’s surprise Llama 4 drop exposes the gap between AI ambition and reality

Meta built Lama 4 models using the compound of MOE architecture, which is a way around the boundaries of running a large AI model. Think about mo as a large team of special workers. Instead of working on everything, only the relevant experts are mobilized for a specific job.

For example, Lama has 400 billion parameter size properties in 4 Maurocks, but only 17 billion parameters are active once in one of 128 experts. Similarly, the scout includes 109 billion total parameters, but one in 16 experts are only 17 billion active at the same time. This design can reduce the counters needed to run the model, as small parts of the nerve network weight are simultaneously dynamic.

Testing Lalama’s reality arrives quickly

Current AI models have a relatively limited limited short -term memory. In AI, a context window works to some extent in this way, determines how much information it can take simultaneously. AI language models such as Lama usually process the memory called tokens as pieces of data, which can be a piece of whole words or long words. Large contexts allow the Windows AI model to take long documents, major code bases and long conversations.

Despite the promotion of a 10 million token context window of Meta Lama 4 Scout, developers have so far discovered that using a portion of the money has proved to be a challenge due to the limits of memory. Simon Wilison Reported On its blog that provides access to third party services, such as GROQ and fireworks, limit the scout context to only 128,000 tokens. Another provider offered 328,000 tokens in conjunction with AI.

Evidence suggests that access to large contexts requires immense resources. Wilison pointed to the notebook of Meta’s example (“build_with_llama_4“), Which states that 1.4 million tokens require eight high -end NVIDIA H100 Gpus to run context.

Willison documentary documentary of his testing problems. When he asked Lama 4 Scout Open rotor Submitting a long online conversation (about 20,000 tokens), the result was not useful. He described the output as a “complete waste output”, which repeatedly turned into a loop.

Source link

Testing Lalama’s reality arrives quickly

Leave a Reply Cancel reply