Google has released Gemma 4 this time (III)

Wed, 08 Apr 2026 23:56:20 +0800

While browsing the forum this time, what struck me most wasn’t which company released another leaderboard, but a very basic statement: “Not enough VRAM; no matter how large the parameters are, it’s useless.”

Previously, I always understood “slow model” as a computational power issue. However, the more I read, the clearer it became that often, the problem isn’t that the GPU can’t compute it, but rather that the data cannot reside in the right place. Just by changing the memory path, the token speed doesn’t just slow down; it drops drastically.

The previous two posts covered the preliminary issues. First Post discussed the release and protocol, and Second Post explained why we should first look at 26B A4B on the 3060 12GB. This final post will only discuss how speed actually collapses.

Out of VRAM, and Not Just a Little Bit Slow

When you break down the process of inference, there are two particularly critical components:

Model weights
KV cache

The weights define the model itself, while the KV cache records the state of previous tokens. The longer the context, the larger the KV cache. As long as both these parts can be stably kept in the GPU VRAM, generating a token basically involves reading data, performing calculations, and writing back results within high-bandwidth memory, and the speed is usually quite good.

The real problem is when you run out of VRAM. Once it doesn’t fit, the inference framework has to compromise:

Storing some weights in system memory
Or storing some KV cache in system memory
Or even shuttling data back and forth between the CPU and GPU

At this point, the issue is no longer about “a little bit of extra computation,” but rather “waiting for data with every single token.”

Why the Cliff, Not a Linear Slowdown

When many people first encounter this pitfall, their intuition is that if the model goes from 14B to 31B, it will just be more than twice as slow, which they can tolerate.

Reality is not like that.

The real dividing line isn’t doubling the parameters; it’s whether or not the working set crosses the VRAM boundary.

As long as it hasn’t crossed, increasing model size usually results in a predictable slowdown.

Once it crosses, the system state changes:

Previously, it was an “on-chip closed loop” within VRAM.
Now, it becomes “VRAM + System RAM + Bus Transfer.”

Changing the path completely changes the cost. This is especially true during the decoding stage, which inherently proceeds token by token with a small batch size. At this point, what is most feared is having every single token wait for a batch of data to be transferred across devices.

So, you see a very typical phenomenon:

The model isn’t unable to run.
The GPU isn’t completely idle either.
But the tokens/s rate is extremely poor.

This is what people call the “cliff.” It’s not that the model suddenly got dumber; it’s that the memory path suddenly became inefficient.

Why `26B A4B` is Friendlier to Local Players

This also explains why I have consistently favored 26B A4B in my previous article. Its total parameter count is certainly not small, but only about 3.8B are actually activated per token. This means that under similar deployment conditions, the pressure it puts on computation and VRAM paths is often easier to control compared to dense large models. This isn’t magic. If your context window gets too long, or if quantization and framework support are inadequate, it will struggle just like any other model. However, compared to a dense model that immediately maxes out the entire VRAM, 26B A4B feels more like a realistic path for consumer-grade GPUs. So, often the issue isn’t whether 31B is weak, but rather which model is better suited for long-term coexistence with local hardware.

Why Macs seem “less prone to running out of memory”

What’s most different about Mac is not the model, but the memory architecture. Apple silicon uses unified memory. The CPU and GPU share a single pool of memory, unlike dedicated graphics machines where there are separate VRAM and main memory that have to be moved across a bus. The biggest advantage of this structure is that many models which would “not fit in VRAM at all” on dedicated graphics machines can take on a different state on Mac:

It might not be fast,
But it can probably fit into the memory first. In other words, Macs are less likely to hit that very rigid “VRAM wall” right from the start. This is why many people feel that Macs are particularly suitable for running large models locally as a fallback option. It solves the problem of “whether the entire working set can be loaded into the same memory pool.”

But Unified Memory Doesn’t Mean High Speed for You

You must look at this part separately. What did unified memory solve?

It solved the hard separation between dedicated VRAM and main memory.
It solved many cases where models couldn’t fit directly onto cards with small VRAM.
It solved some very ugly cross-device data transfers. But what didn’t it solve?
It didn’t change large model inference from “massive memory reads” to something else.
It didn’t make large models suddenly stop being bandwidth-hungry.
It didn’t automatically give all inference frameworks the mature ecosystem of CUDA. So, the comfort of Mac is not the same thing as the speed of a high-VRAM NVIDIA card. Mac is more like:
Quiet machine operation
Large total memory
Unified architecture
Finally allowing models that previously wouldn’t fit to run first. NVIDIA’s high-VRAM card is more like:
Mature ecosystem
Complete CUDA toolchain
When you truly keep the model and cache on the GPU, speed can be much easier to boost.

Why Speed Ultimately Depends on NVIDIA’s Large VRAM

This is not an emotional judgment, but a practical conclusion drawn after wrestling with local deployment. If you are pursuing these things:

Local assistant always running
Multi-turn long conversations
Long context windows
Higher token/s rate
Minimizing waiting time as much as possible Then ultimately, the large VRAM NVIDIA card is what matters. Because what you are truly buying is the ability to keep the model and cache stably on the GPU. Mac also works, but it is better suited for a different set of requirements:
I want a machine that can load an even larger model into it first.
I accept average speed, but I don’t want to mess with drivers and peripherals.
I care more about overall experience, power consumption, and noise. Both paths are reasonable; they just solve different problems.

Back to Gemma 4, My Final Verdict

This time with Gemma 4, I really feel that local open-source models have reached a stage where hardware discussion is more warranted. But just because the model gets stronger doesn’t mean the laws of physics loosen up accordingly. No matter how strong the 31B model is, it will slow down if VRAM is insufficient. Even with the practical 26B A4B, long contexts still put pressure on the system. And while Apple’s unified memory is comfortable, it only makes it easier to “get something running”; it won’t give you the speed of a large VRAM CUDA card for free. So, I’ll end with this rather blunt summary:

For speed, prioritize NVIDIA with large VRAM.
For reliability/fallback, Mac’s unified memory is indeed comfortable.
If you plan to run on hardware like the 3060 12GB long-term, don’t always aim for dense, massive models; a path like 26B A4B is more realistic. With this set of articles, I will wrap up here.

References

Writing Notes

Original Prompt

$blog-writer Google has released the Gemma4 model after a year. As usual, I'm trying to deploy it locally on that old desktop with an unupgraded NVIDIA 3060 12GB graphics card. This time I caught the initial release, but I couldn't find an upgraded version of the commonly used Gemma3. However, there is a similar version called GemmaE4b. Please first search and introduce all the models released this time, what the letters in their abbreviations mean, and then search for online reviews about Gemma4. The key point is that Google updated the model's protocol this time, so the restrictions for users are fewer. The biggest surprise: my usual test question—write a piece of C++ code to output a five-pointed star in the console. Last year's smaller parameter open-source models couldn't handle this problem, but Google managed it this time. In the first version, it gave the answer, completely exceeding my expectations. It knew about my trap; outputting a five-pointed star to the console is very difficult, so it directly hardcoded a string of a five-pointed star for direct console output. This is the original text: Because drawing a five-pointed star with precise geometric structure using mathematical logic in a pure text console (Console) is very complex (involving coordinate system transformation and pixel filling), the most classic and visually best method is to use ASCII Art. After I forced it to perform calculations, it also managed it through mathematical calculation, successfully drawing the five-pointed star. Previously, I often used Gemma4 for local translation tasks; many multilingual versions of historical articles on current blogs are like this. The model used for local testing: gemma-4-26b-a4b. The 31b version is indeed too slow. But looking at the reviews, the 31b effect is very good, and its ranking performance is excellent. While browsing forums, I realized that if the VRAM is insufficient and the model parameters are increased, the token generation speed will drop drastically. Can you explain why? Macs don't have this problem because they use unified memory; please explain the technical reason. Also, if speed is required, then an NVIDIA card with large VRAM is still necessary. The Mac solution can serve as a fallback, but it cannot match the speed. This content is very extensive; please evaluate whether it should be split into a series of articles.

Writing Outline Summary

The third article will only retain the two threads: “Why speed collapses” and “Why Mac does not equal fast,” without revisiting the content of the previous two articles.
Start by discussing VRAM limitations, then move to non-linear slowdowns; this logic is smoother than the previous version.
Mac and Nvidia will be discussed as two different dimensions—one focusing on reliability/fallback, and one focusing on speed.
The conclusion will only retain hardware judgment, without repeating the explanation of why series separation occurred.

Mac on Uncle Xiang's Notebook