The previous article Google released Gemma 4 (Part 1): Don’t rush to local deployment; you need to understand the model and protocol first covered the release and protocols. This current article only talks about the local experience itself; the last one continues with Google released Gemma 4 (Part 3): Why does running out of VRAM cause a cliff, and why can Mac act as a fallback but is slow.
Why I ran 26B A4B first
The reason is actually quite basic: it’s about hardware reality.
While 31B is certainly powerful, and the official leaderboards and initial community feedback have been very strong. However, if you run it on a machine like a 3060 12GB, the issue immediately shifts from “Is it powerful?” to “Is it worth waiting for?”. Once the model and cache start offloading to system memory, the speed can easily collapse. I will cover this in detail in a third article.
26B A4B is different.
Although its total parameters are 25.2B, only about 3.8B are actually activated per token. Simply put, it’s the one in Gemma 4 that feels most “designed for local users.”
So, if your machine is similar to mine—an older consumer-grade card—here is a straightforward way to decide:
- If you want to see benchmark scores, go with
31B. - If you plan to use it locally in the long term, start with
26B A4B.
The Five-Point Star Problem: Someone Finally Understood My Trap This Time
I have always had a rather basic test question: asking the model to write a piece of C++ code that outputs a five-pointed star to the console.
This problem might look like a joke, but it’s actually quite tricky. Many models tend to interpret it as a pure mathematical drawing problem, and then they start talking about coordinates, trigonometric functions, and loops, ultimately outputting a mess of characters in the plain text console that is completely unreadable.
Last year, many small-parameter open-source models failed at this point.
My first reaction to Gemma 4 this time was actually quite surprising. It didn’t rush to pretend it understood; instead, it first identified the constraints and provided this judgment:
Since drawing a five-pointed star with precise geometric structure directly using mathematical logic in a plain text console (Console) is very complex (involving coordinate system transformation and pixel filling), the most classic and visually effective method is to use ASCII Art.
Regarding the Five-Pointed Star Problem, Someone Finally Understood My Trap This Time
To put it plainly, they first understood the environmental constraints behind the problem. The console is not a canvas, and the character grid is not a pixel grid. You must first figure out “how to stably output a five-pointed star,” before discussing mathematical drawing. Then, in its first version, it directly provided a hardcoded string for the five-pointed star. This action was very on point. It wasn’t about showing off derivations; it was about getting the problem solved correctly first.
What surprised me even more is that it could continue further.
If it had only stopped at ASCII Art, this problem would have only shown that it recognized the trap. What really impressed me was that when I continued to ask it to perform mathematical calculations afterward, it didn’t falter; instead, it was able to proceed logically, mapping the geometric relationships onto a character grid and finally calculating the pentagram. This demonstrates not just “it can write some code,” but rather that it understands this problem has two layers:
- The first layer: What is the most stable answer for the console?
- The second layer: If you insist on doing calculations, how do you reduce a geometric problem onto a character grid?
Previously, many local small models would jump straight to the second layer and fail at the first.
Gemma 4reversed this approach this time; it first identified the boundaries and then decided on the solution method. I think this is more valuable than any single benchmark score.
This Coding Improvement Isn’t Just About Being “Smarter”
The reason this five-star problem is so useful is that it doesn’t just test syntax. What it truly tests is:
- The ability to first understand the output environment.
- The ability to admit when an intuitive solution is inappropriate.
- The ability to switch between achieving “optimal presentation effect” and fulfilling “user-mandated calculation.”
Once a model can solve this type of problem correctly, it indicates that the model is starting to act more like a development assistant capable of handling real-world constraints, rather than just one that completes code snippets.
This is also why my first impression of
Gemma 4is much better than last year’s batch of smaller open models. Many models from last year were good at chatting, completing, and getting by, but when faced with a problem that had even slight boundary conditions, they tended to show their limitations. At least Google has addressed this weakness this time.
Translating this line cannot simply be stated as “Gemma 4 completely replaces everything”
You brought up a very key point earlier: previously, people often used Gemma for local translation.
The transition to Gemma 4 isn’t actually that linear. This is because Google released TranslateGemma separately in February 2026, and it was built on the architecture of Gemma 3.
What does this mean?
It means that if your existing local translation pipeline is already working smoothly, you don’t necessarily have to switch everything over to Gemma 4 in the short term. Especially for scenarios with very specific goals—like only needing stable multilingual conversion—a dedicated translation model still has its value.
However, if what you want is a single local model that can reasonably handle translation, Q&A, code, and general text tasks, then a more versatile route like 26B A4B is smoother.
It might not be the most specialized one, but it’s more like choosing the “good enough main model to get running first” option in a real-world scenario.
Why I Don’t Want to Keep Praising 31B in the Second Article
It’s not that 31B is bad; quite the opposite, it’s too good, which makes it easy to get distracted.
If you keep focusing on the leaderboard performance of 31B, it’s easy to write this article as “Strong models are truly strong.” But what local deployment fears most is exactly that kind of talk. Because what truly determines whether you will continue using it every day isn’t the leaderboard, but rather:
- Is the startup too slow?
- Does the response speed drop severely?
- Does long context quickly ruin the experience?
- Can your own machine actually handle it?
On a machine like the
3060 12GB, these practical issues are much more important than the leaderboard. So, my conclusion for the second article is simple.31Bis worth looking at;26B A4Bis worth using. For local players, these two statements are not the same thing.
My Initial Local Conclusion
If I had to summarize my experience from this test in one sentence, it would be:
Gemma 4 finally feels like a local model that understands context/scenarios.
Especially the 26B A4B. It might not be the model best for showing off on leaderboards, but under real-world constraints—like older hardware, consumer-grade GPUs, and long-term local use—it actually feels more like the true workhorse choice.
At least with this five-star test, Google has passed.
References
- Gemma 4: Byte for byte, the most capable open models
- Gemma 4 model card
- google/gemma-4-26B-A4B-it on Hugging Face
- Gemma 3: The Developer Guide
- TranslateGemma: A new family of open translation models
- Gemma 4 31B on FoodTruck Bench
Writing Notes
Original Prompt
$blog-writer Google has released the Gemma4 model after a year. As usual, I'm trying to deploy it locally on that old desktop with an unupgraded NVIDIA 3060 12GB graphics card. This time I caught the initial release, but I couldn't find an upgraded version of the commonly used Gemma3. However, there is a similar version called GemmaE4b. Please search and introduce all the models released this time, what the abbreviation letters mean in them, and then search for online reviews about Gemma4. The key point is that Google updated the model's protocol this time, and the restrictions for users are fewer. The biggest surprise: my usual test question—write a piece of C++ code to output a five-pointed star in the console. Last year's smaller open-source models couldn't handle this problem, but Google managed it this time. In the first version, it gave an answer that completely exceeded my expectations; it knew about my trap. Outputting a five-pointed star to the console is very tricky, so it directly hardcoded a string for the five-pointed star, which was outputted directly to the console. This is the original text: Because drawing a five-pointed star with precise geometric structure using mathematical logic in a pure text console (Console) is very complex (involving coordinate system transformation and pixel filling), the most classic and visually best method is to use ASCII Art. After I forced it to perform calculations, it also succeeded through mathematical calculation, successfully drawing the five-pointed star. Previously, I often used Gemma4 for local translation tasks; many multilingual versions of historical articles on current blogs are like this. The model used for local testing: gemma-4-26b-a4b. The 31b version is indeed too slow. But looking at the reviews, the 31b effect is very good, and its ranking performance is excellent. Also, while browsing forums, I realized that if the VRAM is insufficient and the model parameters are increased, the token generation speed will drop drastically. Can you explain why? Macs don't have this problem because they use unified memory; please explain the technical reason. Furthermore, if speed is required, only an NVIDIA card with large VRAM will do. The Mac solution can serve as a fallback, but it cannot match the speed. This content is very extensive; please evaluate whether it should be split into a series of articles.
Writing Outline Summary
- The second article will only retain the local experience, and will no longer summarize the first article or explain VRAM principles for the third one.
- First provide the hard judgment on “why run 26B A4B first,” then expand with the five-star test.
- The five-star question is treated as the main axis because it better illustrates the boundary sense in coding scenarios than benchmark scores.
- The translation task will be given its own section to avoid making
Gemma 4seem like a linear successor to all previous processes.