Gemma on Uncle Xiang's Notebook

Google has released Gemma 4 this time (III)

Wed, 08 Apr 2026 23:56:20 +0800

While browsing the forum this time, what struck me most wasn’t which company released another leaderboard, but a very basic statement: “Not enough VRAM; no matter how large the parameters are, it’s useless.”

Previously, I always understood “slow model” as a computational power issue. However, the more I read, the clearer it became that often, the problem isn’t that the GPU can’t compute it, but rather that the data cannot reside in the right place. Just by changing the memory path, the token speed doesn’t just slow down; it drops drastically.

The previous two posts covered the preliminary issues. First Post discussed the release and protocol, and Second Post explained why we should first look at 26B A4B on the 3060 12GB. This final post will only discuss how speed actually collapses.

Out of VRAM, and Not Just a Little Bit Slow

When you break down the process of inference, there are two particularly critical components:

Model weights
KV cache

The weights define the model itself, while the KV cache records the state of previous tokens. The longer the context, the larger the KV cache. As long as both these parts can be stably kept in the GPU VRAM, generating a token basically involves reading data, performing calculations, and writing back results within high-bandwidth memory, and the speed is usually quite good.

The real problem is when you run out of VRAM. Once it doesn’t fit, the inference framework has to compromise:

Storing some weights in system memory
Or storing some KV cache in system memory
Or even shuttling data back and forth between the CPU and GPU

At this point, the issue is no longer about “a little bit of extra computation,” but rather “waiting for data with every single token.”

Why the Cliff, Not a Linear Slowdown

When many people first encounter this pitfall, their intuition is that if the model goes from 14B to 31B, it will just be more than twice as slow, which they can tolerate.

Reality is not like that.

The real dividing line isn’t doubling the parameters; it’s whether or not the working set crosses the VRAM boundary.

As long as it hasn’t crossed, increasing model size usually results in a predictable slowdown.

Once it crosses, the system state changes:

Previously, it was an “on-chip closed loop” within VRAM.
Now, it becomes “VRAM + System RAM + Bus Transfer.”

Changing the path completely changes the cost. This is especially true during the decoding stage, which inherently proceeds token by token with a small batch size. At this point, what is most feared is having every single token wait for a batch of data to be transferred across devices.

So, you see a very typical phenomenon:

The model isn’t unable to run.
The GPU isn’t completely idle either.
But the tokens/s rate is extremely poor.

This is what people call the “cliff.” It’s not that the model suddenly got dumber; it’s that the memory path suddenly became inefficient.

Why `26B A4B` is Friendlier to Local Players

This also explains why I have consistently favored 26B A4B in my previous article. Its total parameter count is certainly not small, but only about 3.8B are actually activated per token. This means that under similar deployment conditions, the pressure it puts on computation and VRAM paths is often easier to control compared to dense large models. This isn’t magic. If your context window gets too long, or if quantization and framework support are inadequate, it will struggle just like any other model. However, compared to a dense model that immediately maxes out the entire VRAM, 26B A4B feels more like a realistic path for consumer-grade GPUs. So, often the issue isn’t whether 31B is weak, but rather which model is better suited for long-term coexistence with local hardware.

Why Macs seem “less prone to running out of memory”

What’s most different about Mac is not the model, but the memory architecture. Apple silicon uses unified memory. The CPU and GPU share a single pool of memory, unlike dedicated graphics machines where there are separate VRAM and main memory that have to be moved across a bus. The biggest advantage of this structure is that many models which would “not fit in VRAM at all” on dedicated graphics machines can take on a different state on Mac:

It might not be fast,
But it can probably fit into the memory first. In other words, Macs are less likely to hit that very rigid “VRAM wall” right from the start. This is why many people feel that Macs are particularly suitable for running large models locally as a fallback option. It solves the problem of “whether the entire working set can be loaded into the same memory pool.”

But Unified Memory Doesn’t Mean High Speed for You

You must look at this part separately. What did unified memory solve?

It solved the hard separation between dedicated VRAM and main memory.
It solved many cases where models couldn’t fit directly onto cards with small VRAM.
It solved some very ugly cross-device data transfers. But what didn’t it solve?
It didn’t change large model inference from “massive memory reads” to something else.
It didn’t make large models suddenly stop being bandwidth-hungry.
It didn’t automatically give all inference frameworks the mature ecosystem of CUDA. So, the comfort of Mac is not the same thing as the speed of a high-VRAM NVIDIA card. Mac is more like:
Quiet machine operation
Large total memory
Unified architecture
Finally allowing models that previously wouldn’t fit to run first. NVIDIA’s high-VRAM card is more like:
Mature ecosystem
Complete CUDA toolchain
When you truly keep the model and cache on the GPU, speed can be much easier to boost.

Why Speed Ultimately Depends on NVIDIA’s Large VRAM

This is not an emotional judgment, but a practical conclusion drawn after wrestling with local deployment. If you are pursuing these things:

Local assistant always running
Multi-turn long conversations
Long context windows
Higher token/s rate
Minimizing waiting time as much as possible Then ultimately, the large VRAM NVIDIA card is what matters. Because what you are truly buying is the ability to keep the model and cache stably on the GPU. Mac also works, but it is better suited for a different set of requirements:
I want a machine that can load an even larger model into it first.
I accept average speed, but I don’t want to mess with drivers and peripherals.
I care more about overall experience, power consumption, and noise. Both paths are reasonable; they just solve different problems.

Back to Gemma 4, My Final Verdict

This time with Gemma 4, I really feel that local open-source models have reached a stage where hardware discussion is more warranted. But just because the model gets stronger doesn’t mean the laws of physics loosen up accordingly. No matter how strong the 31B model is, it will slow down if VRAM is insufficient. Even with the practical 26B A4B, long contexts still put pressure on the system. And while Apple’s unified memory is comfortable, it only makes it easier to “get something running”; it won’t give you the speed of a large VRAM CUDA card for free. So, I’ll end with this rather blunt summary:

For speed, prioritize NVIDIA with large VRAM.
For reliability/fallback, Mac’s unified memory is indeed comfortable.
If you plan to run on hardware like the 3060 12GB long-term, don’t always aim for dense, massive models; a path like 26B A4B is more realistic. With this set of articles, I will wrap up here.

References

Writing Notes

Original Prompt

$blog-writer Google has released the Gemma4 model after a year. As usual, I'm trying to deploy it locally on that old desktop with an unupgraded NVIDIA 3060 12GB graphics card. This time I caught the initial release, but I couldn't find an upgraded version of the commonly used Gemma3. However, there is a similar version called GemmaE4b. Please first search and introduce all the models released this time, what the letters in their abbreviations mean, and then search for online reviews about Gemma4. The key point is that Google updated the model's protocol this time, so the restrictions for users are fewer. The biggest surprise: my usual test question—write a piece of C++ code to output a five-pointed star in the console. Last year's smaller parameter open-source models couldn't handle this problem, but Google managed it this time. In the first version, it gave the answer, completely exceeding my expectations. It knew about my trap; outputting a five-pointed star to the console is very difficult, so it directly hardcoded a string of a five-pointed star for direct console output. This is the original text: Because drawing a five-pointed star with precise geometric structure using mathematical logic in a pure text console (Console) is very complex (involving coordinate system transformation and pixel filling), the most classic and visually best method is to use ASCII Art. After I forced it to perform calculations, it also managed it through mathematical calculation, successfully drawing the five-pointed star. Previously, I often used Gemma4 for local translation tasks; many multilingual versions of historical articles on current blogs are like this. The model used for local testing: gemma-4-26b-a4b. The 31b version is indeed too slow. But looking at the reviews, the 31b effect is very good, and its ranking performance is excellent. While browsing forums, I realized that if the VRAM is insufficient and the model parameters are increased, the token generation speed will drop drastically. Can you explain why? Macs don't have this problem because they use unified memory; please explain the technical reason. Also, if speed is required, then an NVIDIA card with large VRAM is still necessary. The Mac solution can serve as a fallback, but it cannot match the speed. This content is very extensive; please evaluate whether it should be split into a series of articles.

Writing Outline Summary

The third article will only retain the two threads: “Why speed collapses” and “Why Mac does not equal fast,” without revisiting the content of the previous two articles.
Start by discussing VRAM limitations, then move to non-linear slowdowns; this logic is smoother than the previous version.
Mac and Nvidia will be discussed as two different dimensions—one focusing on reliability/fallback, and one focusing on speed.
The conclusion will only retain hardware judgment, without repeating the explanation of why series separation occurred.

Google released Gemma 4 this time (Part II)

Wed, 08 Apr 2026 23:52:20 +0800

If you only look at the leaderboard, 31B is definitely the most eye-catching. But when you actually get the machine out, it’s still that un-upgraded RTX 3060 12GB, and your judgment will change immediately. How should I put it? For local deployment, in the end, it’s not about who looks the fanciest, but who seems like the one you can live with long-term. For me, what is truly worth running first this time isn’t 31B, but 26B A4B.

The previous article Google released Gemma 4 (Part 1): Don’t rush to local deployment; you need to understand the model and protocol first covered the release and protocols. This current article only talks about the local experience itself; the last one continues with Google released Gemma 4 (Part 3): Why does running out of VRAM cause a cliff, and why can Mac act as a fallback but is slow.

Why I ran `26B A4B` first

The reason is actually quite basic: it’s about hardware reality. While 31B is certainly powerful, and the official leaderboards and initial community feedback have been very strong. However, if you run it on a machine like a 3060 12GB, the issue immediately shifts from “Is it powerful?” to “Is it worth waiting for?”. Once the model and cache start offloading to system memory, the speed can easily collapse. I will cover this in detail in a third article. 26B A4B is different. Although its total parameters are 25.2B, only about 3.8B are actually activated per token. Simply put, it’s the one in Gemma 4 that feels most “designed for local users.” So, if your machine is similar to mine—an older consumer-grade card—here is a straightforward way to decide:

If you want to see benchmark scores, go with 31B.
If you plan to use it locally in the long term, start with 26B A4B.

The Five-Point Star Problem: Someone Finally Understood My Trap This Time

I have always had a rather basic test question: asking the model to write a piece of C++ code that outputs a five-pointed star to the console.

This problem might look like a joke, but it’s actually quite tricky. Many models tend to interpret it as a pure mathematical drawing problem, and then they start talking about coordinates, trigonometric functions, and loops, ultimately outputting a mess of characters in the plain text console that is completely unreadable.

Last year, many small-parameter open-source models failed at this point.

My first reaction to Gemma 4 this time was actually quite surprising. It didn’t rush to pretend it understood; instead, it first identified the constraints and provided this judgment:

Since drawing a five-pointed star with precise geometric structure directly using mathematical logic in a plain text console (Console) is very complex (involving coordinate system transformation and pixel filling), the most classic and visually effective method is to use ASCII Art.

Regarding the Five-Pointed Star Problem, Someone Finally Understood My Trap This Time

To put it plainly, they first understood the environmental constraints behind the problem. The console is not a canvas, and the character grid is not a pixel grid. You must first figure out “how to stably output a five-pointed star,” before discussing mathematical drawing. Then, in its first version, it directly provided a hardcoded string for the five-pointed star. This action was very on point. It wasn’t about showing off derivations; it was about getting the problem solved correctly first.

What surprised me even more is that it could continue further.

If it had only stopped at ASCII Art, this problem would have only shown that it recognized the trap. What really impressed me was that when I continued to ask it to perform mathematical calculations afterward, it didn’t falter; instead, it was able to proceed logically, mapping the geometric relationships onto a character grid and finally calculating the pentagram. This demonstrates not just “it can write some code,” but rather that it understands this problem has two layers:

The first layer: What is the most stable answer for the console?
The second layer: If you insist on doing calculations, how do you reduce a geometric problem onto a character grid? Previously, many local small models would jump straight to the second layer and fail at the first. Gemma 4 reversed this approach this time; it first identified the boundaries and then decided on the solution method. I think this is more valuable than any single benchmark score.

This Coding Improvement Isn’t Just About Being “Smarter”

The reason this five-star problem is so useful is that it doesn’t just test syntax. What it truly tests is:

The ability to first understand the output environment.
The ability to admit when an intuitive solution is inappropriate.
The ability to switch between achieving “optimal presentation effect” and fulfilling “user-mandated calculation.” Once a model can solve this type of problem correctly, it indicates that the model is starting to act more like a development assistant capable of handling real-world constraints, rather than just one that completes code snippets. This is also why my first impression of Gemma 4 is much better than last year’s batch of smaller open models. Many models from last year were good at chatting, completing, and getting by, but when faced with a problem that had even slight boundary conditions, they tended to show their limitations. At least Google has addressed this weakness this time.

Translating this line cannot simply be stated as “Gemma 4 completely replaces everything”

You brought up a very key point earlier: previously, people often used Gemma for local translation. The transition to Gemma 4 isn’t actually that linear. This is because Google released TranslateGemma separately in February 2026, and it was built on the architecture of Gemma 3. What does this mean? It means that if your existing local translation pipeline is already working smoothly, you don’t necessarily have to switch everything over to Gemma 4 in the short term. Especially for scenarios with very specific goals—like only needing stable multilingual conversion—a dedicated translation model still has its value. However, if what you want is a single local model that can reasonably handle translation, Q&A, code, and general text tasks, then a more versatile route like 26B A4B is smoother. It might not be the most specialized one, but it’s more like choosing the “good enough main model to get running first” option in a real-world scenario.

Why I Don’t Want to Keep Praising `31B` in the Second Article

It’s not that 31B is bad; quite the opposite, it’s too good, which makes it easy to get distracted. If you keep focusing on the leaderboard performance of 31B, it’s easy to write this article as “Strong models are truly strong.” But what local deployment fears most is exactly that kind of talk. Because what truly determines whether you will continue using it every day isn’t the leaderboard, but rather:

Is the startup too slow?
Does the response speed drop severely?
Does long context quickly ruin the experience?
Can your own machine actually handle it? On a machine like the 3060 12GB, these practical issues are much more important than the leaderboard. So, my conclusion for the second article is simple. 31B is worth looking at; 26B A4B is worth using. For local players, these two statements are not the same thing.

My Initial Local Conclusion

If I had to summarize my experience from this test in one sentence, it would be: Gemma 4 finally feels like a local model that understands context/scenarios. Especially the 26B A4B. It might not be the model best for showing off on leaderboards, but under real-world constraints—like older hardware, consumer-grade GPUs, and long-term local use—it actually feels more like the true workhorse choice. At least with this five-star test, Google has passed.

References

Writing Notes

Original Prompt

$blog-writer Google has released the Gemma4 model after a year. As usual, I'm trying to deploy it locally on that old desktop with an unupgraded NVIDIA 3060 12GB graphics card. This time I caught the initial release, but I couldn't find an upgraded version of the commonly used Gemma3. However, there is a similar version called GemmaE4b. Please search and introduce all the models released this time, what the abbreviation letters mean in them, and then search for online reviews about Gemma4. The key point is that Google updated the model's protocol this time, and the restrictions for users are fewer. The biggest surprise: my usual test question—write a piece of C++ code to output a five-pointed star in the console. Last year's smaller open-source models couldn't handle this problem, but Google managed it this time. In the first version, it gave an answer that completely exceeded my expectations; it knew about my trap. Outputting a five-pointed star to the console is very tricky, so it directly hardcoded a string for the five-pointed star, which was outputted directly to the console. This is the original text: Because drawing a five-pointed star with precise geometric structure using mathematical logic in a pure text console (Console) is very complex (involving coordinate system transformation and pixel filling), the most classic and visually best method is to use ASCII Art. After I forced it to perform calculations, it also succeeded through mathematical calculation, successfully drawing the five-pointed star. Previously, I often used Gemma4 for local translation tasks; many multilingual versions of historical articles on current blogs are like this. The model used for local testing: gemma-4-26b-a4b. The 31b version is indeed too slow. But looking at the reviews, the 31b effect is very good, and its ranking performance is excellent. Also, while browsing forums, I realized that if the VRAM is insufficient and the model parameters are increased, the token generation speed will drop drastically. Can you explain why? Macs don't have this problem because they use unified memory; please explain the technical reason. Furthermore, if speed is required, only an NVIDIA card with large VRAM will do. The Mac solution can serve as a fallback, but it cannot match the speed. This content is very extensive; please evaluate whether it should be split into a series of articles.

Writing Outline Summary

The second article will only retain the local experience, and will no longer summarize the first article or explain VRAM principles for the third one.
First provide the hard judgment on “why run 26B A4B first,” then expand with the five-star test.
The five-star question is treated as the main axis because it better illustrates the boundary sense in coding scenarios than benchmark scores.
The translation task will be given its own section to avoid making Gemma 4 seem like a linear successor to all previous processes.

Google has released Gemma 4 this time (Part 1)

Wed, 08 Apr 2026 23:48:20 +0800

On the day of the initial release, what I originally wanted to do was simple: find an upgraded version corresponding to Gemma 3 and download it to run. However, after looking around, I was a bit stunned. The familiar naming convention of 4B / 12B / 27B is gone; instead, we have E4B, 26B A4B, and 31B. How should I put it? This time, what Google truly changed wasn’t just the model sizes, but even “how you should understand this batch of models.”

I’ve broken down these articles into three parts. This current article only clarifies the release information, model names, and protocols; the next one will cover Google Released Gemma 4 (Part II): Running Locally on a 3060 12GB, 26B A4B is More Realistic; and the last one will conclude with Google Released Gemma 4 (Part III): Why VRAM Insufficiency Causes a Cliff, and Why Mac Can Be a Fallback But Isn’t Fast.

Let’s first clarify what was actually released this time

Last year, Gemma 3 was released on March 12, 2025, and this Gemma 4 was released on April 2, 2026. It is indeed about a year apart. However, we cannot approach this by asking, “Who is the next generation after 27B.” The four main sizes provided by the official source are no longer simply categorized by total parameters. | E2B | Dense | 2.3B effective, 5.1B including embeddings, 128K context | On-device, ultra-lightweight local |

Clarify what was actually released this time

Model	Structure	Key Numbers	Typical Scenarios
`E4B`	Dense	4.5B effective, 8B including embeddings, 128K context	The original 4B small model main line

Clarify what was actually released this time

Model	Structure	Key Numbers	Typical Scenarios
`26B A4B`	MoE	25.2B total, approx. 3.8B active, 256K context	Consumer GPUs, local deployment, balancing quality and speed

Clarify what was actually released this time

Model Size	Structure	Key Numbers	Typical Scenarios
`31B`	Dense	30.7B dense, 256K context	Aiming for the upper limit, leaderboards, and more stable quality

Let’s clarify what was actually released this time

If you only look at the surface, you might feel that the naming is more confusing. But it’s not random; Google is deliberately splitting the three tracks:

Small model for on-device use, given to E2B / E4B
Local player track, given to 26B A4B
Quality and upper limit track, given to 31B This is also why many people’s first impression might be, “The previously familiar upgrade path has been broken.” It’s not that they didn’t release an upgraded version; it’s that Google doesn’t want to sell products based on just one dimension: total parameters.

‘E’ and ‘A’ are not decorative letters this time

In this batch of names, the most confusing ones are E4B and A4B. The ‘E’ in E2B and E4B stands for effective parameters, according to the official documentation. Because these two models use Per-Layer Embeddings, the total parameter count and the actual effective parameter count are not measured by the same metric. Simply put, Google is reminding you that this is not like the old “a simple 4B dense model.” The ‘A’ in 26B A4B stands for active parameters. The total size is 25.2B, but only about 3.8B are actually activated per token. This is key to the MoE approach: the total model size is large, but the part that actually participates in computation at runtime is much smaller. So, even though both names seem to have a ‘4B’, their meanings are completely different:

E4B is for the small model line.
26B A4B is a large MoE with “activation scale of around 4B” during local inference. This naming convention was indeed awkward at first, but it is closer to the actual deployment experience than before.

If you previously used Gemma 3, how to find the corresponding relationship this time

I think the easiest place to misjudge with this generation is to treat it as a linear upgrade from Gemma 3. If you look at it based on usage habits, you can roughly understand it like this:

Those who used to focus on 4B for light tasks should now first look at E4B
Those who used to focus on 27B to see the model’s upper limit should now look at 31B
If you previously wanted to find a balance point on consumer-grade GPUs that is “powerful enough but not completely unrunnable,” now focus on 26B A4B If you don’t clarify this layer first, local deployment will easily go wrong later. You might complain, “Why isn’t there the familiar upgraded version?” while mistakenly choosing a model that isn’t actually suitable for you.

The most valuable update this time isn’t the parameters

What really made me feel like this release was a “finally figured it out” moment wasn’t the leaderboard, but the license. The old Gemma terms weren’t unusable, but they always felt a bit awkward. Especially if you care about these things:

Redistribution
Distillation or secondary packaging
Integrating the model into your own product pipeline
Commercial deployment You always have to go back and look at how those notices, downstream restrictions, and accompanying agreements in the terms should be handled. By changing directly to Apache 2.0 this time, Gemma 4 made things much cleaner. The core message is very clear:
Commercially usable
Modifiable
Redistributable The main obligations are limited to retaining familiar open-source elements like the license, notices, and modification documentation. Simply put, Google didn’t just open-source a model this time; they smoothed out the entire process of “whether or not people feel safe using it.”

Initial Community Feedback, Basically Two Lines

If you only look at the first week’s buzz, there are roughly two main sentiments.

The first line is that 31B is genuinely capable. The official benchmarks are already very impressive. In the Arena AI text leaderboard, 31B was ranked among the top open-source models upon release, and it also showed a significant improvement over Gemma 3 27B on LiveCodeBench v6. Many people’s first reaction is that achieving this level of performance with this size is quite beyond expectations.

The second line is that 26B A4B seems like a lifeline for local users. It might not be the flashiest flagship model at first glance, but it is very practical. Especially if you aren’t running things in a data center, but rather on consumer-grade GPUs, workstations, or even older machines, the local experience tends to fall onto this line.

Of course, there’s a very realistic prerequisite for the initial wave of feedback: the ecosystem is still catching up with the versions. Templates, quantization methods, inference frameworks, and front-end tools—many haven’t fully kept pace yet. Therefore, when looking at comments right now, it’s best to view them in two layers:

The core model: There has indeed been a big improvement here.
Local experience: This will continue to be influenced by the maturity of the toolchain.

My Conclusion on the First Article

If you just want to know what Google actually released this time, one sentence is enough. Gemma 4 is no longer following the old idea of “a line of dense models from small to large,” but rather separating three paths: device-side deployment, local deployment, and quality ceiling. The names like E4B, 26B A4B, and 31B sound strange, but behind them is a very practical division of labor for deployment. But if you ask me what the biggest change is this time, I still stick to that judgment: It’s not about parameters, nor is it about leaderboards; it’s that Google finally put Gemma 4 into an open-source protocol that everyone feels more comfortable actually using. This step is more important than the numbers on the surface. In the next article, I won’t continue talking about the conference narrative; I’ll go straight back to local machines. Still with that unupgraded RTX 3060 12GB, why was it that my initial focus wasn’t on 31B, but on 26B A4B.

References

Writing Notes

Original Prompt

$blog-writer Google has released the Gemma4 model after a year. As usual, I'm trying to deploy it locally on that old desktop with an unupgraded NVIDIA 3060 12GB graphics card. This time I caught the initial release, but I couldn't find an upgraded version of the previously used Gemma3. However, there is a similar version called GemmaE4b. Please first search and introduce all the models released this time, what the abbreviation letters mean in them, and then search for online reviews about Gemma4. The key point is that Google updated the model's protocol this time, so the restrictions for users are fewer. The biggest surprise: my usual test question—write a piece of C++ code to output a five-pointed star in the console. Last year's smaller parameter open-source models couldn't handle this problem, but Google managed it this time. In the first version, it gave an answer that completely exceeded my expectations; it knew about my trap. Outputting a five-pointed star to the console is very tricky, so it directly hardcoded a string for the five-pointed star, and the console outputted it directly. This is the original text: Because drawing a five-pointed star with precise geometric structure using mathematical logic in a pure text console (Console) is very complex (involving coordinate system transformation and pixel filling), the most classic and visually best method is to use ASCII Art. After I forced it to perform calculations, it also managed it through mathematical calculation and successfully drew the five-pointed star. Previously, I often used Gemma4 for local translation tasks; many multilingual versions of historical articles on current blogs are like this. The model used for local testing: gemma-4-26b-a4b. The 31b version is indeed too slow. But looking at the reviews, the 31b performs very well, and its ranking scores are excellent. Also, while browsing forums, I realized that if the VRAM is insufficient and the model parameters increase, the token generation speed will drop drastically. Can you explain why? Macs don't have this problem because they use unified memory; please explain the technical reason. Furthermore, if speed is required, then an NVIDIA card with large VRAM is still necessary. The Mac solution can serve as a fallback, but it cannot match the speed. This content is quite extensive; please evaluate whether it should be split into a series of articles.

Writing Outline Summary

The first article will only focus on clarifying “what was actually released this time” and “why the protocol is important,” avoiding topics that compete with local experience discussions.
We will separate the model roadmap breakdown and then explain the meaning of the letters, making the logical flow more direct than the previous version.
For the protocol section, we retained the judgment: “What was truly released this time is not the parameters, but the usage restrictions.”
Community feedback will only be used for synthesis/conclusion, without preemptively including too many local experience details.

Don't force weak models onto hard tasks.

Thu, 02 Apr 2026 22:05:00 +0800

Recently, I’ve been migrating some edge cases to MiniMax and local models. The more I use them, the more I feel that we shouldn’t always measure things by the standard of “the most powerful model.”

My judgment is straightforward: don’t force weak models into hard tasks. Models like MiniMax are indeed limited in capability, but for complex coding, long-chain reasoning, or ambiguous requirement decomposition, they fall a bit short. However, if you ask it to do data cleaning, document writing, or searching for proposal materials—these kinds of tasks—it can handle them perfectly well. The same logic applies to local models around the 12B size; translation, format rewriting, and batch cleaning are actually where they are best suited.

To put it plainly, it’s not that the models lack value; it’s just that we shouldn’t place them in the wrong roles.

The real problem isn’t how strong the model is, but whether it works correctly.

Many people who talk about large models automatically think of the most difficult tasks.

Writing complex engineering code independently
Deconstructing an entire system in one go
Multi-turn reasoning over long contexts
Planning and executing while searching These are certainly important. But in real-world work, what is actually piled on your desk most often isn’t these kinds of tasks. It’s more like:
Cleaning up a pile of dirty fields
Organizing scattered information into readable documents
Converting long texts into summaries, FAQs, or outlines
Standardizing mixed Chinese and English content formats
Gathering data from multiple web pages and then compiling it into a draft proposal For these types of tasks, what is most needed is not “the model thinking like a genius,” but three things:
Instruction following must be reasonably accurate.
Output structure should be as stable as possible.
The cost must be low enough that you are willing to use it repeatedly. This is why I always feel that weak models are not useless; they just cannot be used in the same kind of battle as flagship models.

MiniMax: What’s Actually Suitable for It

First, let’s talk about MiniMax. The official positioning of MiniMax-M2.5 is actually quite high. In press releases and open platform documentation, they push it towards scenarios like programming, tool calling, search, and office productivity, even emphasizing speed and cost advantages. I don’t completely disbelieve these claims, but I prefer to break them down. For me, what MiniMax is genuinely good at isn’t “the most complex development tasks,” but rather the following:

Data Cleaning

A lot of data cleaning is essentially manual labor involving semi-structured text.

Name unification
Field mapping
Anomaly labeling
Classification tagging
Table field completion What these types of tasks fear most is not the model being “dumb,” but rather inconsistent formatting or divergent outputs. As long as the model can reliably output results in JSON, tables, or fixed templates, it’s actually sufficient. While powerful models certainly can do this, using the most expensive tier of model just to clean fields is often not cost-effective.

Documentation Writing

Writing documentation is annoying, not difficult. When an interface changes, a process changes, or a field is modified, the documentation has to change accordingly. This process doesn’t actually require the model to have strong creativity; rather, it requires it not to over-exert itself and alter clearly defined things into something ambiguous. MiniMax is often more reliable for these kinds of tasks than one might expect. Especially when you have already prepared the context, it acts more like a capable documentation assistant rather than an actual engineer.

Solution Material Search

The official platform is also promoting search and tool calling, so this direction is fine. Many times, what we need is not for the model to “come up with an answer out of thin air,” but rather for it to first find relevant web pages, documents, announcements, or materials, and then organize them neatly. In this scenario, cheaper models like MiniMax are very valuable because searching, summarizing, and integrating are inherently high-frequency, mundane tasks. So my actual view is: MiniMax isn’t incapable; rather, it is better suited for the dirty, tiring, and repetitive tasks within a production pipeline. If you let it act as an assistant or general laborer, it is often competent; but if you ask it to handle the entire engineering process, the probability of disappointment increases.

Local 12B Models, Best Suited for Bringing Back These Tasks

Looking further down, the logic for local deployment is actually the same. When many people talk about local models, they inevitably ask one question: Can it replace the flagship cloud models? I think this question is flawed from the start. For local models around 12B, what has real practical value isn’t “proving that it can handle the most powerful tasks,” but rather bringing back those stable, repetitive, sensitive, low-profit, yet high-frequency tasks.

Translation

This is one of the most natural scenarios for local models. As explicitly mentioned in the official blog of Qwen2.5, it has enhanced capabilities for long-text generation, structured data understanding, and JSON output, and supports over 29 languages. This combination is inherently suitable for tasks like translation, bilingual rewriting, format standardization, and terminology normalization. Technical documentation, field descriptions, product introductions, and API comments—these items often have stable structures and fixed terminology. While local models might not produce the most elegant translations, they are usually sufficient.

Data Cleaning

This is also where local models are particularly realistic. Many spreadsheets, documents, and business materials that you might not want to upload to the cloud. Especially internal data, customer records, meeting minutes, and draft proposals—when privacy and permissions are involved, running it locally provides much more peace of mind. At this point, the significance of a local model around 12B isn’t “how smart it is,” but rather that “it’s on my machine, and it can reliably handle these dirty tasks.”

Fixed Format Rewriting

For example:

Meeting minutes organized into a fixed template
Product titles cleaned into a unified naming convention
Bug descriptions rewritten into ticket format
Mixed Chinese and English text cleaned into single-language versions

These types of tasks share consistent characteristics: clear rules, large batches, high repetition, low value per instance, but significant cumulative effort. This is exactly what local models are best suited for.

Can the 3060 12GB actually run a model around 12B?

I prefer to write about this realistically: “It can run it, but don’t get your hopes up too high.” Google provided a very useful VRAM table in the official documentation for Gemma 3. The Gemma 3 12B roughly requires:

About 20 GB of VRAM to load the full precision version.
About 12.2 GB to load the medium quantization version.
About 8.7 GB to load a lower VRAM consumption version. The official documentation also specifically reminds that this is only for model loading, and does not include prompt or runtime overhead. This sentence is very key. What does it mean? It means that running a model around 12B on a card like the 3060 12GB is not impossible, but the prerequisites are usually:
You are running a quantized version.
The context length should not be too long.
The task shouldn’t be too complex.
You accept average, or even slow, speed. If you are willing to accept these premises, then running a local 12B model is indeed feasible. Tasks like translation, summarization, table cleaning, and fixed format conversion are not exaggerated in this regard. Furthermore, the official repository for Qwen2.5-14B-Instruct-GGUF itself provides multiple quantization formats, which actually makes the intention very clear: models in this category are inherently adapted for the local inference ecosystem. So my conclusion has never been that “the 3060 12GB can easily handle a 12B model,” but rather: It can run these types of models, but it is better suited for work with low expectations, high repetition, and high privacy requirements.

Cheap Models and Local Models: It’s Not Just About Saving API Costs

When people talk about this, the first reaction is always saving money. Of course, saving money is important. But I think the greater value is that you start daring to outsource all those little tasks you used to avoid doing. Before, you might not have written a dedicated script just to clean up a few hundred data points. You also wouldn’t manually adjust dozens of pages of mixed Chinese and English documents to achieve uniform formatting. And you certainly wouldn’t read through every single webpage to gather materials for an ad-hoc proposal. Things are different now. As long as the cost is low enough and the barrier is low enough, these tasks that were previously considered “not worth the effort” suddenly become worthwhile. You no longer hesitate over whether or not to do it; instead, you just throw it to a cheap model or a local model to run through first. This is what I see as the most realistic change. Powerful models are responsible for tackling core problems, weaker models handle miscellaneous tasks, and local models provide fallback and batch processing. With this division of labor, the entire workflow becomes smooth.

Conclusion

So, the final word remains: don’t always try to make one model conquer everything. Models like MiniMax are weak in capability, but they aren’t useless. If you use them to tackle complex engineering tasks, vague requirements, or multi-turn reasoning, you will naturally be disappointed; however, if you use them for data cleaning, document drafting, or searching for proposal materials, they often work quite smoothly. The same applies to local models around 12B. Their purpose isn’t to prove that “I no longer need cloud flagships,” but rather to reliably move stable, repetitive, sensitive, and high-volume tasks back onto their own machines. Simply put: don’t let a weak model do what it is not good at. Place them in the right role, and they will have real value.

References

Writing Notes

Original Prompt

Minimax’s large model is weak in capability, but it’s fine for tasks like data cleaning, document writing, and searching for proposal materials; with the same logic, deploying a large model locally for translation or data cleaning work is also good. The model parameter count is around 12b, and even a local GPU like the RTX 3060 with 12GB can handle it.

Writing Outline Summary

Retained the core judgment of “don’t force weak models onto hard tasks,” and did not write it as a model leaderboard comparison.
The MiniMax section is mainly based on the official positioning for programming, searching, and office work, then applies this judgment back to real-world tasks like data cleaning, document handling, and information retrieval.
For local models, I selected two officially sourced options: Qwen2.5 and Gemma 3, one supporting multilingual and structured output, and the other supporting 12B size and VRAM usage.
The description for the 3060 12GB was intentionally phrased as “capable, but don’t get too carried away,” to avoid presenting quantized inference as an absolute conclusion.
In the conclusion, I re-categorized strong models, weak models, and local models based on their respective roles, making the main thread more focused.

Gemma on Uncle Xiang's Notebook

Google has released Gemma 4 this time (III)

Out of VRAM, and Not Just a Little Bit Slow

Why the Cliff, Not a Linear Slowdown

Why 26B A4B is Friendlier to Local Players

Why Macs seem “less prone to running out of memory”

But Unified Memory Doesn’t Mean High Speed for You

Why Speed Ultimately Depends on NVIDIA’s Large VRAM

Back to Gemma 4, My Final Verdict

References

Writing Notes

Original Prompt

Writing Outline Summary

Google released Gemma 4 this time (Part II)

Why I ran 26B A4B first

The Five-Point Star Problem: Someone Finally Understood My Trap This Time

Regarding the Five-Pointed Star Problem, Someone Finally Understood My Trap This Time

What surprised me even more is that it could continue further.

This Coding Improvement Isn’t Just About Being “Smarter”

Translating this line cannot simply be stated as “Gemma 4 completely replaces everything”

Why I Don’t Want to Keep Praising 31B in the Second Article

My Initial Local Conclusion

References

Writing Notes

Original Prompt

Writing Outline Summary

Google has released Gemma 4 this time (Part 1)

Let’s first clarify what was actually released this time

Clarify what was actually released this time

Clarify what was actually released this time

Clarify what was actually released this time

Let’s clarify what was actually released this time

‘E’ and ‘A’ are not decorative letters this time

If you previously used Gemma 3, how to find the corresponding relationship this time

The most valuable update this time isn’t the parameters

Initial Community Feedback, Basically Two Lines

My Conclusion on the First Article

References

Writing Notes

Original Prompt

Writing Outline Summary

Don't force weak models onto hard tasks.

The real problem isn’t how strong the model is, but whether it works correctly.

MiniMax: What’s Actually Suitable for It

Data Cleaning

Documentation Writing

Solution Material Search

Local 12B Models, Best Suited for Bringing Back These Tasks

Translation

Data Cleaning

Fixed Format Rewriting

Can the 3060 12GB actually run a model around 12B?

Cheap Models and Local Models: It’s Not Just About Saving API Costs

Conclusion

References

Writing Notes

Original Prompt

Writing Outline Summary

Why `26B A4B` is Friendlier to Local Players

Why I ran `26B A4B` first

Why I Don’t Want to Keep Praising `31B` in the Second Article