A long period of heavy AI programming

Recently, in the project, there has been heavy use of AI programming, which should be the most integrated AI in work over the past three years. The notes taken were not systematic; whatever came to mind was recorded.

Background

Linux environment, backend service development, without involving any UI or frontend content.

Models

I’ve tried out the three “Big Three” in China – minimax, glm, and kimi – and kimi has performed best. claude effectively handles large requests by breaking them down, while codex is most suitable for production environments; it’s exceptionally cautious.

Claude is the most versatile player currently, and no one can beat it in the programming arena, but it’s expensive.
Minimax offers the best value for money, with fast enough speed and stable performance – a beneficiary of the “Shrimp Wave.”
Codex is generally good, but sometimes its instruction following isn’t quite right; it keeps trying to optimize and improve performance, which I don’t need. During unit testing, I prefer verbose answers that allow me to clearly understand the cases.
Kimi has a very high instruction following rate and feels most natural to use in China.
GLM was normal before the holiday season, but suffered from severe lack of compute power after the holidays and was abandoned.

Positioning

AI’s brain capacity far exceeds that of an individual, and some module designs, such as engaging with AI to discuss solutions, can effectively expand the thinking chain and lead to more reasonable design schemes.

Senior mentors, capable assistants.

Localization

You can ask him if you don’t understand, you have clear development tasks that you can hand over for it to execute – essentially, you’ve brought along a very capable assistant.

Issues

Domestic models experienced widespread compute shortages upon returning after the Spring Festival, resulting in excessively slow outputs. While offering affordability and value for money, the sluggish performance significantly impacted the efficiency of interactions in real-world scenarios. The chaos surrounding Zhihuo’s issues during the Spring Festival – including accusations of betraying developers and arbitrarily changing package prices – escalated the situation dramatically, culminating in an apology released on the fifth day of the New Year. Internal processes were also somewhat disorganized, leading to a full refund of my historical packages after I requested a refund, despite the initial announcement stating only upgrades would be refunded while retaining the rights to older packages.

The initial lack of weekly limits with Zhihuo was a misjudgment by the platform, as now purchases include them. This demonstrates that they underestimated users’ ability to “clip coupons.” The full refunds also prevented me from continuing to “clip coupons.” GLM-4.7’s capabilities are comparable to Kimi-2.5 in terms of instruction following and adherence.

Notably, all models currently require manual auditing.

Unit Testing

In the initial stages of project design, each module is designed to be independently testable through unit testing. During later development, it was found that large models generated their own code and then wrote unit test cases, with most scenarios passing completely. Because it wasn’t test-driven development, the purpose of unit testing shifted to post-business iteration and refactoring phases, allowing for auditing AI modifications to ensure they didn’t break existing functionality.

Performance Testing

If no AI has been applied to some core functions, it’s likely that the coding development simply doesn’t bother with performance testing. With AI, a supplementary report is generated to see how the data looks.

Documentation

Maintaining documentation can be a laborious task, but AI is different; it can help you maintain and synchronize updates to the latest code branches while modifying code.

New Ability Analysis

We attempted to use Codex for service performance optimization after granting authorization. It could automatically call perf for performance analysis, but it wasn’t intelligent enough. It identified frequent memory allocation as the cause of low efficiency, but failed to understand that this was due to excessive loop iterations causing the frequent memory allocations – a logical flaw in the code involving numerous temporary variables constructed and destructed within the loop each time.

Process

Maintenance of AI projects involves human intervention and iterative development based on modules and functions. We don’t expect AI to continuously maintain new features; each time a prompt is written, it’s akin to manually crafting a small development plan, outlining which modules are involved and where the most suitable modifications should be made.

Many workflows circulating online haven’t been attempted in my environment, and the processes used here are relatively traditional – they feel the most intuitive to use.