The initial design of the blog translation project was overly complex – first parsing Markdown format, then using placeholders to protect the content, and finally sending it to a large model for translation. This was entirely unnecessary; large models inherently possess the ability to recognize Markdown syntax and can directly process the original content while maintaining formatting during translation.
Our work shifted from debugging code to debugging the prompting of the model.
Model: google/gemma-3-4b
Hardware: Nvidia 3060 12GB
Indeed, we chose a non-thinking model – thinking models were inefficient when executing translation tasks. We compared the performance of 4b and 12b parameters, and for translation purposes, gemma3’s 4b parameter was sufficient; there was no significant advantage in terms of 12b parameters.
12b parameter speed: 11.32 tok/sec , 4b parameter speed: 75.21 tok/sec.
Background Introduction
Despite adding various constraints within the system, the output translation results still presented some issues, such as: lack of formatting protection, inclusion of extraneous explanatory content. When defining roles, it was already stated to protect Markdown format and only output translation results; ultimately, the translation remained unstable.
At this point, I remembered encountering a comic translation project previously, which also leveraged the capabilities of large models. Its translation effect seemed better than mine. Upon reviewing the code and comparing the request data, the comic translation project would include a set of context with each request, in addition to the current translation content, it would also include previous translation content.
What were the benefits? Not only did this improve the coherence between preceding and following translations, but it also ensured the stability of the output format.
The Importance of Historical Conversations
As large AI models (such as the GPT series, Claude, Gemini, etc.) become more prevalent, an increasing number of businesses and developers are accessing these models via APIs to build intelligent customer service, content generation, code assistant, and other applications. However, many people encounter a common issue during initial access: model outputs are disjointed, lack contextual understanding, and even answer the wrong questions.
A key reason for this phenomenon is – not including historical conversation content in API requests.
What is a History Dialogue?
A history dialogue refers to the exchange records between a model and a user within a single conversation session. In most large language model APIs (such as OpenAI’s Chat Completions API), developers need to construct the complete messages
array themselves, passing the historical dialogue in turn as user
and assistant
message format.
Example
{
"model": "gpt-4",
"messages": [
{"role": "user", "content": "Write me a resignation letter"},
{"role": "assistant", "content": "Okay, what do you want to write about as the reason for your resignation?"},
{"role": "user", "content": "I want to pursue personal career development"}
]
}
If you only send the last sentence:
{"role": "user", "content": "I want to pursue personal career development"}
The model won’t know you are writing a resignation letter, and its output quality will be very poor because it doesn’t understand the context.
Why Historical Dialogue is So Important?
1. Build Context, Enhance Coherence
AI models are inherently “context-driven.” They cannot remember anything that has happened “previously,” unless you explicitly tell it. By passing in the dialogue history, the model can better understand your intent and topic context, resulting in outputs more aligned with expectations.
2. Reduce Misunderstanding Rate
If you want the model to complete a multi-turn instruction, such as writing, summarizing, or debugging code, historical context allows the model to gradually accumulate understanding and avoid going off-topic or losing focus midway through.
3. Simulating Realistic Human Dialogue Behavior
In practical applications such as customer service systems, educational assistants, and health consultations, user questions often unfold gradually rather than being expressed clearly in a single instance. Preserving dialogue history allows the AI to behave more like a “memoryful assistant.”
How to Correctly Add Historical Conversations in an API?
Using OpenAI’s API as an example, we recommend following the structure below:
messages = [
{"role": "system", "content": "You are a professional legal assistant"},
{"role": "user", "content": "What are the essential conditions for a contract?"},
{"role": "assistant", "content": "Contract validity requires fulfilling several conditions: ..."},
{"role": "user", "content": "Does an oral agreement count?"}
]
response = openai.ChatCompletion.create(
model="gpt-4",
messages=messages
)
Note:
- Use the
system
message to set the model’s behavior and identity. - Only retain recent key conversations, not necessarily the entire history (to avoid exceeding token limits).
- In long sessions, truncate early content and maintain core information summaries to control token consumption.
Practical Recommendations
- Dialogue State Management: The backend needs to design caching mechanisms to record each user’s conversation history (e.g., Redis, database).
- Limit Length: OpenAI GPT-4 has a context length of 128k tokens, Claude 3 can reach 200k~1M, and requires reasonable truncation.
- Dynamic Summarization of History: When the historical content is too long, use a model to first summarize the old conversations before adding them to the dialogue context.
Summary
AI large model capabilities are powerful, but developers need to “feed” it sufficient contextual information. By adding historical conversations within API requests, not only can the quality and coherence of the model’s output be significantly improved, but users can also experience a more natural and realistic conversation.
Regardless of whether you’re building AI customer service, writing assistants, coding helpers, or educational applications, this is an optimization technique that cannot be ignored.