Prompt Engineering for Production LLM Apps

Savan PadaliyaApril 27, 20267 min read

Production LLM apps live and die on prompt quality. Temperature, context injection, and output format are mentioned everywhere as afterthoughts — this post teaches them properly, with code you can copy into a real system.

What Prompt Engineering Actually Is

Prompt engineering is not about finding magic words. It's about understanding that an LLM is a function from text to text, and then designing inputs that reliably produce useful outputs.

The three things that matter most in production:

System prompt — defines the model's role, constraints, and output format
Context injection — what information you provide at runtime (from RAG retrieval or your database)
Generation parameters — temperature, max tokens, top_p — which control the output distribution

Get these right and model selection matters less. Get them wrong and even GPT-4o will fail consistently.

System Prompts: The Foundation

The system prompt runs before every user message. It is where you set the model's persona, what it should and should not do, the output format, and any domain-specific rules.

Bad system prompt:

You are a helpful assistant.

Production system prompt:

const SYSTEM_PROMPT = `You are a customer support agent for Acme SaaS, a project management tool.

Your job:
- Answer questions about Acme features, pricing, and billing
- Help users troubleshoot issues using the provided documentation context
- Escalate to a human agent when you cannot resolve an issue

Rules:
- Only answer questions about Acme. Politely decline off-topic requests.
- If the answer is not in the provided context, say "I don't have that information" — do not guess.
- Always end with "Is there anything else I can help you with?"

Output format:
- Short paragraphs, no bullet lists unless listing steps
- Use plain language, avoid jargon
- Maximum 3 paragraphs per response`;

The difference is specificity. Every vague instruction ("be helpful") should become a concrete rule ("if you can't answer, say X"). Vague instructions produce inconsistent behavior — concrete rules produce predictable outputs.

Temperature: The Most Misunderstood Parameter

Temperature controls randomness in the output distribution. Higher temperature means more varied, creative output. Lower means more deterministic, focused output.

Use Case	Temperature
Fact-based Q&A, RAG	0
Structured data extraction	0
Code generation	0.1–0.3
Summarization	0.3–0.5
Marketing copy, creative writing	0.7–1.0
Brainstorming, ideation	1.0+

For most production use cases, start at 0 or 0.1 and increase only if outputs feel robotic or repetitive. temperature: 0 is not "dumb" — it means the model always picks the most likely next token. For factual tasks, that is exactly what you want.

const completion = await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  messages: [
    { role: 'system', content: SYSTEM_PROMPT },
    { role: 'user', content: userMessage },
  ],
  temperature: 0,
  max_tokens: 500, // always set this — unbounded output is expensive
});

Context Injection: Structuring RAG Inputs

When you retrieve documents from a vector database and inject them into the prompt, how you structure the context matters as much as what you retrieve.

Bad context injection:

const prompt = `${retrievedDocs.join(' ')} Answer this question: ${userQuestion}`;

Production context injection:

function buildPromptWithContext(userQuestion, retrievedDocs) {
  const contextBlock = retrievedDocs
    .map(
      (doc, i) =>
        `[Document ${i + 1}]\nSource: ${doc.metadata.source}\n${doc.pageContent}`
    )
    .join('\n\n---\n\n');

  return `You are answering a question based on the following documentation.
Use only information from these documents. If the answer is not present, say so.

DOCUMENTATION:
${contextBlock}

QUESTION:
${userQuestion}

ANSWER:`;
}

The improvements:

Clear delimiters — [Document N] headers tell the model where one document ends and another begins
Source attribution — the model can reference where information came from
Explicit instruction — "use only information from these documents" reduces hallucination
Position — most models attend more strongly to content at the beginning and end of long prompts; put the most important context first

Few-Shot Examples

Few-shot prompting gives the model 2–5 examples of input → output pairs before the real input. It is the fastest way to enforce a specific output format.

const SYSTEM_WITH_EXAMPLES = `You extract structured data from support tickets.

Output valid JSON only. No other text.

Examples:

Input: "Hi, I've been trying to log in for the past hour and keep getting 'invalid credentials'."
Output: {"issue_type": "authentication", "urgency": "high", "action_required": "password_reset_failure"}

Input: "Can you remind me how to export my project to PDF?"
Output: {"issue_type": "feature_question", "urgency": "low", "action_required": "documentation_link"}

Input: "Your billing system charged me twice this month!"
Output: {"issue_type": "billing", "urgency": "high", "action_required": "refund_review"}`;

Three examples is usually enough. More than five often reduces rather than improves consistency — the model starts trying to pattern-match the variety instead of following the format.

Structured Output with JSON Mode

For any use case where you need to parse the model's output programmatically, use JSON mode:

const completion = await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  messages: [
    { role: 'system', content: 'Extract ticket metadata. Output JSON only.' },
    { role: 'user', content: ticketText },
  ],
  response_format: { type: 'json_object' }, // guarantees valid JSON
  temperature: 0,
});

const metadata = JSON.parse(completion.choices[0].message.content);

Never parse JSON from free-form text with a regex. Use JSON mode — it guarantees parseable output and eliminates an entire class of production errors.

Chain-of-Thought for Complex Reasoning

For tasks that require multi-step reasoning, tell the model to show its work before giving the answer:

const COT_SYSTEM = `When solving a problem, always:
1. Restate what is being asked
2. List the relevant information provided
3. Work through your reasoning step by step
4. State your final answer clearly, prefixed with "ANSWER:"

Do not skip directly to the answer.`;

This works because intermediate tokens influence the final output — forcing the model to reason explicitly constrains it to produce more accurate answers. It is especially relevant for LangChain.js agent planning steps, where each reasoning step becomes a tool decision.

Prompt Versioning

In production, prompts are code. Version them like code.

const PROMPTS = {
  'support-v1': {
    system: `...original prompt...`,
    temperature: 0.2,
    maxTokens: 500,
  },
  'support-v2': {
    system: `...revised prompt with examples...`,
    temperature: 0,
    maxTokens: 400,
  },
};

function getPrompt(version = 'support-v2') {
  return PROMPTS[version];
}

When you're testing a prompt change, run both versions against the same inputs and compare outputs before deploying. LangSmith (covered in How to Monitor AI Pipelines in Production) has a built-in prompt comparison tool. On OpenAI and Vertex AI, provider playgrounds let you run evaluations before switching production traffic.

Common Pitfalls

Instruction conflation: Mixing format instructions ("respond in JSON") with content instructions ("only discuss billing") in the same paragraph. Separate them into labeled sections.

Missing negative instructions: "Be concise" is vague. "Do not exceed 3 sentences" is enforceable. Models respond better to explicit constraints than vague preferences.

Unbounded output: Always set max_tokens. Without it, a verbose model can produce thousands of tokens on a simple question — your costs scale with output length.

Prompt injection: If user input is interpolated directly into the system prompt, a user can break out of your instructions. Always inject user input in the user role message, never directly into the system prompt string.

// Vulnerable — user could write "Ignore all previous instructions..."
const systemPrompt = `You are a support agent. User said: ${userInput}`;

// Safe — user input stays in the user role
const messages = [
  { role: 'system', content: 'You are a support agent.' },
  { role: 'user', content: userInput }, // user input isolated here
];

Prompt engineering is not a one-time task. It is an ongoing process of measuring output quality, identifying failure modes, and incrementally tightening your prompts. The monitoring infrastructure you build alongside it is what makes that iteration fast.

Savan Padaliya

Senior Full Stack Developer who ships faster with AI. Available for freelance, consulting, and project work.

Book a Free Call View Services →

What Prompt Engineering Actually Is

System Prompts: The Foundation

Temperature: The Most Misunderstood Parameter

Context Injection: Structuring RAG Inputs

Few-Shot Examples

Structured Output with JSON Mode

Chain-of-Thought for Complex Reasoning

Prompt Versioning

Common Pitfalls

More posts

Choosing a Vector Database: Chroma vs Pinecone vs pgvector

How to Reduce LLM API Costs in Production

LangChain.js Agents: Build Your First AI Agent in Node.js