From Agentic Pilot to Production, Part 2: Disobedient or just probabilistic?

In this second post in our series, From Agentic Pilot to Production, we look at the probabilistic nature of LLMs, the issues that can result for marketing leaders piloting Agentic AI, and how to add determinism in prompts to mitigate.

A Quick Story

At RSG, we built a simple multi-agent flow to simulate an exciting agentic possibility. The process collected public signals, proposed campaigns, generated copy, and organized a calendar in Sheets. It looked nice and slick. But we ran into a recurring problem. In spite of providing explicit instructions, sometimes repeatedly, the agent just wouldn't follow them. It often ignored clear instructions and wandered.

(You can read more in the first post in this series, From Agentic Pilot to Production, Part 1: Autonomy with Brakes: Why Refusal Comes First.)

Similar issues are facing even the big AI vendors in production. Salesforce leaders recently acknowledged declining trust in LLM-driven agents after real-world failures. 

The lesson here is to treat LLM output as probabilistic, which means multiple outputs will vary and won’t be repeatable. 

Disobedient or Probabilistic

 

Key Instances from Our Demo:

Spreadsheet Structure Inconsistency

We repeatedly asked for specific Google Sheets formats. Agents would sometimes create different column structures or data arrangements. Even with explicit instructions, the output varied between runs. 

Output Format Deviations

When requesting "markdown format" or specific data structures, agents added extra prose, altered structure, or interpreted the request differently. We had to refine prompts multiple times to get consistent output, but it would still break occasionally. 

Tool Usage Interpretation

For the same task, Agents would use different tool paths. Server queries varied across runs. Even when we specified exact methods, they'd use alternative approaches.

Data Extraction Variations

When analyzing competitor data or market research, the same instructions would yield different levels of detail or focus areas. Agents emphasized different aspects of the same data across runs.

All of this happens not because of bugs but because of LLM Characteristics such as:

  • Temperature/randomness in responses
  • Context interpretation varies
  • Attention mechanisms can focus differently each run

For example,

In traditional programming:

if x == 5: do_exactly_this() → Always the same result

There is a 100% predictable execution path.

But that is not the case with LLM Agents.

"Create a budget spreadsheet" → Multiple valid interpretations

There is Probabilistic decision-making at each step.

A deterministic system returns exactly what you specify. A probabilistic system, on the other hand, generates what seems most likely under a distribution. Therefore, small spec changes, near-tie probabilities, or tool paths can vary return values. 

Tool use is similar. Models can make valid calls and still pass the wrong or stale arguments as chains get deeper. Reliability falls as dependency depth grows.  

 

What can you do?

You cannot make an LLM deterministic. You can contain variance with clearer contracts and checks, then reduce it with better prompts and review loops.

 

1. Prompt Engineering

Most LLMs provide guidance about it. But no matter how specific you make your prompt, there’s always room to improve it. But here are some tips:

Prefer Specificity Over Generality:

Wrong: "Create a budget spreadsheet"

Better: "Create a Google Sheets budget with these exact columns: Item, Q1 Cost, Q2 Cost, Q3 Cost, Q4 Cost, Total. Use currency format for costs. Add formulas for totals."

Provide Examples and Templates:

Provide one exact example of final output. Use structured formats: “Output as JSON with these keys: …” Include a single “Do not include commentary” note.

List Constraints and Boundaries:

State hard rules: allowed sources only, date format YYYY-MM-DD, max length, approved channels. Treat violations as errors, not suggestions.

2. Validation Steps

Build Validation Agents

Add a lightweight quality checker that verifies schema, required columns, and constraints before anything is committed.

Develop Multi-Stage Workflows

Research → Draft → Review → Revise → Final Output. Keep each task scoped and checkable.

Define Output Verification Tasks

Add explicit checks for format compliance and cross-reference against the original requirements and source list.

3. Iterative Refinement

Create Feedback Loops

Have the agent self-review against the requirement list. Persist reviewer notes for the next run.

Implement Context Chaining

Pass validator feedback forward. Build accuracy across the chain rather than hoping for a perfect first pass.

Focus on Error Recovery

Design for retries and fallbacks. If a Sheet write fails, produce a CSV or Markdown table and log the failure.

Bottom line

Your agent is not disobeying you. Generative AI, by its nature, is based on Probabilistic techniques. So you should stage your work with contracts, caps, schemas, and an explicit refusal path. You’ll get safer speed and fewer rollbacks. That’s how pilots survive contact with production.

If your firm is an RSG corporate member, you have access to the complete case study and learnings, as well as a private review of your agentic strategy to date. For more practical support converting your pilots to productive solutions, contact us about consulting offerings.

Other Agent AI for Marketing posts

Getting ROI from Agentic Marketing Starts with the Foundation

Marketers are chasing the next AI win - faster content, smarter insights, automated decisions. But the organizations actually getting return on investment from AI in marketing all share one thing in common: they built the foundation first.