From Agentic Pilot to Production, Part 2: Disobedient or just probabilistic?
In this second post in our series, From Agentic Pilot to Production, we look at the probabilistic nature of LLMs, the issues that can result for marketing leaders piloting Agentic AI, and how to add determinism in prompts to mitigate.
A Quick Story
At RSG, we built a simple multi-agent flow to simulate an exciting agentic possibility. The process collected public signals, proposed campaigns, generated copy, and organized a calendar in Sheets. It looked nice and slick. But we ran into a recurring problem. In spite of providing explicit instructions, sometimes repeatedly, the agent just wouldn't follow them. It often ignored clear instructions and wandered.
(You can read more in the first post in this series, From Agentic Pilot to Production, Part 1: Autonomy with Brakes: Why Refusal Comes First.)
Similar issues are facing even the big AI vendors in production. Salesforce leaders recently acknowledged declining trust in LLM-driven agents after real-world failures.
The lesson here is to treat LLM output as probabilistic, which means multiple outputs will vary and won’t be repeatable.

Key Instances from Our Demo:
Spreadsheet Structure Inconsistency
We repeatedly asked for specific Google Sheets formats. Agents would sometimes create different column structures or data arrangements. Even with explicit instructions, the output varied between runs.
Output Format Deviations
When requesting "markdown format" or specific data structures, agents added extra prose, altered structure, or interpreted the request differently. We had to refine prompts multiple times to get consistent output, but it would still break occasionally.
Tool Usage Interpretation
For the same task, Agents would use different tool paths. Server queries varied across runs. Even when we specified exact methods, they'd use alternative approaches.
Data Extraction Variations
When analyzing competitor data or market research, the same instructions would yield different levels of detail or focus areas. Agents emphasized different aspects of the same data across runs.
All of this happens not because of bugs but because of LLM Characteristics such as:
- Temperature/randomness in responses
- Context interpretation varies
- Attention mechanisms can focus differently each run
For example,
In traditional programming:
if x == 5: do_exactly_this() → Always the same result
There is a 100% predictable execution path.
But that is not the case with LLM Agents.
"Create a budget spreadsheet" → Multiple valid interpretations
There is Probabilistic decision-making at each step.
A deterministic system returns exactly what you specify. A probabilistic system, on the other hand, generates what seems most likely under a distribution. Therefore, small spec changes, near-tie probabilities, or tool paths can vary return values.
Tool use is similar. Models can make valid calls and still pass the wrong or stale arguments as chains get deeper. Reliability falls as dependency depth grows.
What can you do?
You cannot make an LLM deterministic. You can contain variance with clearer contracts and checks, then reduce it with better prompts and review loops.
1. Prompt Engineering
Most LLMs provide guidance about it. But no matter how specific you make your prompt, there’s always room to improve it. But here are some tips:
Prefer Specificity Over Generality:
Wrong: "Create a budget spreadsheet"
Better: "Create a Google Sheets budget with these exact columns: Item, Q1 Cost, Q2 Cost, Q3 Cost, Q4 Cost, Total. Use currency format for costs. Add formulas for totals."
Provide Examples and Templates:
Provide one exact example of final output. Use structured formats: “Output as JSON with these keys: …” Include a single “Do not include commentary” note.
List Constraints and Boundaries:
State hard rules: allowed sources only, date format YYYY-MM-DD, max length, approved channels. Treat violations as errors, not suggestions.
2. Validation Steps
Build Validation Agents
Add a lightweight quality checker that verifies schema, required columns, and constraints before anything is committed.
Develop Multi-Stage Workflows
Research → Draft → Review → Revise → Final Output. Keep each task scoped and checkable.
Define Output Verification Tasks
Add explicit checks for format compliance and cross-reference against the original requirements and source list.
3. Iterative Refinement
Create Feedback Loops
Have the agent self-review against the requirement list. Persist reviewer notes for the next run.
Implement Context Chaining
Pass validator feedback forward. Build accuracy across the chain rather than hoping for a perfect first pass.
Focus on Error Recovery
Design for retries and fallbacks. If a Sheet write fails, produce a CSV or Markdown table and log the failure.
Bottom line
Your agent is not disobeying you. Generative AI, by its nature, is based on Probabilistic techniques. So you should stage your work with contracts, caps, schemas, and an explicit refusal path. You’ll get safer speed and fewer rollbacks. That’s how pilots survive contact with production.
If your firm is an RSG corporate member, you have access to the complete case study and learnings, as well as a private review of your agentic strategy to date. For more practical support converting your pilots to productive solutions, contact us about consulting offerings.