Presenters

Source

Designing Smarter AI Workflows: The Strategic Power of Human-in-the-Loop 🚀

Hey tech enthusiasts! Ever feel like AI is moving at lightning speed, but sometimes leaves us scratching our heads about how it makes its decisions? Ashley Nutter, a seasoned product leader at CNN, recently shed some light on this fascinating challenge, especially when it comes to designing “agentic workflows” – AI systems where an agent makes decisions or takes actions.

Ashley’s talk dives deep into a crucial aspect: human-in-the-loop (HITL). It’s not about eliminating humans, but about strategically placing their judgment where it matters most, enabling us to harness the power of AI agents without sacrificing trust or brand integrity.

Understanding Agentic Workflows and Human-in-the-Loop 🤖🤝

First things first, let’s clarify terms:

  • Agentic Workflows: These are AI systems where the output is an agent that makes decisions or takes actions. Think of an AI assistant that can book appointments or an AI that can draft an article outline.
  • Human-in-the-Loop (HITL): This refers to any point where a human needs to interact before a decision can be made or an action can be taken by the AI.

Ashley, with her background in consumer streaming products and now at CNN, understands the dual challenge of serving millions of news consumers globally and supporting a vast network of journalists. The goal? To create compelling storytelling, scale experiences rapidly, and deliver more value to users faster. Agentic workflows offer a powerful path to achieve this, but they come with inherent risks.

The Spectrum of Control vs. Agency: Navigating the Trade-offs ⚖️

Ashley beautifully illustrates the design space as a spectrum:

  • Full Control of Outcomes: This is where humans are heavily involved, ensuring predictable results.
  • Full Agency: This is where AI agents operate with significant autonomy.

The general direction for many teams is towards agency, as it unlocks:

  • Rapid Scaling: Delivering more value to more users, faster.
  • Unprecedented Personalization: Offering each user the best possible product experience.
  • New Avenues for User Value: Discovering innovative ways to serve users.

However, this increased agency introduces risks:

  • Unpredictability: An agent might take an unexpected action.
  • Erosion of User Trust: If an AI’s actions are consistently off, users lose faith.
  • Brand Integrity Compromise: Decisions inconsistent with brand values can be damaging.
  • Business Outcomes at Risk: Ultimately, these factors can impact the bottom line.

The Bottleneck of Traditional HITL: Why We Need a Smarter Approach 🚧

Often, HITL is employed to mitigate risks. But the traditional approach can become a bottleneck, slowing down progress or acting as a gatekeeper. Ashley points out two key issues:

  1. Linear Scaling & Inconsistency:
    • Each human decision adds linear cost.
    • Humans can produce inconsistent outcomes due to differing backgrounds and decision trees.
  2. Counteracting Lack of Confidence: When HITL is used as a band-aid for low confidence in the AI, it can:
    • Delay system-level improvements for the agent.
    • Prevent crisp definitions of success, as human judgment can always “fix” issues.
    • Lead to workarounds that undermine the value of agency itself.

Strategic Allocation of Judgment: The Key to Unlocking Agency ✨

The goal isn’t zero human involvement, but rather the strategic allocation of human judgment to the most impactful parts of the system. This helps us move faster and unlock agency without undermining trust.

User trust is paramount, especially in news, but it applies to every product – whether it’s about quality, reliability, or accuracy. The core challenge for product leaders is to decide:

  • What risk is acceptable?
  • Where does judgment truly belong in the system?
  • How do we scale up without undermining what users value?

Where Does Judgment Have an Outsized Impact? 🤔

Ashley highlights three key areas where human judgment is particularly powerful:

  1. Ambiguous Outcomes: Situations involving nuance, context, or values judgments. For example, assessing fairness standards in interview questions is more impactful than simple fact-checking.
  2. Irreversible or Reputationally Costly Consequences: Decisions with permanent impact or those that could severely damage reputation. Archiving millions of hours of video footage has irreversible potential, while journalistic integrity carries immense reputational cost.
  3. System Improvement: When human judgment can actively make the system better over time.

Operationalizing Strategic HITL: A Four-Question Framework 🛠️

To implement this strategic approach, Ashley proposes answering four key questions when designing workflows:

1. When Does a Human Step In? (The Trigger) 🚨

The trigger for human review should be based on product risk, not just agent confidence.

  • Risk vs. Confidence: Don’t solely rely on the agent’s confidence level. A low-confidence decision might be low-stakes (e.g., a $5 refund), while a high-confidence decision could be disastrous (e.g., a large corporate account refund). Focus on how bad it would be if the agent is wrong.
  • Explicit and Testable Triggers:
    • Explicit: Clear, articulable rules for human involvement. No “vibes” allowed!
    • Testable: Scenario-based tests (evals) to ensure triggers are effective. These evals should answer: “Does this require a human?” “Would this cause harm?” “Would this cross a trust boundary?”

Examples of Explicit and Testable Triggers:

  • External Publication: Flag for review before publishing to any external channel. 📢
  • One-Way Door Trigger: Flag if the agent’s action is irreversible. 🚪
  • Blast Radius: Flag if the decision impacts a certain number of users or revenue. 🎯
  • Regulated or Sensitive Domains: Flag for financial data, medical information, etc. ⚕️
  • Scenario-Based Triggers: Use red teams to identify high-risk scenarios (e.g., products involving firearms, alcohol, or religion). 🔥

Minimizing Unnecessary Review: The trigger should catch only what matters. Too permissive, and manual checks creep in. Too broad, and reviewers rubber-stamp. Aim for reduced volume and minimized unnecessary review.

2. What is the Human Actually Doing? (The Feedback Type) ✍️

The type of feedback humans provide directly impacts scalability and effectiveness. The trade-off is between:

  • Binary Response (Approve/Reject):
    • Pros: Fast, highly scalable, low effort, consistent. Great for safety gates.
    • Cons: Weak signal for the agent (doesn’t explain why), struggles with ambiguity.
    • Best for: Clear decision criteria, well-understood risks (e.g., policy violations). The human’s role is to gate, not teach.
  • Open-Ended Review (Comments, Edits, Explanations):
    • Pros: Richer information, captures judgment, best for improving the agent over time.
    • Cons: More expensive, doesn’t scale as well, harder to standardize.
    • Best for: Evolving rules, situations with high nuance (e.g., editorial judgment).

Hybrid Approaches: Often, a mix is best. For example, binary approval/rejection with open-ended feedback only on rejections. Or staging feedback, starting broad and becoming more binary over time. The choice depends on optimizing for safety or smarts.

3. What Happens After the Human Acts? (Capturing Judgment) 🔄

Human judgment must be captured to drive future improvements:

  • Updating Prompts, Evals, or Training Data: The goal is to reduce future workload.
  • Using Evals: Evals help identify if HITL is compensating for bad system design or if risk is genuinely decreasing. For instance, an agent generating metadata needs to include attribution (e.g., “according to a study”) to avoid misrepresenting findings.
  • Repetition is Key: If reviewers repeatedly provide the same feedback, encode it into the system. This helps identify gaps in how the system updates for new information.
  • Leveraged Judgment: When review reduces workload, human judgment becomes leverage.

4. How Does It Change Over Time? (Defining Success and Downgrading) 📈

The ultimate goal is to reduce future workload and allow the AI agent to take on more responsibility.

  • Clear Success Criteria: Define success upfront and ensure it’s evaluable against data and observations. “Good vibes” aren’t enough for product management.
  • Failure to Define Downgrade Criteria: If clear criteria for reducing human review aren’t set, HITL will likely persist indefinitely. This is an organizational failure, not a technical one.

The Future of AI Workflows: Judgment Where It Counts 💡

As AI systems become more agentic, product leaders face the critical task of deciding where judgment lives. It won’t be everywhere or nowhere, but strategically placed where it truly counts.

By thoughtfully designing human-in-the-loop processes, we can enable AI agents to operate with greater autonomy, unlock immense potential for scaling and personalization, and deliver more value to users, all without undermining the trust that forms the bedrock of any successful product.

Thank you, Ashley, for this insightful perspective on building smarter, more trustworthy AI!

Appendix