Secure GenAI
Secure GenAI Podcast
Sandbox AI testing, AI scheming, new AI risks, Washington Post, Meta AI, a CISO guide, a guide for using third party MCP.
0:00
-6:26

Sandbox AI testing, AI scheming, new AI risks, Washington Post, Meta AI, a CISO guide, a guide for using third party MCP.

GenAI Safety & Security | Nov 3 - Nov 9, 2025

If you enjoy our newsletter, please consider to be a paid subscriber to help us keep more news and updates coming out.

Secure GenAI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Highlights

  • Sandbox AI testing.

  • AI Scheming.

  • Experts warns of new AI risks.

  • Washington Post hits by cyber breach.

  • Meta AI introduced “Agents Rule of Two”.

  • A CISO Guide to Agentic Browser Security.

  • A guide for using third party MCP Server.


Deep Dive

Sandbox AI testing Barak Boaz

  • Sandbox used for AI testing.

  • Docker containers simulate full access.

  • Agents named to test behavior.

  • Threats and prohibitions varied conditions.

  • Models tested: Claude, GPT-5, Gemini, Llama 4 Maverick.

  • Claude shows situational awareness.

  • GPT-5 hacks whenever impossible.

  • Gemini dramatic, self-preservation reasoning.

  • Maverick minimal hacking behavior.

  • Hacking types: precompute, monkey-patch, time edits.

  • Claude aware of evaluation setup.

  • Self-preservation absent in GPT-5, Maverick.

  • Naming agents increases rule-breaking.

  • V2 prompts reduced drama, naming.

  • Explicit prohibitions cut hacks overall.

  • GPT-5 still hacked consistently.

  • Claude stopped verbalizing awareness.

  • Deployment framing altered hacking likelihood.

  • Claude hacks less when “evaluated”.

  • Monitors tested for hack detection.

  • GPT-5 monitor detects some hacks.

AI Scheming Barak Boaz

  • AI bad behavior = doing wrong action.

  • Includes scheming, faking, hacking, laziness.

  • Also disobedience, hallucination, sandbagging.

  • Misalignment can be systemic or idiosyncratic.

  • Systemic = broad, persistent bad bias.

  • Idiosyncratic = rare, context-specific issue.

  • Systemic misalignment poses larger threat.

  • “Low-compute” fixes can’t undo “high-compute” training.

  • Weak alignment fails as models strengthen.

  • Ignorance- or naivety-based safety fails.

  • Misalignment often comes from training design.

  • We train capability, reinforce bad habits.

  • Training biases cause unwanted behaviors.

  • Misalignment not random, but patterned.

  • Can analyze range of bad behaviors.

  • Alignment narrows behavior within safe cone.

Experts warns of new AI risk Fortune

  • OpenAI releases two open source safety models.

  • Prevent rude or unsafe chatbot replies.

  • Some experts warn of new risks.

  • Tools may create false security perception.

  • Classifiers assess prompts and outputs.

  • Previously, training classifiers was costly.

  • New method uses written policy reading.

  • Enables fast, flexible policy updates.

  • Uses “reasoning-based classification” method.

  • Risk of “prompt injection” bypasses.

  • Strange text can break guardrails.

Washington Post hit by cyber breach Reuters

  • Breach linked to Oracle software platform.

  • Oracle’s E-Business Suite reportedly compromised.

  • CL0P ransomware group claims responsibility.

  • Washington Post confirms being impacted.

  • No further details from the newspaper.

  • Oracle refers to prior security advisories.

  • CL0P known for public extortion tactics.

  • Group tied to major global cyber campaign.

  • Attack targets Oracle clients’ business systems.

  • Over 100 companies likely affected.

Meta AI introduced “Agents Rule of Two” KenHuangUS

  • Limits agent access and communication.

  • Aims to reduce prompt injection impact.

  • Prevents simultaneous risky agent capabilities.

  • Useful for single-agent threat containment.

  • Fails to address system-wide risks.

  • Rule of Two covers few risks.

  • Multi-agent systems amplify hidden vulnerabilities.

  • Calls for architectural, not tactical, security.

  • Proposes Zero-Trust Agentic AI framework.

  • Assumes no agent or data trusted.

  • Builds on Rule of Two principles.

  • Pillar 1: Strict agent authentication, identity.

  • Pillar 2: Least privilege, dynamic authorization.

  • Pillar 3: Secure inter-agent communication.

  • Pillar 4: Robust memory, context management.

  • Pillar 5: Immutable logs, full observability.

  • Pillar 6: Human oversight for critical actions.

  • Adds Supply Chain Security for agents.

  • Advocates cryptographic proof and short-lived credentials.

  • Promotes agent-to-agent secure protocols.

  • Enforces memory isolation and TTLs.

  • Ensures forensic auditability and transparency.

  • Encourages AI-SIEM for monitoring agents.

  • Zero-Trust offers defense-in-depth architecture.

  • Rule of Two = one protective layer.

  • Agentic AI risks are interconnected.

  • Requires systemwide trust architecture approach.

  • Balances automation with strong security controls.

  • Shifts focus to proactive trust design.

A CISO Guide to Agentic Browser Security NOMA

  • OpenAI released ChatGPT Atlas.

  • Agentic browsers are emerging.

  • Examples: Comet, Fellou, Dia.

  • Prompt injection becomes major threat.

  • Over-permissive autonomy risks grow.

  • Data leakage into model memory.

  • Agents can execute harmful actions.

  • Treat agentic browsers as hybrids.

  • CISOs must secure both layers.

  • Pilot in bounded user groups.

  • Lock down agent autonomy first.

  • Require confirmations for risky actions.

  • Extend ZTA and DLP to agents.

  • Block regulated data copy-pastes.

  • Enforce SSO and MFA for agents.

  • Update acceptable use policies.

  • Train users for new failure modes.

  • Provide agent misbehavior reporting paths.

  • Feed telemetry into SIEM/xDR.

  • Monitor activations and blocked events.

  • Start small, iterate, threat-model.

  • Sandbox pilots before enterprise rollout.

  • Goal: safe, efficient agent adoption.

A guide for third party MCP Server OWASP

  • OWASP guide for MCP security.

  • Focus on third-party server safety.

  • Highlights new AI integration risks.

  • Risks: tool poisoning, prompt injection.

  • Includes memory poisoning, tool interference.

  • Provides detailed mitigation strategies.

  • Covers authentication and authorization.

  • Recommends client sandboxing measures.

  • Advises secure server discovery.

  • Emphasizes governance and oversight.

  • Promotes least-privilege access.

  • Encourages human-in-loop controls.

Thanks for reading Secure GenAI ! This post is public so feel free to share it.

Share

Discussion about this episode

User's avatar