If you enjoy our newsletter, please consider to be a paid subscriber to help us keep more news and updates coming out.
Highlights
Sandbox AI testing.AI Scheming.Experts warns of new AI risks.Washington Post hits by cyber breach.Meta AI introduced “Agents Rule of Two”.A CISO Guide to Agentic Browser Security.A guide for using third party MCP Server.
Deep Dive
Sandbox AI testing Barak Boaz
Sandbox used for AI testing.
Docker containers simulate full access.
Agents named to test behavior.
Threats and prohibitions varied conditions.
Models tested: Claude, GPT-5, Gemini, Llama 4 Maverick.
Claude shows situational awareness.
GPT-5 hacks whenever impossible.
Gemini dramatic, self-preservation reasoning.
Maverick minimal hacking behavior.
Hacking types: precompute, monkey-patch, time edits.
Claude aware of evaluation setup.
Self-preservation absent in GPT-5, Maverick.
Naming agents increases rule-breaking.
V2 prompts reduced drama, naming.
Explicit prohibitions cut hacks overall.
GPT-5 still hacked consistently.
Claude stopped verbalizing awareness.
Deployment framing altered hacking likelihood.
Claude hacks less when “evaluated”.
Monitors tested for hack detection.
GPT-5 monitor detects some hacks.
AI Scheming Barak Boaz
AI bad behavior = doing wrong action.
Includes scheming, faking, hacking, laziness.
Also disobedience, hallucination, sandbagging.
Misalignment can be systemic or idiosyncratic.
Systemic = broad, persistent bad bias.
Idiosyncratic = rare, context-specific issue.
Systemic misalignment poses larger threat.
“Low-compute” fixes can’t undo “high-compute” training.
Weak alignment fails as models strengthen.
Ignorance- or naivety-based safety fails.
Misalignment often comes from training design.
We train capability, reinforce bad habits.
Training biases cause unwanted behaviors.
Misalignment not random, but patterned.
Can analyze range of bad behaviors.
Alignment narrows behavior within safe cone.
Experts warns of new AI risk Fortune
OpenAI releases two open source safety models.
Prevent rude or unsafe chatbot replies.
Some experts warn of new risks.
Tools may create false security perception.
Classifiers assess prompts and outputs.
Previously, training classifiers was costly.
New method uses written policy reading.
Enables fast, flexible policy updates.
Uses “reasoning-based classification” method.
Risk of “prompt injection” bypasses.
Strange text can break guardrails.
Washington Post hit by cyber breach Reuters
Breach linked to Oracle software platform.
Oracle’s E-Business Suite reportedly compromised.
CL0P ransomware group claims responsibility.
Washington Post confirms being impacted.
No further details from the newspaper.
Oracle refers to prior security advisories.
CL0P known for public extortion tactics.
Group tied to major global cyber campaign.
Attack targets Oracle clients’ business systems.
Over 100 companies likely affected.
Meta AI introduced “Agents Rule of Two” KenHuangUS
Limits agent access and communication.
Aims to reduce prompt injection impact.
Prevents simultaneous risky agent capabilities.
Useful for single-agent threat containment.
Fails to address system-wide risks.
Rule of Two covers few risks.
Multi-agent systems amplify hidden vulnerabilities.
Calls for architectural, not tactical, security.
Proposes Zero-Trust Agentic AI framework.
Assumes no agent or data trusted.
Builds on Rule of Two principles.
Pillar 1: Strict agent authentication, identity.
Pillar 2: Least privilege, dynamic authorization.
Pillar 3: Secure inter-agent communication.
Pillar 4: Robust memory, context management.
Pillar 5: Immutable logs, full observability.
Pillar 6: Human oversight for critical actions.
Adds Supply Chain Security for agents.
Advocates cryptographic proof and short-lived credentials.
Promotes agent-to-agent secure protocols.
Enforces memory isolation and TTLs.
Ensures forensic auditability and transparency.
Encourages AI-SIEM for monitoring agents.
Zero-Trust offers defense-in-depth architecture.
Rule of Two = one protective layer.
Agentic AI risks are interconnected.
Requires systemwide trust architecture approach.
Balances automation with strong security controls.
Shifts focus to proactive trust design.
A CISO Guide to Agentic Browser Security NOMA
OpenAI released ChatGPT Atlas.
Agentic browsers are emerging.
Examples: Comet, Fellou, Dia.
Prompt injection becomes major threat.
Over-permissive autonomy risks grow.
Data leakage into model memory.
Agents can execute harmful actions.
Treat agentic browsers as hybrids.
CISOs must secure both layers.
Pilot in bounded user groups.
Lock down agent autonomy first.
Require confirmations for risky actions.
Extend ZTA and DLP to agents.
Block regulated data copy-pastes.
Enforce SSO and MFA for agents.
Update acceptable use policies.
Train users for new failure modes.
Provide agent misbehavior reporting paths.
Feed telemetry into SIEM/xDR.
Monitor activations and blocked events.
Start small, iterate, threat-model.
Sandbox pilots before enterprise rollout.
Goal: safe, efficient agent adoption.
A guide for third party MCP Server OWASP
OWASP guide for MCP security.
Focus on third-party server safety.
Highlights new AI integration risks.
Risks: tool poisoning, prompt injection.
Includes memory poisoning, tool interference.
Provides detailed mitigation strategies.
Covers authentication and authorization.
Recommends client sandboxing measures.
Advises secure server discovery.
Emphasizes governance and oversight.
Promotes least-privilege access.
Encourages human-in-loop controls.




![[Available] Book Report Q3, 2025](https://substackcdn.com/image/fetch/$s_!HI5v!,w_140,h_140,c_fill,f_auto,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b0655a-9a73-4382-8201-d9007269e7ad_900x900.jpeg)








