I think the core issue is in static benchmarks and the community needs to start moving beyond measuring pass/fail (which worked when agents were incapable of doing much of the work) to dynamic evals that simulate more how we evaluate humans.
We're doing that internally to continuously improve our own agent and make it robust against adversarial attacks itself. We will release some insights about self-improvement soon!
AI agents break in ways traditional software doesn't. Logic bugs, reasoning failures, edge cases that manual testing and static benchmarks don't fully explore.
Nyx is an autonomous adversarial harness that probes your agents for vulnerabilities. Since agents are non-deterministic, it can be hard to find the gaps by just reading code. So it interacts with your AI agents in blackbox mode to surface issues across security, logic, and alignment at scale, before they reach users. It's also massively parallel by default
Instead of spending time writing static evals for the key failure modes of your AI agents, point Nyx at any system and it autonomously discovers failure modes that matter. It can typically find issues in under 10 minutes that manual audits take hours to surface.
This is early work and we know the methodology is still going to evolve. We would love nothing more than feedback from the community as we iterate on this.
We wrote some thoughts on static vs. dynamic evals and how it relates to understanding the security posture of an AI system. Static security evals no longer carry the signal they used to. A one-shot pass/fail tells you almost nothing about real-world risk.
we did a lot of thinking around this topic. and distilled it into a new way to dynamically evaluate the security posture of an AI system (which can apply for any system for that matter). we wrote some thoughts on this here: https://fabraix.com/blog/adversarial-cost-to-exploit
Not sure which version of Gemini are you using but Claude is so much better for me. Gemini is generally overeager to make a code change even when I am just asking conceptual questions, among other issues.
Yup! But in my opinion the current state of guardrails is still lacking and I hope this is one way that helps improve our understanding of these systems.
I think the core issue is in static benchmarks and the community needs to start moving beyond measuring pass/fail (which worked when agents were incapable of doing much of the work) to dynamic evals that simulate more how we evaluate humans.
reply