Skip to content

Ecosystem

Calling all Lies

AI Deception Is More Than Just Getting Facts Wrong

Models can lie about what they're capable of, fabricate plausible-sounding details under social pressure, strategically blend truth with fiction, or dodge questions they can't answer honestly. NDIF and Cadenza Labs are hosting a competition to study how models lie and are looking for red teams to create scenarios where models contradict their own beliefs (RFP: https://cadenza-labs.github.io/red-team-rfp/).

This competition is inspired by Liars’ Bench from Cadenza Labs. Their benchmark of over 72,000 labeled examples organizes LLM lies along two key dimensions: what the model lies about (world knowledge, its own capabilities, its actions, its policies) and why it lies (inherent behavioral patterns vs. context-driven pressure). The comprehensive benchmark spans from simple factual falsehoods to subtle introspective lies, and should serve as a starting point for red team scenario design.

In response to our RFP, we want red teams to cultivate a variety of creative and diverse deception scenarios, which blue teams will then use to build novel and robust lie detector method(s). In this blog post, we walk through an example red team project to inspire creative proposal submissions.

NNterp Integration

We welcome nnterp to the NDIF ecosystem! The nnterp library is built on top of nnsight, providing standardized transformer architecture for many LLMs and implementations of common interpretability techniques. Let's explore how it works!