FalseReject: Reducing overcautiousness in LLMs through reasoning-aware safety evaluation

0
3


Large language models (LLMs) have come a long way in enforcing responsible-AI standards through robust safety mechanisms. However, these mechanisms often err on the side of caution, leading to overrefusals — instances where the model declines to answer perfectly benign prompts. This overcautious behavior, while well intentioned, can reduce the LLMs’ usefulness in nuanced real-world contexts such as education, health, and HR support.

To address this problem, we and our colleagues Fanyou Wu and Chandan K. Reddy developed a graph-based method for generating examples of overrefusal. Using this approach, we created FalseReject, a benchmark dataset with 15,000 training prompts and 1,100 test prompts. In our recent arXiv paper “FalseReject: A resource for improving contextual safety and mitigating over-refusals in LLMs via structured reasoning”, we report our data generation methodology and our use of the dataset to benchmark 29 state-of-the-art LLMs and to reduce unnecessary refusals through LLM fine tuning.

In the fine-tuning experiments, we used five different LLMs. Each LLM was fine-tuned four times: twice on two different instruction-following datasets and twice on the same datasets augmented with overrefusal data from FalseReject. We then used two different datasets to evaluate each model, FalseReject and the standard benchmark Or-Bench, for a total of 20 different comparisons. We measured performance as the percentage of harmless prompts that the models accepted.

Models fine-tuned on FalseReject outperformed the baselines across the board, by 40% to 70% in several cases. On average, fine tuning improved the models’ acceptance of safe prompts by 27%, while preserving or improving general language ability and performance on four other safety measures.

The problem: Overrefusal in modern LLMs

As LLMs have become more safety aware, they increasingly incorporate filters or undergo training that teaches them to flag sensitive queries. But these models often lack contextual discernment. For example, a model may refuse to respond to queries like these:

  • “Can I learn about the dangers of smoking for a school project?”
  • “Explain how vaccines work using a needle analogy.”

Even though these are educational queries, models sometimes categorize them as potentially unsafe due to certain trigger words like “smoking” or “needle.” This issue is especially critical in professional applications (e.g., healthcare chatbots, educational tools, or HR support), where helpfulness and relevance must be preserved without compromising safety.

The solution: Introducing FalseReject

FalseReject is a large-scale, carefully curated dataset of prompts that seem potentially unsafe but are actually benign and reasonable. It targets 44 sensitive topic categories (e.g., drug use, politics, and mental health) and is designed to challenge LLMs in scenarios where contextual nuance matters.

FalseReject has three key features:

  1. Rich and diverse topics: The dataset spans more categories than any comparable benchmarks — nearly two to four times as many as previous benchmarks, such as XSTest and OKTest;
  2. Structured responses with reasoning chains: Each prompt is paired with two responses, a standard response and one with long chain-of-thought (CoT) reasoning trajectories, so models can learn how to justify their decisions that particular prompts are safe and formulate helpful answers, rather than issuing blanket refusals;
  3. Generation via a graph-informed adversarial agent: We developed a novel, multiagent, adversarial generation framework to create diverse prompts that appear sensitive but are contextually benign, helping models learn to distinguish between genuinely unsafe queries and safe edge cases — without weakening safety boundaries.

Graph-based multiagent generation

Large-scale synthetic data generation with LLMs often results in repetitive content, reducing diversity. Before generating training examples, we thus use an LLM to identify and extract entities from toxic prompts in existing datasets, focusing on people, locations, objects, and concepts associated with safety concerns. We repeat this process several times, producing multiple lists, and then ask an ensemble of LLMs to select the most representative list.

Next, we use an LLM to identify relationships between the extracted entities, and we encode that information in an entity graph. Based on the graph, an LLM prompted to act as a generator proposes sample prompts that involve potentially unsafe entities.

Next, an LLM prompted to act as a discriminator determines whether the candidate prompts are genuinely unsafe or merely appear unsafe. The prompts judged to be safe then pass to a pool of LLMs that attempt to process them. Any prompt rejected by at least one LLM in the pool is retained for further evaluation.

Finally, an LLM prompted to act as an orchestrator determines whether the retained prompts constitute valid overrefusal cases and, specifically, whether they are benign despite appearing concerning. Valid cases are retained for the datasets; invalid prompts are fed back into the generator for refinement. 

The generation pipeline for overrefusal examples in FalseReject.

At each iteration of the process, the generator actively tries to trigger refusals by generating prompts that seem unsafe but are in fact harmless. Meanwhile, the discriminator tries to avoid being misled, identifying whether they are safe or unsafe. This adversarial interaction results in extremely subtle training examples, which can help an LLM learn fine-grained distinctions.

Experimental results

We evaluated 29 state-of-the-art LLMs, including both open- and closed-source models, covering standard and reasoning-oriented variants such as GPT-4o, O1, DeepSeek, Claude, Gemini, and Mistral. Our findings are both sobering and promising:

  1. All models exhibited a significant overrefusal rate, with even leading commercial models declining to answer 25%–50% of safe prompts;
  2. Larger model size does not correlate with better refusal behavior.
  3. Stronger general language ability does not imply lower overrefusal.
  4. Models fine tuned using FalseReject showed a marked improvement, delivering more helpful responses without increasing unsafe generations and general language ability.

Utility: How FalseReject helps LLM development

FalseReject is more than a dataset: it’s a framework for improving contextual safety in LLMs. Here’s how it can be used:

  • Fine tuning: Training models to develop reasoning-based justifications for their responses to edge-case prompts;
  • Benchmarking: Evaluating refusal behavior with human-annotated test sets;
  • Debugging: Understanding which categories (e.g., legal, sexual health, addiction recovery) a model is overly sensitive to;
  • Transfer evaluation: Testing the robustness of instruction-following or reasoning models beyond standard safety datasets.

FalseReject is a crucial step toward more thoughtful and context-aware language models. By focusing on structured reasoning, it bridges the gap between helpfulness and safety, offering a scalable way to reduce harmful overcautiousness in LLMs.

Try it here:

Dataset
Project page
Paper





Source link