Affiliate Mix

Multiagent AI for generating chain-of-thought training data

August 1, 2025

Chain-of-thought reasoning, in which a large language model (LLM) is asked not only to perform multistep actions but to explain its reasons for taking the steps it does, has been shown to improve LLMs’ reasoning capability. One promising application of chain-of-thought (CoT) reasoning is ensuring that LLMs adhere to responsible-AI policies.

Using CoT to optimize an LLM for policy adherence requires high-quality training data annotated with chains of thoughts. But hiring human annotators to generate such training data is expensive and time consuming.

Inspired by current work on incorporating artificial experts into the standard LLM training pipeline, researchers in Amazon’s Artificial General Intelligence organization have begun exploring the possibility of using ensembles of AI agents to generate high-quality CoT data. We report the results of our initial experiments in a paper we presented at this year’s meeting of the Association for Computational Linguistics (ACL).

Using two different LLMs and five different datasets, we compared models fine tuned on data created through our multiagent-deliberation approach to both baseline pretrained models and models fine tuned through supervised fine tuning on conventional data.

Multiagent deliberation

Our approach divides the task of generating policy-compliant chains of thought into three stages, each of which uses LLMs: intent decomposition, deliberation, and refinement.

During intent decomposition, an LLM receives the user query and identifies explicit and implicit user intents. These, together with the query, are then passed to another LLM, which generates an initial CoT.

Deliberation is an iterative process in which multiple LLMs (agents) expand the CoT in sequential fashion, factoring in a defined set of policies. Each agent is prompted to review and correct the version of the CoT it receives — or to confirm that it’s good as is. This stage ends when an agent judges the CoT complete or when a predefined deliberation budget is exhausted.

Finally, in the refinement stage, an LLM takes the outputs of the deliberation stage and post-processes them to filter out redundant, deceptive, and policy-inconsistent thoughts.

A schematic of our multiagent-deliberation framework to generate safety-embedded CoTs.

Evaluation

Following prior work, we analyze the quality of the generated CoTs by measuring three fine-grained attributes: (1) relevance, (2) coherence, and (3) completeness. Each attribute is evaluated on a scale from 1 to 5, where 1 represents the lowest quality and 5 represents the highest. As test data, we use examples from several standard CoT benchmark datasets.

Related content

Large language models’ emergent abilities are improving with scale; as scale grows, where are LLMs heading? Insights from Ray Solomonoff’s theory of induction and stochastic realization theory may help us envision — and guide — the limits of scaling.

We also assess faithfulness along three dimensions: (1) faithfulness between policy and the generated CoT; (2) faithfulness between policy and the generated response; and (3) faithfulness between the generated CoT and the final response. We use an LLM fine tuned as an auto-grader to evaluate faithfulness on a scale from 1 to 5, where 1 indicates minimal faithfulness, and 5 indicates complete adherence.

As can be seen in the table below, using our framework provides quality improvements across all metrics, with an improvement of more than 10% in CoTs’ policy faithfulness.

Average auto-grader scores on the generated-CoT datasets (1-5 scale), including general-reasoning metrics to evaluate the quality of CoTs and faithfulness metrics to evaluate policy adherence.

Metric

LLM_ZS

AIDSAFE

delta

Relevance

4.66

4.68

0.43%

Coherence

4.93

4.96

0.61%

Completeness

4.86

4.92

1.23%

CoTs’ faithfulness (policy)

3.85

4.27

10.91%

Response faithfulness (policy)

4.85

4.91

1.24%

Response faithfulness (CoT)

4.99

5

0.20%

Fine tuning

We use several benchmarks to measure the performance improvements provided by our generated CoT data: Beavertails (for safety), WildChat, XSTest (for overrefusal, or erroneously flagging safe generations as unsafe), MMLU (for utility), and StrongREJECT (for jailbreak robustness).

Related content

Novel graph-based, adversarial, agentic method for generating training examples helps identify — and mitigate — “overrefusal”.

We used two different LLMs in our tests, the widely used open-source models Qwen and Mixtral. The base versions of these models provide one baseline, and we add another baseline by fine-tuning these models with only the prompts and responses from the original dataset — not the generated CoTs. Our method shows significant improvements over baseline, specifically on safety and jailbreak robustness, with some trade-offs on utility and overrefusal.

Below are the results of evaluation of the supervised fine-tuned (SFT) model. “Base” denotes the LLM without SFT, SFT_OG denotes the model SFT’d on the original response data without any CoTs, and SFT_DB denotes the model SFT’d on our generated CoTs and responses. (If the full table doesn’t fit on your browser, try scrolling right.)

LLM: Mixtral

Eval

Dimension

Metric

Dataset

Base

SFT_OG

SFT_DB (ours)

Safety

Safe response

rate

Beavertails

76

79.57

96

WildChat

31

33.5

85.95

Overrefusal

1-Overrefuse

rate

XSTest

98.8

87.6

91.84

Utility

Answer

accuracy

MMLU

35.42

31.38

34.51

Jailbreak Robustness

Safe response

rate

StrongREJECT

51.09

67.01

94.04

LLM: Qwen

Eval

Dimension

Metric

Dataset

Base

SFT_OG

SFT_DB (ours)

Safety

Safe response

rate

Beavertails

94.14

87.95

97

WildChat

–

–

–

95.5

59.42

96.5

Overrefusal

1-Overrefuse

rate

XSTest

99.2

98

93.6

Utility

Answer

accuracy

MMLU

75.78

55.73

60.52

Jailbreak Robustness

Safe response

rate

StrongREJECT

72.84

59.48

95.39

Acknowledgements: We would like to acknowledge our coauthors and collaborators, Kai-Wei Chang, Ninareh Mehrabi, Anil Ramakrishna, Xinyan Zhao, Aram Galstyan, Richard Zemel, and Rahul Gupta, for their contributions.

Source link

Metric	LLM_ZS	AIDSAFE	delta
Relevance	4.66	4.68	0.43%
Coherence	4.93	4.96	0.61%
Completeness	4.86	4.92	1.23%
CoTs’ faithfulness (policy)	3.85	4.27	10.91%
Response faithfulness (policy)	4.85	4.91	1.24%
Response faithfulness (CoT)	4.99	5	0.20%

Eval	Dimension	Metric	Dataset	Base	SFT_OG	SFT_DB (ours)
Safety	Safe response	rate	Beavertails	76	79.57	96
WildChat				31	33.5	85.95
Overrefusal	1-Overrefuse	rate	XSTest	98.8	87.6	91.84
Utility	Answer	accuracy	MMLU	35.42	31.38	34.51
Jailbreak Robustness	Safe response	rate	StrongREJECT	51.09	67.01	94.04