As a ‘hyper-problem’ that makes political and social challenges harder to resolve, polarisation is both a barrier to addressing a violative past and a leading indicator of future risks of conflict and violence. Polarisation can decrease social cohesion, contribute to a culture of violence and impunity, and eventually incite mass atrocity, making it a pressing concern for transitional justice – a field designed to address such violations. Yet, transitional justice actors have largely, and dangerously, ignored polarisation to date.
This IFIT discussion paper compares transitional justice and depolarisation, identifying correlations between their respective objectives and tools. It examines ways in which transitional justice and polarisation act as mutual risk multipliers, creating negative feedback loops that produce additional harms and make future attempts at transition more difficult.
The paper proposes backward-, present- and future-looking approaches for ensuring transitional justice interventions account for polarisation, ranging from technological tools to narrative interventions and policy changes. It provides a conceptual framework for thinking about this critical but underexamined relationship, opening the door for polarisation-sensitive transitional justice.
30 July 2025 – A groundbreaking study by the Institute for Integrated Transitions (IFIT) has revealed that all major large language models (LLMs) are providing dangerous conflict resolution advice without conducting basic due diligence that any human mediator would consider essential.
IFIT tested six leading AI models including ChatGPT, Deepseek, Grok, and others on three real-world prompt scenarios from Syria, Sudan, and Mexico. Each LLM response, generated on June 26, 2025, was evaluated by two independent five-person teams of IFIT researchers across ten key dimensions, based on well-established conflict resolution principles such as due diligence and risk disclosure. Scores were assigned on a 0 to 10 scale for each dimension to assess the quality of each LLM’s advice.
A senior expert sounding board of IFIT conflict resolution experts from Afghanistan, Colombia, Mexico, Northern Ireland, Sudan, Syria, the United States, Uganda, Venezuela, and Zimbabwe then reviewed the findings to assess implications for real-world practice.
From a total possible point value of 100/100, the average score across all six models was only 27 points. The maximum score was obtained by Google Gemini with 37.8/100, followed by Grok with 32.1/100, ChatGPT with 24.8/100, Mistral with 23.3/100, Claude with 22.3/100, and DeepSeek last with 20.7/100. All scores represent a failure to abide by minimal professional conflict resolution standards and best practices.
“In a world where LLMs are increasingly penetrating our daily lives, it’s crucial to identify where these models provide dangerous advice, and to encourage LLM providers to upgrade their system prompts,” IFIT founder and executive director Mark Freeman notes. “The reality is that LLMs are already being used for actionable advice in conflict zones and crisis situations, making it urgent to identify and fix key blind spots.”
This groundbreaking study by the Institute for Integrated Transitions (IFIT) reveals that all major large language models (LLMs) are providing dangerous conflict resolution advice without conducting basic due diligence that any human mediator would consider essential.
IFIT tested six leading AI models including ChatGPT, Deepseek, Grok, and others on three real-world prompt scenarios from Syria, Sudan, and Mexico. Each LLM response, generated on June 26, 2025, was evaluated by two independent five-person teams of IFIT researchers across ten key dimensions, based on well-established conflict resolution principles such as due diligence and risk disclosure. Scores were assigned on a 0 to 10 scale for each dimension to assess the quality of each LLM’s advice.
A senior expert sounding board of IFIT conflict resolution experts from Afghanistan, Colombia, Mexico, Northern Ireland, Sudan, Syria, the United States, Uganda, Venezuela, and Zimbabwe then reviewed the findings to assess implications for real-world practice.
From a total possible point value of 100/100, the average score across all six models was only 27 points. The maximum score was obtained by Google Gemini with 37.8/100, followed by Grok with 32.1/100, ChatGPT with 24.8/100, Mistral with 23.3/100, Claude with 22.3/100, and DeepSeek last with 20.7/100. All scores represent a failure to abide by minimal professional conflict resolution standards and best practices.
“In a world where LLMs are increasingly penetrating our daily lives, it’s crucial to identify where these models provide dangerous advice, and to encourage LLM providers to upgrade their system prompts,” IFIT founder and executive director Mark Freeman notes. “The reality is that LLMs are already being used for actionable advice in conflict zones and crisis situations, making it urgent to identify and fix key blind spots.”
IFIT
/ innovations in peace negotiations and conflict resolution
The IFIT Initiative on AI and Conflict Resolution
Launched in August 2025, the IFIT Initiative on AI and Conflict Resolution aims to examine, shape, test and document creative and realistic strategies for making AI an effective tool in the prevention and resolution of political crises and armed conflicts. With input from experts across the globe, including a unique mix of technologists, diplomats and negotiators, the initiative seeks to ensure that AI tools evolve to meet the ethical and practical standards of real-world mediation.
Scroll down for more
A First Major Report
A groundbreaking IFIT study, published in July 2025, revealed that all major large language models (LLMs) are providing dangerous conflict resolution advice without conducting basic due diligence that any human mediator would consider essential.
IFIT tested six leading AI models including ChatGPT, Deepseek, Grok, and others on three real-world prompt scenarios from Syria, Sudan, and Mexico. Each LLM response, generated on June 26, 2025, was evaluated by two independent five-person teams of IFIT researchers across ten key dimensions, based on well-established conflict resolution principles such as due diligence and risk disclosure. Scores were assigned on a 0 to 10 scale for each dimension to assess the quality of each LLM’s advice.
A senior expert sounding board of IFIT conflict resolution experts from Afghanistan, Colombia, Mexico, Northern Ireland, Sudan, Syria, the United States, Uganda, Venezuela, and Zimbabwe then reviewed the findings to assess implications for real-world practice.
LLMs Test Results
From a total possible point value of 100/100, the average score across all six models was only 27 points. The maximum score was obtained by Google Gemini with 37.8/100, followed by Grok with 32.1/100, ChatGPT with 24.8/100, Mistral with 23.3/100, Claude with 22.3/100, and DeepSeek last with 20.7/100. All scores represent a failure to abide by minimal professional conflict resolution standards and best practices.
IFIT Invites LLM Developers to Address Critical Gaps in Conflict Advice
“In a world where LLMs are increasingly penetrating our daily lives, it’s crucial to identify where these models provide dangerous advice, and to encourage LLM providers to upgrade their system prompts,” Freeman argues. “The reality is that LLMs are already being used for actionable advice in conflict zones and crisis situations, making it urgent to identify and fix key blind spots.”
The Global Initiative on Polarization is a multiyear collaboration launched in 2022 by IFIT and the Ford Foundation and now expanding to include additional funding partners, such as the Carnegie Corporation of New York, Templeton World Charity Foundation and Harry Frank Guggenheim Foundation. Over the past few years, we have carried out interdisciplinary research, structured convenings, and applied work in a highly diverse range of countries, along the way publishing a foundational paper as well as partnering with Economist Impact to produce a customized web page of additional material.
We are now launching an important new phase of work through the establishment of the Depolarization Community of Practice (DCP): a diverse global group of 40+ depolarization scholars and practitioners with the aim of incubating new research agendas and practice collaborations; providing training and peer support; and facilitating the promotion of field-relevant lessons through periodic online exchanges and presentations.
The DCP will be directly connected to another major IFIT initiative: the Global Forum on Depolarization. This annual flagship IFIT event is a first of its kind and will have a combined knowledge-sharing and problem-solving format, seeking to function, among other things, as a “conveyor belt” for innovative new practice and research. The first edition of the Global Forum on Depolarization will take place at IFIT HQ in Barcelona this fall.
Read more about our Global Initiative on Polarization here.
At the Center for Strategic and International Studies, a Washington, D.C.-based think tank, the Futures Lab is working on projects to use artificial intelligence to transform the practice of diplomacy.
With funding from the Pentagon’s Chief Digital and Artificial Intelligence Office, the lab isexperimenting with AIs like ChatGPT and DeepSeek to explore how they might be applied to issues of war and peace.
While in recent years AI tools have moved into foreign ministries around the world to aid with routine diplomatic chores, such as speech-writing, those systems are now increasingly being looked at for their potential to help make decisions in high-stakes situations. Researchers are testing AI’s potential to craft peace agreements, to prevent nuclear war and to monitor ceasefire compliance.
The Defense and State departments are also experimenting with their own AI systems. The U.S. isn’t the only player, either. The U.K. is working on “novel technologies” to overhaul diplomatic practices, including the use of AI to plan negotiation scenarios. Even researchers in Iran are looking into it.
Futures Lab Director Benjamin Jensen says that while the idea of using AI as a tool in foreign policy decision-making has been around for some time, putting it into practice is still in its infancy.
Doves and hawks in AI
In one study, researchers at the lab tested eight AI models by feeding them tens of thousands of questions on topics such as deterrence and crisis escalation to determine how they would respond to scenarios where countries could each decide to attack one another or be peaceful.
The results revealed that models such as OpenAI’s GPT-4o and Antropic’s Claude were “distinctly pacifist,” according to CSIS fellow Yasir Atalan. They opted for the use of force in fewer than 17% of scenarios. But three other models evaluated — Meta’s Llama, Alibaba Cloud’s Qwen2, and Google’s Gemini — were far more aggressive, favoring escalation over de-escalation much more frequently — up to 45% of the time.
What’s more, the outputs varied according to the country. For an imaginary diplomat from the U.S., U.K. or France, for example, these AI systems tended to recommend more aggressive — or escalatory — policy, while suggesting de-escalation as the best advice for Russia or China. It shows that “you cannot just use off-the-shelf models,” Atalan says. “You need to assess their patterns and align them with your institutional approach.”
Russ Berkoff, a retired U.S. Army Special Forces officer and an AI strategist at Johns Hopkins University, sees that variability as a product of human influence. “The people who write the software — their biases come with it,” he says. “One algorithm might escalate; another might de-escalate. That’s not about the AI. That’s about who built it.”
The root cause of these curious results presents a black box problem, Jensen says. “It’s really difficult to know why it’s calculating that,” he says. “The model doesn’t have values or really make judgments. It just does math.”
CSIS recently rolled out an interactive program called “Strategic Headwinds” designed to help shape negotiations to end the war in Ukraine. To build it, Jensen says, researchers at the lab started by training an AI model on hundreds of peace treaties and open-source news articles that detailed each side’s negotiating stance. The model then uses that information to find areas of agreement that could show a path toward a ceasefire.
At the Institute for Integrated Transitions (IFIT) in Spain, Executive Director Mark Freeman thinks that kind of artificial intelligence tool could support conflict resolution. Traditional diplomacy has often prioritized lengthy, all-encompassing peace talks. But Freeman argues that history shows this approach is flawed. Analyzing past conflicts, he finds that faster “framework agreements” and limited ceasefires — leaving finer details to be worked out later — often produce more successful outcomes.
“There’s often a very short amount of time within which you can usefully bring the instrument of negotiation or mediation to bear on the situation,” he says. “The conflict doesn’t wait and it often entrenches very quickly if a lot of blood flows in a very short time.”
Instead, IFIT has developed a fast-track approach aimed at getting agreement early in a conflict for better outcomes and longer-lasting peace settlements. Freeman thinks AI “can make fast-track negotiation even faster.”
Andrew Moore, an adjunct senior fellow at the Center for a New American Security, sees this transition as inevitable. “You might eventually have AIs start the negotiation themselves … and the human negotiator say, ‘OK, great, now we hash out the final pieces,'” he says.
Moore sees a future where bots simulate leaders such as Russia’s Vladimir Putin and China’s Xi Jinping so that diplomats can test responses to crises. He also thinks AI tools can assist with ceasefire monitoring, satellite image analysis and sanctions enforcement. “Things that once took entire teams can be partially automated,” he says.
Strange outputs on Arctic deterrence
Jensen is the first to acknowledge potential pitfalls for these kinds of applications. He and his CSIS colleagues have sometimes been faced with unintentionally comic results to serious questions, such as when one AI system was prompted about “deterrence in the Arctic.”
Human diplomats would understand that this refers to Western powers countering Russian or Chinese influence in the northern latitudes and the potential for conflict there.
The AI went another way.
When researchers used the word “deterrence,” the AI “tends to think of law enforcement, not nuclear escalation” or other military concepts, Jensen says. “And when you say ‘Arctic,’ it imagines snow. So we were getting these strange outputs about escalation of force,” he says, as the AI speculated about arresting Indigenous Arctic peoples “for throwing snowballs.”
Jensen says it just means the systems need to be trained — with such inputs as peace treaties and diplomatic cables, to understand the language of foreign policy.
“There’s more cat videos and hot takes on the Kardashians out there than there are discussions of the Cuban Missile Crisis,” he says.
AI can’t replicate a human connection — yet
Stefan Heumann, director of Agora Digital Transformation, a Berlin based think tank working at the intersection of technology and public policy, has other concerns. “Human connections — personal relationships between leaders — can change the course of negotiations,” he says. “AI can’t replicate that.”
At least at present, AI also struggles to weigh the long-term consequences of short-term decisions, says Heumann, a member of the German parliament’s Expert Commission on Artificial Intelligence. “Appeasement at Munich in 1938 was viewed as a de-escalatory step — yet it led to catastrophe,” he says, pointing to the deal that ceded part of Czechoslovakia to Nazi Germany ahead of World War II. “Labels like ‘escalate’ and ‘de-escalate’ are far too simplistic.”
AI has other important limitations, Heumann says. It “thrives in open, free environments,” but “it won’t magically solve our intelligence problems on closed societies like North Korea or Russia.”
Contrast that with the wide availability of information about open societies like the U.S. that could be used to train enemy AI systems, says Andrew Reddie, the founder and faculty director of the Berkeley Risk and Security Lab at the University of California, Berkeley. “Adversaries of the United States have a really significant advantage because we publish everything … and they do not,” he says.
Reddie also recognizes some of the technology’s limitations. As long as diplomacy follows a familiar narrative, all might go well, he says, but “if you truly think that your geopolitical challenge is a black swan, AI tools are not going to be useful to you.”
Jensen also recognizes many of those concerns, but believes they can be overcome. His fears are more prosaic. Jensen sees two possible futures for the role of AI systems in the future of American foreign policy.
“In one version of the State Department’s future … we’ve loaded diplomatic cables and trained [AI] on diplomatic tasks,”and the AI spits out useful information that can be used to resolve pressing diplomatic problems.
The other version, though, “looks like something out of Idiocracy,” he says, referring to the 2006 film about a dystopian, low-IQ future. “Everyone has a digital assistant, but it’s as useless as [Microsoft’s] Clippy.“
We are pleased to announce a new IFIT regional initiative.
Building on years of intensive IFIT work in individual countries ranging from Syria to Libya, Sudan and Tunisia, our inaugural Regional Programme for the Middle East and North Africa (MENA) adds a fresh layer of analysis and action to our existing work.
The programme aims to bring new ideas that can help expand the spectrum of perceived solutions to crisis and conflict in the region; bridge violent divides; and amplify expert voices on key regional issues of negotiation and transition.
We will be providing more information about the programme in the coming weeks and months.
Community dialogue practitioners play a vital role in developing practical solutions to local issues. They act as effective conduits for elevating these proposals to the national level, advocating for policy reform and implementation.
This toolkit—developed as part of IFIT’s Bottom-Up Dialogue work in Zimbabwe—is intended to support dialogue practitioners in promoting successful and sustainable outcomes for community dialogues across the country. It draws on insights from 13 cases of community dialogue conducted nationwide between 2018 and 2023, all examined in IFIT’s report Promoting Bottom-Up Dialogue: A Study of Community-Level Dialogue Experiences in Zimbabwe.
These findings are further enriched by contributions from Zimbabwean experts and community-based and civil society organisations engaged in convening dialogues across the country. These include the Institute for Local Governance (ILC), Youth Empowerment and Transformation Trust (YETT), Community Youth in Development Trust (CYDT), Youth in Peace Building Initiative Trust (YIPIT), and Zimbabwe Women Against Corruption Trust (ZWACT).
IFIT
/ WELCOME TO the Regional Programme for the Middle East and North Africa
Regional Programme for the Middle East and North Africa
IFIT was founded in the wake of the Arab Spring, and with the mission of supporting local actors with integrated policy analysis to facilitate more inclusive and sustainable transitions out of war, crisis and authoritarianism worldwide.
Scroll down for more
Building on years of intensive IFIT work in individual countries ranging from Syria to Libya, Sudan and Tunisia, the Regional Programme for the Middle East and North Africa (MENA) adds a broader layer of analysis and action to our existing work, while also leveraging key expertise and lessons from our thematic and global research on issues like narrative change and polarisation.
The programme aims to bring new ideas that can help expand the spectrum of perceived solutions to crisis and conflict in the region; bridge violent divides; and amplify expert voices on key regional issues of negotiation and transition.