Science

USC Researchers Stress-Test AI Chatbots for Mental Health Care

A new study enlisted 100 mental health professionals to evaluate how leading language models respond to real patient questions, revealing both promise and safety concerns.

Omega Editorial· June 24, 2026· 3 min read

Key takeaways

100 mental health professionals evaluated leading AI chatbots responding to real patient questions, finding generally high empathy scores but recurring safety concerns.
High overall performance ratings did not eliminate risks like overgeneralization, limited personalization, and advice that crossed clinical boundaries.
Traditional AI evaluation methods focused on knowledge tasks miss critical aspects of mental health conversations, which require emotional awareness and open-ended engagement.
USC researchers received an OpenAI Mental Health Award to study how AI handles challenging scenarios like resistant clients and to develop improved safety protocols.
Interdisciplinary collaboration with domain experts is essential for identifying subtle safety issues that computer scientists alone might miss.

As artificial intelligence chatbots become increasingly common sources of mental health support, researchers are working to understand whether these systems can safely handle sensitive conversations about psychological well-being. More than a third of psychologists now report patients who use AI for mental health support, according to recent surveys.

Ruishan Liu, a computer science professor at USC Viterbi School of Engineering, is leading efforts to rigorously evaluate how AI performs in these high-stakes scenarios. Her latest research project, CounselBench, recruited 100 mental health professionals—more than 70% of them licensed therapists—to assess how leading language models respond to real patient questions.

Promising performance with persistent risks

The evaluation revealed a complex picture. Current AI models generally scored well on empathy and performed strongly across multiple dimensions. However, high overall ratings did not eliminate safety concerns. Clinicians identified recurring problems including overgeneralization, limited personalization, and advice that sometimes crossed clinical boundaries. These issues appeared even in responses that otherwise seemed empathetic and helpful.

To probe deeper, Liu's team conducted a second phase where clinicians designed challenging questions specifically to stress-test the models and expose potential weaknesses. The approach represents a departure from traditional AI evaluation methods, which typically focus on knowledge-based tasks like multiple-choice questions or standardized exams.

Why it matters

With a documented shortage of mental health resources and growing public reliance on AI for psychological support, understanding the capabilities and limitations of these systems has direct implications for patient safety. The research highlights a critical gap: computer scientists evaluating AI responses may miss subtle safety issues that trained clinicians immediately recognize, such as when advice crosses professional boundaries or fails to account for individual patient circumstances.

Beyond evaluation to improvement

Liu's team recently received an OpenAI Mental Health Award to extend the research. The next phase will examine more challenging scenarios, including how models handle resistant or uncooperative clients—common situations in real-world therapy settings.

The research also aims to move from identifying problems to solving them. Now that specific failure patterns have been documented, Liu's team is working on methods to make language models safer and more suitable for deployment in clinical contexts.

Liu emphasizes that building trustworthy AI for healthcare requires collaboration across disciplines. Computer scientists working alone may not recognize risks that domain experts immediately identify. For CounselBench, input from trained mental health professionals proved essential for spotting subtle safety concerns that might otherwise go unnoticed.

The findings were first reported by USC News in an interview with Liu about her research on AI evaluation methods for healthcare applications.

#ai in healthcare#mental health chatbots#ai safety#language models#clinical ai evaluation#usc research

This is an original analysis by the Omega editorial team. Source reporting: AI Watch.

Want systems like this working for your business?

Book a Call

USC Researchers Stress-Test AI Chatbots for Mental Health Care

Promising performance with persistent risks

Why it matters

Beyond evaluation to improvement

More in Science

ORNL Advances Autonomous Labs with AI-Driven Decision-Making

Nexentis subsidiary partners with Boltz AI to screen drug targets

Space-Based Solar Power Emerges as Answer to AI Energy Crunch