Lost in translation? Speech processing in psychiatry

Páll Matthíasson
Jun 17
10 min read

Interview with dr. Jón Guðnason

Dr. Jón Guðnason is a Professor of Engineering at Reykjavik University, specializing in speech signal processing and language technology. He earned his MSc degree in electrical engineering from the University of Iceland in 2000, and a PhD degree in speech signal processing from Imperial College London in 2007.

His research focuses on the nature of voice source and production, examining their effects on speech applications such as recognition and synthesis. After completing his doctoral studies, he worked as a post-doctoral researcher at Columbia University in 2008 before joining the faculty of engineering at Reykjavik University in 2009.

In 2014, he founded Almannaromur, an organization dedicated to promoting language technology for the Icelandic language. Professor Gudnason has led significant efforts to develop automatic speech recognition and text-to-speech synthesis for Icelandic, resulting in the implementation of Icelandic speech recognition in systems from major technology companies like Google and Microsoft.

As a prolific researcher in speech processing, signal processing, and machine learning, his work has been widely cited in the academic community. He currently directs the Language and Voice Lab (LVL) at Reykjavik University, collaborating with Almannarómur on various speech corpus collection efforts to advance Icelandic language technology.

Páll Matthíasson: Thank you for joining me today to discuss automatic speech recognition technology. Could you start by describing your background and how you became interested in speech processing?

Jón Guðnason: My background is in electrical engineering. I became interested in speech because it was the main focus of communication theory, which always fascinated me. I was also interested in signal processing and pattern recognition, which was a precursor to machine learning, so speech combined many of my interests. I joined Imperial College London in 2000 to pursue my PhD studies. There, the focus was very much on signal processing techniques to analyze voice. When I joined the faculty at Reykjavik University in 2009, I broadened my research to include the application of voice analysis in the health sector, as well as developing standard speech processing tools such as Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) for the Icelandic language.

Before we dive into the psychiatric applications, could you explain how automatic speech recognition technology actually works? Where does the potential for being “lost in translation” begin?

Certainly. ASR technology begins with voice analysis, converting acoustic signals into digital data that computers can process. The system analyzes various characteristics of speech including frequency patterns, amplitude variations, temporal features, and phonetic segments.

Modern ASR systems typically follow a processing pipeline: first, audio preprocessing reduces noise and enhances speech clarity; then feature extraction identifies relevant speech characteristics; acoustic modeling matches sound patterns to phonetic units; language modeling determines the most likely word sequences; and finally post-processing refines outputs based on context and domain knowledge. Importantly, most of these processing steps today are implemented using neural networks - complex mathematical systems that form the fundamental building blocks of artificial intelligence.

This reliance on neural networks presents both opportunities and challenges. On one hand, these networks have dramatically improved ASR accuracy compared to earlier statistical methods. On the other hand, they operate as “black boxes” where the internal decision-making processes are not always transparent. For truly comprehensive voice understanding in clinical settings, we need to thoroughly understand what these neural networks are doing and how to work with them effectively. This is particularly crucial in psychiatry, where interpretability - knowing why the system made a certain transcription decision - can be as important as accuracy itself.

Translation errors can occur at any stage of this pipeline. A microphone might not capture subtle sounds; background noise might interfere; the system might misinterpret similar-sounding words; or it might fail to understand contextual meaning. In psychiatry, where nuance is crucial, these errors can be particularly problematic. While this technology can analyze voice patterns to assist with diagnosing psychological conditions, in clinical documentation, the main objective is to accurately recognize and transcribe the actual words and sentences spoken during patient interactions - a goal that's still surprisingly challenging to achieve perfectly.

Dr. Jón Guðnason: “The main technical challenge in voice analysis for the health sector is identifying features in the speech signal that relate to specific symptoms or disorders, whether neurological or physiological.”

How have you personally used voice analysis, and have you applied it in psychiatric contexts?

The main technical challenge in voice analysis for the health sector is identifying features in the speech signal that relate to specific symptoms or disorders, whether neurological or physiological. Most of my work has focused on cognitive workload. With my colleague dr. Kamilla Rún Jóhannsdóttir at the Department of Psychology of Reykjavík University, we set up experiments where participants solved tasks of varying difficulty using speech, and our goal was to identify the difficulty level based solely on voice characteristics.

One practical application is monitoring people in safety-critical jobs, such as air traffic control, to ensure they're handling an appropriate workload. Others have used similar voice analysis techniques to quantify levels of depression and anxiety, though regrettably, we haven't expanded into that area yet.

You mentioned developing ASR for the Icelandic language. Can you tell me more about that work?

When I returned to Iceland in 2009, there had been very little development of ASR and TTS. I actively campaigned for government funding for this development. These were meager years after the financial crisis of 2008, with low interest in prioritizing language technology. Still, we started with a small funded student summer project in 2010 and then collaborated with Google to gather data for Icelandic ASR. By 2012, they included this in their Google Search services, demonstrating what was possible for the language.

The following years were more fruitful. We initiated projects on ASR for radiology in 2013, parliament in 2014, and created general open-source speech recognition and synthesis for Icelandic in 2015 and 2016. In 2019, the government launched the Language Technology Programme for Icelandic where ASR and TTS played significant roles.

The open-source software and data collected during these projects have made ASR and TTS for Icelandic available to many, allowing companies to develop solutions for specific customers. It was, for example, very satisfying to see Microsoft add Icelandic to its speech ASR and TTS to its speech infrastructure and there was an increase in language technology innovation and start-up activity in Iceland. However, challenges remain - Icelandic is not as well-developed as larger languages, meaning accuracy is often not as good as it should be and still too low for some applications where it's already proving useful in more widely-spoken languages.

Can you describe how ASR is used in healthcare systems? What are the communication gaps or “translation issues” you've encountered, and what have you achieved despite these challenges?

ASR is the core technology in dictation systems used throughout healthcare, enabling clinicians to create reports quickly and efficiently. This has been my focus for the past ten years. Recently, we've been exploring more challenging scenarios, including transcribing doctor-patient interviews and over-the-phone question-answering for triaging patients.

The translation challenges are significant. Medical terminology is notoriously complex, with terms that sound similar but have radically different meanings. Context matters tremendously - the same word can have different implications depending on the specialty. And conversations in healthcare are often emotionally charged, with patients speaking unclearly or clinicians using shorthand expressions that ASR systems struggle to interpret correctly.

These challenges multiply when dealing with unstructured clinical conversations versus structured dictation. In a psychiatric interview, for instance, patients might speak hesitantly, mumble, or use idiosyncratic expressions that carry important diagnostic information but confuse the ASR system.

Dr. Jón Guðnason: “In remote psychiatry, which has grown dramatically in recent years, ASR can provide real-time transcription during video sessions, helping clinicians maintain accurate records while focusing on the patient.”

We began developing a dictation system for Icelandic radiologists in 2013 and formed the spin-off company Tiro around that in 2016. The system is now used in radiology at LSH and several other facilities, and we're expanding the technology to other specialties. Orthopedics and pediatrics implementations are nearly complete.

The main challenge in developing healthcare dictation systems is how vocabulary varies dramatically between specialties - literally different languages that need different translation approaches. When we started with radiology, we compiled most of the vocabulary manually, but now we're working on automating these processes to better bridge these communication gaps.

As a psychiatrist, I'm particularly interested in how this technology might help in mental health settings, where accurate understanding is absolutely crucial. What specific applications do you see for ASR in psychiatry, and what unique “translation” challenges might we face?

In psychiatry, ASR could be transformative in several ways, but the translation challenges are perhaps more profound than in any other medical specialty.

The most immediate application is clinical documentation - psychiatrists spend significant time on notes and paperwork, which ASR can streamline, allowing more time with patients. However, psychiatry involves capturing subtle emotional cues, implicit meanings, and sometimes even what remains unsaid - elements that current ASR systems often miss entirely.

Beyond documentation, ASR can enable better analysis of therapy sessions. By creating accurate transcripts, clinicians can review sessions more thoroughly, identify patterns in patient speech, and track progress over time. But here's where things get complicated: a patient saying “I'm fine” with a particular intonation might mean precisely the opposite. Current ASR systems capture the words but lose the meaning - a critical loss in translation.

For psychiatric applications specifically, the system needs specialized vocabulary covering DSM-5 terms, psychological assessment scales, medication names, and therapeutic approaches. The technology must also handle emotional speech with varying prosody and speech patterns that might be affected by psychiatric conditions themselves - from the pressured speech of mania to the halting, disorganized communication of schizophrenia.

In remote psychiatry, which has grown dramatically in recent years, ASR can provide real-time transcription during video sessions, helping clinicians maintain accurate records while focusing on the patient. But connection issues, background noise, and the difficulty of establishing rapport remotely all create additional translation barriers between what's said and what's recorded.

What do you see as the most important challenges in implementing ASR effectively in healthcare?

The most important challenge is designing systems that truly streamline clinicians' workflows. We've spent considerable time figuring out how to capture dictation efficiently and deliver written reports back to systems where they can be appropriately processed. Our objective is making clinicians' work easier without sacrificing speed and accuracy, which ultimately impacts patient safety.

For psychiatric applications specifically, the challenges include capturing nuanced emotional content, dealing with patients who might speak in disorganized ways, and handling the highly sensitive nature of the information being discussed. Privacy and security considerations are particularly important in mental health settings.

What about risks? In psychiatry, being “lost in translation” could have serious consequences. What are the potential pitfalls of these mistranslations in automatic dictation systems and how can they be mitigated?

Errors are inevitable when using automatic speech recognition and AI in general - perfect translation from human speech to accurate text simply doesn't exist yet. When developing the technology, we minimize error risk by using training data - speech recordings with correct text already available. But even a few word errors can lead to catastrophic misunderstandings in psychiatric contexts.

Consider the difference between “The patient is not suicidal” and “The patient is now suicidal” - a single word error that completely inverts the meaning with potentially life-threatening consequences. Or imagine the system transcribing “The medication isn't working” as “The medication is working.” These aren't hypothetical scenarios; we've observed such critical mistranslations in our testing.

In psychiatry specifically, the consequences of mistranslation can be severe: incorrect diagnoses, inappropriate treatment plans, or missed warning signs of deteriorating mental health. The system might fail to capture emotional nuances or cultural contexts that are essential for proper understanding.

It's therefore essential for clinicians to review and approve all automatically transcribed reports and correct them if needed. The interface between the clinician and the electronic health record system is critically important, as is the surrounding user experience design.

In healthcare, especially psychiatry, we shouldn't think of this as “Artificial Intelligence” replacing human judgment, but rather as tools providing computer-assisted processes where clinicians play the lead role - human translators who can verify and correct the machine's understanding. The technology should enhance, not replace, the human element of psychiatric care, which remains our best defense against these dangerous mistranslations.

Dr. Jón Guðnason: “I believe we'll see significant advances in contextual understanding - ASR systems that not only transcribe words accurately but interpret their meaning in clinical contexts, much like a skilled human translator does.”

Looking toward the future, what developments in ASR technology do you believe will help bridge these translation gaps and have the greatest impact on psychiatric practice?

I believe we'll see significant advances in contextual understanding - ASR systems that not only transcribe words accurately but interpret their meaning in clinical contexts, much like a skilled human translator does. This could help flag potential concerns in patient speech that even experienced clinicians might miss, while reducing the dangerous mistranslations we currently see.

The most promising development will be multimodal systems that don't just listen to words but observe facial expressions, body language, and vocal tone - capturing the full spectrum of human communication rather than just the verbal component. This would address one of the fundamental translation problems in current systems: they hear the words but miss the meaning conveyed through these other channels.

We're also likely to see more sophisticated emotion recognition capabilities integrated with ASR, helping identify subtle changes in patients' emotional states over time. This could be particularly valuable for monitoring conditions like depression or bipolar disorder, where emotional states can fluctuate dramatically between sessions.

The integration of ASR with other clinical systems will become more seamless, reducing administrative burden while enhancing the quality of patient records. And as these systems become more specialized for psychiatric applications, they'll better handle the unique vocabulary and speech patterns encountered in mental health settings - developing essentially a specialized “translator” for psychiatric contexts.

Ultimately, these technologies should give psychiatrists more time to focus on what matters most - the human connection with patients - while enhancing the quality and consistency of care through better documentation and analysis. The goal isn't perfect translation - that's likely impossible - but rather translation that's good enough to be helpful while making its limitations transparent.

Thank you for these insights. It seems like ASR holds significant promise for psychiatry, though the potential to be “lost in translation” remains a serious concern that requires thoughtful implementation and a clear understanding of both capabilities and limitations.

Absolutely. Like any translation tool, ASR will never be perfect - something is always lost when converting human experience into data. The key is remembering that these are tools to enhance human capabilities, not replace them. We need the human translator - the clinician - to verify what the machine thinks it heard against what was actually meant.

When implemented properly, with appropriate training and clear processes for verification, ASR can help psychiatrists focus more on patients and less on paperwork. The technology should serve as a bridge between patient experiences and clinical documentation, not a barrier. And while mistranslations will continue to occur, acknowledging this limitation is the first step toward using these tools responsibly. □