MicrophoneMarcel Wassink

New developments in speech recognition technology that take account of context and medical vocabulary could make it a useful tool for clinicians. But how can you train a computer to take accurate dictation – particularly in a medical context?

Transcriptionists do much more than simply type what was dictated. For a start, they leave out the ‘um’s and ‘eh’s, ignore dialogue that is not part of the dictation, implement corrections that are dictated as part of the text, fill the information into forms, and even rephrase sentences. They format and organise text, adding section headings, numbering lists and standard blocks of content. In short, they ensure that the final document communicates what was meant, rather than just what was said.

The new speech recognition technology developed is called Intelligent Speech Interpretation (ISI). It emulates the capabilities of good medical transcriptionists, increases the productivity of secretarial staff, and frees resources for more critical tasks. Crucially, the technology is just as useful to doctors who prefer to look after the reporting process themselves as to those who delegate transcription and editing to someone else.

Situational intelligence

Acoustics due to background noise as well as differences in dialect, variations in pitch and speed can pose a challenge, as well as how distinct or slurred the pronunciation is. By filtering out acoustic events, which have no relevance for the current report and comparing with known variations in speaker characteristics, the system can compensate for many of these deviations and normalise the speech for further processing.

Next, the system must recognise what the speaker said. As with other challenges in speech recognition, the context of the dictation is the key to generating high-quality and consistent results. This starts with vocabulary. Awareness of what people are likely to say not only helps recognise what they do say, it also helps identify what doesn’t belong.

For example, “PET” (photon emission tomography) is more likely in a radiologist’s report than “pet” (an animal kept at home). Plus, the probability of “PET” being followed by “scan” is much higher than it being followed by "food".

Controlling the applications

The understanding of context also allows using speech recognition to control the user interface – a particularly attractive feature for doctors who prefer to look after the full report generation process. By differentiating command from dictation, speech recognition and word-processing applications can be linked more closely to eliminate most of the remaining keyboard work.

The doctor can open documents, switch on and off italics, go back and make corrections, spell unusual patient names, and so on, just using speech. And for a radiologist, it would mean being able to simultaneously dictate and manipulate an image. With some simple configuration, a doctor can even define personal ‘speech-macros’, to insert frequently used blocks of text, or even to navigate through regularly used forms and fill them out without needing a keyboard.

Rules and analyses

Whether it is to dictate the content of a report or execute commands for the word processing application, ISI works internally with phonetic representations of words, and rules for the structures of phrases, sentences and documents. Starting off with basic representations and rules, along with suitable vocabulary, the system then added more detail by statistically examining large numbers of existing texts. When transcribing a dictation, the system compares the words on hand with these statistics to imply the word, phrase, sentence or document section, and adjust the output accordingly.

By working closely with a number of medical system manufacturers, including MedQuist, the researchers developing the ISI system have been able to correlate dictations with both the machine-recognised texts and manually corrected final reports. This has enabled them to improve both the initial recognition rate and the quality of the reports the system delivers. In particular, it helped them discover unexpected “rules”, such as situations where the speech recognition system recognised an acronym correctly, but a trained transcriptionist would always turn that acronym into a full phrase; for example, “BP 118/82”, instead of the dictated text "blood pressure colon 118 over 82".


Speech recognition for dictation can be used for immediate production of reports by the clinician, or as part of speeding up a traditional transcription service, letting the secretaries concentrate on quality control; though obviously the system helps here too, by getting the spellings right. Either way it reduces the administrative overhead and shortens the time between dictation and release – which many radiology departments appreciate, for example, to complement a PACS that already accelerates the expectations of their colleagues elsewhere in the hospital. The bottom line is shorter patient waiting times and higher patient satisfaction.

Off-the-shelf, consumer software can only be made available at a low price by providing a fairly basic speech engine for generic use. But a professional speech recognition solution has to be carefully optimized for the specific requirements of the institution it is used in.

Marcel Wassink, Managing Director
Philips Speech Recognition Systems


Philips Speech Recognition