This document discusses plans for migration of KDE Text-to-Speech (KTTS) to the KDE version 4 platform. The current architecture and implementation are described. Current limitations and issues are enumerated. A new implementation based on the Speech Dispatcher backend is presented. Finally, the current state of migration is given.
This document will be of interest to TTS and accessibility developers who wish to know the future plans with respect to KDE4 and TTS. It will serve as a guide to the KDE accessibility team and KTTS developers.
KTTS consists of a GUI for configuring and controlling TTS configuration options and controlling speech in realtime, called kttsmgr, and a non-GUI backend that implements the actual speech functions, called kttsd. kttsmgr is implemented as a KCModule, which means that the same GUI is also available in the KDE Control Center. kttsd runs as a background process that accepts requests for speech from applications via DCOP. From the viewpoint of kttsd, kttsmgr is just another application, albeit one that sets configuration options in the kttsdrc file. Note that kttsd does not perform TTS synthesis itself; instead, it dispatches text messages to one or more synthesizer backends, such as the Festival Speech Synthesizer.
When planning the migration strategy for KTTS, it is useful to review the objectives that were used in the design of the existing KTTS system.

Illustration 1: Current kttsd Architecture
Applications submit text to be spoken to the KSpeech interface of kttsd via DCOP. They may also specify a Talker Code, which contains the attributes of the desired speech. An example DCOP message coming from the KMouth application would be
dcop kttsd KSpeech sayText "Guten Tag" "de"
In this example, KMouth wishes to say a greeting in German. Talker Codes may specify desired attributes such as language, gender, talking rate, and volume. For a complete description of Talker Codes, see the kspeech.h file.
kttsd places the message and Talker Code into one of several queues based on the type of the message. Message type is determined by the application. Message types determine the priority of speaking order. Message types listed in decreasing priority are:
Screen Reader Output
Warning
Message
Text Job
There is a separate queue for each type of message. All are first-in first-out queues except for Screen Reader Output, which is always a single item queue, i.e., Screen Reader Output always interrupts and discards any in-progress Screen Reader Output.
The DCOP interface, prioritization, and queueing is handled by the core code in kttsd. The rest of the code is implemented as a dequeuing algorithm that passes messages to a series of plugins. Most of the plugins run asynchronously, either by running them in a separate KProcess, or by running them in a separate QThread. The steps a message passes through are:
The message passes through any Filters the user has configured. For example, the user can configure a String Replacer Filter to substitute words for chat emoticons
If the message is a Text Job, it passes through a Sentence Boundary Detector (SBD) Filter, which splits the message up into individual sentences. Each sentence is dequeued and passed through each of the next steps separately.
kttsd uses the Talker Code specified by the application to find a matching Talker configured by the user. For example, if KMouth specifies "de" for the Talker Code, kttsd tries to find a Talker configured to speak German. If no such Talker has been configured, kttsd uses the default Talker.
kttsd passes the sentence to the chosen Talker plugin for synthesis, and the name of a temporary wav file. The synthesizer outputs the synthesized audio to this file.
If configured, the wav file is passed to the Sox Stretcher plugin to speed up or slow down the overall speech rate. (Most of the synth plugins also permit speech rate adjustment in their configuration dialogs.)
The wav file is passed to an audio plugin for playback.
The dequeuing algorithm attempts to work ahead. While one sentence is being played back on the audio stream, the next few sentences are being synthesized and stretched. Incoming messages are simultaneously being filtered and broken up into sentences. By far, audio playback is the most wall clock lengthy process, so typically, kttsd will have fully prepared 3 or 4 sentences ahead of the current sentence being played back. When the audio of one sentence is finished, a wav file is immediately ready for the next sentence. The downside to this approach is that when kttsd audio queue is empty, the next request is delayed slightly as it is filtered, broken into sentences, synthesized, and stretched. In practice, this delay is small, but sometimes noticeable when the CPU is slow or busy. The upside to kttsd handling the audio rather than the synthesizer is that kttsd can stop playback instantaneously in order to speak a higher priority message, or in response to a pause or stop request from the user or application.
There are some trade-offs and issues in the existing KTTS design described in the previous section:
It has already been mentioned that kttsd handling audio playback itself means that it can instantaneously pause or stop playback in response to user or application requests, or incoming higher priority messages, even when the synthesizer does not offer a stop capability once it has been started. On the other hand, writing audio to wav file adds latency because of the disk I/O.
Writing synthesized text to a wav file permits kttsd to keep a queue of synthesized sentences without extreme memory usage.
Most synthesizers offer a means to synthesize to a wav file, but few offer a more efficient means of returning audio without compiling and linking against a library. Linking against synthesizer libraries adds build and install dependencies. Since kttsd runs synthesizers in a separate subprocess, synthesizers can be upgraded without recompiling kttsd.
Performing Sentence Boundary Detection (SBD) reduces the synthesis time, especially if the synthesizer does not support audio output before it has finished synthesizing a full text message. SBD also permits rewind/advance by sentence, and also permits interleaving of higher priority messages with low priority messages, especially if the synthesizer cannot be stopped once started.
Correct SBD is a complex task. The current algorithm used by kttsd SBD is based on regular expressions and sometimes breaks sentences incorrectly. The Festival synthesizer performs SBD itself using a more sophisticated algorithm based on analyzing parts of speech.
SBD is highly language dependent. kttsd does not perform SBD correctly for non-latin languages such as Korean or Chinese. Fortunately for kttsd, open source synthesizers for these languages do not currently exist.
kttsd tries to match the attributes specified by the application in the Talker Code with the Talkers the user has configured. Obviously, this depends upon the user to configure the Talkers. It would be better if kttsd were able to discover and configure Talkers itself in response to requests. Currently, if the user has not configured any Talkers at all, kttsd will attempt to automatically configure a Talker, but a more robust capability to automatically configure multiple Talkers for a variety of languages and attributes as they are requested would be better. For example, if a user has Festival installed with both male and female voices, kttsd should configure two talkers so that it can respond effectively to gender requests.
kttsd does not permit simultaneous speech. It is designed to provide a single speech output stream. Blind users sometimes prefer that their system speak two or more streams simultaneously. Some users prefer to listen to a main TTS stream, while permitting simultanouse alerts and short messages. These users have trained their ear to comprehend multiple streams, thereby operating more efficiently.
Since applications request speech via DCOP, kttsd can only support requests from KDE applications. Non-KDE applications, such as GTK-based apps (Gnome), or console applications cannot make use of kttsd unless they are willing to add DCOP as a dependency.
kttsd only supports speech while KDE is running. This means it does not support speech from bootup until KDE is up and running.
Many users run both KDE and Gnome applications. The Gnome desktop uses its own speech API. This means that users may experience conflicts over the speech stream, simultaneous speech or blocking, and a generally degraded user experience. It would be better if Gnome and KDE (and other applications) could share a common speech subsystem to avoid these problems.
As new synthesizers, voices, and capabilities become available, kttsd must be modified to support them. The Gnome and Emacs projects must similarly enhance their speech subsystem. A shared subsystem would reduce maintenance manpower, which is already a scarce resource.
kttsd provides event feedback for sentence start/end, and text job/end. However, it does not provide word or phoneme-level feedback. Therefore, kttsd does not support word highlight (follow the bouncing ball), or facial movement/speech synchronization (talking head).
Speech Dispatcher is an open source project whose goal is to provide a device independent layer for speech synthesis through a simple, stable and well documented interface. It takes care of most of the tasks necessary to solve in speech enabled applications. What is a very high level GUI library to graphics, Speech Dispatcher is to speech synthesis.
In many respects, Speech Dispatcher is very similar to KTTS. They both solve similar problems and achieve similar goals. Speech Dispatcher offers the following advantages:
Applications interface with Speech Dispatcher via a well defined protocol called SSIP. SSIP is currently implemented over TCP as TCP is well understood, network transparent and available in all operating systems of interest. SSIP can be adapted to other protocols, such as DBUS in the future.
Not dependent upon any desktop.
Already supported by the Gnome Speech API via its Speech Dispatcher plugin.
Works with console applications.
Can provide speech support from near bootup. (It is normally started as an operating system service.)
Low latency time from speech request to audio output.
Plugin-based backend supports a variety of speech synthesizers and audio subsystems.
Provides a rich set of message types and priorities, especially for screen reader use.
Proven performance in screen reader applications.
The plan for KTTS in KDE4 is to replace the synthesizer, stretcher, audio, and SBD filter plugins in kttsd with the single Speech Dispatcher backend. Illustration 2 shows the resulting new architecture:

Illustration 2: KTTS and Speech Dispatcher Under KDE4
In this architecture, Speech Dispatcher is the primary speech message dispatching component of the entire system, including KDE, Gnome, and console applications. KDE applications can continue to interface with kttsd via DCOP or DBUS. kttsd no longer performs the prioritization, queueing, synthesizing, or audio functions. Instead, these functions are taken over by Speech Dispatcher. KTTS continues to include the following functions:
DCOP/DBUS interface
KNotify interface
Text filtering
Configuration GUI and realtime user control GUI
(Note: The diagram above is a bit dated and does not accurately depict the TTS Engine API architecture. See the Brailcom website for more info.)
KDE applications have two possible interfaces for requesting speech. They can continue to use the KTTS DCOP or DBUS interface, or they may interface directly with Speech Dispatcher via TCP. The choice is up to the KDE application developer. If low latency, high performance is a primary concern, or desktop independence is important, it might be best to bypass KTTS and interface directly with Speech Dispatcher. If text filtering or ease of programming are primary concerns, KTTS will be a better choice.
In addition to the advantages of Speech Dispatcher listed above, this architecture also offers the following advantages, goals, and trade-offs:
Before the new architecture can be implemented there are few changes that are needed to Speech Dispatcher. The Speech Dispatcher team has already agreed in principle to these changes:
These tasks are listed in rough chronological order.
Convert the code in KTTS Qt3/KDE3 to Qt4/KDE4 that will be still be needed in the new architecture. This includes DCOP interface, filter plugins, KNotify interface, and portions of kttsmgr. In order to have a working KTTS, it might be useful to also convert the portions that will eventually be discarded using the Qt3 Compatibility library.
Add DBUS support to kttsd, assuming that is the standard interprocess messaging protocol that KDE4 will use. Drop DCOP support.
Refactor Speech Dispatcher to support the low-level TTS Engine API specification from freedesktop.org.
Implement prerequisite changes to Speech Dispatcher.
Define a mapping from the KTTS KSpeech Interface to corresponding functions in the Speech Dispatcher SSIP interface, including mapping of message types. This may require some changes to the KSpeech API, but the goal will be to keep these to a minimum with minimal impact on existing KDE applications.
Interface kttsd with the new KNotify implementation in kdelibs4.
Remove the SBD filter, synthesizer, stretcher, and audio plugins from kttsd. Substitute the Speech Dispatcher backend.
Design new GUI screens for configuring KTTS and Speech Dispatcher and controlling speech output at runtime. Note: It would be good to work closely with usability.kde.org at this step.
Implement the new screens and functionality in kttsmgr.
Design and implement Speech Dispatcher/KDEMM (Phonon) integration. Most likely, this will be one or more (optional) audio plugins for Speech Dispatcher.
Make changes to existing apps required by any changes to KSpeech API: KMouth, KSayIt, Kate and KHtml plugins.
Add support to Speech Dispatcher (more precisely, the TTS Engine API) for additional synthesizers and languages which KTTS currently supports but Speech Dispatcher does not. In particular, Epos, Hadifix, and possibly commercial synths Cepstral, DECTalk, and IBM TTS (formerly ViaVoice). Note that there are licensing issues that must be carefully considered when doing this.
When a working AT-SPI infrastructure is available in KDE4, integrate Speech Dispatcher as an AT Service.