KTTS KDE4 Roadmap

Executive Summary

This document discusses plans for migration of KDE Text-to-Speech (KTTS) to the KDE version 4 platform. The current architecture and implementation are described. Current limitations and issues are enumerated. A new implementation based on the Speech Dispatcher backend is presented. Finally, the current state of migration is given.

This document will be of interest to TTS and accessibility developers who wish to know the future plans with respect to KDE4 and TTS. It will serve as a guide to the KDE accessibility team and KTTS developers.

Contents


Current Architecture and Implementation

KTTS consists of a GUI for configuring and controlling TTS configuration options and controlling speech in realtime, called kttsmgr, and a non-GUI backend that implements the actual speech functions, called kttsd. kttsmgr is implemented as a KCModule, which means that the same GUI is also available in the KDE Control Center. kttsd runs as a background process that accepts requests for speech from applications via DCOP. From the viewpoint of kttsd, kttsmgr is just another application, albeit one that sets configuration options in the kttsdrc file. Note that kttsd does not perform TTS synthesis itself; instead, it dispatches text messages to one or more synthesizer backends, such as the Festival Speech Synthesizer.

Objectives

When planning the migration strategy for KTTS, it is useful to review the objectives that were used in the design of the existing KTTS system.

Low Dependency
Since applications communicate with kttsd via DCOP, they have only a runtime dependency on it. Applications can discover if kttsd is installed and running and modify their behavior accordingly.
Easy Speech Enabling
The KSpeech API is designed to make speech enabling of applications as simple as possible for the application developers, while still permitting more advanced speech capabilities for those applications which require it.
Synthesizer Transparency
Applications need not concern themselves with details of the speech synthesizers used to generate speech. The simplest speech-enabled application simply sends text to kttsd via DCOP. kttsd takes care of picking an appropriate synthesizer, prioritizes the message with other speech, handles the interface with the speech synthesizer, and plays the speech on the audio device. When appropriate, applications can specify the attributes they desire. For example, if the application knows the language of the text, it can pass that in the DCOP request to kttsd and kttsd will choose the appropriate synthesizer and voice to speak the text. This is implemented via the concept of Talkers and Talker Codes. For details, see the KSpeech interface.
Low Latency versus Control
When an application requests speech, it is desirable for that speech to begin being heard as soon as possible. This is particularly important for screen reader applications. At the same time, one needs to be able stop or pause speech and inject higher priority messages. Existing synthesizers provide a mixed bag of these capabilities. Most are pretty good at reducing latency, but almost none provide a means to stop speech, once it has started. kttsd in kde 3.5 endeavoured to merge these two opposing requirements with mixed success.
User Control
Most synthesizer provide little or no interactive user control over speech. Most use textual configuration options or hard-to-use command line options. kttsmgr provides a single GUI for configuring speech options and controlling speech interactively.
Integration With the K Desktop Environment
KTTS offers the following KDE integration features:
Text Filtering
Most synthesizers provide basic text to speech conversion. Some offer richer speech via markup languages such as SABLE or SSML. KTTS provides ways for users to make use of these facilities via filters that convert text from one format to a format supported by the synthesizer. For example, one can convert the XHTML of a web page to SSML, thereby speaking link URLS in a faster, softer voice. One might call this "rich speak". In addition, filtering provides a means for correcting mispoken text. For example, a smiley emoticon in chat can be spoken as "smiles" rather than "colon hypen right-paren".
The KTTS project began as an effort to provide TTS services for ebook and long documents. Support for screen readers and other assistive technologies is also included, but it should be noted that this was not the primary consideration in its design -- at least not initially.



Illustration 1: Current kttsd Architecture


Illustration 1 shows the current architecture of the kttsd components.



Applications submit text to be spoken to the KSpeech interface of kttsd via DCOP. They may also specify a Talker Code, which contains the attributes of the desired speech. An example DCOP message coming from the KMouth application would be

dcop kttsd KSpeech sayText "Guten Tag" "de"

In this example, KMouth wishes to say a greeting in German. Talker Codes may specify desired attributes such as language, gender, talking rate, and volume. For a complete description of Talker Codes, see the kspeech.h file.

kttsd places the message and Talker Code into one of several queues based on the type of the message. Message type is determined by the application. Message types determine the priority of speaking order. Message types listed in decreasing priority are:

There is a separate queue for each type of message. All are first-in first-out queues except for Screen Reader Output, which is always a single item queue, i.e., Screen Reader Output always interrupts and discards any in-progress Screen Reader Output.

The DCOP interface, prioritization, and queueing is handled by the core code in kttsd. The rest of the code is implemented as a dequeuing algorithm that passes messages to a series of plugins. Most of the plugins run asynchronously, either by running them in a separate KProcess, or by running them in a separate QThread. The steps a message passes through are:

  1. The message passes through any Filters the user has configured. For example, the user can configure a String Replacer Filter to substitute words for chat emoticons

  2. If the message is a Text Job, it passes through a Sentence Boundary Detector (SBD) Filter, which splits the message up into individual sentences. Each sentence is dequeued and passed through each of the next steps separately.

  3. kttsd uses the Talker Code specified by the application to find a matching Talker configured by the user. For example, if KMouth specifies "de" for the Talker Code, kttsd tries to find a Talker configured to speak German. If no such Talker has been configured, kttsd uses the default Talker.

  4. kttsd passes the sentence to the chosen Talker plugin for synthesis, and the name of a temporary wav file. The synthesizer outputs the synthesized audio to this file.

  5. If configured, the wav file is passed to the Sox Stretcher plugin to speed up or slow down the overall speech rate. (Most of the synth plugins also permit speech rate adjustment in their configuration dialogs.)

  6. The wav file is passed to an audio plugin for playback.

The dequeuing algorithm attempts to work ahead. While one sentence is being played back on the audio stream, the next few sentences are being synthesized and stretched. Incoming messages are simultaneously being filtered and broken up into sentences. By far, audio playback is the most wall clock lengthy process, so typically, kttsd will have fully prepared 3 or 4 sentences ahead of the current sentence being played back. When the audio of one sentence is finished, a wav file is immediately ready for the next sentence. The downside to this approach is that when kttsd audio queue is empty, the next request is delayed slightly as it is filtered, broken into sentences, synthesized, and stretched. In practice, this delay is small, but sometimes noticeable when the CPU is slow or busy. The upside to kttsd handling the audio rather than the synthesizer is that kttsd can stop playback instantaneously in order to speak a higher priority message, or in response to a pause or stop request from the user or application.

Current Limitations/Issues

There are some trade-offs and issues in the existing KTTS design described in the previous section:

Planned KDE4 Implementation

Speech Dispatcher is an open source project whose goal is to provide a device independent layer for speech synthesis through a simple, stable and well documented interface. It takes care of most of the tasks necessary to solve in speech enabled applications. What is a very high level GUI library to graphics, Speech Dispatcher is to speech synthesis.

In many respects, Speech Dispatcher is very similar to KTTS. They both solve similar problems and achieve similar goals. Speech Dispatcher offers the following advantages:

The plan for KTTS in KDE4 is to replace the synthesizer, stretcher, audio, and SBD filter plugins in kttsd with the single Speech Dispatcher backend. Illustration 2 shows the resulting new architecture:

KTTS and Speech Dispather Under KDE4


Illustration 2: KTTS and Speech Dispatcher Under KDE4



In this architecture, Speech Dispatcher is the primary speech message dispatching component of the entire system, including KDE, Gnome, and console applications. KDE applications can continue to interface with kttsd via DCOP or DBUS. kttsd no longer performs the prioritization, queueing, synthesizing, or audio functions. Instead, these functions are taken over by Speech Dispatcher. KTTS continues to include the following functions:

(Note: The diagram above is a bit dated and does not accurately depict the TTS Engine API architecture. See the Brailcom website for more info.)

KDE applications have two possible interfaces for requesting speech. They can continue to use the KTTS DCOP or DBUS interface, or they may interface directly with Speech Dispatcher via TCP. The choice is up to the KDE application developer. If low latency, high performance is a primary concern, or desktop independence is important, it might be best to bypass KTTS and interface directly with Speech Dispatcher. If text filtering or ease of programming are primary concerns, KTTS will be a better choice.

Advantages, Goals, and Trade-offs

In addition to the advantages of Speech Dispatcher listed above, this architecture also offers the following advantages, goals, and trade-offs:

Eliminate SBD
Because Speech Dispatcher is closely integrated with speech synthesizers libraries, the need for Sentence Boundary Detection is eliminated. The downside to this is build and install dependency on particular synthesizer versions. Since Speech Dispatcher would be the primary component for both the Gnome and KDE desktops, this should be manageable.
Better Screen Reader Support
Speech Dispatcher is designed with Screen Reader applications in mind. It provides a rich set of message types and priorities tailored to the needs of these applications. For example, imagine downloading a file and a progress bar dialog. One wants to count off the percent progress in an intelligible way. Instead of speaking "10 percent 20 30 per 40 per fif 60", etc., it is better to speak "10 percent 40 percent 60 percent". Also, it is essential that "100 percent" is spoken when the download completes so that a blind user knows the download is finished. Speech Dispatcher automatically handles such a situation.
Eliminate wav File Disk I/O
Since Speech Dispatcher links to speech synthesizer libraries, the need to produce wav files on disk is unnecessary. Instead, audio is returned from the Synthesizer to Speech Dispatcher as memory transfers or piped I/O.
Unified Accessibility Infrastructure
The FreeStandards and FreeDesktop Accessibility groups have been striving to unify accessibility capabilities in order to provide the best possible experience for users with disabilities. Towards that end, both the Gnome and KDE Accessibility teams have agreed to center their efforts around the Assistive Technology - Service Provider Interface (AT-SPI). The goal is to provide well-functioning assistive technologies regardless whether a user runs KDE applications, Gnome applications, or both. The conversion to Qt4/KDE4 is a key enabling step towards this goal and Speech Dispatcher would be an excellent component for providing TTS services in the AT-SPI infrastructure. By the way, Speech Dispatcher would also be an excellent choice for Braille message dispatching, since the requirements for Braille message dispatching are very similar to TTS and also need to be coordinated.
Unified Low-level TTS API
The FreeStandards and FreeDesktop Accessibility groups are also endeavouring to agree on a single low-level TTS Engine API. Work on this API specification is currently underway. See http://lists.freedesktop.org/archives/accessibility/2006-February/000069.html. When it is ready, Speech Dispatcher will provide this API implementation in addition to its higher-level SSIP interface. See the "TTS Engine API" block in Illustration 2. An additional advantage to this standardized API is uniform treatment of text markup using Speech Synthesis Markup Language (SSML) with extensions.

Getting There

Prerequisites

Before the new architecture can be implemented there are few changes that are needed to Speech Dispatcher. The Speech Dispatcher team has already agreed in principle to these changes:

Long Text Message Type
The message types and priorities currently supported by Speech Dispatcher are ideal for the low latency, short lifetime messages of screen reader applications. Missing however, is a message type for ebook and low priority, longer text messages. Without this new Long Text type, Speech Dispatcher will tend to discard text when higher priority requests are made.
Begin/End Sentence/Job Feedback
Since KTTS currently emits DCOP events at the beginning and end of a text job playback, and also at the beginning and end of each sentence, Speech Dispatcher must likewise supply this information to KTTS. Partial support for this has already been added to Speech Dispatcher in the form of markers that can be embedded in text. It is desirable if sentence begin/end can be accomplished without KTTS performing SBD.
KDEMM (Phonon) Integration
In order for speech audio output to "play" well in the KDE desktop, Speech Dispatcher will need to support the KDE4 multimedia framework. For instance, one does not wish for speech output to block other KDE audio output. Volume and spatial controls (e.g. front/back speakers) similarly should be coordinated. Of course, this must be implemented in such a way that it does not add hard KDE dependencies to Speech Dispatcher. That would be unacceptable in the Gnome desktop, or other platforms. An optional plugin implementation comes to mind. One simple solution relies on using ALSA for audio output, since ALSA already provides mixing without blocking of audio devices. Speech Dispatcher already supports ALSA, as well as NAS.

Tasks/Steps

These tasks are listed in rough chronological order.

  1. Convert the code in KTTS Qt3/KDE3 to Qt4/KDE4 that will be still be needed in the new architecture. This includes DCOP interface, filter plugins, KNotify interface, and portions of kttsmgr. In order to have a working KTTS, it might be useful to also convert the portions that will eventually be discarded using the Qt3 Compatibility library.

  2. Add DBUS support to kttsd, assuming that is the standard interprocess messaging protocol that KDE4 will use. Drop DCOP support.

  3. Refactor Speech Dispatcher to support the low-level TTS Engine API specification from freedesktop.org.

  4. Implement prerequisite changes to Speech Dispatcher.

  5. Define a mapping from the KTTS KSpeech Interface to corresponding functions in the Speech Dispatcher SSIP interface, including mapping of message types. This may require some changes to the KSpeech API, but the goal will be to keep these to a minimum with minimal impact on existing KDE applications.

  6. Interface kttsd with the new KNotify implementation in kdelibs4.

  7. Remove the SBD filter, synthesizer, stretcher, and audio plugins from kttsd. Substitute the Speech Dispatcher backend.

  8. Design new GUI screens for configuring KTTS and Speech Dispatcher and controlling speech output at runtime. Note: It would be good to work closely with usability.kde.org at this step.

  9. Implement the new screens and functionality in kttsmgr.

  10. Design and implement Speech Dispatcher/KDEMM (Phonon) integration. Most likely, this will be one or more (optional) audio plugins for Speech Dispatcher.

  11. Make changes to existing apps required by any changes to KSpeech API: KMouth, KSayIt, Kate and KHtml plugins.

  12. Add support to Speech Dispatcher (more precisely, the TTS Engine API) for additional synthesizers and languages which KTTS currently supports but Speech Dispatcher does not. In particular, Epos, Hadifix, and possibly commercial synths Cepstral, DECTalk, and IBM TTS (formerly ViaVoice). Note that there are licensing issues that must be carefully considered when doing this.

  13. When a working AT-SPI infrastructure is available in KDE4, integrate Speech Dispatcher as an AT Service.

Migration Status

29 Jun 2006
Conversion of the KTTS code in trunk/KDE/kdeaccessibility/kttsd to Qt4/KDE4 is complete. Conversion of the code from DCOP to DBUS is complete. Since kdelibs4 is still in flux, compilation breaks regularly.
The KMouth application in kdeaccessibility has also been converted to the latest kdelibs and works with kttsd. KSayit compiles, but does not yet run.
DCOP has been eliminated from kdelibs. This means that applications interfacing with kttsd will require recoding to use DBUS. This provides some opportunities to improve the interface, and prepare for the conversion to the Speech Dispatcher backend. Application developers need not be concerned as the primary goal is to continue to keep simple requirements simple to implement.
Conversion to the new KNotify architecture presents some challenges. KNotify no longer broadcasts events, so it will be necessary to move the KTTS notification configuration and implementation code into kdelibs (KNotifyBySpeech). From a user perspective, this will be more intuitive and easier to understand. The downside is it will require additional widgets to be added to the KNotify configuration dialog, which is already quite crowded. There is some thinking going on to help deal with this.
A Phonon audio output plugin has been added to kttsd. It works, but there are some issues. When last tried, the Stop and Pause functions were not working. For the time being, the ALSA plugin remains in kttsd, but it appears probable that it can be removed in the future. This would eliminate an entire tab from the KttsMgr configuration dialog. There do not appear to be any obstacles to writing a Phonon plugin for Speech Dispatcher/TTS API.
Work on the TTS Engine API specification is complete and the Speech Dispatcher team has begun design work on an implementation. See the Brailcom TTS and TTS API Provider websites for more information.

ChangeLog

5 Mar 2006, Gary Cramblitt (a.k.a. PhantomsDad)
Initial version
5 Mar 2006, Gary Cramblitt
Suggestions from Hynek Hanke.
29 Jun 2006, Gary Cramblitt
Status update.