Text To Wave Server DLL : ttw2ksrv.dll
Physical Name : Ttw2ksrv.dll (Installation directory)
DispatchID (Reference name): uTTSEngine
Properties, Methods, Events, XML Tags Description, Wave Formats Supported
| Properties | Description |
| QueryNumberofSoundCards() As Long | Gets the number of sound cards installed on the machine |
| QueryNumberofVoicesInstalled() As Long | Gets the total number of text to speech voices installed on users machine |
| QueryNumberofWaveFormats() As Long | Gets the number of wave formats supported for production |
| QuerySoundCardName(ByVal Cardno As Long) As String | Gets the sound card manufacturers name and number ID of the Sound card used for output. This is significant in case of machine with more than one sound card . Cardno is the total number of cards minus one. That is specifying zero means sound card number one. |
| QueryVoiceSpeed() As Long | The value can range from -10 to +10. A value of zero sets a voice to speak at its default rate |
| QueryVoiceTypeName(ByVal VoiceIndex As Long) As String | Gets the voice title as per the index specified. The index number is be always less than or equal to the total number of engines available. (Get it by QueryNumberofVoicesInstalled) |
| QueryVolume() As Long | Gets the volume rate specified. The value can range from 0 to 100 . |
| QueryWaveFormatsSupported(ByVal FormatIndex As Long) As String | Gets the wave format specification allowed for output as per the index specified. The index number is be always less than or equal to the total number of wave fromats available. (Get it by QueryNumberofWaveFormats) |
| Methods | Description | ||||||||||||||||||||||||
| Public Function mTextToWave(ByVal WaveFilePath As String, ByVal TextData As String, Optional ByVal iSync As Boolean = True, Optional ByVal WaveFormat As String = "SAFT22kHz16BitMono", Optional ByVal DataisaFile As Boolean = False, Optional ByVal SpeakPunc As Boolean = False, Optional ByVal VoiceTypeIndex As String = "0", Optional ByVal VoiceSpeed As Long = "0", Optional ByVal VoiceVolume As Long = "100") As Boolean |
Prime instruction which converts text to wave. Returns true if successful. The conversion includes parsing for XML scripted text as described below. The instruction takes the following parameters...
| ||||||||||||||||||||||||
| Public Function RealSpeak(ByVal TextData As String, Optional Queueup As Boolean = False, Optional ByVal iSync As Boolean = False, Optional ByVal DataisaFile As Boolean = False, Optional ByVal Priority As Boolean = False, Optional ByVal SoundCardIndex As Long = 0, Optional ByVal VoiceTypeIndex As String = 0, Optional ByVal VoiceSpeed As Long = 0, Optional ByVal VoiceVolume As Long = 100, Optional ByVal OutputFormat As String = "SAFT22kHz16BitMono", Optional ByVal SpeakPunc As Boolean = False, Optional ByVal Purge As Boolean = False) As Boolean |
Prime instruction which converts text to speech in real time. Returns true if successful. The conversion includes parsing for XML scripted text as described below. The instruction takes the following parameters...
| ||||||||||||||||||||||||
| RealPause() | Pauses the readout of the text. You may restart the paused reading using RealResume | ||||||||||||||||||||||||
| RealResume() | Resumes the readout of the text. You may pause the paused reading using RealPause | ||||||||||||||||||||||||
| RealWordSkip(ByVal wNumber As Long) | Skips the already speaking engine the specified number (wNumber) onwards | ||||||||||||||||||||||||
| RealStop() | Totally stops the text to speech readout in real time. |
| Events | Description |
| Event ConversionStarted() Event ConversionStopped() |
These events are fired immediately as per the execution of mTextToWave Instruction |
| Event WordSpoken(ByVal StreamNumber As Long, ByVal StreamPosition As Variant, ByVal CharacterPosition As Long, ByVal Length As Long)
|
The Word event occurs
when the text-to-speech (TTS) engine detects a word boundary while
speaking a stream. StreamNumber = Queue number of the text string being told using RealSpeak Instruction. (i.e. With Queueup=True) StreamPosition = The character position in the output stream at which the word begins CharacterPosition = The character position in the input stream one character before the start of the word. Length = length of the word spoken
|
| Event Sentence(ByVal StreamNumber As Long, ByVal StreamPosition As Variant, ByVal CharacterPosition As Long, ByVal Length As Long) | The Sentence event
when the text-to-speech (TTS) engine detects a sentence boundary StreamNumber = Queue number of the text string being told using RealSpeak Instruction. (i.e. With Queueup=True) StreamPosition = The character position in the output stream at which the word begins CharacterPosition = The character position in the input stream one character before the start of the sentence. Length = length of the sentence spoken |
| Event Phoneme(ByVal StreamNumber As Long, ByVal StreamPosition As Variant, ByVal Duration As Long, ByVal NextPhoneId As Integer, ByVal CurrentPhoneId As Integer) | The Phoneme event
occurs when the text-to-speech (TTS) engine detects a phoneme boundary
while speaking a stream
StreamNumber = Queue number of the text string
being told using RealSpeak Instruction. (i.e. With Queueup=True) |
| Event Bookmark(ByVal StreamNumber As Long, ByVal StreamPosition As Variant, ByVal Bookmark As String, ByVal BookmarkId As Long) | The Bookmark event
occurs when the text-to-speech (TTS) engine detects a bookmark while
speaking a stream
It should be noted that Bookmark events may not be
synchronized with the actual speaking of the words in text streams
containing bookmarks. In some circumstances, TTS buffering considerations
may cause a Bookmark event to be received sooner than the voice speaks the
word preceding the bookmark in the text stream |
| Event StartedRealDictation(ByVal StreamNumber As Long, ByVal StreamPosition As Variant) | The StartRealDictation
event occurs when the text-to-speech (TTS) engine begins speaking a stream StreamNumber = Queue number of the text string being told using RealSpeak Instruction. (i.e. With Queueup=True) StreamPosition = The character position in the output stream at which the word begins |
| Event StoppedRealDictation(ByVal StreamNumber As Long, ByVal StreamPosition As Variant) | The StartRealDictation
event occurs when the text-to-speech (TTS) engine reaches the end of
speaking a stream StreamNumber = Queue number of the text string being told using RealSpeak Instruction. (i.e. With Queueup=True) StreamPosition = The character position in the output stream at which the word begins |
| Event EnginePrivate(ByVal StreamNumber As Long, ByVal StreamPosition As Long, ByVal EngineData As Variant) | The EnginePrivate
event occurs when a private text-to-speech (TTS) engine detects a custom
event condition boundary while speaking a stream StreamNumber = Queue number of the text string being told using RealSpeak Instruction. (i.e. With Queueup=True) StreamPosition = The character position in the output stream at which the word begins Enginedata =Data returned by the engine with the event. When using another manufacturer's TTS engine, consult its documentation for details |
| Event AudioLevel(ByVal StreamNumber As Long, ByVal StreamPosition As Variant, ByVal level As Long) | The AudioLevel event
occurs when the text-to-speech (TTS) engine detects an audio level change
while speaking a stream StreamNumber = Queue number of the text string being told using RealSpeak Instruction. (i.e. With Queueup=True) StreamPosition = The character position in the output stream at which the word begins |
| Event VoiceChange(ByVal StreamNumber As Long, ByVal StreamPosition As Variant, ByVal VoiceTitle As String) | The VoiceChange event
occurs when the text-to-speech (TTS) engine detects a change of voice
while speaking a stream StreamNumber = Queue number of the text string being told using RealSpeak Instruction. (i.e. With Queueup=True) StreamPosition = The character position in the output stream at which the word begins VoiceTitle =Title of the newly selected voice |
| XML Scripting Tags |
XML TTS TutorialSAPI XML TTS for Application DevelopersSAPI text-to-speech (TTS) extensible markup language (XML) tags fall into several categories. Voice state control tagsSAPI TTS XML supports five tags that control the state of the current voice: Volume, Rate, Pitch, Emph, and Spell. VolumeThe Volume tag controls the volume of a voice. The tag can be empty, in which case it applies to all subsequent text, or it can have content, in which case it only applies to that content. The Volume tag has one required attribute: Level. The value of this attribute should be an integer between zero and one hundred. Values outside of this range will be truncated.
One hundred represents the default volume of a voice. Lower values represent percentages of this default. That is, 50 corresponds to 50% of full volume. Values specified using the Volume tag will be combined with values specified programmatically (using ISpVoice::SetVolume). For example, if you combine a SetVolume( 50 ) call with a <volume level="50"> tag, the volume of the voice should be 25% of its full volume. RateThe Rate tag controls the rate of a voice. The tag can be empty, in which case it applies to all subsequent text, or it can have content, in which case it only applies to that content. The Rate tag has two attributes, Speed and AbsSpeed, one of which must be present. The value of both of these attributes should be an integer between negative ten and ten. Values outside of this range may be truncated by the engine (but are not truncated by SAPI). The AbsSpeed attribute controls the absolute rate of the voice, so a value of ten always corresponds to a value of ten, a value of five always corresponds to a value of five.
All text which follows should be spoken at rate ten. SpeedThe Speed attribute controls the relative rate of the voice. The absolute value is found by adding each Speed to the current absolute value.
Zero represents the default rate of a voice, with positive values being faster and negative values being slower. Values specified using the Rate tag will be combined with values specified programmatically (using ISpVoice::SetRate). PitchThe Pitch tag controls the pitch of a voice. The tag can be empty, in which case it applies to all subsequent text, or it can have content, in which case it only applies to that content. The Pitch tag has two attributes, Middle and AbsMiddle, one of which must be present. The value of both of these attributes should be an integer between negative ten and ten. Values outside of this range may be truncated by the engine (but are not truncated by SAPI). The AbsMiddle attribute controls the absolute pitch of the voice, so a value of ten always corresponds to a value of ten, a value of five always corresponds to a value of five.
All text which follows should be spoken at pitch ten. The Middle attribute controls the relative pitch of the voice. The absolute value is found by adding each Middle to the current absolute value.
Zero represents the default middle pitch for a voice, with positive values being higher and negative values being lower. EmphThe Emph tag instructs the voice to emphasize a word or section of text. The Emph tag cannot be empty. The following word should be emphasized.
The method of emphasis may vary from voice to voice. SpellThe Spell tag forces the voice to spell out all text, rather than using its default word and sentence breaking rules, normalization rules, and so forth. All characters should be expanded to corresponding words (including punctuation, numbers, and so forth). The Spell tag cannot be empty. Direct item insertion tagsThree tags are supported that applications the ability to insert items directly at some level: Silence, Pron, and Bookmark. SilenceThe Silence tag inserts a specified number of milliseconds of silence into the output audio stream. This tag must be empty, and must have one attribute, Msec.
PronThe Pron tag inserts a specified pronunciation. The voice will process the sequence of phonemes exactly as they are specified. This tag can be empty, or it can have content. If it does have content, it will be interpreted as providing the pronunciation for the enclosed text. That is, the enclosed text will not be processed as it normally would be. The Pron tag has one attribute, Sym, whose value is a string of white space separated phonemes.
BookmarkThe Bookmark tag inserts a bookmark event into the output audio stream. Use this event to signal the application when the audio corresponding to the text at the Bookmark tag has been reached. The Bookmark tag must be empty. The Bookmark tag has one attribute, Mark, whose value is a string. This value can then be used to differentiate between bookmark events (each of which will contain the string value from their corresponding tag). The application will receive an event here,
and another one here Voice context control tagsTwo tags provide context to the current voice: PartOfSp and Context. Those tags enable the voice to determine how to deal with the text it is processing. With both of these tags, the extent to which voices use the context may vary. PartOfSpThe PartOfSp tag provides the voice with the part of speech of the enclosed word(s). Use this tag to enable the voice to pronounce a word with multiple pronunciations correctly depending on its part of speech. The PartOfSp tag cannot be empty. The PartOfSp tag has one attribute, Part, which takes a string corresponding to a SAPI part of speech as its attribute. Only SAPI defined parts of speech are supported - "Unknown", "Noun", "Verb", "Modifier", "Function", "Interjection".
ContextThe Context tag provides the voice with information which the voice may then use to determine how to normalize special items, like dates, numbers, and currency. Use this tag to enable the voice to distinguish between confusable date formats (see the example, below). The Context tag cannot be empty. The Context tag has one attribute, Id, which takes a string corresponding to the context of the enclosed text. Several contexts are defined by SAPI and are more likely to be recognized by SAPI compliant voices, but any string may be used. See documentation for a particular voice for more details. Voice Selection TagsThere are two tags which can be used (potentially) to change the current voice: Voice and Lang. VoiceThe Voice tag selects a voice based on its attributes, Age, Gender, Language, Name, Vendor, and VendorPreferred. The tag can be empty, in which case it changes the voice for all subsequent text, or it can have content, in which case it only changes the voice for that content. The Voice tag has two attributes: Required and Optional. These correspond exactly to the required and optional attributes parameters to ISpObjectTokenCategory_EnumerateTokens and SpFindBestToken functions. The selected voice follows exactly the same rules as the latter of these two functions. That is, all the required attributes are present, and more optional attributes are present than with the other installed voices (if several voices have equal numbers of optional attributes one is selected at random). See Object Tokens and Registry Settings for more details. In addition, the attributes of the current voice are always added as optional attributes when the Voice tag is used. This means that, a voice which is more similar to the current voice will be selected over one which is less similar. If no voice is found that matches all of the required attributes, no voice change will occur. The default voice should speak this sentence.
A female non-child should speak this sentence, if one exists.
LangThe Lang tag selects a voice based solely on its Language attribute. The tag can be empty, in which case it changes the voice for all subsequent text; or it can have content, in which case it only changes the voice for that content. The Lang tag has one attribute, LangId. This attribute should be a LANGID, such as 409 (U.S. English) or 411 (Japanese). Note that these numbers are hexadecimal, but without the typical "0x". The Lang tag is a shortened version of the Voice tag with the Required attribute containing "Language=xxx". So the following examples should produce exactly the same results: Custom PronunciationAn alternative to using the <P> tag with the DISP and PRON attributes is to use custom pronunciation. Using custom pronunciation, tags in the form of the following.
can be written as
More specifically, if you want to recognize the word hello only when it is pronounced as ah and display greeting when recognized, you would normally use something like the following.
Using custom pronunciation, the above would translate to the following. |
| Wave Formats | Description |
(1) = "SAFT8kHz16BitMono" (2) = "SAFT8kHz8BitStereo" (3) = "SAFT8kHz16BitMono" (4) = "SAFT8kHz16BitStereo" (5) = "SAFT11kHz8BitMono" (6) = "SAFT11kHz8BitStereo" (7) = "SAFT11kHz16BitMono" (8) = "SAFT11kHz16BitStereo" (9) = "SAFT12kHz8BitMono" (10) = "SAFT12kHz8BitStereo" (11) = "SAFT12kHz16BitMono" (12) = "SAFT12kHz16BitStereo" (13) = "SAFT16kHz8BitMono" (14) = "SAFT16kHz8BitStereo" (15) = "SAFT16kHz16BitMono" (16) = "SAFT16kHz16BitStereo" (17) = "SAFT22kHz8BitMono" (18) = "SAFT22kHz8BitStereo" (19) = "SAFT22kHz16BitMono" (20) = "SAFT22kHz16BitStereo" (21) = "SAFT24kHz8BitMono" (22) = "SAFT24kHz8BitStereo" (23) = "SAFT24kHz16BitMono" (24) = "SAFT24kHz16BitStereo" (25) = "SAFT32kHz8BitMono" (26) = "SAFT32kHz8BitStereo" (27) = "SAFT32kHz16BitMono" (28) = "SAFT32kHz16BitStereo" (29) = "SAFT44kHz8BitMono" (30) = "SAFT44kHz8BitStereo" (31) = "SAFT44kHz16BitMono" (32) = "SAFT44kHz16BitStereo" (33) = "SAFT48kHz8BitMono" (34) = "SAFT48kHz8BitStereo" (35) = "SAFT48kHz16BitMono" (36) = "SAFT48kHz16BitStereo" (37) = "Raw textual data" (38) = "Amiga 8svx files" (39) = "Apple/SGI AIFF files" (40) = "SUN .au style files" (41) = "PCM u - law" (42) = "PCM A - law" (43) = "G7xx ADPCM files (read only)" (44) = "mutant DEC .au files" (45) = "NeXT .snd files" (46) = "AVR Files" (47) = "CVS and VMS files (continous variable slope)" (48) = "GSM" (49) = "Macintosh HCOM files" (50) = "Amiga MAUD files" (51) = "IRCAM SoundFile files" (52) = "NIST SPHERE files" (53) = "Turtle beach SampleVision files." (54) = "Soundtool (DOS) files" (55) = "Yamaha TX-16W sampler files." (56) = "Sound Blaster .VOC files" (57) = "Psion (palmtop) A-law WVE files" (58) = "MS ADPCM, IMA ADPCM" (59) = "SAFTADPCM_11kHzMono" (60) = "SAFTADPCM_11kHzStereo" (61) = "SAFTADPCM_22kHzMono" (62) = "SAFTADPCM_22kHzStereo" (63) = "SAFTADPCM_44kHzMono" (64) = "SAFTADPCM_44kHzStereo" (65) = "SAFTADPCM_8kHzMono" (66) = "SAFTADPCM_8kHzStereo" (67) = "SAFTCCITT_ALaw_11kHzMono" (68) = "SAFTCCITT_ALaw_11kHzStereo" (69) = "SAFTCCITT_ALaw_8kHzMono" (70) = "SAFTCCITT_ALaw_8kHzStereo" (71) = "SAFTCCITT_ALaw_22kHzMono" (72) = "SAFTCCITT_ALaw_22kHzStereo" (73) = "SAFTCCITT_ALaw_44kHzMono" (74) = "SAFTCCITT_ALaw_44kHzStereo" (75) = "SAFTCCITT_uLaw_11kHzMono" (76) = "SAFTCCITT_uLaw_11kHzStereo" (77) = "SAFTCCITT_uLaw_8kHzMono" (78) = "SAFTCCITT_uLaw_8kHzStereo" (79) = "SAFTCCITT_uLaw_22kHzMono" (80) = "SAFTCCITT_uLaw_22kHzStereo" (81) = "SAFTCCITT_uLaw_44kHzMono" (82) = "SAFTCCITT_uLaw_44kHzStereo" (83) = "SAFTGSM610_11kHzMono" (84) = "SAFTGSM610_22kHzMono" (85) = "SAFTGSM610_8kHzMono" (86) = "SAFTGSM610_44kHzMono" (87) = "SAFTTrueSpeech_8kHz1BitMono" |
The wave file output is avaiable in the following formats. Some formats may require a retail dll (not supported in demo) or may still be under development. |