TTS Builder Pro v5.0 Help

FAQ Steps to create a new voice Steps we have done for you


TTS Builder Pro FAQ Index

  1. Can I build a SAPI TTS Voice (Engine) for any language ?
    Yes you can create a TTS Voice for any language on Earth.
    For Non-English retail customers please ask for quotation mentioning the purchase price for custom diphone list.
     
  2. Which international languages are supported ?
    "American English"
    "British English"
    "French"
    "Italian"
    "Dutch"
    "German"
    "Portuguese"
    "Spanish"
    "Indian - Hindi"

     
  3. Do I require a professional recording studio and equipments to create such a voice ?
    Yes and No. Yes if you are a voice artist or if you wish to create a professional release of the TTS voice.
     No not required if you are home amateurish developer. Please note we can provide voice artists for any given language.
     
  4. What does the voice artist have to record ?
    The voice artist is provided a list of nonsense words derived from a combination of diphones.
     
  5. You claim to create a voice in a week. Can I hear a sample wave file created using TTS Builder in a week ?
    Sure download here, the voice talent has completed the nonsense word recordings only a week before its release, listen to the state and quality of diphone editing for only one week of efforts. Leisure8.wav
     
  6. Can I create a SAPI Compatible voice for all operating systems ?  Can I have a demo ?
    Yes sure please download here KevinAlpha5.zip
     
  7. Is  TTS Builder based on festival open source?
    Yes but we have made tons of modifications and even created a GUI for the whole activity. For a layman its no more a difficulty or technical puzzle, no need to be a DSP engineer to study festival. Even a Voice talent can directly use it. The most difficult part of creating a GUI editor for editing the diphone boundaries and labels is included. We have added three DLL's of our own so that festival works within the system.
     
  8. If I want to create a Voice of my own, how many days will it approximately consume ?
    Once you have decided to purchase TTS Builder, the manual will explain you the number of recordings that are necessary for the voice. After the recordings are done say in a day or a two, editing the recordings will consume around 3 weeks, thereafter you may simply click compile to SAPI Installer and in few seconds have the setup.exe for your voice ready for redistribution.
     
  9. Why doesn't the editing work occur automatically?
    TTS Science is not YET as well developed to be automatic, hence the manual editing. Some doctorate students are working on the same, though the technology is primitive and under test. Note the diphone boundary identification module is independent of Festival and is much accurate with TTS Builder Proprietary idea.
     
  10. How much is the price and what kind of support is available ?
    The price is $499 for full version purchase. Source code is available for $999. Also strict email 24/7 support is available no telephone support is available for the order. For telephone support the price is $19K for one month support. Please email here for volume purchase discount or if you have any product questions
    Order URL : www.research-lab.com/ttsbuilderorder.htm
     
  11. How do I download the full version ?
    As soon as the order is placed the full version download instructions are provided with the order.
     
  12. So how do I download the demo version ?
    Download www.research-lab.com/downloads/ttsbuilder01.zip
     
  13. Can I see my voice in the control panel drop-down list of XP, Vista ?
    Yes that is the job of the SAPI Installer
     
  14. What is a phoneme ?
    In human language, a phoneme (from the Greek: φώνημα, phōnēma, "a sound uttered") is the smallest linguistically distinctive unit of sound. Phonemes carry no semantic content themselves. In theoretical terms, phonemes are not the physical segments themselves, but cognitive abstractions or categorizations of them. A morpheme is the smallest structural unit with meaning.
     
  15. What is a diphone?
     

    In phonetics, a diphone is an adjacent pair of phones. It is usually used to refer a recording of the transition between two phones.

    In the following diagram, a stream of phones are represented by P1, P2, etc., and the corresponding diphones are represented by D1-2, D2-3, etc:

    |P1===|P2===|P3===|P4===|P5===|P6===|
    ===|D1-2=|D2-3=|D3-4=|D4-5=|D5-6=|===
    
  16. Why use a diphone ?
    Two phones cannot be joined for reproduction without hearing large noise and friction between boundaries of transition. Hence diphones do the trick. We simply save the transitions from one phone to other and then join the transitions for Text To Speech.

  17. What are the list of things we get readymade from TTSBuilder team for building a new language?

  18. What are the steps to build TTS Voice once we have acquired the language files?


Note : Following are the main steps to new voice creation which we have done for you. `

  1. Complete Custom Language Linguistic Features Kit Development.
    This involves Lexicon, Syntax, phonetics and phonology.
    Details of work involved :

    Adding support for Textual Analysis
    1 Tokenization
    2 Text Normalization
    2.1 Homograph Disambiguation
    3 Text Modes
    4 Extensions

    Linguistic Processing:
    POS Tagging and The Lexicon
    1 POS Tagging
    2 The Lexicon
    2.1 The Compiled Lexicon
    2.2 The Addenda
    2.3 Pronunciation Lookup
    2.4 LTS Rules
    2.5 Post-Lexical Rules
    3 Extensions for whole word codec tweak

    Determination of Prosody
    1 Prosodic Phrasing
    1.1 CART Method
    1.2 Full Statistical Method
    2 Duration
    2.1 Klatt Durations
    2.2 CART Durations
    3 Intonation
    3.1 Default Intonation
    3.2 Simple Method
    3.3 Intonation Tree Method
     
  2. Create Diphone.scm or Diphlist.scm as per your language. This is the basic skeleton file. Apart from that we require the following files for installation:
    Scheme Files
    A.1 Diphone Database Construction
    A.1.1  custom phones.scm
    A.1.2 custom schema.scm
    A.1.3 custom voice artist diphone.scm
    A.2 Text Analysis
    A.2.1 custom token.scm
    A.3 Pronunciation Prediction
    A.3.1 custom lex.scm
    A.3.2 custom postlex.scm
    A.3.3 allowables.scm
    A.3.4 custom-voice artist.desc
    A.4 Prosody Determination
    A.4.1 custom-voice artist dur.scm
    A.4.2 custom durtreeZ.scm
     
  3. Create customdiph.list which lists all the diphones.
  4. Set environment variables for Festvox and Speech Tools
  5. Collection for Existing Reference of Text To Speech Synthesized prompt-wav collection for pronunciation comparison during creation of labels or use SAPI/HMM based labels.
    Read detailed documentation here

Note : Following are the main steps you should execute using TTS Builder

  1. Place Order for TTSbuilder. www.research-lab.com/ttsbuilderorder.htm
    or use paypal tim@research-lab.com for $199.99 for individual license and $999 for institutional license.
    Select diphone list for your language.
    Record the diphone combinations or words from a voice talent
    See finalize diphone set and recordings
     
     
  2. Create voice recordings and copy paste all wave, scm files to exact locations mentioned in email
    How to? Screenshot
     
  3. Create Label Files for diphones boundaries using any of the options
    How to? Screenshot
     
  4. Create EST file for diphone boundaries
    How to? Screenshot
     
  5. Create PitchMark Files for diphones
    How to? Screenshot
     
  6. Create LPC Compression of diphones (this will make sure your voice is distributable and in size like 20 MB and not 400 MB etc.)
    How to? Screenshot
     
  7. Create a SAPI Installer
    How to? Screenshot
     
  8. Voice is ready. Test the voice.
    How to? Screenshot
     
  9. Hear glitches during the first test ? Edit all Labels and  Pitch marks using TTSBuilder GUI Rules
    How to? Screenshot
     
  10. Check the files responsible for a poor low quality word reproduction, edit the files again.
    How to? Screenshot
     

Create voice recordings and copy paste all wave, scm files to exact locations mentioned in members email
How to? Click open diphone list--and as per list of phones start saying the nonsense word like for example "taabaabaa", "taapaapaa" etc, you may use any recording software like wavepad or the one included.

 

Please record every nonsense word with a small pause before the recording and after the recording. Use of the keyword useng_xxxx.wav is generic and not related with US English.
Please start with useng_0001.wav and end with say useng_9999.wav.


You may use the recording tool like the one given and make sure you record in the following format only 16KHz, 16 Bit MONO Wave. No other format please.  You may use a Pearl CC 30 professional high level condenser microphone together with the Marenius PM L-22 high level microphone amplifier for best results.


 


Create Label Files for diphones boundaries using any of the options


Once all recordings of nonsense words are complete, you may start labeling the diphone boundaries. This activity is simply creating *.lab extension file in the lab directory per wave file that was recorded. Click Festival Labeling --Input in the text box the file number you wish to process in proper format like 0001---to say--0999--2000. Once done , click 'Individual Creation' . Use Start button to automatically process all wave files in incremental order from the number typed in the text box. You may create the same file using other techniques like SAPI based speech recognition technique or HMM based technique. We have included in this version an SAPI based technique.
We have a IEEE patent pending over a new hybrid topic.

 


Create EST file for diphone boundaries


Once the labeling process is completed, then built the diphone index necessary for extraction of the diphones from the wav files. The index file identifies which diphone comes from which file and at what time offsets. Click the button labeled "Create EST file". Note you must have all the diphone labels files exactly equal to the number of recorded created before you click this button. Please note this is an very important file which contains vital boundary information and will be required to be created every-time you edit/repair the boundary or pitch-marks.

Sample file information :

 

EST_File index

DataType ascii

NumEntries 1899

EST_Header_End

thf-hh useng_1614 0.668373 0.734956 0.791381

tf-hh useng_1613 0.47872 0.538168 0.574933

thf-w useng_1612 0.474566 0.555184 0.616686

tf-w useng_1611 0.600997 0.661375 0.735687

thf-wh useng_1610 0.46577 0.533615 0.602223

tf-wh useng_1609 0.462249 0.517092 0.585254


Create PitchMark Files for diphones

Click "Festival Pitchmarking" to open a dialogue for automatically creating pitchmark files. Click "individual creation" to start creating a pitchmark file with extension *.pm. Pitchmark files extract pitchmarks from the wave signal, the results of which may need to be hand-corrected. The program filters an incoming waveform with
low and high pass filters and finds pitch mark peaks by autocorrelation. Unvoiced sections are filled with the default pitchmarks. Every recorded wave files corresponding pitchmark file is required to be created in the folder "pm"

 


 


Create the LPC Compression of diphones (this will make sure your voice is distributable and in size like 20 MB and not 400 MB etc.)
 

Once the pitchmark files are created we are now ready to compress the data for concatenation. Here we supply the pitchmark file information for every wave file created to Linear Predictive Coder Compression. This program generates LPC coeffcients and residuals in the "lpc" directory

 


Create SAPI Installer for testing

Your files are ready for your voice to be tested for text to speech, you will also after installation be able to see your name in list of the windows control panel's Speech Section.

 

 

 


Create a SAPI Installer


Your files are ready for your voice to be tested for text to speech, you will also after installation be able to see your name in list of the windows control panel's Speech Section.


Voice is ready. Test the voice.
 

Click this to open your voice in Windows Control Panel list.


Hear spikes/glitches for the first test ? Edit all Labels and  Pitch marks using TTSBuilder GUI Editing Rules
 

Much hand correction is needed for a lot of label boundaries and pitch mark repairs. Use the TTSBuilder graphical interface created to load the wave file number in question (works with simple format of 1-2500). Then click load diphones button.

We will advise you check every label and pitch file. Once you load the file please click "load diphones" button, this will mark the label boundaries.

To edit the label boundaries simply press P to edit the farthest diphone boundary before the end of recording pause, M to edit the second last diphone boundary,  F to edit the third or middle diphone boundary, Y to edit the second diphone boundary, X to edit the first diphone that starts the recording after a small pause.

 

To edit the pitch marks, please double click between the diphones and it will display another interface, please select the tops if you see the pitchmark elsewhere kindly move the cursor to the top (do not click) and press Q.

 


Check the files responsible for poor low quality word reproduction, edit the files again.
How to? Screenshot

 

Once you are done editing label and pitchmark files you will have to simply recreate the EST file, LPC file and recreate the voice installer for SAPI. In case you discover that a particular word is badly reproduced, simply use the section no 7 program to find the files and phones responsible. You must edit all the files listed in the drop down list below.


Basic Three Text To Speech Creation Steps which we do customized for your language

1. Text Analysis (WE DO THIS FOR YOU)

The aim of text analysis is to identify the words from the text to be synthesized - i.e.units that can be given pronunciations by means of lexicon lookup or letter-to-soundrules. The first step in this process involves tokenization - identification of chunksof the text called tokens. An initial approximation of these units would be white-space and punctuation separated strings of characters. However, this is obviously an over-simplification and such phenomena as abbreviations, blank-lines and hyphens will cause complications. Once the text has been tokenized, text normalization must be applied. Unrestricted text typically contains many abbreviations, acronyms, non-alphabetic characters, punctuation and strings of digits. All of these items may be interpreted differently depending on their context. For instance, the token `2001' could be expanded into `two thousand and one', `the year two thousand and one', `two zero zero one',` two thousand and first' etc. The aim of text normalization is to establish the correct interpretation of the input text.



2. Linguistic Analysis (WE DO THIS FOR YOU)
The next stage to be performed is linguistic analysis. This involves much work and includes processes such as syntactic parsing, assigning word pronunciation andidentifying lexical stress.

Syntactic Parsing
Syntactic analysis plays an extremely important role in the text-to-speech process. Typically textual documents consist of many homographs and these can not be disambiguated unless some kind of syntactic part-of-speech tagging is carried out. Syntactic parsing can also prove to be invaluable to the process of text normalization, the proper interpretation of the text can only be determined with extra syntacticinformation.Syntax further plays a crucial part in the determination of prosodic aspects suchas rhythm, duration and intonation. Orthographic commas often indicate clausal boundaries - however, they usually do not contribute much to prediction of phrasalboundaries. Often, TTS systems use predefined lists of function words (articles,prepositions, conjunctions) to locate phrase boundaries and content words (nouns,verbs, adjectives) to identify intonational accents. Increasingly, the systems of today are discarding this inefficient method and employing statistical Part-Of-Speech (POS) taggers to determine the most probable syntactic analysis

Semantic Analysis
An area that has not received much attention in TTS systems is that of semantic analysis. Often semantic and pragmatic knowledge is crucial to disambiguation of certain sentences. For example, Klatt Refrence uses the example of \She hit theold man with the umbrella". It's unclear here who is actually holding the umbrella to indicate that the woman is, what Klatt calls a pseudopause (slowing down of speaking rate and a fall-rise in pitch) may be used between the words man and with.One such TTS system, DECtalk, deals with the above problem by allowing the user to specify locations of missing pseudopauses.

Word Pronunciation and Lexical Stress
The next important step is assigning appropriate pronunciation and lexical stress to words. If this task is not performed correctly, the intelligibility and indeed acceptability of the resultant speech will be undermined. In certain languages, pronunciation can be straightforwardly predicted from orthography - one such language, as noted by Witten ([32], p.234) is Esperanto, in which each letter in its orthography has only one sound. However, as we all know, English is not so straight forward, as there exist many unpredictable pronunciations for words. When proper names are considered,the problem is rendered even larger. TTS systems have in the past dealt with this problem by supplementing any letter-to-sound (LTS) rules with an exception dictionary. This approach has actu-
ally inverted over the years as the problem of computer storage has decreased, and systems today usually opt for considerably larger pronunciation dictionaries supplemented with LTS rules to deal with anything not contained within this dictionary. Three large, widely-used, on-line pronunciation dictionaries, originally intended for Automatic Speech Recognition (ASR) but adapted for TTS synthesis, are PRONLEX, CMUdict and CELEX. The ¯rst of these two are both for American pronunciation and cover most, if not all words of the Wall Street Journal and the Switchboard Corpus - roughly 100,000. CELEX was designed for British English and contains about 160,000 words taken from the Oxford Advanced Learner's Dictionary (OALD) and the Longman Dictionary of Contemporary English. The three dictionaries employ different phone-sets, all of which are represented as ASCII key-symbols (see section 3.2.3)
and represent three levels of lexical stress (primary, secondary and none) LTS rules typically consist of various phonological rules modeled as Finite State Transducers and are often augmented with probabilities to model pronunciation variation. For instance, the word `the' can be pronounced as before consonants and as before vowels. Mappings are often made between orthography and pronunciation using decision trees trained from a labeled corpus. One such decision tree used by many TTS systems including Festival is the Classification and Regression Tree (CART). CART trees contain a binary question (yes/no answer) about some feature at each node in the tree. When predicting categorical values, probability distributions are provided in the form of a classification tree, while mean and standard deviations are provided in the form of a regression tree when gaussian or continuousvalues (e.g. ranges over real numbers) are predicted. For predicting pronunciation, a CART tree is built that classifies a particular situation (current phone) described by a set of features (left and right contexts) and assigns it a probability. Of course, in fluent speech, certain co-articulatory effects occur over word boundaries. The pronunciation obtained by means of lexicon lookup or LTS rules will constitute that of the word in isolation. If co-articulatory e®ects are ignored by a synthesis system, the result will be over-precise articulation. In order to deal with alteration of segments, post-lexical rules are employed. For instance, in British-English, post-vocalic `r' is deleted unless followed by a vowel and in Hiberno-English, `t' is limited to a fricative when it occurs in inter-vocalic position and certain other contexts

The importance of lexical stress cannot be ignored - difference in stress can affect the meaning of a word. For instance, take the word object: if the first syllable is stressed as in 􀀀 􀀀 , the word is interpreted as a noun; if on the other hand, the second syllable is stressed as in 􀀀 􀀀 , it's a verb. There exist many algorithms for predicting lexical stress and they are usually performed during a process known as syllabication. This process is quite a difficult one, as there is no agreed definition of where syllable boundaries should occur and how many syllables should be contained in a word. describes, words such as communism and mysticism can have either three syllables or four depending on whether the final is considered to be syllabic. One way of identifying syllables, is to appeal to relative
phone sonority - i.e. a sound's loudness relative to that of other sounds with the same length, stress and pitch. A possible theory of the syllable is that peaks of sonority coincide with peaks of syllabicity.Morphological analysis often reduces the problem of syllabification and assigning lexical stress. Correct handling of the phenomenon frequently depends on morphemic structure. For instance, reference cites the example hothead in which the word `th' should not be considered as a single phoneme and this fact is only recognizable after morphemic decomposition. TTS systems need morpheme dictionaries to do this- Festival currently does not employ one.

Determination of Prosody
The task of determining prosody also falls under the category of linguistic analysis, however I have decided that this important process merits a section of its own. There has been a great deal of research carried out on prosodic synthesis in order to render  computer speech sound more natural and fluent. Prosody is important in that it conveys both linguistic and extra-linguistic information about a speaker's attitude,intentions and physical or emotional state. Without prosody, synthesized voices sound extremely wooden. Prosody is often termed suprasegmental. However, the distinction between prosodic and segmental effects is quite unclear. Just as there are close correlations between theprosodic element of timing and the segmental element of duration, intensity or prominence is strongly related to stress and accent. Likewise, the prosodic of intonation can be traced down to segmental level. When determining prosody in text-to-speech synthesis, three principal areas can be addressed: prominence, structure and tune. Each of these is highly important in that they affect the sensations of intensity, length and pitch - these closely interlinked aspects of prosody can be implemented by appealing to the human perceptions of speech.

Prominence
\The problem [. . . ] is that one cannot state a procedure for combining sonority, length, stress and pitch so as to form prominence. There is no way in which one can measure the prominence of a sound." 2 . As reference describes above, prominence is extremely difficult to define. It is clear however that it is closely related and perhaps secondary to perception of stress, accent, duration and pitch. Prominence increases due to many factors - for instance, the intensity of a sound grows with vowel height, laryngeal state and F0. Intensity also is typically weaker near the end of an utterance, but as reference points out, this also appears to be a secondary effect of changing voice source characteristics. TTS systems thus do not tend to model prominence as such, as it will be acoustically
manifested in the next two aspects of prosody to be discussed.

Structure and Rhythm
When humans speak, they tend to group words together with noticeable breaks or disjunctions between them. Generally there is a limit to the length of these groups,often called intonational units, tone groups or prosodic phrases, just as there is a limit to the size of our lungs. TTS systems must carry out the task of identifying these prosodic phrases if they have any chance of producing naturally sounding speech. A crude approach would be to treat commas and other punctuation characters as boundaries - however, this treatment is not sufficient. Normally there are smaller internal phrases - for example, the prosodic phrase \I heard it through the grapevine"will have internal boundaries: I heard it j through j the grapevine. TTS systems estimate these smaller units by appealing to syntax - parts of speech and syntactic phrases will prove invaluable to the task in hand. On a segmental level, durations play a key-part in assigning structure and rhythm to an utterance. Typically, segments and indeed syllables will be longer just before a boundary and pauses will be larger between whole prosodic phrases.  TTS implementations modify duration by employing duration rules. On a first pass, these rules will determine inherent durations of various phonemes - in English, diphthongs and vowels are usually longer than consonants and certain consonants will be shorter than others for instance. The rules may then go on to account for contextual effects on segment durations, by appealing to syntactic structure, phonological markers (stress and accents for example) and phonetic context. Beyond these first approaches, non-linguistic factors such as speaking rate can be considered. Duration rules vary in complexity across TTS systems. Some merely look at inherent durations of segments, while others appeal to full statistical models based on POS and phrase break context. Festival offers various levels of complexity from the most basic level to using hand-written rules and up to employment of CART trees trained from data.

Tune
One of the characteristics of human speech is the continuous variation of what listeners perceive as pitch. The intonation or tune of a sentence or phrase is the pattern of changes in pitch that occurs over time and conveys much information about syntactic and discourse structure as well as extra-linguistic elements such as the speaker's attitude or emotion. Intonation is highly useful for emphasis and is used everyday to distinguish questions from statements. The aim of the tune component in a TTS system is to determine the appropriate intonation contour for each spoken phrase. In the past, TTS systems have opted for extremely simple intonation models if any at all. For instance, one such crude approach would be to use a monotone with falling pitch at the end of each sentence or to have rising and falling pitch per word.  However, as we all know, this is a huge simplification of what actually goes on - luckily,  today's systems are much more sophisticated and the results often sound quite natural. Usually, intonation is determined separately per tone group and then further modifications are carried out depending on semantics, discourse structure and other aspects that may alter intonation. There are numerous types of intonation models used by TTS systems and they all vary in both naturalness and ease of application. Generally, these theories must all first divide an utterance into tone groups, a task already determined by assigning structure. Next, the tonic syllable or principal pitch accent of each one is chosen. Obviously, lexical stress and accents play a key role in this step. Finally, a pitch contour is assigned to each tone group based on these target points. The way in which the pitch contour is assigned can vary - gives a description of past classifications identifying five different primary intonation contours. For example, one of these would be a continuous rise in F0 corresponding to a question. The overall pitch movement per tone group is controlled by specifying the pitch at three locations: the start of the tone group, the beginning of the tonic syllable and the end of the tone group. The pitch can then be interpolated in different ways over each half of the tone group. Two popular models of pitch accent classification are the ToBI (Tones and Break Indices) and the Tilt theories. The first model distinguishes five pitch accents for English which are labeled by combining two simple tones (high (H) and low (L)) in different ways. Other symbols are also used to denote which tone falls on the stressed syllable (*), for marking boundary tones (%) and for phrase accents (-). Reference give three example accent labels for the tone group \oh, really": L+H*, L*+H or L*. The ToBI system also permits employment of a Break Index which indicates the strength of the boundary between words by means of a number - for instance, 0 could indicate absence of a break while 4 could appear between large tone groups such as whole sentences. Rather than employing discrete parameters, the
Tilt model uses continuous parameters such as duration and amplitude to characterize accents. Reference gives a brief but informative description of the Tilt theory along with other similar ones such as the Fujisaki model. Currently, the most effective solutions to determining pitch contours use CART trees to predict accents and tone boundaries. These trees can be hand written but
also automatically trained from data. As regards assigning the F0 contour itself, statistically trained methods are usually employed.



3. Waveform Generation (YOU DO THIS YOURSELF USING MENU IN TTS BUILDER)

The final process in the TTS synthesis involves the production of an acoustic signal. The process is conventionally divided into three approaches: formant synthesis by rule, articulator synthesis by rule and concatenation based on recorded speech. Each of these approaches have their advantages and disadvantages, and the degree of naturalness varies from one to another. As O'Shaughnessy reference) describes, current speech synthesizers must deal with the conflicting demands of producing speech of maximum quality, and employing minimum memory space, algorithmic complexity, and computation time.

Formant synthesis by rule
The idea of both formant and articulatory synthesis by rule stems from the source-filter theory which sees speech as the result of excitation of a linear filter by one or more sound sources. Formant synthesis by rule systems attempt to model the vocal
transfer function by means of rules to vary the parameters (formant frequencies and bandwidths, F0, source spectrum and parameters etc.) of formant resonators. These rules are usually simplified approximations to natural speech. The concept of formant synthesis by rule was first coined in the 1960's and has evolved considerably since then. Systems that appeal to the concept employ a resonance synthesizer which may be analogue or digital and in series or parallel. Rules are knowledge-based and vary in complexity across systems. There are many advantages to the approach, two of which are the ability to generate smooth transitions between segments and the relatively low storage requirements. However, there are disadvantages, the main one being that synthesized speech is quite unnatural due to difficulties in specifying sufficiently detailed and accurate rules. Some examples ofexperimental TTS systems include MITalk and Klattalk, while commercial systems nclude DECTalk and InfoVox. Articulatory synthesis by rule The aim of articulatory synthesis is to simulate the resonance effects of the vocal tract, by dividing it into a series of sections of varying cross-sectional area, each represented by elements that create the appropriate impedance corresponding to physical reality. Several research groups have attempted to construct simplified models of the articulators and shape of the vocal tract with varying degrees of success. However,
both processing cost and lack of data to model rules for specifying vocal tract shapes (speech postures) greatly hinders success of TTS systems employing the procedure.


Waveform Concatenation
Continuous speech is notoriously difficult if not impossible to model using a set of rulesand formulas. This has prompted many to investigate the possibility of producing computer speech by means of concatenating units extracted from an inventory of digital recorded speech. Dramatic evolution in memory costs and computational power have helped to lead this idea to fruition. The question of what the units should actually constitute is an important one.While large units such as sentences and phrases may seem appealing, they have proved useless for unrestricted texts due to huge storage overhead. However, they have proved to be of use in voice-response systems in which speech is reproduced directly from previously-coded speech. As regards words and morphemes, concatenation is inherently limited also because of storage requirements. The syllable seems highly appealing - however there are over 10,000 in the English language. Apart from storage needs, the phenomenon of coarticulation thwarts attempts to concatenate syllable
sized units - a great deal of acoustic action occurs at syllable boundaries (stops are exploded, the sound source changes between voicing and frication etc.). As Klatt points out, all efforts to string together phoneme-sized chunks of speech have
failed because of the well-known coarticulatory effects between adjacent phonemes.While the choice of phonemes ensures a reduced inventory (about 43 in English),complex interpolation algorithms have to deal with coarticulation and usually these are highly inadequate. As O'Shaugnessy ([29], Chp.2, p.80) describes, not enough is yet known or understood about coarticulation to establish a proper set of rules to describe how the spectral parameters for each phone are modified by its neighbors.
The smallest coarticulatory e®ects are generally found at the acoustic center of a phoneme, a fact which has lead to the discovery of the diphone. As I have mentioned above, conjoined phonemes will not sound natural as speech postures do not alter so abruptly for each phoneme - there is gradual motion from one articulation to the next. Diphones, which are the acoustic units from the center of one phoneme to the center of the next, avoid this problem since they already contain these problematic transitions.
Concatenating diphones in the proper sequence will result in speech of much higher quality since the sounds merged at boundaries are spectrally similar. Although some further smoothing is required, the algorithms needed to do this are not complicated.
Minimal storage requirements (only about 1600 diphone in English) have ensured that the diphone has become the most successfully employed unit so far in TTS systems - however there are potential disadvantages. As Klatt ([24], p.758) describes,
discontinuities may occur when there is a mismatch of vowel quality between diphones to be concatenated. Also, during recordings, the speaker must ensure constant voice quality so that there aren't any sudden changes in the source spectrum in the center of
syllables. Despite these drawbacks, many diphone synthesizers have been developed including one at AT&T Bell Laboratories and the Festival Speech Synthesis System developed at Edinburgh which will be used in this project. A related alternative to the diphone is the demi-syllable. As opposed to diphones, demi-syllables can contain complicated consonant clusters which are not easy to re-
produce by diphone concatenation. However, they do have difficulty in handling between-syllable coarticulatory effects as I described earlier. Triphones have also been proposed as possible units and even units spanning four segments - however, once again memory becomes the problem. Probably the most suitable way of concatenating speech is to employ a hybrid diphone approach where variable-phone units are chosen depending on context. An example of a system making use of this concept is one developed at AT&T Bell Laboratories by J.P.Olive,1990. Whatever units are chosen, their representation must permit stretching and compression to achieve target durations determined by duration rules and smoothing of transitions in concatenation. As we know, an F0 contour must also be imposed over these concatenated units. The arrival of linear prediction (LP) speech analysis/resynthesis opened up the possibility of performing these tasks. A comprehensive review of spectral smoothing techniques is discussed in Chappell and Hansen (1998). Some of these include optimal coupling, where boundaries of each diphone are shifted to provide the best ¯t with adjacent segments, waveform interpolation and LP techniques involving interpolation of spectral parameters in any of several domains and
waveform interpolation of the residual. Since LP separates the pitch of a signal from its spectral envelope, prosody manipulation can be easily performed. If the concatenated waveform is represented by a sequence of linear-prediction coefficients, an and expansion of frames. A similar method known as PSOLA (Pitch-Synchronous Overlap and Add) is pitch-synchronous in that each frame is centered around a pitch mark in the speech (extending a pitch-period either side). A new waveform is created simply by overlapping and adding the frames. Pitch is increased by making the pitch- marks closer together (by decreasing the distance between frames) and duration is increased by duplication of frames. There are many tradeoffs when it comes to LPC synthesis - while LPC is fully automatic and simpler than formant synthesis, synthesized speech sounds somewhat less natural. Also, standard LPC methods do not account for anti-resonances even though these occur in the vocal tract. As a result, the output spectrum for sounds such as nasals are less accurate. Recent LPC methods such as multi-pulse residual excitation (see ref.[29], p.116) do improve the spectral estimations of sounds such
as voiced fricatives however. Current LPC methods are highly successful and it is difficult to tell if formant-based synthesizers actually produce more natural output.The choice between rule systems for formant synthesizers and concatenation strategies may ultimately depend on limits to the flexibility and naturalness of concatenation schemes involving encoded natural speech, but the best current lpc-based systems are quite competitive with the best formant-base rule programs"


 

 Finalize a diphone set

Generally, the number of diphones in a language is roughly the square of the number of phones. We are not using a diphone-genlist from festival, but providing you ready list of diphones for your language. See the following folders must have been created in the path of the TTS Builder Application. The output file usengdiph.list has the list of diphones that must be recorded by the voice artist in a recording studio. This file must be available in "etc" folder

 


 

 © www.research-lab.com