Contents:



XML Schema: SAPI

Quick-links: [Elements] [Attributes] [Source]

This schema describes the SAPI 5.0 TTS XML grammar format. The SAPI TTS XML schema is included in the TTS XML parser. Hence, it is not necessary to include the schema in the XML file when authoring a grammar. NOTE: This schema is based on the Microsoft schema language and is not fully W3C compliant. This schema will be rewritten and will be compliant with the W3C standard once it has been approved by the W3C.

This schema describes the following elements and attributes:

Elements Attributes Element-specific Attributes
<BOOKMARK>
<CONTEXT>
<EMPH>
<LANG>
<PARTOFSP>
<PITCH>
<PRON>
<RATE>
<SAPI> (document element)
<SILENCE>
<SPELL>
<VOICE>
<VOLUME>
ABSMIDDLE
ABSSPEED
ID
LANGID
LEVEL
MARK
MIDDLE
MSEC
OPTIONAL
PART
REQUIRED
SPEED
SYM

Document conventions:

SAPI Elements

<BOOKMARK>
Inserts a bookmark into the input stream using the bookmark element. If an application specifies interest in bookmark events, it will receive an event when synthesis has passed this element in an input stream. If the audio output destination supports handling of events, then an application will receive this event once the synthesized speech up to this bookmark has been output. Otherwise, an application receives a bookmark event when the voice implementation has synthesized speech up to this bookmark.
syntax: <BOOKMARK
  MARK = int
/>
content: empty
order: many (default)
parents: SAPI
children: (none)
attributes: MARK
model: closed
source:
<ElementType name="BOOKMARK" content="empty" model="closed">
	<description>Inserts a bookmark into the input stream using the bookmark element. If an application specifies interest in bookmark events, it will receive an event when synthesis has passed this element in an input stream. If the audio output destination supports handling of events, then an application will receive this event once the synthesized speech up to this bookmark has been output. Otherwise, an application receives a bookmark event when the voice implementation has synthesized speech up to this bookmark. </description>
	<attribute type="MARK"/>
</ElementType>
<CONTEXT>
The context can specify the type of normalization rules which should be applied to the scoped text. SAPI does not guarantee any predefined contexts.
syntax: <CONTEXT
  ID = string
>
  mixed content
</CONTEXT>
content: mixed
order: many (default)
parents: SAPI
children: (none)
attributes: ID
model: closed
source:
<ElementType name="CONTEXT" content="mixed" model="closed">
	<description>The context can specify the type of normalization rules which should be applied to the scoped text. SAPI does not guarantee any predefined contexts. </description>
	<attribute type="ID"/>
</ElementType>
<EMPH>
Places emphasis on the words contained by this element.
syntax: <EMPH />
content: empty
order: many (default)
parents: SAPI
children: (none)
attributes: (none)
model: closed
source:
<ElementType name="EMPH" content="empty" model="closed">
	<description>Places emphasis on the words contained by this element. </description>
</ElementType>
<LANG>
Changes the LANGID of the scoped text. When the LANGID is changed, SAPI will try to detect if the current voice can handle the new language. If voice does not speak the specified language, then an engine must choose another language it speaks as a best attempt. Using the VOICE tag and REQUIRED attribute, this fall back path can be prevented if not desirable.
syntax: <LANG
  LANGID = int
>
  mixed content
</LANG>
content: mixed
order: many (default)
parents: SAPI
children: (none)
attributes: LANGID
model: closed
source:
<ElementType name="LANG" content="mixed" model="closed">
	<description>Changes the LANGID of the scoped text. When the LANGID is changed, SAPI will try to detect if the current voice can handle the new language. If voice does not speak the specified language, then an engine must choose another language it speaks as a best attempt. Using the VOICE tag and REQUIRED attribute, this fall back path can be prevented if not desirable. 
</description>
	<attribute type="LANGID"/>
</ElementType>
<PARTOFSP>
The part of speech of contained word(s). The PARTOFSP tag is used to force a particular pronunciation of a word (for example, the word record as a noun versus the word record as a verb).
syntax: <PARTOFSP
  PART = enumeration: noun|verb|modifier|function|interjection|unknown
>
  mixed content
</PARTOFSP>
content: mixed
order: many (default)
parents: SAPI
children: (none)
attributes: PART
model: closed
source:
<ElementType name="PARTOFSP" content="mixed" model="closed">
	<description>The part of speech of contained word(s). The PARTOFSP tag is used to force a particular pronunciation of a word (for example, the word record as a noun versus the word record as a verb). </description>
	<attribute type="PART"/>
</ElementType>
<PITCH>
The scoped/global element PITCH modifies the underlying numerical values of a speech block. Relative attribute values, those preceded by a dash (-) or a plus sign (+), increment the underlying numerical value by the specified amount. SAPI compliant engines have the option of supporting only the guaranteed range of values and behaving as -10 for adjustments below -10 and behaving as +10 for values above +10.
syntax: <PITCH
 [ ABSMIDDLE = int ]
  MIDDLE = int
>
  mixed content
</PITCH>
content: mixed
order: many (default)
parents: SAPI
children: (none)
attributes: ABSMIDDLE, MIDDLE
model: closed
source:
<ElementType name="PITCH" content="mixed" model="closed">
	<description>The scoped/global element PITCH modifies the underlying numerical values of a speech block. Relative attribute values, those preceded by a dash (-) or a plus sign (+), increment the underlying numerical value by the specified amount. SAPI compliant engines have the option of supporting only the guaranteed range of values and behaving as -10 for adjustments below -10 and behaving as +10 for values above +10.</description>
	<attribute type="MIDDLE"/>
	<attribute type="ABSMIDDLE"/>
</ElementType>
<PRON>
Pronounces the contained text (possibly empty) according to the provided Unicode string.
syntax: <PRON
  SYM = char
>
  mixed content
</PRON>
content: mixed
order: many (default)
parents: SAPI
children: (none)
attributes: SYM
model: open
source:
<ElementType name="PRON" content="mixed" model="open">
	<description>Pronounces the contained text (possibly empty) according to the provided Unicode string. 
	</description>
	<attribute type="SYM"/>
</ElementType>
<RATE>
Set the relative speed adjustment at which words are synthesized.
syntax: <RATE
 [ ABSSPEED = int ]
 [ SPEED = int ]
>
  mixed content
</RATE>
content: mixed
order: many (default)
parents: SAPI
children: (none)
attributes: ABSSPEED, SPEED
model: closed
source:
<ElementType name="RATE" content="mixed" model="closed">
	<description>Set the relative speed adjustment at which words are synthesized.</description>
	<attribute type="SPEED"/>
	<attribute type="ABSSPEED"/>
</ElementType>
<SAPI>
At the beginning of the SAPI tag, the state of the voice is the same state as the insertion point of the SAPI tag. At the close of the SAPI tag, the voice returns to the same state as that of the insertion point. SAPI tags may be nested. When a nested SAPI tag is closed, the voice state returns to what it was at the insertion point of the nested tag.
syntax: <SAPI >
  (many)
  <BOOKMARK>
  <SILENCE>
  <EMPH>
  <SPELL>
  <PARTOFSP>
  <PRON>
  <LANG>
  <VOICE>
  <RATE>
  <VOLUME>
  <PITCH>
  <CONTEXT>
  mixed content
</SAPI>
content: mixed
order: many (default)
parents: No parents found. This is probably the document element.
children: BOOKMARK, CONTEXT, EMPH, LANG, PARTOFSP, PITCH, PRON, RATE, SILENCE, SPELL, VOICE, VOLUME
attributes: (none)
model: open
source:
<ElementType name="SAPI" content="mixed" model="open">
	<description>At the beginning of the SAPI tag, the state of the voice is the same state as the insertion point of the SAPI tag. At the close of the SAPI tag, the voice returns to the same state as that of the insertion point. SAPI tags may be nested. When a nested SAPI tag is closed, the voice state returns to what it was at the insertion point of the nested tag. </description>
	<element type="BOOKMARK"/>
	<element type="SILENCE"/>
	<element type="EMPH">
		<description> Place emphasis on the words contained by this element. It is up to the engine implementation to design what emphasis is for the engine. </description>
	</element>
	<element type="SPELL">
		<description>Spell out words letter by letter contained by this element. NOTE: The engine should not normalize the text scoped in the SPELL tag.  This includes numbers, words, etc. Words which contain punctuation, such as “U.S.A” should spell out the letters as well as the punctuation scoped within the tag. </description>
	</element>
	<element type="PARTOFSP"/>
	<element type="PRON">
		<description>String representing a phoneme for a language supported by the voice implementing synthesized speech. </description>
	</element>
	<element type="LANG"/>
	<element type="VOICE"/>
	<element type="RATE"/>
	<element type="VOLUME">
		<description>0 to 100 (no overflow allowed)</description>
	</element>
	<element type="PITCH">
		<description>Set the relative pitch adjustment of synthesized speech.</description>
	</element>
	<element type="CONTEXT"/>
</ElementType>
<SILENCE>
Produces silence for a specified number of milliseconds to the output audio stream.
syntax: <SILENCE
  MSEC = int
/>
content: empty
order: many (default)
parents: SAPI
children: (none)
attributes: MSEC
model: closed
source:
<ElementType name="SILENCE" content="empty" model="closed">
	<description>Produces silence for a specified number of milliseconds to the output audio stream. </description>
	<attribute type="MSEC"/>
</ElementType>
<SPELL>
Spells out words letter by letter contained by this element. Note: The engine should not normalize the text scoped in the SPELL tag. This includes numbers, words, etc. Words that contain punctuation, such as "U.S.A." should spell out the letters as well as the punctuation scoped within the tag.
syntax: <SPELL />
content: empty
order: many (default)
parents: SAPI
children: (none)
attributes: (none)
model: closed
source:
<ElementType name="SPELL" content="empty" model="closed">
	<description>Spells out words letter by letter contained by this element. 
Note:  The engine should not normalize the text scoped in the SPELL tag. This includes numbers, words, etc. Words that contain punctuation, such as "U.S.A." should spell out the letters as well as the punctuation scoped within the tag. </description>
</ElementType>
<VOICE>
Sets which voice implementation is used for synthesis of associated input stream text. The best voice implementation given the required and optional attributes will be selected by SAPI.
syntax: <VOICE
 [ OPTIONAL = string ]
 [ REQUIRED = string ]
>
  mixed content
</VOICE>
content: mixed
order: many (default)
parents: SAPI
children: (none)
attributes: OPTIONAL, REQUIRED
model: closed
source:
<ElementType name="VOICE" content="mixed" model="closed">
	<description>Sets which voice implementation is used for synthesis of associated input stream text. The best voice implementation given the required and optional attributes will be selected by SAPI. </description>
	<attribute type="REQUIRED"/>
	<attribute type="OPTIONAL"/>
</ElementType>
<VOLUME>
The scoped/global elements VOLUME modify the underlying numerical values of a speech block. The underlying value can never be below zero or exceed 100. All negative value entries will result in zero and all values above 100 will result in 100. VOLUME may also receive an absolute value (no '-' or '+' character) of an integer between zero and 100.
syntax: <VOLUME
  LEVEL = int
>
  mixed content
</VOLUME>
content: mixed
order: many (default)
parents: SAPI
children: (none)
attributes: LEVEL
model: closed
source:
<ElementType name="VOLUME" content="mixed" model="closed">
	<description>The scoped/global elements VOLUME modify the underlying numerical values of a speech block. The underlying value can never be below zero or exceed 100. All negative value entries will result in zero and all values above 100 will result in 100. VOLUME may also receive an absolute value (no '-' or '+' character) of an integer between zero and 100. </description>
	<attribute type="LEVEL"/>
</ElementType>

SAPI Attributes

<... ABSMIDDLE="">
The value can range from –10 to +10. A value of 0 sets a voice to speak at its default pitch. A value of –10 sets a voice to speak at three-fourths (or Ύ) of its default pitch. A value of +10 sets a voice to speak at four-thirds (or 4/3) of its default pitch. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the pitch by the 24th root of 2 (about 1.03). Values more extreme than –10 and 10 will be passed to an engine but SAPI 5compliant engines may not support such extremes and instead may clip the pitch to the maximum or minimum pitch it supports. Values of –24 and +24 must lower and raise pitch by 1 octave respectively. All incrementing/decrementing by 1 must multiply/divide the pitch by the 24th root of 2. When scoped, this attribute is absolute.
syntax: [ ABSMIDDLE = int ]
required: no (default)
datatype: int
elements: PITCH
source:
<AttributeType name="ABSMIDDLE" dt:type="int">
	<description> The value can range from –10 to +10. A value of 0 sets a voice to speak at its default pitch. A value of –10 sets a voice to speak at three-fourths (or Ύ) of its default pitch. A value of +10 sets a voice to speak at four-thirds (or 4/3) of its default pitch. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the pitch by the 24th root of 2 (about 1.03). Values more extreme than –10 and 10 will be passed to an engine but SAPI 5compliant engines may not support such extremes and instead may clip the pitch to the maximum or minimum pitch it supports. Values of –24 and +24 must lower and raise pitch by 1 octave respectively. All incrementing/decrementing by 1 must multiply/divide the pitch by the 24th root of 2. When scoped, this attribute is absolute.</description>
</AttributeType>
<... ABSSPEED="">
The value can range from –10 to +10. A value of 0 sets a voice to speak at its default rate. A value of –10 sets a voice to speak at one-third (or 1/3) of its default rate. A value of +10 sets a voice to speak at 3 times its default rate. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the rate by the 10th root of 3 (about 1.12). Values more extreme than –10 and +10 will be passed to an engine, but SAPI 5compliant engines may not support such extremes and instead may clip the rate to the maximum or minimum rate it supports. When scoped, this attribute is absolute.
syntax: [ ABSSPEED = int ]
required: no (default)
datatype: int
elements: RATE
source:
<AttributeType name="ABSSPEED" dt:type="int">
	<description>The value can range from –10 to +10. A value of 0 sets a voice to speak at its default rate. A value of –10 sets a voice to speak at one-third (or 1/3) of its default rate. A value of +10 sets a voice to speak at 3 times its default rate. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the rate by the 10th root of 3 (about 1.12). Values more extreme than –10 and +10 will be passed to an engine, but SAPI 5compliant engines may not support such extremes and instead may clip the rate to the maximum or minimum rate it supports. When scoped, this attribute is absolute.</description>
</AttributeType>
<... ID="">
This specifies the type of context. Refer to the SAPI documentation for the vairous context ids.
syntax: ID = string
required: yes
datatype: string
elements: CONTEXT
source:
<AttributeType name="ID" dt:type="string" required="yes">
	<description>This specifies the type of context. Refer to the SAPI documentation for the vairous context ids.</description>
</AttributeType>
<... LANGID="">
Language identifier. The language identifier is specified as a hexadecimal value. For example, the LANGID for English (US) expressed in the hexadecimal form is 409.
syntax: LANGID = int
required: yes
datatype: int
elements: LANG
source:
<AttributeType name="LANGID" dt:type="int" required="yes">
	<description>Language identifier. The language identifier is specified as a hexadecimal value. For example, the LANGID for English (US) expressed in the hexadecimal form is 409. </description>
</AttributeType>
<... LEVEL="">
This specifies the volume as percent of the maximum volume of the current voice. Each voice implementation has it’s own maximum volume. This value must between 0 and 100 inclusive. Values above 100 or below 0 are clipped to 100 and 0 respectively.
syntax: LEVEL = int
required: yes
datatype: int
elements: VOLUME
source:
<AttributeType name="LEVEL" dt:type="int" required="yes">
	<description> This specifies the volume as percent of the maximum volume of the current voice. Each voice implementation has it’s own maximum volume. This value must between 0 and 100 inclusive. Values above 100 or below 0 are clipped to 100 and 0 respectively.</description>
</AttributeType>
<... MARK="">
The value of a bookmark may be any string or integer.
syntax: MARK = int
required: yes
datatype: int
elements: BOOKMARK
source:
<AttributeType name="MARK" dt:type="int" required="yes">
	<description>The value of a bookmark may be any string or integer. </description>
</AttributeType>
<... MIDDLE="">
The value can range from –10 to +10. A value of 0 sets a voice to speak at its default pitch. A value of –10 sets a voice to speak at three-fourths (or Ύ) of its default pitch. A value of +10 sets a voice to speak at four-thirds (or 4/3) of its default pitch. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the pitch by the 24th root of 2 (about 1.03). Values more extreme than –10 and 10 will be passed to an engine but SAPI 5compliant engines may not support such extremes and instead may clip the pitch to the maximum or minimum pitch it supports. Values of –24 and +24 must lower and raise pitch by 1 octave respectively. All incrementing/decrementing by 1 must multiply/divide the pitch by the 24th root of 2. When scoped, this attribute is relative.
syntax: MIDDLE = int
required: yes
datatype: int
elements: PITCH
source:
<AttributeType name="MIDDLE" dt:type="int" required="yes">
	<description>The value can range from –10 to +10. A value of 0 sets a voice to speak at its default pitch. A value of –10 sets a voice to speak at three-fourths (or Ύ) of its default pitch. A value of +10 sets a voice to speak at four-thirds (or 4/3) of its default pitch. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the pitch by the 24th root of 2 (about 1.03). Values more extreme than –10 and 10 will be passed to an engine but SAPI 5compliant engines may not support such extremes and instead may clip the pitch to the maximum or minimum pitch it supports. Values of –24 and +24 must lower and raise pitch by 1 octave respectively. All incrementing/decrementing by 1 must multiply/divide the pitch by the 24th root of 2. When scoped, this attribute is relative.</description>
</AttributeType>
<... MSEC="">
Number of milliseconds, from zero to 65535, of silence. Value entries that exceed this range should be limited to 65535. Value entries that are below this range (negative values) should be set to zero.
syntax: MSEC = int
required: yes
datatype: int
elements: SILENCE
source:
<AttributeType name="MSEC" dt:type="int" required="yes">
	<description>Number of milliseconds, from zero to 65535, of silence. Value entries that exceed this range should be limited to 65535. Value entries that are below this range (negative values) should be set to zero. </description>
</AttributeType>
<... OPTIONAL="">
The XML parser selects the first voice registered containing all of the specified attributes. A string that contains semicolon-delimited sub-strings is used to specify the attributes. The speak call will fail if the parser cannot find the required tags.
syntax: [ OPTIONAL = string ]
required: no (default)
datatype: string
elements: VOICE
source:
<AttributeType name="OPTIONAL" dt:type="string">
	<description>The XML parser selects the first voice registered containing all of the specified attributes. A string that contains semicolon-delimited sub-strings is used to specify the attributes. The speak call will fail if the parser cannot find the required tags. 
</description>
</AttributeType>
<... PART="">
String name of part of speech. Valid SAPI parts of speech arenoun, verb, modifier, function, interjection and unknown.
syntax: PART = enumeration: noun|verb|modifier|function|interjection|unknown
required: yes
datatype: enumeration
values: noun|verb|modifier|function|interjection|unknown
elements: PARTOFSP
source:
<AttributeType name="PART" dt:type="enumeration" dt:values="noun|verb|modifier|function|interjection|unknown" required="yes">
	<description> String name of part of speech. Valid SAPI parts of speech arenoun, verb, modifier, function, interjection and unknown. </description>
</AttributeType>
<... REQUIRED="">
The XML parser selects the first voice registered containing all of the specified attributes. A string that contains semicolon-delimited sub-strings is used to specify the attributes. The speak call will fail if the parser cannot find the required tags.
syntax: [ REQUIRED = string ]
required: no (default)
datatype: string
elements: VOICE
source:
<AttributeType name="REQUIRED" dt:type="string">
	<description>The XML parser selects the first voice registered containing all of the specified attributes. A string that contains semicolon-delimited sub-strings is used to specify the attributes. The speak call will fail if the parser cannot find the required tags. 
</description>
</AttributeType>
<... SPEED="">
The value can range from –10 to +10. A value of 0 sets a voice to speak at its default rate. A value of –10 sets a voice to speak at one-third (or 1/3) of its default rate. A value of +10 sets a voice to speak at 3 times its default rate. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the rate by the 10th root of 3 (about 1.12). Values more extreme than –10 and +10 will be passed to an engine, but SAPI 5compliant engines may not support such extremes and instead may clip the rate to the maximum or minimum rate it supports. When scoped, this attribute is relative.
syntax: [ SPEED = int ]
required: no (default)
datatype: int
elements: RATE
source:
<AttributeType name="SPEED" dt:type="int">
	<description>The value can range from –10 to +10. A value of 0 sets a voice to speak at its default rate. A value of –10 sets a voice to speak at one-third (or 1/3) of its default rate. A value of +10 sets a voice to speak at 3 times its default rate. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the rate by the 10th root of 3 (about 1.12). Values more extreme than –10 and +10 will be passed to an engine, but SAPI 5compliant engines may not support such extremes and instead may clip the rate to the maximum or minimum rate it supports. When scoped, this attribute is relative.</description>
</AttributeType>
<... SYM="">
String representing a phoneme for a language supported by the voice implementing synthesizing speech. Refer to SAPI Phoneme Spec.
syntax: SYM = char
required: yes
datatype: char
elements: PRON
source:
<AttributeType name="SYM" dt:type="char" required="yes">
	<description>String representing a phoneme for a language supported by the voice implementing synthesizing speech. Refer to SAPI Phoneme Spec.</description>
</AttributeType>

SAPI Source

<Schema name="SAPI" xmlns="urn:schemas-microsoft-com:xml-data" xmlns:dt="urn:schemas-microsoft-com:datatypes">
	<description> This schema describes the SAPI 5.0 TTS XML grammar format. The SAPI TTS XML schema is included in the TTS XML parser. Hence, it is not necessary to include the schema in the XML file when authoring a grammar. NOTE: This schema is based on the Microsoft schema language and is not fully W3C compliant. This schema will be rewritten and will be compliant with the W3C standard once it has been approved by the W3C.</description>
	<!-- Attribute definitions -->
	<AttributeType name="ID" dt:type="string" required="yes">
		<description>This specifies the type of context. Refer to the SAPI documentation for the vairous context ids.</description>
	</AttributeType>
	<AttributeType name="SYM" dt:type="char" required="yes">
		<description>String representing a phoneme for a language supported by the voice implementing synthesizing speech. Refer to SAPI Phoneme Spec.</description>
	</AttributeType>
	<AttributeType name="LANGID" dt:type="int" required="yes">
		<description>Language identifier. The language identifier is specified as a hexadecimal value. For example, the LANGID for English (US) expressed in the hexadecimal form is 409. </description>
	</AttributeType>
	<AttributeType name="LEVEL" dt:type="int" required="yes">
		<description> This specifies the volume as percent of the maximum volume of the current voice. Each voice implementation has it’s own maximum volume. This value must between 0 and 100 inclusive. Values above 100 or below 0 are clipped to 100 and 0 respectively.</description>
	</AttributeType>
	<AttributeType name="MARK" dt:type="int" required="yes">
		<description>The value of a bookmark may be any string or integer. </description>
	</AttributeType>
	<AttributeType name="MIDDLE" dt:type="int" required="yes">
		<description>The value can range from –10 to +10. A value of 0 sets a voice to speak at its default pitch. A value of –10 sets a voice to speak at three-fourths (or Ύ) of its default pitch. A value of +10 sets a voice to speak at four-thirds (or 4/3) of its default pitch. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the pitch by the 24th root of 2 (about 1.03). Values more extreme than –10 and 10 will be passed to an engine but SAPI 5compliant engines may not support such extremes and instead may clip the pitch to the maximum or minimum pitch it supports. Values of –24 and +24 must lower and raise pitch by 1 octave respectively. All incrementing/decrementing by 1 must multiply/divide the pitch by the 24th root of 2. When scoped, this attribute is relative.</description>
	</AttributeType>
	<AttributeType name="MSEC" dt:type="int" required="yes">
		<description>Number of milliseconds, from zero to 65535, of silence. Value entries that exceed this range should be limited to 65535. Value entries that are below this range (negative values) should be set to zero. </description>
	</AttributeType>
	<AttributeType name="OPTIONAL" dt:type="string">
		<description>The XML parser selects the first voice registered containing all of the specified attributes. A string that contains semicolon-delimited sub-strings is used to specify the attributes. The speak call will fail if the parser cannot find the required tags. 
</description>
	</AttributeType>
	<AttributeType name="REQUIRED" dt:type="string">
		<description>The XML parser selects the first voice registered containing all of the specified attributes. A string that contains semicolon-delimited sub-strings is used to specify the attributes. The speak call will fail if the parser cannot find the required tags. 
</description>
	</AttributeType>
	<AttributeType name="SPEED" dt:type="int">
		<description>The value can range from –10 to +10. A value of 0 sets a voice to speak at its default rate. A value of –10 sets a voice to speak at one-third (or 1/3) of its default rate. A value of +10 sets a voice to speak at 3 times its default rate. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the rate by the 10th root of 3 (about 1.12). Values more extreme than –10 and +10 will be passed to an engine, but SAPI 5compliant engines may not support such extremes and instead may clip the rate to the maximum or minimum rate it supports. When scoped, this attribute is relative.</description>
	</AttributeType>
	<AttributeType name="PART" dt:type="enumeration" dt:values="noun|verb|modifier|function|interjection|unknown" required="yes">
		<description> String name of part of speech. Valid SAPI parts of speech arenoun, verb, modifier, function, interjection and unknown. </description>
	</AttributeType>
	<AttributeType name="ABSMIDDLE" dt:type="int">
		<description> The value can range from –10 to +10. A value of 0 sets a voice to speak at its default pitch. A value of –10 sets a voice to speak at three-fourths (or Ύ) of its default pitch. A value of +10 sets a voice to speak at four-thirds (or 4/3) of its default pitch. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the pitch by the 24th root of 2 (about 1.03). Values more extreme than –10 and 10 will be passed to an engine but SAPI 5compliant engines may not support such extremes and instead may clip the pitch to the maximum or minimum pitch it supports. Values of –24 and +24 must lower and raise pitch by 1 octave respectively. All incrementing/decrementing by 1 must multiply/divide the pitch by the 24th root of 2. When scoped, this attribute is absolute.</description>
	</AttributeType>
	<AttributeType name="ABSSPEED" dt:type="int">
		<description>The value can range from –10 to +10. A value of 0 sets a voice to speak at its default rate. A value of –10 sets a voice to speak at one-third (or 1/3) of its default rate. A value of +10 sets a voice to speak at 3 times its default rate. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the rate by the 10th root of 3 (about 1.12). Values more extreme than –10 and +10 will be passed to an engine, but SAPI 5compliant engines may not support such extremes and instead may clip the rate to the maximum or minimum rate it supports. When scoped, this attribute is absolute.</description>
	</AttributeType>
	<!-- Definition of SAPI Element -->
	<ElementType name="SAPI" content="mixed" model="open">
		<description>At the beginning of the SAPI tag, the state of the voice is the same state as the insertion point of the SAPI tag. At the close of the SAPI tag, the voice returns to the same state as that of the insertion point. SAPI tags may be nested. When a nested SAPI tag is closed, the voice state returns to what it was at the insertion point of the nested tag. </description>
		<element type="BOOKMARK"/>
		<element type="SILENCE"/>
		<element type="EMPH">
			<description> Place emphasis on the words contained by this element. It is up to the engine implementation to design what emphasis is for the engine. </description>
		</element>
		<element type="SPELL">
			<description>Spell out words letter by letter contained by this element. NOTE: The engine should not normalize the text scoped in the SPELL tag.  This includes numbers, words, etc. Words which contain punctuation, such as “U.S.A” should spell out the letters as well as the punctuation scoped within the tag. </description>
		</element>
		<element type="PARTOFSP"/>
		<element type="PRON">
			<description>String representing a phoneme for a language supported by the voice implementing synthesized speech. </description>
		</element>
		<element type="LANG"/>
		<element type="VOICE"/>
		<element type="RATE"/>
		<element type="VOLUME">
			<description>0 to 100 (no overflow allowed)</description>
		</element>
		<element type="PITCH">
			<description>Set the relative pitch adjustment of synthesized speech.</description>
		</element>
		<element type="CONTEXT"/>
	</ElementType>
	<!-- Definition of elements -->
	<!--Definition of BOOKMRK Element -->
	<ElementType name="BOOKMARK" content="empty" model="closed">
		<description>Inserts a bookmark into the input stream using the bookmark element. If an application specifies interest in bookmark events, it will receive an event when synthesis has passed this element in an input stream. If the audio output destination supports handling of events, then an application will receive this event once the synthesized speech up to this bookmark has been output. Otherwise, an application receives a bookmark event when the voice implementation has synthesized speech up to this bookmark. </description>
		<attribute type="MARK"/>
	</ElementType>
	<!-- Definition of SILENCE Element -->
	<ElementType name="SILENCE" content="empty" model="closed">
		<description>Produces silence for a specified number of milliseconds to the output audio stream. </description>
		<attribute type="MSEC"/>
	</ElementType>
	<!-- Definition of EMPH Element -->
	<ElementType name="EMPH" content="empty" model="closed">
		<description>Places emphasis on the words contained by this element. </description>
	</ElementType>
	<!-- Definition of SPELL Element -->
	<ElementType name="SPELL" content="empty" model="closed">
		<description>Spells out words letter by letter contained by this element. 
Note:  The engine should not normalize the text scoped in the SPELL tag. This includes numbers, words, etc. Words that contain punctuation, such as "U.S.A." should spell out the letters as well as the punctuation scoped within the tag. </description>
	</ElementType>
	<!-- Definition of PARTOFSP Element -->
	<ElementType name="PARTOFSP" content="mixed" model="closed">
		<description>The part of speech of contained word(s). The PARTOFSP tag is used to force a particular pronunciation of a word (for example, the word record as a noun versus the word record as a verb). </description>
		<attribute type="PART"/>
	</ElementType>
	<!--Definition of PRON Element-->
	<ElementType name="PRON" content="mixed" model="open">
		<description>Pronounces the contained text (possibly empty) according to the provided Unicode string. 
	</description>
		<attribute type="SYM"/>
	</ElementType>
	<!-- Definition of LANG Element -->
	<ElementType name="LANG" content="mixed" model="closed">
		<description>Changes the LANGID of the scoped text. When the LANGID is changed, SAPI will try to detect if the current voice can handle the new language. If voice does not speak the specified language, then an engine must choose another language it speaks as a best attempt. Using the VOICE tag and REQUIRED attribute, this fall back path can be prevented if not desirable. 
</description>
		<attribute type="LANGID"/>
	</ElementType>
	<!-- Definition of VOICE Element -->
	<ElementType name="VOICE" content="mixed" model="closed">
		<description>Sets which voice implementation is used for synthesis of associated input stream text. The best voice implementation given the required and optional attributes will be selected by SAPI. </description>
		<attribute type="REQUIRED"/>
		<attribute type="OPTIONAL"/>
	</ElementType>
	<!-- Definition of RATE Element -->
	<ElementType name="RATE" content="mixed" model="closed">
		<description>Set the relative speed adjustment at which words are synthesized.</description>
		<attribute type="SPEED"/>
		<attribute type="ABSSPEED"/>
	</ElementType>
	<!-- Definition of VOLUME Element -->
	<ElementType name="VOLUME" content="mixed" model="closed">
		<description>The scoped/global elements VOLUME modify the underlying numerical values of a speech block. The underlying value can never be below zero or exceed 100. All negative value entries will result in zero and all values above 100 will result in 100. VOLUME may also receive an absolute value (no '-' or '+' character) of an integer between zero and 100. </description>
		<attribute type="LEVEL"/>
	</ElementType>
	<!-- Definition of PITCH Element -->
	<ElementType name="PITCH" content="mixed" model="closed">
		<description>The scoped/global element PITCH modifies the underlying numerical values of a speech block. Relative attribute values, those preceded by a dash (-) or a plus sign (+), increment the underlying numerical value by the specified amount. SAPI compliant engines have the option of supporting only the guaranteed range of values and behaving as -10 for adjustments below -10 and behaving as +10 for values above +10.</description>
		<attribute type="MIDDLE"/>
		<attribute type="ABSMIDDLE"/>
	</ElementType>
	<!-- Definition of CONTEXT Element -->
	<ElementType name="CONTEXT" content="mixed" model="closed">
		<description>The context can specify the type of normalization rules which should be applied to the scoped text. SAPI does not guarantee any predefined contexts. </description>
		<attribute type="ID"/>
	</ElementType>
</Schema>

Schema Attributes Reference:

open model
The element can contain elements, attributes, and text not specified in the content model. This is the default value.
closed model
The element cannot contain elements, attributes, and text except for that specified in the content model. DTDs use a closed model.
textOnly content
The element can contain only text, not elements. Note that if the model attribute is set to "open", the element can contain text and additional elements.
eltOnly content
The element can contain only the elements, not free text. Note that if the model attribute is set to "open", the element can contain text and additional elements.
empty content
The element cannot contain text or elements. Note that if the model attribute is set to "open", the element can contain text and additional elements.
mixed content
The element can contain a mix of named elements and text. This is the default value.
one order
Permits only one of a set of elements.
seq order
Requires the elements to appear in the specified sequence.
many order
Permits the elements to appear (or not appear) in any order. This is the default.

Datatype Reference:

bin.base64 datatype
MIME-style Base64 encoded binary BLOB.
bin.hex datatype
Hexadecimal digits representing octets.
boolean datatype
0 or 1, where 0 == "false" and 1 =="true".
char datatype
String, one character long.
date datatype
Date in a subset ISO 8601 format, without the time data. For example: "1994-11-05".
dateTime datatype
Date in a subset of ISO 8601 format, with optional time and no optional zone. Fractional seconds can be as precise as nanoseconds. For example, "1988-04-07T18:39:09".
dateTime.tz datatype
Date in a subset ISO 8601 format, with optional time and optional zone. Fractional seconds can be as precise as nanoseconds. For example: "1988-04-07T18:39:09-08:00".
entity datatype
Represents the XML ENTITY type.
entities datatype
Represents the XML ENTITIES type.
enumeration datatype
Represents an enumerated type (supported on attributes only).
fixed.14.4 datatype
Same as "number" but no more than 14 digits to the left of the decimal point, and no more than 4 to the right.
float datatype
Real number, with no limit on digits; can potentially have a leading sign, fractional digits, and optionally an exponent. Punctuation as in U.S. English. Values range from 1.7976931348623157E+308 to 2.2250738585072014E-308.
id datatype
Represents the XML ID type.
idref datatype
Represents the XML IDREF type.
idrefs datatype
Represents the XML IDREFS type.
int datatype
Number, with optional sign, no fractions, and no exponent.
nmtoken datatype
Represents the XML NMTOKEN type.
nmtokens datatype
Represents the XML NMTOKENS type.
notation datatype
Represents a NOTATION type.
number datatype
Number, with no limit on digits; can potentially have a leading sign, fractional digits, and optionally an exponent. Punctuation as in U.S. English. (Values have same range as most significant number, R8, 1.7976931348623157E+308 to 2.2250738585072014E-308.)
string datatype
Represents a string type.
time datatype
Time in a subset ISO 8601 format, with no date and no time zone. For example: "08:15:27".
time.tz datatype
Time in a subset ISO 8601 format, with no date but optional time zone. For example: "08:1527-05:00".
i1 datatype
Integer represented in one byte. A number, with optional sign, no fractions, no exponent. For example: "1, 127, -128".
i2 datatype
Integer represented in one word. A number, with optional sign, no fractions, no exponent. For example: "1, 703, -32768".
i4 datatype
Integer represented in four bytes. A number, with optional sign, no fractions, no exponent. For example: "1, 703, -32768, 148343, -1000000000".
r4 datatype
Real number, with seven digit precision; can potentially have a leading sign, fractional digits, and optionally an exponent. Punctuation as in U.S. English. Values range from 3.40282347E+38F to 1.17549435E-38F.
r8
Real number, with 15 digit precision; can potentially have a leading sign, fractional digits, and optionally an exponent. Punctuation as in U.S. English. Values range from 1.7976931348623157E+308 to 2.2250738585072014E-308.
ui1 datatype
Unsigned integer. A number, unsigned, no fractions, no exponent. For example: "1, 255".
ui2 datatype
Unsigned integer, two bytes. A number, unsigned, no fractions, no exponent. For example: "1, 255, 65535".
ui4 datatype
Unsigned integer, four bytes. A number, unsigned, no fractions, no exponent. For example: "1, 703, 3000000000".
uri datatype
Universal Resource Identifier (URI). For example, "urn:schemas-microsoft-com:Office9".
uuid datatype
Hexadecimal digits representing octets, optional embedded hyphens that are ignored. For example: "333C7BC4-460F-11D0-BC04-0080C7055A83".