Speech Synthesis Markup Language (SSML) in Chatbots

by Michael Szul on Sat Jul 07 2018 10:53:54

No ads, no tracking, and no data collection. Enjoy this article? Buy us a ☕.

When most people think about chatbots, they think about text applications (messaging), and interacting with bots through something like Skype or a web chat client. The truth is, chatbots can easily be utilized inside of voice-enabled personal digital assistants, and those bots can be enhanced using a special form of markup language.

Speech Synthesis Markup Language (SSML) is an XML schema language designed to describe speech and voice meta data for things like text-to-speech. This is used by the Microsoft Bot Framework (as well as other services from both Azure and AWS) to control how speech works for Cortana, web clients, and any other channel that allow for speech.

A typical SSML string would look like this:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="string">What time is the movie playing?</speak>

In this example, <speak> is the root element. This doesn't do much for a Cortana skill, however, since Cortana will just speak the text that is provided without the markup. The power is in the other elements and attributes that allow for alterations to the language.

For example, you can provide emphasis:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="string">What <emphasis level="strong">time</emphasis> is the movie playing?</speak>

Another element that can be used to change the audio of the spoken text is the <voice> element. This allows you to alter gender, age and other variables that can affect the sentence.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="string">
          <voice xml:lang="en-gb" gender="female">I have <emphasis level="moderate">nothing</emphasis> to say to you at this time.</voice>
      </speak>

The <prosody> element, meanwhile, allows you to control the pitch, range, volume, etc. of the voice in a text-to-speech application.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="string">
          <voice xml:lang="en-gb" gender="female">I have <prosody pitch="+1st" rate="-10%" volume="90">nothing</prosody> to say to you at this time.</voice>
      </speak>

You can also implement pauses by using the <break> element.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="string">
          <voice xml:lang="en-gb" gender="female">I have <break time="5000ms" />nothing to say to you at this time.</voice>
      </speak>

You can use the <say-as> element to instruct the synthesis engine on how a particular word needs to be pronounced.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="string">
          <s>I own <say-as interpret-as="cardinal">15,000</say-as> comic books.</s>
      <speak>

An important note is that the <s> element used above represents a sentence, and its optional. SSML has sentence and paragraph (<p>) elements to inform structure, but these can be handled automatically by the synthesis engine.

You can also explicitly instruct the synthesis engine on phonetic pronunciation. Amazon's SSML documentation has a great example using the word "pecan."

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="string">
          <p>
              <s>You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.</s>
              <s>I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.</s>
          </p>
      <speak>

There are other available elements for SSML, but those deal more with structure, word substitution, and attached meta data--elements you likely will not work with often for chatbots. Both Microsoft and Amazon have good online documentation for using SSML, so if you want to explore further, I suggestion you check them out.

How does all this work with chatbots? A very basic way to incorporate this is to use the session.say() method inside of a Microsoft Bot Framework application.

sess.say("I have nothing to say to you at this time.", `<voice xml:lang="en-gb" gender="female">I have <emphasis level="moderate">nothing</emphasis> to say to you at this time.</voice>`, {
              inputHint: builder.InputHint.ignoringInput
          }
      );