locked
Text-To-Speech Audio Playback solution? RRS feed

  • Question

  • I'm working on a TTS audio playback solution:

    1) text is sent to an Azure Function via Http

    2) Function calls Text-to-Speech API and gets the audio response (stream)

    3) Function streams the audio in http response back to client-side 

    4) Client plays the audio. 

    My questions are:

    1)  How do I modify the code below to stream the audio to client? see context.res .  I have it working well and getting the response from TTS.  I just don't know how to return the audio via the function's http response.

    2) Should I use Azure Media Player to stream the audio on the client?  What do you recommend for audio playback on client?

    3) Is there a better architecture you would recommend to stream the audio? 

    Thanks in advance for your help and suggestions!

    Donnie

    module.exports = async function (context, req) {
    let text;
    let apiToken = '';
    const subscriptionKey = "xxxxxx" 
        if (req.body && req.body.text) {
            text = req.body.text;
            try {
                tokenResponse = await axios({
                    url:'https://eastus.api.cognitive.microsoft.com/sts/v1.0/issuetoken',
                    method:'post',
                    headers:{'Ocp-Apim-Subscription-Key': subscriptionKey}
                })
                apiToken = tokenResponse.data;
            } catch (error) {
                context.log(error);
            }
            if(apiToken !== ''){
            try { 
                    
                let xml_body = xmlbuilder.create('speak')
                    .att('version', '1.0')
                    .att('xml:lang', 'en-us')
                    .ele('voice')
                    .att('xml:lang', 'en-us')
                    .att('name', 'Microsoft Server Speech Text to Speech Voice (en-US, Guy24KRUS)')
                    .txt(text)
                    .end();
                    let xmlbody = xml_body.toString();
                    context.log(xmlbody);
                
                let audioResponse = await axios({
                    method: 'post',
                    url: 'https://eastus.tts.speech.microsoft.com/cognitiveservices/v1',
                    data:xmlbody,
                    responseType: 'stream',
                    headers: {
                        'Authorization': 'Bearer ' + sandyToken,
                        'cache-control': 'no-cache',
                        'User-Agent': 'SANDY',
                        'X-Microsoft-OutputFormat': 'riff-24khz-16bit-mono-pcm',
                        'Content-Type': 'application/ssml+xml'
                    }
                })
                context.log(audioResponse.data);
                
    
              } catch (error) {
                context.log(error);
              }
              context.res = {
                status: 200,
                body: audioResponse.data.pipe(res) //?????
            }; 
            }
              
               
        }
        else {
            context.res = {
                status: 400,
                body: "Please pass text string in the request body"
            };
        }
    };


    Donnie Kerr


    • Edited by DonnieKerr Friday, February 8, 2019 2:32 PM
    Friday, February 8, 2019 2:24 PM

Answers

  • Hey Donnie, 

    I've been working on a project that takes the response from the TTS endpoint, loads it, and makes it available for playback in the browser. Here's the code snippet written for jQuery using Ajax. The trick was creating a blob, writing the data to that blob, then creating an audio element and loading it.

      $("#text-to-speech").on("click", function(e) {
        e.preventDefault();
        var ttsInput = document.getElementById("translation-result").value;
        var ttsVoice = document.getElementById("select-voice").value;
        var ttsRequest = { 'text': ttsInput, 'voice': ttsVoice }
    
        var xhr = new XMLHttpRequest();
        xhr.open('post', '/text-to-speech', true);
        xhr.setRequestHeader("Content-Type", "application/json");
        xhr.responseType = "blob";
        xhr.onload = function(evt){
          if (xhr.status === 200) {
            audioBlob = new Blob([xhr.response], {type: "audio/mpeg"});
            audioURL = URL.createObjectURL(audioBlob);
            if (audioURL.length > 5){
              var audio = document.getElementById('audio');
              var source = document.getElementById('audio-source');
              source.src = audioURL;
              audio.load();
              audio.play();
            }else{
              console.log("An error occurred getting and playing the audio.")
            }
          }
        }
        xhr.send(JSON.stringify(ttsRequest));
      });
    Keep in mind that I'm making server-to-server API calls, so I'm not hitting the TTS endpoint directly. If you'd like to look at the source code for the project, here's the link: https://github.com/erhopf/translation-demo.

    Wednesday, February 13, 2019 12:59 AM

All replies

  • Ling has some interesting stuff here https://twitter.com/donaldkerr/status/1094399327525588992  in twitter thread how she made Cog Services bindings in .Net.  But, I just don't see how to do that in Node?

    I just need to know how to write the returned audio file back to the client or how to write it to a storage account so I can have a url to send to the html5 audio component to play it.


    Donnie Kerr


    • Edited by DonnieKerr Sunday, February 10, 2019 1:11 AM
    Sunday, February 10, 2019 1:10 AM
  • Hey Donnie, 

    I've been working on a project that takes the response from the TTS endpoint, loads it, and makes it available for playback in the browser. Here's the code snippet written for jQuery using Ajax. The trick was creating a blob, writing the data to that blob, then creating an audio element and loading it.

      $("#text-to-speech").on("click", function(e) {
        e.preventDefault();
        var ttsInput = document.getElementById("translation-result").value;
        var ttsVoice = document.getElementById("select-voice").value;
        var ttsRequest = { 'text': ttsInput, 'voice': ttsVoice }
    
        var xhr = new XMLHttpRequest();
        xhr.open('post', '/text-to-speech', true);
        xhr.setRequestHeader("Content-Type", "application/json");
        xhr.responseType = "blob";
        xhr.onload = function(evt){
          if (xhr.status === 200) {
            audioBlob = new Blob([xhr.response], {type: "audio/mpeg"});
            audioURL = URL.createObjectURL(audioBlob);
            if (audioURL.length > 5){
              var audio = document.getElementById('audio');
              var source = document.getElementById('audio-source');
              source.src = audioURL;
              audio.load();
              audio.play();
            }else{
              console.log("An error occurred getting and playing the audio.")
            }
          }
        }
        xhr.send(JSON.stringify(ttsRequest));
      });
    Keep in mind that I'm making server-to-server API calls, so I'm not hitting the TTS endpoint directly. If you'd like to look at the source code for the project, here's the link: https://github.com/erhopf/translation-demo.

    Wednesday, February 13, 2019 12:59 AM
  • Thank you!  In your saveAudio you return response.content.  What format is the response? 

    How did you know what type the blob needed to be in? audio/mpeg. 

    I got back a wav file and it was hard to tell if it was a stream, a buffer, a blob, etc.? There is no documentation that tells you want the format of the response is?  So I wasn't sure if I could just return audioResponse.data or if I had to stream it.  Piping it to a file doesn't not work inside an Azure Function, so I needed a way to send the reponse back to the client.  According to your server-side code, I would just need to send audioResponse.data back, similar to how you returned response.content in Python.  Correct?

    I will give it a try and then convert it to a blob on the client like you did.

    Thanks!

    Donnie



    response = requests.post(constructed_url, headers=headers, data=body) # Write the response as a wav file for playback. The file is located # in the same directory where this sample is run. print(response) return response.content


    Donnie Kerr

    Wednesday, February 13, 2019 5:36 AM
  • Does your audio.play() work in Chrome?  I can't get audio to play using code below ...

    Also,  why does Cog Services TTS API return a wav file for me and a mp3 for you?  There doesn't seem to be a way to control that.  We have the same audio output setting in the xml.   

    Thanks,

    Donnie

    axios({
              method:'post',
              url: apiurl,
              data: data,
              responseType: "blob"
            }).then(function(response){
                  
                  let player = document.getElementById("player");
                  let audioBlob = new Blob([response], {type: "audio/wav"});
                  let audioURL = URL.createObjectURL(audioBlob);
                      player.src = audioURL;
                      player.load();
                      player.play();
              
            }).catch(function (error) { 
                console.log(error);
            });


    Donnie Kerr

    Wednesday, February 13, 2019 10:12 PM
  • I keep getting this error in chrome. ugh! this is frustrating!  should be easy to do! 

    DOMException: Failed to load because no supported source was found


    Donnie Kerr

    Wednesday, February 13, 2019 11:55 PM
  • Sorry for the delay. It didn't notify me that you had responded. So the response comes back as binary data. Depending on the response format you've specified in your request, that's what you'll want to use when setting the media type for your blob. Once the blob is created, you can access it via URL. Depending on what you're using for playback, the code to load the audio may be different. What I'm using works in both Firefox and Chrome. I haven't tested it in Edge or IE. 
    Thursday, February 14, 2019 6:50 PM
  • You'll need to change the X-Microsoft-OutputFormat header. Here's a link to that reference topic:

    https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/rest-apis#text-to-speech-api

    Thursday, February 14, 2019 6:51 PM
  • Mind posting your updated code? I can take a look. 
    Thursday, February 14, 2019 6:52 PM
  • I am using the same outputFormat as you, but mine returned a wav and yours returned mpeg.  It is not clear in the this doc what format to expect back.  No idea where the wav format came from. 

    Donnie Kerr

    Thursday, February 14, 2019 7:39 PM
  • this is the client-side that calls the Azure Function api ...

    note: I was playing around with the blob type but neither play.  There is also an issue where Chrome blocks the player.play, so the promise stuff is a work around.  I can't tell if the problem is with the format of the blob URL or it is this blocking issue (no supported source found might just mean it is being blocked.)

            
              let data = {
                    text: this.$refs.text.value
                  }
              
            axios({
              method:'post',
              url: apiurl,
              data: data,
              responseType: "blob"
            }).then(function(response){
              
                  let player = document.getElementById("player");
                  let audioBlob = new Blob([response], {type: "audio/mpeg"});
                  let audioURL = window.URL.createObjectURL(audioBlob);
                      player.crossOrigin = 'anonymous';
                      player.src = audioURL;
                      player.load();
                      //player.play();
                      var promise = player.play();
    
                      if (promise !== undefined) {
                          promise.catch(error => {
                            if(error){
                              alert(error)
                            }
                            
                          }).then(() => {
                              // Auto-play started
                              player.onplaying = function() {
                                  alert("Sandy is speaking");
                              };
                             
                          });
                      }
              
            }).catch(function (error) { 
                console.log(error);
            });
      

    this is the part of the code in the Azure Function that makes the api call ...

    let xml_body = xmlbuilder.create('speak')
                    .att('version', '1.0')
                    .att('xml:lang', 'en-us')
                    .ele('voice')
                    .att('xml:lang', 'en-us')
                    .att('name', 'Microsoft Server Speech Text to Speech Voice (en-US, Guy24KRUS)')
                    .txt(text)
                    .end();
                    let xmlbody = xml_body.toString();
                    context.log(xmlbody);
                
                let audioResponse = await axios({
                    method: 'post',
                    url: 'https://eastus.tts.speech.microsoft.com/cognitiveservices/v1',
                    data:xmlbody,
                    headers: {
                        'Authorization': 'Bearer ' + sandyToken,
                        'cache-control': 'no-cache',
                        'User-Agent': 'tts',
                        'X-Microsoft-OutputFormat': 'riff-24khz-16bit-mono-pcm',
                        'Content-Type': 'application/ssml+xml'
                    }
                })
               
                sendtoclient = audioResponse.data;


    Donnie Kerr

    Thursday, February 14, 2019 7:46 PM

  • Donnie Kerr

    Friday, February 15, 2019 3:23 PM
  • the blue is the format of the response I receive from test-to-speech.  the red is it logged to the console on the client-side in chrome. 

    When this blob is passed to the audio component to play, nothing happens.  It doesn't play.  And sometimes I get the DOM exception "No Supported Source Found." 

    It appears that x-wav format I'm getting isn't compatible or gets corrupted.  And I'm still confused why microsoft is return a wav response.  I still don't see any way to specify file format in the request.  So, not sure how Erik is getting a mp4/mpeg.  

    I'd rather not have to figure out how to convert it from wav blob to mpeg.  I'd rather fugure how to get mp4 in the first place.


    Donnie Kerr

    Friday, February 15, 2019 3:31 PM
  • I switched the format mp3, but the blob URL still doesn't play/being blocked by chrome and safari. 
    audio-16khz-32kbitrate-mono-mp3

    Donnie Kerr

    Friday, February 15, 2019 6:46 PM
  • Can't seem to overcome the blocking issue related to calling audio.play().  I even get this error when the user is required to click a button first before play is called.  So, it appears to be a browser support issue with audio component and play().  

    Unhandled Promise Rejection: NotAllowedError: The request is not allowed by the user agent or the platform in the current context, possibly because the user denied permission. 


    Donnie Kerr

    Friday, February 15, 2019 7:46 PM
  • Well,  I gave up on passing blob/objectUrl to the audio component because non of the browsers are supporting playing it. 

    I decided to try to save the speech mp3 to blob storage and return the url to the client.  Hoping that it would play from an actual url.  Unfortunately, it still doesn't play in browser.

    I also tried to download the mp3 from the blob storage, but it won't even load into QuickTime player.  The speech file seems corrupted.

    Just trying everything because there is no official Text-To-Speech documentation on how to play back the speech. 

    Here is my node.js Azure Function code that receives the text, converts it using the EXACT code sample from mircrosoft.  I then add in the save to blob storage part.

    here is an example of the resulting url so you can try to play it and see it doesn't work.

    https://sandyspeaksstorage.blob.core.windows.net/sandy-speaks/e5f3613d-31f1-41f5-beba-08443ec7aa17.mp3

    module.exports = function(context, req) {
    
    const xmlbuilder = require('xmlbuilder');
    const request = require('request');
    const fs = require('fs');
    const {
      Aborter,
      BlobURL,
      BlockBlobURL,
      ContainerURL,
      ServiceURL,
      StorageURL,
      SharedKeyCredential,
      uploadStreamToBlockBlob
    } = require("@azure/storage-blob");
    const account = "xxx";
    const accountKey = "xxx";
    const sharedKeyCredential = new SharedKeyCredential(account, accountKey);
    const pipeline = StorageURL.newPipeline(sharedKeyCredential);
    const serviceURL = new ServiceURL(`https://${account}.blob.core.windows.net`,pipeline);
    const containerName = `sandy-speaks`;
    const containerURL = ContainerURL.fromServiceURL(serviceURL, containerName);
    const subscriptionKey = "eaf9b09c7eea4809b001cae2bf5f47b7" ;
    
    let text = req.body.text; //from http request
    
    function textToSpeech(subscriptionKey, saveAudio) {
        let options = {
            method: 'POST',
            uri: 'https://eastus.api.cognitive.microsoft.com/sts/v1.0/issueToken',
            headers: {
                'Ocp-Apim-Subscription-Key': subscriptionKey
            }
        };
        
        function getToken(error, response, body) {
            context.log("Getting your token...\n")
            if (!error && response.statusCode == 200) {
                
                saveAudio(body)
            }
            else {
              context.log(error);
            }
        }
        request(options, getToken)
    }
    
    function saveAudio(accessToken) {
       
        let xml_body = xmlbuilder.create('speak')
          .att('version', '1.0')
          .att('xml:lang', 'en-us')
          .ele('voice')
          .att('xml:lang', 'en-us')
          .att('name', 'Microsoft Server Speech Text to Speech Voice (en-US, Guy24KRUS)')
          .txt(text)
          .end();
       
        let body = xml_body.toString();
    
        let options = {
            method: 'POST',
            baseUrl: 'https://eastus.tts.speech.microsoft.com/',
            url: 'cognitiveservices/v1',
            headers: {
                'Authorization': 'Bearer ' + accessToken,
                'cache-control': 'no-cache',
                'User-Agent': 'tts',
                'X-Microsoft-OutputFormat': 'audio-16khz-32kbitrate-mono-mp3',
                'Content-Type': 'application/ssml+xml'
            },
            body: body
        };
        
        function convertText(error, response,body){
         
          if (!error && response.statusCode == 200) {
            context.log("Converting text-to-speech. Please hold...\n")
          }
          else {
            context.log(error);
          }
          context.log("Your file is ready.\n")
          
          //save to blob storage
             const uuidv4 = require('uuid/v4'); 
            let filename = uuidv4()+'.mp3';
            context.log(filename)
            const blobName = filename;
            const fileBuffer = body;
            const blobURL = BlobURL.fromContainerURL(containerURL, blobName);
            context.log(blobURL)
            const blockBlobURL = BlockBlobURL.fromBlobURL(blobURL);
            const Stream = require('stream')
            const readableStream = new Stream.Readable()
            readableStream._read = () => {} 
            readableStream.push(fileBuffer)
            readableStream.push(null)
          
            uploadStreamToBlockBlob(
                Aborter.timeout(30 * 60 * 60 * 1000), 
                readableStream,
                blockBlobURL,
                4 * 1024 * 1024,
                20,
                {
                  progress: ev => context.log(ev)
                }
              );
              context.log("uploadStreamToBlockBlob success");  
          //end save to blob storage
          context.done(null,{res:blobURL})
          
        }
        
      request(options, convertText);
        
    }
    
    textToSpeech(subscriptionKey, saveAudio);
    
    }


    Donnie Kerr


    • Edited by DonnieKerr Saturday, February 16, 2019 7:39 PM
    Saturday, February 16, 2019 7:37 PM
  • This works....

        fetch(
          "https://australiaeast.tts.speech.microsoft.com/" +
            "cognitiveservices/v1",
          {
            method: "POST",
            headers: {
              Authorization: "Bearer " + this.accessToken,
              "cache-control": "no-cache",
              "User-Agent": "tts",
              // "X-Microsoft-OutputFormat": "riff-24khz-16bit-mono-pcm",
              "X-Microsoft-OutputFormat": "audio-16khz-32kbitrate-mono-mp3",
              "Content-Type": "application/ssml+xml"
            },
            //   responseType: 'stream',
            body: body
          }
        )
          .then(response => {
            return response.arrayBuffer();
          })
          .then(data => {
            debugger;
            let blob = new Blob([data], { type: "audio/mpeg" });
            FileUtil.downloadBlob(blob, "test.mp3");
          })
          .catch(function(err) {
            alert(err);
          });

    • Proposed as answer by Vaibhav T Friday, December 27, 2019 1:56 PM
    Monday, October 28, 2019 1:13 AM
  • Thanks mate! This works. 

    There was a mistake in my code only.

    Friday, December 27, 2019 1:57 PM