How I may help
LinkedIn Profile Email me!
Call me using Skype client on your machine

Reload this page Speech To Text

Here is a report on various ways to make your computer talk.

 

Topics this page:

  • Apple Siri
  • Your comments???
  •  

    Site Map List all pages on this site 
    About this site About this site 
    Go to first topic Go to Bottom of this page

    Search

    Top of page Voice Recognition

    This article makes a good point

    Apple set the bar with Siri when it bought Siri and included it in the iPhone 4S launch in 2011. As their promotional videos shows, Siri makes the iPhone personal assistant by combining voice recognition with task automation.

    Using just voice alone, Siri enbles users to compose and send text messages and emails, schedule meetings, ask for directions, set reminders, and so on.

    Additionally, Siri added semantic technology so it understands search requests spoken in plain English, like, "What is the largest city in Texas?".

    Google has voice recognition built-in.

    Top of page Microsoft Voice Recognition

    The Windows Speed Recognition (WSR) introduced with Windows Vista required users to write their own macros, so was mostly used by very advanced users.

    After Microsoft bought the "voice portal" company TellMe in 2007, Microsoft put voice command technology in Windows Phone 7 and 8, but not on desktops.

    Intel worked with Nuance to bring their Dragon Naturally Speaking and Sync in Ford cars to the new Dragon Assistant app for Windows Ultrabooks.

    Top of page Poor Man's Speech to Text

    Many laptops come with a microphone built in.

    The computer can be made to recognize your voice if you are willing to jump among 3 programs: Sound Recorder, Windows Explorer, Run command window.

    The Run command window would run a custom batch file

    1. Create a sound file. Windows desktop users can use SoundRecorder.exe. Its default format is .wma. But to save .wav files, instead of going to Windows Start > All Programs > Accessories. go to Windows Start and type in the Search box:

      soundrecorder /file outputfile.wav
      

      The Sound Recorder GUI says there is a maximum length of 60 seconds. But this video shows that after the initial 60 seconds is reached, select from the menu Edit > Copy, then File > New. Click No to the pop-up. Select from the menu Paste > Insert, then Edit and Paste Insert to add an extra 60 seconds to the Length.

      After recording is stopped, by default files are named outputfile.wav and saved to folder C:\Users\%USERNAME%\Documents.

      Ideally, the program would chunk the sound rather than sending a long file.

    2. Run a command to obtain text from the speech file.
    3. Format response to strip out meta data so only the speech text shows.

    Top of page AT&T Speech to Text

    AT&T's Speech capabilities has been improving for decades. Available from the Developer Portal is SDK ATTSpeechKit.zip for iOS and Android

    AT&T's Watson℠ speech engine providers differentiates itself vs. competitors by offering a robust library of speech contexts optimized for specific applications.

    One of these is specified in requests as the value for header X-SpeechContext.

    BLAH: Unfortunately, one cannot tweak the voice recognition algorithms to improve accuracy for individual users.

    OAuth 2.0 Authentication

    AT&T's Speech to Text web service works over all networks, using OAuth for access authentication, as demo'd by running a batch command file containing invocation of
    TOOL: cURL.exe (the secure version for https) emulates a web browser from within a Run command window.

    curl "https://api.att.com/oauth/token" \ 
        --insecure \ 
        --header "Accept: application/x-www-form-urlencoded" \ 
        --header "Content-Type: application/x-www-form-urlencoded" \ 
        --data "client_id=YOUR_APP_KEY&client_secret=YOUR_APP_SECRET&grant_type=client_credentials&scope=SPEECH" \ 
        --proxy "https://proxy.if.you.use.one.com:8080"
    

    --insecure allows connections using HTTPS without SSL certs.

    TIP: I've found that -k is needed to make the request work.

    Even though requests are usually insecure, the requests are made with customized versions of YOUR_APP_KEY and YOUR_APP_SECRET issued when the services app is registered by a developer who registered.

    The response includes a refresh token for use the original access token expires. But since the expiry_on field is 0, the token won't expire. Nevertheless, it is still wise to add logic to handle the expiry case, just in case in the future the access token expiry policy changes (AT&T would just change it without advanced notice...probably during a major release). Getting the refresh token is very similar to the access token. The payload would look like this:

    client_id=ABCDEF0123456789ABCDEF0123456789& client_​secret=ABCDEF0123456789& grant_type=refresh_token& r​efresh_token=ABCDEF0123456789ABCDEF0123456789ABCDE​F0123456789

    Requests

    The Access_Token returned during authentication is inserted in all subsequent client requests to obtain text back from audio files sent to the server using a run batch command demo'd by this statement:

    curl "https://api.att.com/rest/2/SpeechToText" \ 
        --insecure \ 
        --request POST \ 
        --header "Authorization: Bearer <Access_Token>" \ 
        --header "Accept: application/json" \ 
        --header "Transfer-Encoding: 3444" \ 
        --header "X-SpeechContext: Generic" \ 
        --header "Content-type: audio/wav" \ 
        --data-binary "@<audio_file>" \
        --proxy "https://proxy.if.you.use.one.com:8080" 
    

    The content length (3444 in the example above) can be determined on Windows using ???.

    Alternately, "Transfer-Encoding: Chunked" specifies using the standard HTTP 1.1 streaming mechanism in 512 bit chunks.

    data-binary specifies the HTTP body which contains only audio data.

    The Speech API does not support audio data nested in MIME multipart documents. MIME is the format used by file uploads in HTML forms.

    Multiple Requests

    Several cURL requests can be issued from within a batch command file referencing a list file:

    for %f in (*.xml) do curl -F importDataFile=%f http://...ImportServlet 1>%f.out
    

    Individual commands are expanded thus for "some.xml":

    curl -F importDataFile=@some.xml http://...ImportServlet 1>some.xml.out
    

    But ATT prefers the .amr (Adaptive Multi-Rate) codec developed by Ericsson using the Algebraic Code Excited Linear Prediction (ACELP) algorithm designed to efficiently compress human speech audio recordings on 3G cell phones for MMS (Multi-Media Messaging).

    More specifically, AMR narrowband, 122 kbit/sec, 8 kHz sampling.

    Get from Github the sample C# RESTful web program as run from here.

    The response looks like this:

    {"Recognition":{"Status":"OK","ResponseId":"71b9410cd5259ec81e49abf892eabd44","N Best":[{"WordScores":[0.58,0.05],"Confidence":0.315,"Grade":"confirm","ResultTex t":"You ...","Words":["You","..."],"LanguageId":"en-US","Hypothesis":"you um"}]} }

    To strip out meta data and only present the ResultText:

    Top of page Reformatting Speech Files

    The ATT Speech API recognizes several file formats: { ".amr", "audio/amr" }, { ".wav", "audio/wav" }, {".awb", "audio/amr-wb"}, {".spx", "audio/x-speex"}

    WARNING: The ATT Speech API only processes wav (Microsoft Waveform) files containing PCM 16, not linear 8.

    "Text": "audiostream-wav: only pcm linear 16 supported (found pcm linear 8)",
    "Text": "RIFF/WAVE coding [85] is not ALAW, ULAW or PCM",

    A .wav file can optionally contain a RIFF header in addition to raw PCM (Pulse Code Modulation) audio bits with sampling (Project) rate of 8kHz (8000 Hz) or 16kHz.

    WARNING: The ATT API requires format to be mono (not streo).

    TOOL: sox from SourceForge can reformat to mono.

    TOOL: Cool Edit from www.syntrillium.com can change both non-audio data and PCM bits in a .wav file.

    NOTE: AMR files used by Nokia and Ericsson phones contain a "#!AMR\n" header. There is also a 3gpp standard AMR format. http://www.connactivity.com/~eaw/amrwork/ published a python script to convert between them.

    Conversion to .wav is necessary * to edit an .amr file:

    ffmpeg -i josie-ring.amr josie-ring.wav
    

    After editing, convert the wave file to a full size music mp3:

    ffmpeg -i josie-ring.wav -ab 128 -ac 2 -ar 44100 josie-ring.mp3
    

    Several cURL requests can be issued from within a batch command file referencing a list file:

    for %f in (*.xml) do curl -F importDataFile=%f http://...ImportServlet 1>%f.out
    

    Individual commands are expanded thus for "some.xml":

    curl -F importDataFile=@some.xml http://...ImportServlet 1>some.xml.out
    

    Top of page AT&T Speech to Text

    AT&T's speech engine is powered by Watson.

    In-line hints are pre-pended to voice data.

    In-line grammar sends in whole grammar set.

    PLS

    SRGS

    Demo from ivee (talking alarm clocks) Jonathan David Ger

    Brent


    How I may help

    Send a message with your email client program


    Your rating of this page:
    Low High




    Your first name:

    Your family name:

    Your location (city, country):

    Your Email address: 



      Top of Page Go to top of page

    Thank you!