My recent experience with voice control on the Pi got me to thinking. Why wasn't this a rising star and constantly being talked about? The idea of talking to your house is so compelling that there must be hundreds of implementations out there.
Well, there are, and none of them work very well.
I described my experience with Jasper, it just didn't live up to the hype. So I went looking and did some experimenting. Everyone talks about how good Siri is. My experience with it is far less than stellar; all the phones I've tried it on misunderstand me about 6 out of 10 times. Google's implementation seems to work better and I get about an 80% success rate. Both of these are stellar compared to several software techniques I tried out, with the absolute worst being CMU Sphinx that Jasper was based on.
Remember, I'm looking at this as a way to control my house with a little computer, not dictate letters wearing a headset, so let me talk a bit about methods. No, I'm not going to bore the heck out of you with a dissertation on the theories of voice recognition, I want what everyone else wants: I want it to work. There are basically two methods of doing speech recognition right now, local and distributed. By local I mean totally on one machine, and distributed is when they send the sound over the internet and decode it somewhere else. Google's voice API is an example of distributed and CMU Sphinx is an example of local.
What we all want is for it to operate like Star Trek:
"Computer."
Nice clear beep
"Turn on the porch lights"
Nice clear acknowledgement, and maybe a, "Porch light is now on."
I went through the entire process of bringing up CMU Sphinx <link>, and when I tried it, I saw something on the order of, "Burn under the blight." To be fair, Sphinx can be trained and its accuracy will shoot way up, but that takes considerable effort and time. The default recognition files just don't cut it. Especially when I tried the same thing with 100%, yes totally accurate results with Google's voice interface. The problem with Google's interface is that it only works in the Chrome browser. Yes, there are tools out there that use the Google voice API; notably VoiceCommand by Steve Hickson <link> , but expect it to quit working soon. Google ended their offering of version 2 of the interface, and version three is limited in how many requests can be used and you have to have a special key to use it. Thus will end a really cool possibility, I hope they bring it back soon.
So, the local possibilities are inaccurate and the distributed are accurate, but the one everyone was using is likely to disappear. There are other distributed solutions, I brought up code taken from Nexiwave <link> and tested it. There was darn near a 100% success rate. The problem was delay. Since I was using a free account, I was shuffled to the bottom of the queue (correctly and expectedly) so the response took maybe three seconds to come back. Now, three seconds seem like a small price to pay, but try it out with a watch to see how uncomfortable that feels in real use. This is not that Nexiwave is slow, it's that the dog gone internet takes time to send data and get back a response. I didn't open a paid account to see if it was any better, this was just an experiment.
But, think about it a bit. "Computer," one thousand and one, one thousand and two, one thousand and three, "Yes". Then the command, "Turn on the porch light", etc. It would be cool and fun to show off, but do you really want to do it that way? Plus it would require that the software run continuously to catch the occasional, "Computer" command initiation. Be real, if you're going to have to push a button to start a command sequence, you might as well push a button to do the entire action. Remember, you have to have a command initiator or something like, "Hey Jeff, get your hand out of the garbagedisposal, it could turn on," could be a disaster. A button somewhere labeled, "Garbage Disposal," would be much simpler and safer.
Don't talk to me about Dragon Naturally Speaking from Nuance <link>. That tool is just unbelievable. It is capable of taking dictation at full speed with totally amazing accuracy, but it only runs on machines much larger than a Pi, and not at all under Linux. Even their development version is constructed for Windows server machines. Microsoft has a good speech recognition system built right into the OS, and under Windows 8, it is incredible. Especially at no additional cost at all. But, there aren't many Raspberry Pi machines running Windows 8.
Thus, I don't have a solution. The most compelling one was Nexiwave, but the delays are annoying and I don't think it would work out long term. Here's the source I used to interface with it:
The audio file I sent was my usual, "Porch light on," and it decoded it exactly first try. I tried a few others and they all worked equally well. Which brings up another item, sound on the raspberry Pi. Frankly, unless you're dealing with digital files and streams, it sucks. There isn't enough filtering on the Pi to keep audio hum out of things. The amplified speakers I was using had a constant low level hum (regular ol' 60 hertz hum), and it would get into the audio captured from the USB microphone as well. This could have been reduced by an expensive power supply with very good filtering, or maybe not; I didn't try.
To add insult to an already injurious process, ALSA (Advanced Linux Sound Architecture) is the single most confusing sound implementation I've ever seen. It was constructed by sound purists and technology students so it is filled with special cases, odd syntax, devices that mostly work, etc. The documentation is full of 'try this'. What? I love experimenting, but I sort of like to have documentation that actually has information in it. Pulse audio is another possibility, but I'll approach that some other time. Maybe a few weeks after hell freezes over, ALSA was bad enough. But, if you're going to experiment with sound under Linux, you'll have to deal with ALSA at some point. Especially if you actually want to turn the volume up or down.
I think I'm going to do some research on remote control ergonomics. There's got to be a cool and actually useful way to turn on the porch lights.
Well, there are, and none of them work very well.
I described my experience with Jasper, it just didn't live up to the hype. So I went looking and did some experimenting. Everyone talks about how good Siri is. My experience with it is far less than stellar; all the phones I've tried it on misunderstand me about 6 out of 10 times. Google's implementation seems to work better and I get about an 80% success rate. Both of these are stellar compared to several software techniques I tried out, with the absolute worst being CMU Sphinx that Jasper was based on.
Remember, I'm looking at this as a way to control my house with a little computer, not dictate letters wearing a headset, so let me talk a bit about methods. No, I'm not going to bore the heck out of you with a dissertation on the theories of voice recognition, I want what everyone else wants: I want it to work. There are basically two methods of doing speech recognition right now, local and distributed. By local I mean totally on one machine, and distributed is when they send the sound over the internet and decode it somewhere else. Google's voice API is an example of distributed and CMU Sphinx is an example of local.
What we all want is for it to operate like Star Trek:
"Computer."
Nice clear beep
"Turn on the porch lights"
Nice clear acknowledgement, and maybe a, "Porch light is now on."
I went through the entire process of bringing up CMU Sphinx <link>, and when I tried it, I saw something on the order of, "Burn under the blight." To be fair, Sphinx can be trained and its accuracy will shoot way up, but that takes considerable effort and time. The default recognition files just don't cut it. Especially when I tried the same thing with 100%, yes totally accurate results with Google's voice interface. The problem with Google's interface is that it only works in the Chrome browser. Yes, there are tools out there that use the Google voice API; notably VoiceCommand by Steve Hickson <link> , but expect it to quit working soon. Google ended their offering of version 2 of the interface, and version three is limited in how many requests can be used and you have to have a special key to use it. Thus will end a really cool possibility, I hope they bring it back soon.
So, the local possibilities are inaccurate and the distributed are accurate, but the one everyone was using is likely to disappear. There are other distributed solutions, I brought up code taken from Nexiwave <link> and tested it. There was darn near a 100% success rate. The problem was delay. Since I was using a free account, I was shuffled to the bottom of the queue (correctly and expectedly) so the response took maybe three seconds to come back. Now, three seconds seem like a small price to pay, but try it out with a watch to see how uncomfortable that feels in real use. This is not that Nexiwave is slow, it's that the dog gone internet takes time to send data and get back a response. I didn't open a paid account to see if it was any better, this was just an experiment.
But, think about it a bit. "Computer," one thousand and one, one thousand and two, one thousand and three, "Yes". Then the command, "Turn on the porch light", etc. It would be cool and fun to show off, but do you really want to do it that way? Plus it would require that the software run continuously to catch the occasional, "Computer" command initiation. Be real, if you're going to have to push a button to start a command sequence, you might as well push a button to do the entire action. Remember, you have to have a command initiator or something like, "Hey Jeff, get your hand out of the garbagedisposal, it could turn on," could be a disaster. A button somewhere labeled, "Garbage Disposal," would be much simpler and safer.
Don't talk to me about Dragon Naturally Speaking from Nuance <link>. That tool is just unbelievable. It is capable of taking dictation at full speed with totally amazing accuracy, but it only runs on machines much larger than a Pi, and not at all under Linux. Even their development version is constructed for Windows server machines. Microsoft has a good speech recognition system built right into the OS, and under Windows 8, it is incredible. Especially at no additional cost at all. But, there aren't many Raspberry Pi machines running Windows 8.
Thus, I don't have a solution. The most compelling one was Nexiwave, but the delays are annoying and I don't think it would work out long term. Here's the source I used to interface with it:
I took this directly from their site and posted it here because it is hard to find, and I don't think they care if I advertise for them. All I did to make it work was to sign up for a free account and enter my particulars in the fields up at the top. It worked first try; simple and easy interface. It would be relatively easy to adapt this to a voice control system on my Pi if I decided to go that way. Which I may do for control in the dark of my bedroom where I don't want to search for a remote that may be behind the side table.#!/usr/bin/python
# Copyright 2012 Nexiwave Canada. All rights reserved.
# Nexiwave Canada PROPRIETARY/CONFIDENTIAL. Use is subject to license terms.
import sys, os, json, urllib2, urllib, time
# You will need python-requests package. It makes things much easier.
import requests
# Change these:
# Login details:
USERNAME = "user@myemail.com"
PASSWORD = "XYZ"
def transcribe_audio_file(filename):
"""Transcribe an audio file using Nexiwave"""
url = 'https://api.nexiwave.com/SpeechIndexing/file/storage/' + USERNAME +'/recording/?authData.passwd=' + PASSWORD + '&auto-redirect=true&response=application/json'
# To receive transcript in plain text, instead of html format, comment this line out (for SMS, for example)
url = url + '&transcriptFormat=html'
# Ready to send:
sys.stderr.write("Send audio for transcript with " + url + "\n")
r = requests.post(url, files={'mediaFileData': open(filename,'rb')})
data = r.json()
transcript = data['text']
# Perform your magic here:
print "Transcript for "+filename+"=" + transcript
if __name__ == '__main__':
# Change this to your own
filename = "/data/audio/test.wav"
transcribe_audio_file(filename)
The audio file I sent was my usual, "Porch light on," and it decoded it exactly first try. I tried a few others and they all worked equally well. Which brings up another item, sound on the raspberry Pi. Frankly, unless you're dealing with digital files and streams, it sucks. There isn't enough filtering on the Pi to keep audio hum out of things. The amplified speakers I was using had a constant low level hum (regular ol' 60 hertz hum), and it would get into the audio captured from the USB microphone as well. This could have been reduced by an expensive power supply with very good filtering, or maybe not; I didn't try.
To add insult to an already injurious process, ALSA (Advanced Linux Sound Architecture) is the single most confusing sound implementation I've ever seen. It was constructed by sound purists and technology students so it is filled with special cases, odd syntax, devices that mostly work, etc. The documentation is full of 'try this'. What? I love experimenting, but I sort of like to have documentation that actually has information in it. Pulse audio is another possibility, but I'll approach that some other time. Maybe a few weeks after hell freezes over, ALSA was bad enough. But, if you're going to experiment with sound under Linux, you'll have to deal with ALSA at some point. Especially if you actually want to turn the volume up or down.
I think I'm going to do some research on remote control ergonomics. There's got to be a cool and actually useful way to turn on the porch lights.