Someone asked me about voice recognition the other day, so I thought it sounded like a fun little project to master. Here’s my go at it.
Unfortunately, I didn’t have much luck with it. Please post a comment if you get more working than I did.
Which Package to Install
I did a little research, and found that Wikipedia has a nice list of open source speech recognition programs. While it’s not huge, it was a good place to start. I chose Julius because it looked the most promising.
Installing Julius
Since Julius is in the repositories, installing it was easy! I just installed it from the Software Center.
I went a step further and installed the voxforge accoustic files from the “More Info” screen.
From what I can tell, there is no gui for julius (although, the project Simon might be a frontend for it). You should find it installed on the command-line though:
$ which julius
/usr/bin/julius
$ julius -help
Running Julius
Looking at the options, my first attempt was:
$ julius -input mic
ERROR: m_chkparam: you should specify at least one LM to run Julius!
The next thing I found was the VoxForge quickstart. I downloaded the tarball, and extracted it:
$tar -xzvf julius-3.5.2-quickstart-linux.tgz
$cd julius-3.5.2-quickstart-linux/
$ julius -input mic -C julian.jconf
That was closer, but it gave me this message at the end of all the output:
------
### read waveform input
Stat: adin_oss: device name = /dev/dsp (application default)
Error: adin_oss: failed to open /dev/dsp
failed to begin input stream
Adding padsp in front of the command fixed that problem:
$ padsp julius -input mic -C julian.jconf
I still got warnings though…
### read waveform input
Stat: adin_oss: device name = /dev/dsp (application default)
Stat: adin_oss: sampling rate = 16000Hz
Stat: adin_oss: going to set latency to 50 msec
Stat: adin_oss: audio I/O Latency = 32 msec (fragment size = 512 samples)
STAT: AD-in thread created
<<< please speak >>>Warning: adin_oss: no data fragment after 300 msec?
Warning: adin_oss: no data fragment after 300 msec?
Warning: adin_oss: no data fragment after 300 msec?
If you open the Sound Settings, the warnings go away. I thought was kind of flakey, but it worked. Unfortunately, the output was a little cryptic, and didn’t give me the feedback that I needed. This is what I get when I said, “Hello”:
pass1_best: <s> DIAL EIGHT
pass1_best_wordseq: 0 3 5
pass1_best_phonemeseq: sil | d ay ax l | ey t
pass1_best_score: -3177.784424
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 13 generated, 13 pushed, 5 nodes popped in 109
sentence1: <s> DIAL OH </s>
wseq1: 0 3 5 1
phseq1: sil | d ay ax l | ow | sil
cmscore1: 1.000 1.000 0.997 1.000
score1: -3393.694580
Running from a Recording
I probably could have used audacity much more easily, but since I was already on the command line, I decided to keep it there with the arecord program. I used this line to record:
$ arecord -r 16000 > test.wav
I played it back and it sounded kind of rough, but we’ll try it —
$ mplayer test.wav
Next, I ran it through julius:
$ ls test.wav > test.txt
$ julius -input rawfile -filelist test.txt -C julian.jconf
Unfortunately, mplayer could play the file, but julius could not open it for some reason.
### read waveform input
Error: adin_file: bytes per second != 32000 (16000)
Error: adin_file: error in parsing wav header at test.wav
Error: adin_file: failed to read speech data: "test.wav"
0 files processed
So, I found an example that used sox to convert it. I had to install sox with apt-get …
sudo apt-get install sox
Then, I converted the file and ran it like this:
$ sox test.wav -r 16000 -b 32 -c 1 test.s32
$ ls test.s32 > test.txt
$ julius -input rawfile -filelist test.txt -C julian.jconf
Still, this is the only output that I got:
### Recognition: 1st pass (LR beam)
...........................................................................................................................pass1_best: <s>
pass1_best_wordseq: 0
pass1_best_phonemeseq: sil
pass1_best_score: -2712.263916
### Recognition: 2nd pass (RL heuristic best-first)
WARNING: IW-triphone for word head "l-ow+t" not found, fallback to pseudo {ow+t}
WARNING: IW-triphone for word head "ow-ow+t" not found, fallback to pseudo {ow+t}
WARNING: IW-triphone for word head "t-ow+t" not found, fallback to pseudo {ow+t}
WARNING: IW-triphone for word head "uw-ow+t" not found, fallback to pseudo {ow+t}
WARNING: 00 _default: hypothesis stack exhausted, terminate search now
STAT: 00 _default: 0 sentences have been found
WARNING: 00 _default: got no candidates, search failed
STAT: 00 _default: 147 generated, 147 pushed, 147 nodes popped in 123
<search failed>
------
### read waveform input
1 files processed
I used audacity to cleanup the file. The Noise Removal improved it somewhat, but it still wasn’t good quality. Here’s the output after that:
### read waveform input
Stat: adin_file: input speechfile: test.wav
STAT: 30000 samples (1.88 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
..........................................................................................................................................................................................pass1_best: <s> DIAL OH </s>
pass1_best_wordseq: 0 3 5 1
pass1_best_phonemeseq: sil | d ay ax l | ow | sil
pass1_best_score: -5237.150391
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 27 generated, 27 pushed, 5 nodes popped in 186
sentence1: <s> DIAL OH </s>
wseq1: 0 3 5 1
phseq1: sil | d ay ax l | ow | sil
cmscore1: 1.000 0.978 0.987 1.000
score1: -5225.757324
------
### read waveform input
1 files processed
I also tried creating a file from scratch in audacity, and I still couldn’t get it:
### read waveform input
Stat: adin_file: input speechfile: test.wav
STAT: 21176 samples (1.32 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
..................................................................................................................................pass1_best: <s> DIAL OH
pass1_best_wordseq: 0 3 5
pass1_best_phonemeseq: sil | d ay ax l | ow
pass1_best_score: -3417.226318
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 23 generated, 23 pushed, 5 nodes popped in 130
sentence1: <s> DIAL OH </s>
wseq1: 0 3 5 1
phseq1: sil | d ay ax l | ow | sil
cmscore1: 1.000 0.911 1.000 1.000
score1: -3453.692871
------
### read waveform input
1 files processed
Running on YouTube Videos
My next task that I wanted to attempt was to try to capture something on a good recording. So, let’s find a good YouTube video to run through julius.
I tried clive, but it failed for some reason:
$sudo apt-get install clive
$ clive -cnrf best http://www.youtube.com/watch?v=dePLd9HAYjQ
fetch http://www.youtube.com/watch?v=dePLd9HAYjQ ...done.
error: no match: `(?-xism:url_encoded_fmt_stream_map=(.*?)&)'
So, I went back to my tried and true Video Downloader Firefox extension. Here is the first video that I tried:
For God So Loved The World (song and hymn history)
I converted the flv file to a wav like this:
ffmpeg -i youtube.flv -vn -acodec pcm_s16le -ar 16000 -ac 1 -f wav test.wav
And, I ran it through Julius like this:
$ ls test.wav > test.txt
$ julius -input rawfile -filelist test.txt -C julian.jconf
The end result was a segmentation fault!
I tried another one: Psalm 119 King James Holy Bible
This one also have me a segmentation fault.
Another: Job 41 (King James Holy Bible)
This one gave me this message:
....trace_backptr: sentence length exceeded ( > 150)
VoxForge Example
If you want to play with the VoxForge addon package, you can look at the readme file that should be located here:
/usr/share/doc/julius-voxforge/examples/README
Here are all the files installed with it:
$ dpkg -L julius-voxforge
/.
/usr
/usr/share
/usr/share/doc
/usr/share/doc/julius-voxforge
/usr/share/doc/julius-voxforge/copyright
/usr/share/doc/julius-voxforge/examples
/usr/share/doc/julius-voxforge/examples/controlapp
/usr/share/doc/julius-voxforge/examples/controlapp/mediaplayer.grammar
/usr/share/doc/julius-voxforge/examples/controlapp/command.py
/usr/share/doc/julius-voxforge/examples/controlapp/mediaplayer.voca
/usr/share/doc/julius-voxforge/examples/controlapp/README.controlapp
/usr/share/doc/julius-voxforge/examples/README
/usr/share/doc/julius-voxforge/examples/sample.grammar
/usr/share/doc/julius-voxforge/examples/sample.voca
/usr/share/doc/julius-voxforge/examples/julian.jconf.gz
/usr/share/doc/julius-voxforge/dict.gz
/usr/share/doc/julius-voxforge/changelog.Debian.gz
/usr/share/julius-voxforge
/usr/share/julius-voxforge/acoustic
/usr/share/julius-voxforge/acoustic/hmmdefs
/usr/share/julius-voxforge/acoustic/macros
/usr/share/julius-voxforge/acoustic/tiedlist
Resources