Speech Recognition – Linux Sagas

Someone asked me about voice recognition the other day, so I thought it sounded like a fun little project to master. Here’s my go at it.

Unfortunately, I didn’t have much luck with it. Please post a comment if you get more working than I did.

Which Package to Install

I did a little research, and found that Wikipedia has a nice list of open source speech recognition programs. While it’s not huge, it was a good place to start. I chose Julius because it looked the most promising.

Installing Julius

Since Julius is in the repositories, installing it was easy! I just installed it from the Software Center.

I went a step further and installed the voxforge accoustic files from the “More Info” screen.

From what I can tell, there is no gui for julius (although, the project Simon might be a frontend for it). You should find it installed on the command-line though:

$ which julius
/usr/bin/julius
$ julius -help

Running Julius

Looking at the options, my first attempt was:

$ julius -input mic
ERROR: m_chkparam: you should specify at least one LM to run Julius!

The next thing I found was the VoxForge quickstart. I downloaded the tarball, and extracted it:

$tar -xzvf julius-3.5.2-quickstart-linux.tgz
$cd julius-3.5.2-quickstart-linux/
$ julius -input mic -C julian.jconf

That was closer, but it gave me this message at the end of all the output:


------
### read waveform input
Stat: adin_oss: device name = /dev/dsp (application default)
Error: adin_oss: failed to open /dev/dsp
failed to begin input stream

Adding padsp in front of the command fixed that problem:

$ padsp julius -input mic -C julian.jconf

I still got warnings though…


### read waveform input
Stat: adin_oss: device name = /dev/dsp (application default)
Stat: adin_oss: sampling rate = 16000Hz
Stat: adin_oss: going to set latency to 50 msec
Stat: adin_oss: audio I/O Latency = 32 msec (fragment size = 512 samples)
STAT: AD-in thread created
<<< please speak >>>Warning: adin_oss: no data fragment after 300 msec?
Warning: adin_oss: no data fragment after 300 msec?
Warning: adin_oss: no data fragment after 300 msec?

If you open the Sound Settings, the warnings go away. I thought was kind of flakey, but it worked. Unfortunately, the output was a little cryptic, and didn’t give me the feedback that I needed. This is what I get when I said, “Hello”:


pass1_best: <s> DIAL EIGHT
pass1_best_wordseq: 0 3 5
pass1_best_phonemeseq: sil | d ay ax l | ey t
pass1_best_score: -3177.784424
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 13 generated, 13 pushed, 5 nodes popped in 109
sentence1: <s> DIAL OH </s>
wseq1: 0 3 5 1
phseq1: sil | d ay ax l | ow | sil
cmscore1: 1.000 1.000 0.997 1.000
score1: -3393.694580

Running from a Recording

I probably could have used audacity much more easily, but since I was already on the command line, I decided to keep it there with the arecord program. I used this line to record:

$ arecord -r 16000 > test.wav

I played it back and it sounded kind of rough, but we’ll try it —

$ mplayer test.wav

Next, I ran it through julius:


$ ls test.wav > test.txt
$ julius -input rawfile -filelist test.txt -C julian.jconf

Unfortunately, mplayer could play the file, but julius could not open it for some reason.


### read waveform input
Error: adin_file: bytes per second != 32000 (16000)
Error: adin_file: error in parsing wav header at test.wav
Error: adin_file: failed to read speech data: "test.wav"
0 files processed

So, I found an example that used sox to convert it. I had to install sox with apt-get …

sudo apt-get install sox

Then, I converted the file and ran it like this:

$ sox test.wav -r 16000 -b 32 -c 1 test.s32
$ ls test.s32 > test.txt
$ julius -input rawfile -filelist test.txt -C julian.jconf

Still, this is the only output that I got:


### Recognition: 1st pass (LR beam)
...........................................................................................................................pass1_best: <s>
pass1_best_wordseq: 0
pass1_best_phonemeseq: sil
pass1_best_score: -2712.263916
### Recognition: 2nd pass (RL heuristic best-first)
WARNING: IW-triphone for word head "l-ow+t" not found, fallback to pseudo {ow+t}
WARNING: IW-triphone for word head "ow-ow+t" not found, fallback to pseudo {ow+t}
WARNING: IW-triphone for word head "t-ow+t" not found, fallback to pseudo {ow+t}
WARNING: IW-triphone for word head "uw-ow+t" not found, fallback to pseudo {ow+t}
WARNING: 00 _default: hypothesis stack exhausted, terminate search now
STAT: 00 _default: 0 sentences have been found
WARNING: 00 _default: got no candidates, search failed
STAT: 00 _default: 147 generated, 147 pushed, 147 nodes popped in 123
<search failed>
------
### read waveform input
1 files processed

I used audacity to cleanup the file. The Noise Removal improved it somewhat, but it still wasn’t good quality. Here’s the output after that:


### read waveform input
Stat: adin_file: input speechfile: test.wav
STAT: 30000 samples (1.88 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
..........................................................................................................................................................................................pass1_best: <s> DIAL OH </s>
pass1_best_wordseq: 0 3 5 1
pass1_best_phonemeseq: sil | d ay ax l | ow | sil
pass1_best_score: -5237.150391
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 27 generated, 27 pushed, 5 nodes popped in 186
sentence1: <s> DIAL OH </s>
wseq1: 0 3 5 1
phseq1: sil | d ay ax l | ow | sil
cmscore1: 1.000 0.978 0.987 1.000
score1: -5225.757324
------
### read waveform input
1 files processed

I also tried creating a file from scratch in audacity, and I still couldn’t get it:


### read waveform input
Stat: adin_file: input speechfile: test.wav
STAT: 21176 samples (1.32 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
..................................................................................................................................pass1_best: <s> DIAL OH
pass1_best_wordseq: 0 3 5
pass1_best_phonemeseq: sil | d ay ax l | ow
pass1_best_score: -3417.226318
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 23 generated, 23 pushed, 5 nodes popped in 130
sentence1: <s> DIAL OH </s>
wseq1: 0 3 5 1
phseq1: sil | d ay ax l | ow | sil
cmscore1: 1.000 0.911 1.000 1.000
score1: -3453.692871
------
### read waveform input
1 files processed

Running on YouTube Videos

My next task that I wanted to attempt was to try to capture something on a good recording. So, let’s find a good YouTube video to run through julius.

I tried clive, but it failed for some reason:

$sudo apt-get install clive

$ clive -cnrf best http://www.youtube.com/watch?v=dePLd9HAYjQ
fetch http://www.youtube.com/watch?v=dePLd9HAYjQ ...done.
error: no match: `(?-xism:url_encoded_fmt_stream_map=(.*?)&)'

So, I went back to my tried and true Video Downloader Firefox extension. Here is the first video that I tried:

For God So Loved The World (song and hymn history)

I converted the flv file to a wav like this:

ffmpeg -i youtube.flv -vn -acodec pcm_s16le -ar 16000 -ac 1 -f wav test.wav

And, I ran it through Julius like this:

$ ls test.wav > test.txt
$ julius -input rawfile -filelist test.txt -C julian.jconf

The end result was a segmentation fault!

I tried another one: Psalm 119 King James Holy Bible

This one also have me a segmentation fault.

Another: Job 41 (King James Holy Bible)

This one gave me this message:

....trace_backptr: sentence length exceeded ( > 150)

VoxForge Example

If you want to play with the VoxForge addon package, you can look at the readme file that should be located here:

/usr/share/doc/julius-voxforge/examples/README

Here are all the files installed with it:

$ dpkg -L julius-voxforge
/.
/usr
/usr/share
/usr/share/doc
/usr/share/doc/julius-voxforge
/usr/share/doc/julius-voxforge/copyright
/usr/share/doc/julius-voxforge/examples
/usr/share/doc/julius-voxforge/examples/controlapp
/usr/share/doc/julius-voxforge/examples/controlapp/mediaplayer.grammar
/usr/share/doc/julius-voxforge/examples/controlapp/command.py
/usr/share/doc/julius-voxforge/examples/controlapp/mediaplayer.voca
/usr/share/doc/julius-voxforge/examples/controlapp/README.controlapp
/usr/share/doc/julius-voxforge/examples/README
/usr/share/doc/julius-voxforge/examples/sample.grammar
/usr/share/doc/julius-voxforge/examples/sample.voca
/usr/share/doc/julius-voxforge/examples/julian.jconf.gz
/usr/share/doc/julius-voxforge/dict.gz
/usr/share/doc/julius-voxforge/changelog.Debian.gz
/usr/share/julius-voxforge
/usr/share/julius-voxforge/acoustic
/usr/share/julius-voxforge/acoustic/hmmdefs
/usr/share/julius-voxforge/acoustic/macros
/usr/share/julius-voxforge/acoustic/tiedlist

Category: Speech Recognition

Voice Recognition in Ubuntu