Voice Recognition in Ubuntu



Someone asked me about voice recognition the other day, so I thought it sounded like a fun little project to master.  Here’s my go at it.

Unfortunately, I didn’t have much luck with it.  Please post a comment if you get more working than I did.

Which Package to Install

I did a little research, and found that Wikipedia has a nice list of open source speech recognition programs.  While it’s not huge, it was a good place to start.  I chose Julius because it looked the most promising.

Installing Julius

Since Julius is in the repositories, installing it was easy!  I just installed it from the Software Center.

Installing Julius with the Software Center

I went a step further and installed the voxforge accoustic files from the “More Info” screen.

From what I can tell, there is no gui for julius (although, the project Simon might be a frontend for it).  You should find it installed on the command-line though:

$ which julius
/usr/bin/julius
$ julius -help

Running Julius

Looking at the options, my first attempt was:

$ julius -input mic
ERROR: m_chkparam: you should specify at least one LM to run Julius!

The next thing I found was the VoxForge quickstart.  I downloaded the tarball, and extracted it:

$tar -xzvf julius-3.5.2-quickstart-linux.tgz
$cd julius-3.5.2-quickstart-linux/
$ julius -input mic -C julian.jconf

That was closer, but it gave me this message at the end of all the output:


------
### read waveform input
Stat: adin_oss: device name = /dev/dsp (application default)
Error: adin_oss: failed to open /dev/dsp
failed to begin input stream

Adding padsp in front of the command fixed that problem:

$ padsp julius -input mic -C julian.jconf

I still got warnings though…


### read waveform input
Stat: adin_oss: device name = /dev/dsp (application default)
Stat: adin_oss: sampling rate = 16000Hz
Stat: adin_oss: going to set latency to 50 msec
Stat: adin_oss: audio I/O Latency = 32 msec (fragment size = 512 samples)
STAT: AD-in thread created
<<< please speak >>>Warning: adin_oss: no data fragment after 300 msec?
Warning: adin_oss: no data fragment after 300 msec?
Warning: adin_oss: no data fragment after 300 msec?

If you open the Sound Settings, the warnings go away.  I thought was kind of flakey, but it worked.  Unfortunately, the output was a little cryptic, and didn’t give me the feedback that I needed.  This is what I get when I said, “Hello”:


pass1_best: <s> DIAL EIGHT
pass1_best_wordseq: 0 3 5
pass1_best_phonemeseq: sil | d ay ax l | ey t
pass1_best_score: -3177.784424
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 13 generated, 13 pushed, 5 nodes popped in 109
sentence1: <s> DIAL OH </s>
wseq1: 0 3 5 1
phseq1: sil | d ay ax l | ow | sil
cmscore1: 1.000 1.000 0.997 1.000
score1: -3393.694580

Running from a Recording

I probably could have used audacity much more easily, but since I was already on the command line, I decided to keep it there with the arecord program.  I used this line to record:

$ arecord -r 16000 > test.wav

I played it back and it sounded kind of rough, but we’ll try it —

$ mplayer test.wav

Next, I ran it through julius:


$ ls test.wav > test.txt
$ julius -input rawfile -filelist test.txt -C julian.jconf

Unfortunately, mplayer could play the file, but julius could not open it for some reason.


### read waveform input
Error: adin_file: bytes per second != 32000 (16000)
Error: adin_file: error in parsing wav header at test.wav
Error: adin_file: failed to read speech data: "test.wav"
0 files processed

So, I found an example that used sox to convert it.  I had to install sox with apt-get …

sudo apt-get install sox

Then, I converted the file and ran it like this:

$ sox test.wav -r 16000 -b 32 -c 1 test.s32
$ ls test.s32 > test.txt
$ julius -input rawfile -filelist test.txt -C julian.jconf

Still, this is the only output that I got:


### Recognition: 1st pass (LR beam)
...........................................................................................................................pass1_best: <s>
pass1_best_wordseq: 0
pass1_best_phonemeseq: sil
pass1_best_score: -2712.263916
### Recognition: 2nd pass (RL heuristic best-first)
WARNING: IW-triphone for word head "l-ow+t" not found, fallback to pseudo {ow+t}
WARNING: IW-triphone for word head "ow-ow+t" not found, fallback to pseudo {ow+t}
WARNING: IW-triphone for word head "t-ow+t" not found, fallback to pseudo {ow+t}
WARNING: IW-triphone for word head "uw-ow+t" not found, fallback to pseudo {ow+t}
WARNING: 00 _default: hypothesis stack exhausted, terminate search now
STAT: 00 _default: 0 sentences have been found
WARNING: 00 _default: got no candidates, search failed
STAT: 00 _default: 147 generated, 147 pushed, 147 nodes popped in 123
<search failed>
------
### read waveform input
1 files processed

I used audacity to cleanup the file.  The Noise Removal improved it somewhat, but it still wasn’t good quality.  Here’s the output after that:


### read waveform input
Stat: adin_file: input speechfile: test.wav
STAT: 30000 samples (1.88 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
..........................................................................................................................................................................................pass1_best: <s> DIAL OH </s>
pass1_best_wordseq: 0 3 5 1
pass1_best_phonemeseq: sil | d ay ax l | ow | sil
pass1_best_score: -5237.150391
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 27 generated, 27 pushed, 5 nodes popped in 186
sentence1: <s> DIAL OH </s>
wseq1: 0 3 5 1
phseq1: sil | d ay ax l | ow | sil
cmscore1: 1.000 0.978 0.987 1.000
score1: -5225.757324
------
### read waveform input
1 files processed

I also tried creating a file from scratch in audacity, and I still couldn’t get it:


### read waveform input
Stat: adin_file: input speechfile: test.wav
STAT: 21176 samples (1.32 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
..................................................................................................................................pass1_best: <s> DIAL OH
pass1_best_wordseq: 0 3 5
pass1_best_phonemeseq: sil | d ay ax l | ow
pass1_best_score: -3417.226318
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 23 generated, 23 pushed, 5 nodes popped in 130
sentence1: <s> DIAL OH </s>
wseq1: 0 3 5 1
phseq1: sil | d ay ax l | ow | sil
cmscore1: 1.000 0.911 1.000 1.000
score1: -3453.692871
------
### read waveform input
1 files processed

Running on YouTube Videos

My next task that I wanted to attempt was to try to capture something on a good recording.  So, let’s find a good YouTube video to run through julius.

I tried clive, but it failed for some reason:

$sudo apt-get install clive

$ clive -cnrf best http://www.youtube.com/watch?v=dePLd9HAYjQ
fetch http://www.youtube.com/watch?v=dePLd9HAYjQ ...done.
error: no match: `(?-xism:url_encoded_fmt_stream_map=(.*?)&)'

So, I went back to my tried and true Video Downloader Firefox extension.  Here is the first video that I tried:

For God So Loved The World (song and hymn history) 

I converted the flv file to a wav like this:

ffmpeg -i youtube.flv -vn -acodec pcm_s16le -ar 16000 -ac 1 -f wav test.wav

And, I ran it through Julius like this:

$ ls test.wav > test.txt
$ julius -input rawfile -filelist test.txt -C julian.jconf

The end result was a segmentation fault!

I tried another one: Psalm 119 King James Holy Bible 

This one also have me a segmentation fault.

Another: Job 41 (King James Holy Bible) 

This one gave me this message:
....trace_backptr: sentence length exceeded ( > 150)

VoxForge Example

If you want to play with the VoxForge addon package, you can look at the readme file that should be located here:

/usr/share/doc/julius-voxforge/examples/README

 Here are all the files installed with it:
$ dpkg -L julius-voxforge
/.
/usr
/usr/share
/usr/share/doc
/usr/share/doc/julius-voxforge
/usr/share/doc/julius-voxforge/copyright
/usr/share/doc/julius-voxforge/examples
/usr/share/doc/julius-voxforge/examples/controlapp
/usr/share/doc/julius-voxforge/examples/controlapp/mediaplayer.grammar
/usr/share/doc/julius-voxforge/examples/controlapp/command.py
/usr/share/doc/julius-voxforge/examples/controlapp/mediaplayer.voca
/usr/share/doc/julius-voxforge/examples/controlapp/README.controlapp
/usr/share/doc/julius-voxforge/examples/README
/usr/share/doc/julius-voxforge/examples/sample.grammar
/usr/share/doc/julius-voxforge/examples/sample.voca
/usr/share/doc/julius-voxforge/examples/julian.jconf.gz
/usr/share/doc/julius-voxforge/dict.gz
/usr/share/doc/julius-voxforge/changelog.Debian.gz
/usr/share/julius-voxforge
/usr/share/julius-voxforge/acoustic
/usr/share/julius-voxforge/acoustic/hmmdefs
/usr/share/julius-voxforge/acoustic/macros
/usr/share/julius-voxforge/acoustic/tiedlist

Resources

6 Comments

  1. 0800peter says:

    while trying to identify text the end of a mp3 file from webradio recording i tried julian jet without much success.
    installation was pretty easy
    http://www.voxforge.org/home/downloads#QuickStart%20Anchor
    unpack and run somewhat like

    ./julian
    -input stdin -C julian.jconf < "bla.wav"

    works if bla.wav is made with sox like

    sox bla.mp3 –channels 1 —
    rate 16k bla.wav

    in my case from the big mp3 file just the last 3 seconds without trailing silence are necessary, so

    sox musicfile.mp3 –channels 1 —
    rate 16k bla.wav reverse trim 0 9 vad trim 0 3 reverse

    generates what i want to input in julius

    but the output is even if there is clear speech not realy recognising any words.

    if using the mic input as in the quickstart is default , it worked at my eeepc 701 running under knoppix. just speak, adjust the mic sensitivity with aumix and enjoy julius recognizing the difference between noise music(ignore it) and speech (incorrectly recognize it)

    anyway maybe some tweaking will make it someday

  2. Nick says:

    The important thing for an accurate speech recognition is a grammar. Grammar describes possible speech recognition outcomes which decoder looks for. That’s how speech recognitoin systems including julius work.

    Default voxforge demo has very limited grammar, it can be useful only for dialing numbers, not for large vocabulary trascription.

    For a large vocabulary trascription it make sense to use Pocketsphinx engine from CMUSphinx. It has very large language models available which allow you to trascribe broadcast news and other texts on variety of topics in US English with high accuracy.

    For more details on concepts of speech recogniton see the tutorial

    http://cmusphinx.sourceforge.net/wiki/tutorial

    And visit CMUSphinx page

    http://cmusphinx.sourceforge.net

  3. selvi says:

    hi, I am trying to play a .raw file in ubuntu 12.04 by using this command

    play -t raw -r 16000 -s -w an251-fash-b.raw
    but i got the following error

    play: invalid option — w
    play: SoX v14.3.2

    play FAIL sox: invalid option

    Plz can anyone solve this.
    Thanks in advance

    • Nick says:

      Hi Selvi

      In recent sox the command line options have changed. You need to use -2 (2 bytes or 16-bit samples) instead of -w (old option for 16-bit samples)

  4. reddy says:

    I WANT MAKE MY SPEECH RECOGNITION ROBOT .I INSTALLED JULIUS ,HTK, AUDACITY.I HAVE ARDUINO.HOW TO SEND COMMUNICATE WITH ARDUINO USING JULIAN.EXPLAIN CLEARLY.

Leave a comment


five × 4 =