Someone asked me about voice recognition the other day, so I thought it sounded like a fun little project to master. Here’s my go at it.
Unfortunately, I didn’t have much luck with it. Please post a comment if you get more working than I did.
Which Package to Install
I did a little research, and found that Wikipedia has a nice list of open source speech recognition programs. While it’s not huge, it was a good place to start. I chose Julius because it looked the most promising.
Installing Julius
Since Julius is in the repositories, installing it was easy! I just installed it from the Software Center.
I went a step further and installed the voxforge accoustic files from the “More Info” screen.
From what I can tell, there is no gui for julius (although, the project Simon might be a frontend for it). You should find it installed on the command-line though:
$ which julius /usr/bin/julius $ julius -help
Running Julius
Looking at the options, my first attempt was:
$ julius -input mic ERROR: m_chkparam: you should specify at least one LM to run Julius!
The next thing I found was the VoxForge quickstart. I downloaded the tarball, and extracted it:
$tar -xzvf julius-3.5.2-quickstart-linux.tgz $cd julius-3.5.2-quickstart-linux/ $ julius -input mic -C julian.jconf
That was closer, but it gave me this message at the end of all the output:
------ ### read waveform input Stat: adin_oss: device name = /dev/dsp (application default) Error: adin_oss: failed to open /dev/dsp failed to begin input stream
Adding padsp in front of the command fixed that problem:
$ padsp julius -input mic -C julian.jconf
I still got warnings though…
### read waveform input Stat: adin_oss: device name = /dev/dsp (application default) Stat: adin_oss: sampling rate = 16000Hz Stat: adin_oss: going to set latency to 50 msec Stat: adin_oss: audio I/O Latency = 32 msec (fragment size = 512 samples) STAT: AD-in thread created <<< please speak >>>Warning: adin_oss: no data fragment after 300 msec? Warning: adin_oss: no data fragment after 300 msec? Warning: adin_oss: no data fragment after 300 msec?
If you open the Sound Settings, the warnings go away. I thought was kind of flakey, but it worked. Unfortunately, the output was a little cryptic, and didn’t give me the feedback that I needed. This is what I get when I said, “Hello”:
pass1_best: <s> DIAL EIGHT pass1_best_wordseq: 0 3 5 pass1_best_phonemeseq: sil | d ay ax l | ey t pass1_best_score: -3177.784424 ### Recognition: 2nd pass (RL heuristic best-first) STAT: 00 _default: 13 generated, 13 pushed, 5 nodes popped in 109 sentence1: <s> DIAL OH </s> wseq1: 0 3 5 1 phseq1: sil | d ay ax l | ow | sil cmscore1: 1.000 1.000 0.997 1.000 score1: -3393.694580
Running from a Recording
I probably could have used audacity much more easily, but since I was already on the command line, I decided to keep it there with the arecord program. I used this line to record:
$ arecord -r 16000 > test.wav
I played it back and it sounded kind of rough, but we’ll try it —
$ mplayer test.wav
Next, I ran it through julius:
$ ls test.wav > test.txt $ julius -input rawfile -filelist test.txt -C julian.jconf
Unfortunately, mplayer could play the file, but julius could not open it for some reason.
### read waveform input Error: adin_file: bytes per second != 32000 (16000) Error: adin_file: error in parsing wav header at test.wav Error: adin_file: failed to read speech data: "test.wav" 0 files processed
So, I found an example that used sox to convert it. I had to install sox with apt-get …
sudo apt-get install sox
Then, I converted the file and ran it like this:
$ sox test.wav -r 16000 -b 32 -c 1 test.s32 $ ls test.s32 > test.txt $ julius -input rawfile -filelist test.txt -C julian.jconf
Still, this is the only output that I got:
### Recognition: 1st pass (LR beam) ...........................................................................................................................pass1_best: <s> pass1_best_wordseq: 0 pass1_best_phonemeseq: sil pass1_best_score: -2712.263916 ### Recognition: 2nd pass (RL heuristic best-first) WARNING: IW-triphone for word head "l-ow+t" not found, fallback to pseudo {ow+t} WARNING: IW-triphone for word head "ow-ow+t" not found, fallback to pseudo {ow+t} WARNING: IW-triphone for word head "t-ow+t" not found, fallback to pseudo {ow+t} WARNING: IW-triphone for word head "uw-ow+t" not found, fallback to pseudo {ow+t} WARNING: 00 _default: hypothesis stack exhausted, terminate search now STAT: 00 _default: 0 sentences have been found WARNING: 00 _default: got no candidates, search failed STAT: 00 _default: 147 generated, 147 pushed, 147 nodes popped in 123 <search failed> ------ ### read waveform input 1 files processed
I used audacity to cleanup the file. The Noise Removal improved it somewhat, but it still wasn’t good quality. Here’s the output after that:
### read waveform input Stat: adin_file: input speechfile: test.wav STAT: 30000 samples (1.88 sec.) STAT: ### speech analysis (waveform -> MFCC) ### Recognition: 1st pass (LR beam) ..........................................................................................................................................................................................pass1_best: <s> DIAL OH </s> pass1_best_wordseq: 0 3 5 1 pass1_best_phonemeseq: sil | d ay ax l | ow | sil pass1_best_score: -5237.150391 ### Recognition: 2nd pass (RL heuristic best-first) STAT: 00 _default: 27 generated, 27 pushed, 5 nodes popped in 186 sentence1: <s> DIAL OH </s> wseq1: 0 3 5 1 phseq1: sil | d ay ax l | ow | sil cmscore1: 1.000 0.978 0.987 1.000 score1: -5225.757324 ------ ### read waveform input 1 files processed
I also tried creating a file from scratch in audacity, and I still couldn’t get it:
### read waveform input Stat: adin_file: input speechfile: test.wav STAT: 21176 samples (1.32 sec.) STAT: ### speech analysis (waveform -> MFCC) ### Recognition: 1st pass (LR beam) ..................................................................................................................................pass1_best: <s> DIAL OH pass1_best_wordseq: 0 3 5 pass1_best_phonemeseq: sil | d ay ax l | ow pass1_best_score: -3417.226318 ### Recognition: 2nd pass (RL heuristic best-first) STAT: 00 _default: 23 generated, 23 pushed, 5 nodes popped in 130 sentence1: <s> DIAL OH </s> wseq1: 0 3 5 1 phseq1: sil | d ay ax l | ow | sil cmscore1: 1.000 0.911 1.000 1.000 score1: -3453.692871 ------ ### read waveform input 1 files processed
Running on YouTube Videos
My next task that I wanted to attempt was to try to capture something on a good recording. So, let’s find a good YouTube video to run through julius.
I tried clive, but it failed for some reason:
$sudo apt-get install clive $ clive -cnrf best http://www.youtube.com/watch?v=dePLd9HAYjQ fetch http://www.youtube.com/watch?v=dePLd9HAYjQ ...done. error: no match: `(?-xism:url_encoded_fmt_stream_map=(.*?)&)'
So, I went back to my tried and true Video Downloader Firefox extension. Here is the first video that I tried:
For God So Loved The World (song and hymn history)
I converted the flv file to a wav like this:
ffmpeg -i youtube.flv -vn -acodec pcm_s16le -ar 16000 -ac 1 -f wav test.wav
And, I ran it through Julius like this:
$ ls test.wav > test.txt $ julius -input rawfile -filelist test.txt -C julian.jconf
The end result was a segmentation fault!
I tried another one: Psalm 119 King James Holy Bible
This one also have me a segmentation fault.
Another: Job 41 (King James Holy Bible)
....trace_backptr: sentence length exceeded ( > 150)
VoxForge Example
If you want to play with the VoxForge addon package, you can look at the readme file that should be located here:
/usr/share/doc/julius-voxforge/examples/README
$ dpkg -L julius-voxforge /. /usr /usr/share /usr/share/doc /usr/share/doc/julius-voxforge /usr/share/doc/julius-voxforge/copyright /usr/share/doc/julius-voxforge/examples /usr/share/doc/julius-voxforge/examples/controlapp /usr/share/doc/julius-voxforge/examples/controlapp/mediaplayer.grammar /usr/share/doc/julius-voxforge/examples/controlapp/command.py /usr/share/doc/julius-voxforge/examples/controlapp/mediaplayer.voca /usr/share/doc/julius-voxforge/examples/controlapp/README.controlapp /usr/share/doc/julius-voxforge/examples/README /usr/share/doc/julius-voxforge/examples/sample.grammar /usr/share/doc/julius-voxforge/examples/sample.voca /usr/share/doc/julius-voxforge/examples/julian.jconf.gz /usr/share/doc/julius-voxforge/dict.gz /usr/share/doc/julius-voxforge/changelog.Debian.gz /usr/share/julius-voxforge /usr/share/julius-voxforge/acoustic /usr/share/julius-voxforge/acoustic/hmmdefs /usr/share/julius-voxforge/acoustic/macros /usr/share/julius-voxforge/acoustic/tiedlist
Resources
- VoxForge Quickstart
- Julius Forums: how do i run the julius executable file in the julius folder
- “The Julius Book” Version 4.1.5 PDF Format
- VoxForge: Running Julian Live
- LaunchPad Answers: How do I install Simon
- VoxForge: Running Julius on 64-bit Ubuntu 10.04
- StackOverflow: ffmpeg 0.5 flv to wav conversion creates wav files that other programs won’t open
- clive
while trying to identify text the end of a mp3 file from webradio recording i tried julian jet without much success.
installation was pretty easy
http://www.voxforge.org/home/downloads#QuickStart%20Anchor
unpack and run somewhat like
./julian
-input stdin -C julian.jconf < "bla.wav"
works if bla.wav is made with sox like
sox bla.mp3 –channels 1 —
rate 16k bla.wav
in my case from the big mp3 file just the last 3 seconds without trailing silence are necessary, so
sox musicfile.mp3 –channels 1 —
rate 16k bla.wav reverse trim 0 9 vad trim 0 3 reverse
generates what i want to input in julius
but the output is even if there is clear speech not realy recognising any words.
if using the mic input as in the quickstart is default , it worked at my eeepc 701 running under knoppix. just speak, adjust the mic sensitivity with aumix and enjoy julius recognizing the difference between noise music(ignore it) and speech (incorrectly recognize it)
anyway maybe some tweaking will make it someday
The important thing for an accurate speech recognition is a grammar. Grammar describes possible speech recognition outcomes which decoder looks for. That’s how speech recognitoin systems including julius work.
Default voxforge demo has very limited grammar, it can be useful only for dialing numbers, not for large vocabulary trascription.
For a large vocabulary trascription it make sense to use Pocketsphinx engine from CMUSphinx. It has very large language models available which allow you to trascribe broadcast news and other texts on variety of topics in US English with high accuracy.
For more details on concepts of speech recogniton see the tutorial
http://cmusphinx.sourceforge.net/wiki/tutorial
And visit CMUSphinx page
http://cmusphinx.sourceforge.net
hi, I am trying to play a .raw file in ubuntu 12.04 by using this command
play -t raw -r 16000 -s -w an251-fash-b.raw
but i got the following error
play: invalid option — w
play: SoX v14.3.2
play FAIL sox: invalid option
Plz can anyone solve this.
Thanks in advance
Hi Selvi
In recent sox the command line options have changed. You need to use -2 (2 bytes or 16-bit samples) instead of -w (old option for 16-bit samples)
I WANT MAKE MY SPEECH RECOGNITION ROBOT .I INSTALLED JULIUS ,HTK, AUDACITY.I HAVE ARDUINO.HOW TO SEND COMMUNICATE WITH ARDUINO USING JULIAN.EXPLAIN CLEARLY.
reddy,
I can’t give you complete instructions, but maybe a few thoughts will help a little.
I don’t know if Audacity will help you much. It is good for editing sound, but unless you need to clean up the sound before you process it, you shouldn’t need it.
Hack a Day has an article in which they recommend a different project called μSpeech:
http://hackaday.com/2012/09/22/speech-recognition-on-an-arduino/
uSpeech is located here: https://github.com/arjo129/uSpeech
Instructables has an article with a different approach:
http://www.instructables.com/id/Speech-Recognition-with-Arduino/
I hope those links help,
Stephen
Hi Guys
I need your help urgently. After installing HTK software on ubuntu 14.04 using a command line. How can i use the graphical interface of the HTK software. Using graphical interface is very important for my project. please let me know the “linux command line syntax code” to initiate the graphical interface of the HTK software.
Kind regards
Firesenbet