Synthesizing an Accapella Song with Festival Speech Synthesis

I have played with Festival before.  It will easily generate speech from written commands.  It seems pretty full featured.  But, I have always wanted to add pitch.  Could I make it sing the words?

I think I found my answer:

Festival Singing Synthesis

So, to try it out, I decided to try to generate the first phrase of this song:

When I Survey Sheet Music

Festival uses an XML format to describe how the notes match up to the words.  So, for the first part, the melody, I created survey1.xml to contain this:


<?xml version="1.0"?>
<!DOCTYPE SINGING PUBLIC "-//SINGING//DTD SINGING mark up//EN"
 "Singing.v0_1.dtd"
[]>
<SINGING BPM="30">
<PITCH NOTE="F3"><DURATION BEATS="0.6">When</DURATION></PITCH>
<PITCH NOTE="F3"><DURATION BEATS="0.3">I</DURATION></PITCH>
<PITCH NOTE="G3"><DURATION BEATS="0.3">Sur</DURATION></PITCH>
<PITCH NOTE="A3"><DURATION BEATS="0.6">vey</DURATION></PITCH>
<PITCH NOTE="G3"><DURATION BEATS="0.3">The</DURATION></PITCH>
<PITCH NOTE="A3"><DURATION BEATS="0.3">a</DURATION></PITCH>
<PITCH NOTE="B3"><DURATION BEATS="0.6">Won</DURATION></PITCH>
<PITCH NOTE="A3"><DURATION BEATS="0.3">dra</DURATION></PITCH>
<PITCH NOTE="G3"><DURATION BEATS="0.3">as</DURATION></PITCH>
<PITCH NOTE="A3"><DURATION BEATS="0.3">cross</DURATION></PITCH>
</SINGING>

To play it, I could use the command:

(tts “survey1.xml” ‘singing)

Here’s the full output:


skp@pecan:~/Downloads$ festival

Festival Speech Synthesis System 2.1:release November 2010
Copyright (C) University of Edinburgh, 1996-2010. All rights reserved.

clunits: Copyright (C) University of Edinburgh and CMU 1997-2010
clustergen_engine: Copyright (C) CMU 2005-2010
hts_engine:
The HMM-based speech synthesis system (HTS)
hts_engine API version 1.04 (http://hts-engine.sourceforge.net/)
Copyright (C) 2001-2010 Nagoya Institute of Technology
 2001-2008 Tokyo Institute of Technology
All rights reserved.
For details type `(festival_warranty)'
festival> (tts "survey1.xml" 'singing)
nil
festival>

Next, I wanted harmony.  So, I created a second file: survey2.xml.  It contained this:


<?xml version="1.0"?>
<!DOCTYPE SINGING PUBLIC "-//SINGING//DTD SINGING mark up//EN"
 "Singing.v0_1.dtd"
[]>
<SINGING BPM="30">
<PITCH NOTE="C3"><DURATION BEATS="0.6">When</DURATION></PITCH>
<PITCH NOTE="C3"><DURATION BEATS="0.3">I</DURATION></PITCH>
<PITCH NOTE="E3"><DURATION BEATS="0.3">Sur</DURATION></PITCH>
<PITCH NOTE="F3"><DURATION BEATS="0.6">vey</DURATION></PITCH>
<PITCH NOTE="G3"><DURATION BEATS="0.3">The</DURATION></PITCH>
<PITCH NOTE="F#3"><DURATION BEATS="0.3">a</DURATION></PITCH>
<PITCH NOTE="G3"><DURATION BEATS="0.6">Won</DURATION></PITCH>
<PITCH NOTE="F3"><DURATION BEATS="0.3">dra</DURATION></PITCH>
<PITCH NOTE="E3"><DURATION BEATS="0.3">as</DURATION></PITCH>
<PITCH NOTE="F3"><DURATION BEATS="0.3">cross</DURATION></PITCH>
</SINGING>

The documented command to generate this to a wav file is “text2wav”.  Unfortunately, that just returns an error:


skp@pecan:~/Downloads$ text2wave -mode singing survey1.xml -o survey1.wav
SIOD ERROR: wrong type of argument to get_c_val

That didn’t work, so I dropped back to just recording it with arecord.  To get arecord to record from my soundcard rather than my microphone, I had to create a loopback device.  This command did the trick:

sudo modprobe snd-aloop

Next, in my volume control, I had to select output to my new loopback device:

Selecting the loopback device

Finally, I threw together this little script to start recording, generate the singing, and close the recording:


#!/bin/sh

arecord -D hw:1,1,0 -f cd survey1.wav &
pid=$!
festival <<!
(tts_file "survey1.xml" 'singing)
!
kill $pid

Or, this script does all of my parts for me:


#!/bin/sh

for f in `ls survey?.xml`
do
 arecord -D hw:1,1,0 -f cd $f.wav &
 pid=$!
 festival <<!
 (tts_file "$f" 'singing)
!
 kill $pid
done

Now, to mix the output down to a single file, I needed to install the speech-tools package:

sudo apt-get install speech-tools

Then, this command mixed all of the parts into a single song:

ch_wave -o survey_full.wav -pc longest survey?.xml.wav

Flinger

Several places refer to a more advanced version of Festival designed more for generating singing.  The variant is called Flinger.

I had trouble finding where to download Flinger.  I think this might be a place, but it requires registration.  I’ll save this for a follow up post:

https://www.cslu.ogi.edu/tts/download/data/

Resources

One thought on “Synthesizing an Accapella Song with Festival Speech Synthesis

Leave a Comment

Your email address will not be published. Required fields are marked *