Skip to content

Getting started

This provides a quick walk-through to help you get started with TrulyNatural concepts and command-line tools.

Prerequisites

  1. Install the TrulyNatural SDK using the provided installer executable.
  2. Open a terminal with a command-line prompt.
  3. Add ~/Sensory/TrulyNaturalSDK/7.6.1/bin to your shell PATH variable, and change your working directory to ~/Sensory/TrulyNaturalSDK/7.6.1

    Configure your shell environment
    PATH=${PATH}:~/Sensory/TrulyNaturalSDK/7.6.1/bin
    cd ~/Sensory/TrulyNaturalSDK/7.6.1
    

Wake words

  1. Let's start with running a simple wake word on live audio input.

    The snsr-eval utility can run all recognition and VAD models types. Use the -t (task) flag to specify the path to the wake word model.

    Start snsr-eval and say "voice genie" a number of times:

    % snsr-eval -t model/spot-voicegenie-enUS-6.5.1-m.snsr
      1275   1920 voicegenie
      5070   5730 voicegenie
      10395  10980 voicegenie
      ^C
    

    The output shows the start and end times in ms since the start of the recording, and the phrase the key word spotter detected. snsr-eval runs until you interrupt it with ^C.

    The spot-voicegenie-enUS-6.5.1-m.snsr model file includes everything needed to run this wake word, including reasonable default settings.

    For wake words, there is one configuration option that you might want to adjust: operating-point, which controls the recognition sensitivity. You can do this on the command-line with the -s option:

    % snsr-eval -t model/spot-voicegenie-enUS-6.5.1-m.snsr -s operating-point=21
      1080   1605 voicegenie
      2610   3180 voicegenie
      5460   5970 voicegenie
    ^C
    

    Increase snsr-eval output verbosity with one or more -v flags:

    % snsr-eval -vt model/spot-voicegenie-enUS-6.5.1-m.snsr
    Using live audio from default capture device. ^C to stop.
    1410   1995 (0.999) voicegenie
    ^C
    
    % snsr-eval -vvt model/spot-voicegenie-enUS-6.5.1-m.snsr
    Using live audio from default capture device. ^C to stop.
    Using operating point 8.
    Available operating points: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21.
    Available vocabulary:
    1: "voicegenie"
    phrase:
      495   1020 (0.9951) voicegenie
    words:
      495   1020 (0.9951) voicegenie
    
    ^C
    

    The value between the parenthesis is the recognition score.

    You can also reduce the output verbosity with one or more -l flags. snsr-eval -ll reports only the recognition result, which makes it suitable for scripted batch testing.

    Tip

    Run snsr-eval, or any of the command-line tools without arguments to see a brief usage summary and a list of available options.

  2. To recognize pre-recorded audio, specify an audio file in either RIFF WAV format or as a binary file containing audio samples and no header. Most models require 16-bit LPCM encoding sampled at 16 kHz.

    % snsr-eval -t model/spot-voicegenie-enUS-6.5.1-m.snsr \
        data/audio/voice-genie-set-cruise-control.wav
      2310   2910 voicegenie
    

    snsr-eval ends once it has processed the entire file.

    If you specify multiple files snsr-eval concatenates the files in order and evaluates the model on the result. Recognition timestamps reflect this concatenation.

    % snsr-eval -t model/spot-voicegenie-enUS-6.5.1-m.snsr \
        data/audio/voice-genie-set-cruise-control.wav \
        data/audio/voice-genie-set-cruise-control.wav
      2310   2910 voicegenie
      9075   9645 voicegenie
    
  3. snsr-eval also runs command sets. These are keyword spotters with more than one active word or phrase, optimized for a low false reject rate.

    Let's inspect the sample music control model vocabulary:

    % snsr-eval -vvt model/spot-music-enUS-1.2.0-m.snsr /dev/null
    Using operating point 17.
    Available operating points: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20.
    Available vocabulary:
      1: "play_music"
      2: "previous_song"
      3: "stop_music"
      4: "next_song"
      5: "pause_music"
    

    Command sets work just like wake words:

    % snsr-eval -t model/spot-music-enUS-1.2.0-m.snsr \
        data/audio/voice-genie-music.wav
      5160   5865 play_music
      8055   8790 next_song
     14820  15705 stop_music
    
  4. Running an adapting wake word uses the same recipe as regular wake words:

    % snsr-eval -vt model/ca-voicegenie-enUS-1.1.0.snsr
    Using live audio from default capture device. ^C to stop.
          [^adapt-started] on worker thread
      8295   8790 (1 sv) voice_genie
          [^adapt-started] on worker thread
    15645  16245 (1 sv) voice_genie
    16515 [^adapted]
    16515 [^new-user] user1/voice_genie
          [^adapt-started] on worker thread
    20445  21060 (0.9678 sv) user1/voice_genie
    21225 [^adapted]
    ^C
    

    Note how the result changes once the model has adapted to your speech. If you have a second speaker say "voice genie" a couple of times you should see user2/voice_genie for their utterances.

    The model adaptations do not persist and will be lost when the model is reloaded. We can address that by specifying a cache-file to hold these. Say "voice genie" a number of times, then stop snsr-eval with ^C:

    % snsr-eval -vt model/ca-voicegenie-enUS-1.1.0.snsr \
        -s cache-file=voice-genie.cache
    Using live audio from default capture device. ^C to stop.
          [^adapt-started] on worker thread
      1485   2070 (1 sv) voice_genie
          [^adapt-started] on worker thread
      4740   5355 (1 sv) voice_genie
      5460 [^adapted]
      5460 [^new-user] user1/voice_genie
          [^adapt-started] on worker thread
      9285   9855 (0.9951 sv) user1/voice_genie
     10140 [^adapted]
    ^C
    

    snsr-eval loads voice-genie.cache when restarting. Note that the very first result already includes the user1/voice_genie result:

    % snsr-eval -vt model/ca-voicegenie-enUS-1.1.0.snsr \
        -s cache-file=voice-genie.cache
    Using live audio from default capture device. ^C to stop.
          [^adapt-started] on worker thread
      1185   1815 (0.9971 sv) user1/voice_genie
      2085 [^adapted]
    ^C
    

Wake word enrollment

EFT models adapt a specific wake word phrase to a speaker's voice. UDT models create new wake words for arbitrary phrases from a handful of speaker-specific recordings.

  1. Let's start with UDT. The spot-enroll utility does EFT and UDT enrollment from recordings. Most models require four recordings of the same phrase for optimal performance. UDT models are less likely to trigger on the same phrase said by another speaker.

    Enroll three phrases into a single model
    % spot-enroll -vt model/udt-universal-3.67.1.0.snsr \
        -o udt-kws.snsr \
        +armadillo-1 data/enrollments/armadillo-1-{0,1,2,3}.wav \
        +armadillo-6 data/enrollments/armadillo-6-{0,1,2,3}.wav \
        +jackalope-1 data/enrollments/jackalope-1-{0,1,2,3}.wav \
        +jackalope-4 data/enrollments/jackalope-4-{0,1,2,3}.wav \
        +terminator-2 data/enrollments/terminator-2-{0,1,2,3}.wav \
        +terminator-6 data/enrollments/terminator-6-{0,1,2,3}.wav
    Adapting: 100% complete.
    Enrolled model saved to "udt-kws.snsr"
    

    We've just created a new udt-kws.snsr wake word model that spots three different phrases with two unique speakers per phrase. We can run the model with snsr-eval. We'll use different recordings than the ones used for enrollment:

    % snsr-eval -t udt-kws.snsr \
        data/enrollments/armadillo-1-0-c.wav \
        data/enrollments/armadillo-6-0.wav \
        data/enrollments/jackalope-1-0-c.wav \
        data/enrollments/jackalope-4-0.wav \
        data/enrollments/terminator-2-0.wav \
        data/enrollments/terminator-6-0.wav
       330    945 armadillo-1
      4485   5265 armadillo-6
      6795   7320 jackalope-1
     10365  10890 jackalope-4
     13245  13950 terminator-2
     15510  16215 terminator-6
    

    The model identifies the phrases and the speakers correctly.

  2. EFT models use the same tools and API for enrollment as UDT. We'll use live-enroll in this example. This is the interactive version of spot-enroll, the main difference being that it prompts to repeat recordings that aren't usable instead of reporting an enrollment error.

    Run the example below, then say "hello blue genie" when prompted.

    % live-enroll -vt model/eft-hbg-enUS-23.0.0.9.snsr \
        -o eft-hbg.snsr +user-1
    
    Say the enrollment phrase (1/4) for "user-1"
    Recording:   4.29 s
    Preliminary enrollment checks passed.
    
    Say the enrollment phrase (2/4) for "user-1"
    Recording:   3.23 s
    Preliminary enrollment checks passed.
    
    Say the enrollment phrase (3/4) for "user-1" with context,
      for example: "<phrase> will it rain tomorrow?"
    Recording:   3.15 s
    Preliminary enrollment checks passed.
    
    Say the enrollment phrase (4/4) for "user-1" with context,
      for example: "<phrase> will it rain tomorrow?"
    Recording:   3.90 s
    Preliminary enrollment checks passed.
    Adapting: 100% complete.
    Enrolled model saved to "eft-hbg.snsr"
    

    As before, test with snsr-eval:

    % snsr-eval -t eft-hbg.snsr
    Using live audio from default capture device. ^C to stop.
       675   1500 (0.8921 sv) user-1/HBG
      3480   4500 (0.8245 sv) user-1/HBG
      7020   8160 (0.877 sv) user-1/HBG
    ^C
    

    The value between the parenthesis is the sv-score.

Templates

Templates are models that add new behaviors to wake word, lvcsr, and stt models via composition. Templates have slots that we fill with models of the required type.

  1. Let's say we would like to reduce false accepts of the music command set used above by requiring commands in it to be prefixed with a low false-accept wake word like "voice genie". We can do this with the tpl-spot-sequential template:

    % snsr-eval -t model/tpl-spot-sequential-1.5.0.snsr \
        -f 0 model/spot-voicegenie-enUS-6.5.1-m.snsr \
        -f 1 model/spot-music-enUS-1.2.0-m.snsr \
        data/audio/voice-genie-music.wav
      5175   5865 play_music
    

    snsr-eval's -f slot filename option loads the named file into the specified slot. The tpl-spot-sequential documentation lists the slots the template supports, and the types of models it expects for these.

    As expected, the new recognizer spots only the command directly following "voice genie".

    snsr-eval supports on-the-fly model composition, but what if we have code that already works with spot-music-enUS-1.2.0-m.snsr that we don't want to modify? Enter snsr-edit, which supports composition and setting changes and can save the result as a new, self-contained model:

    % snsr-edit -vvt model/tpl-spot-sequential-1.5.0.snsr \
        -o vg-music.snsr \
        -f 0 model/spot-voicegenie-enUS-6.5.1-m.snsr \
        -f 1 model/spot-music-enUS-1.2.0-m.snsr
    Loading "model/tpl-spot-sequential-1.5.0.snsr" as the template model.
    Loading "model/spot-voicegenie-enUS-6.5.1-m.snsr" into setting "0".
    Loading "model/spot-music-enUS-1.2.0-m.snsr" into setting "1".
    Output written to "vg-music.snsr".
    
    % snsr-eval -t vg-music.snsr data/audio/voice-genie-music.wav
      5175   5865 play_music
    

    tpl-spot-sequential has a loop setting that changes behavior. Lets give that a try:

    % snsr-eval -t vg-music.snsr -s loop=1 data/audio/voice-genie-music.wav
      5175   5865 play_music
      8055   8790 next_song
    

    This recognizes the first two music commands, but not "stop music" as the gap between "next song" and "stop music" is more than the listen-window, which is five seconds:

    % snsr-edit -t model/spot-music-enUS-1.2.0-m.snsr -q listen-window
    listen-window = 5
    
  2. We can run two keyword spotters simultaneously with tpl-spot-concurrent:

    % snsr-eval -t model/tpl-spot-concurrent-1.5.0.snsr \
        -f 0 model/spot-voicegenie-enUS-6.5.1-m.snsr \
        -f 1 model/spot-music-enUS-1.2.0-m.snsr \
        data/audio/voice-genie-music.wav
      4485   5085 voicegenie
      5160   5865 play_music
      8055   8790 next_song
     14820  15705 stop_music
    

Speech To Text stt

TrulyNatural STT includes support for modern transformer-based end-to-end recognizers suitable for transcription tasks.

  1. STT models use the same tools and API as wake words. Let's run a sample audio file through stt-enUS-automotive-medium-2.3.15-pnc.snsr with snsr-eval:

    % snsr-eval -t model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \
        data/audio/voice-genie-set-cruise-control.wav
    P   2040   2960 God Boice jeni
    P   2240   3560 Voice. Genie said the creek
    P   2240   3960 Voice. Genie set the cruise control
    P   2240   4200 Voice. Genie set the cruise control to
    P   2240   4720 Voice. Genie set the cruise control to further
    P   2240   5200 Voice. Genie set the cruise control to fifty five. Mrs
    P   2240   5600 Voice Genie set the cruise control to fifty five miles back
    P   2240   5800 Voice Genie set the cruise control to fifty five miles per hour
    P   2240   5840 Voice Genie set the cruise control to fifty five miles per hour
    NLU intent: set_cruise_control (0.9969) = voice genie set the cruise control to 55 miles per hour
    NLU entity:   number (0.9931) = 55
    NLU entity:   speed_unit (0.9942) = miles per hour
      2240   5840 Voice Genie set the cruise control to fifty five miles per hour.
    

    Partial or interim hypotheses are shown prefixed with P. These provide useful feedback for live transcription tasks, but are less interesting when recognizing from file. You can suppress them by setting partial-result-interval = 0:

    % snsr-eval -t model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \
        -s partial-result-interval=0 \
        data/audio/voice-genie-set-cruise-control.wav
    NLU intent: set_cruise_control (0.9969) = voice genie set the cruise control to 55 miles per hour
    NLU entity:   number (0.9931) = 55
    NLU entity:   speed_unit (0.9942) = miles per hour
      2240   5840 Voice Genie set the cruise control to fifty five miles per hour.
    
  2. STT and LVCSR models only produce a final recognition hypothesis at end-of-file, or when a VAD signals that speech has ended. For convenience snsr-eval has an -a option that adds the tpl-vad-lvcsr template if you're using live audio and specify an STT model that does not include a VAD.

    % snsr-eval -lt model/stt-enUS-automotive-medium-2.3.15-pnc.snsr
    ERROR: With live audio LVCSR and STT models require a VAD. You can add one with the -a flag.
    
    % snsr-eval -alt model/stt-enUS-automotive-medium-2.3.15-pnc.snsr
    NLU intent: no_command (0.9995) = the quick brown fox jumped over the lazy dog's back
    The quick Brown Fox jumped over the lazy dog's back.
    ^C
    

    Create a new model that includes a VAD:

    % snsr-edit -vvo vad-stt.snsr \
        -t model/tpl-vad-lvcsr-3.17.0.snsr \
        -f 0 model/stt-enUS-automotive-medium-2.3.15-pnc.snsr
    Loading "model/tpl-vad-lvcsr-3.17.0.snsr" as the template model.
    Loading "model/stt-enUS-automotive-medium-2.3.14-pnc.snsr" into setting "0".
    Output written to "vad-stt.snsr".
    
    % snsr-eval -lt vad-stt.snsr
    NLU intent: no_command (0.9995) = the quick brown fox jumped over the lazy dog's back
    The quick Brown Fox jumped over the lazy dog's back.
    
  3. Creating an STT model that's gated by a wake word is also easy with tpl-opt-spot-vad-lvcsr:

    % snsr-edit -vvo vg-vad-stt.snsr \
        -t model/tpl-opt-spot-vad-lvcsr-1.25.0.snsr \
        -f 0 model/spot-voicegenie-enUS-6.5.1-m.snsr \
        -f 1 model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \
        -s include-wake-word-audio=1
    Loading "model/tpl-opt-spot-vad-lvcsr-1.24.0.snsr" as the template model.
    Loading "model/spot-voicegenie-enUS-6.5.1-m.snsr" into setting "0".
    Loading "model/stt-enUS-automotive-medium-2.3.14-pnc.snsr" into setting "1".
    Output written to "vg-vad-stt.snsr".
    

    include-wake-word-audio = 1 includes the wake word in the audio seen by the STT recognizer, but configures the STT to elide this from the recognition hypothesis. This improves recognition accuracy if there's no pause between the wake word and the STT command.

    Test with snsr-eval
    % snsr-eval -lt vg-vad-stt.snsr \
        data/audio/voice-genie-set-cruise-control.wav
    NLU intent: set_cruise_control (0.9968) = set the cruise control to 55 miles per hour
    NLU entity:   number (0.9937) = 55
    NLU entity:   speed_unit (0.9936) = miles per hour
    Set the cruise control to fifty five miles per hour.
    
    Test with snsr-eval and live audio
    % snsr-eval -t vg-vad-stt.snsr
    P   7180   7340 Said
    P   7220   7860 Said the radio
    P   7220   8220 Set the radio tonight
    P   7260   8580 Set the radio to ninety one
    P   7260   9060 Set the radio to ninety one point. Three
    P   7260   9260 Set the radio to ninety one point. Five
    P   7260   9620 Set the radio to ninety one point. Five f
    NLU intent: set_radio (0.9674) = set the radio to 91.5 FM
    NLU entity:   radio_station (0.9688) = 91.5 FM
      7260   9860 Set the radio to ninety one point. Five F. M.
    ^C
    

LVCSR tnl

Use Grammar-based recognition for command and control tasks with small to medium sized vocabularies on devices that aren't powerful enough for STT. VoiceHub provides a convenient interface for creating these models.

  1. Let's build a model that recognizes the example audio files in data/enrollments/. We'll use a grammar file from data/grammars/:

    % snsr-edit -vvo commands.snsr \
        -t model/lvcsr-build-enUS-2.7.3.snsr \
        -f grammar-stream data/grammars/enrollments-nlu-slot.txt \
        -s partial-result-interval=0
    Loading "model/lvcsr-build-enUS-2.7.3.snsr" as the template model.
    Loading "data/grammars/enrollments-nlu-slot.txt" into setting "grammar-stream".
    Output written to "commands.snsr".
    
    % snsr-eval -t commands.snsr data/enrollments/armadillo-1-0-c.wav
    NLU intent: calculate (0) =  18 percent of 643
    NLU entity:   percent (0) = 18
    NLU entity:   number (0) = 643
      375   3195 armadillo 18 percent of 643
    

    We set grammar-stream to the contents of data/grammars/enrollments-nlu-slot.txt to define which sentences the recognizer will accept, and partial-result-interval = 0 to suppress interim results.

    data/grammars/enrollments-nlu-slot.txt
    # LVCSR grammar specification for test utterances in data/enrollments/
    # Includes lightweight NLU slot markup.
    #
    # In a tpl-spot-vad-lvcsr pipeline the prefix would be consumed by the spotter.
    prefix = armadillo | jackalope | terminator;
    
    # Numbers used in the intent rule below
    number = 18 | 643 | 20 | 6;
    
    # Places
    place = target | winco | susan's house | gas;
    
    # Dates
    date = friday | tomorrow | next week;
    
    # List of known utterances in the *-c.wav files.
    intent =
    {calculate {percent $number} percent of {number}} |
    {call call the nearest {place}} |
    {navigate how far away is {place}} |
    {avcontrol {action play} {type more} songs by this artist} |
    {avcontrol {action record} a {type video}} |
    {startTimer start a timer for {number} minutes {unit :minutes}} |
    {navigate i'm running low on {place}} |
    {calendar {action cancel} all my {type meetings} on {date}} |
    {navigate directions to {place}} |
    {messaging do i have any new texts {action :query} {type :texts}} |
    {calendar {type open} my calendar to {date}} |
    {alarm {type set} an alarm for {number} am {date}};
    
    # Match the prefix and zero or one of the sentences.
    # <s> and </s> are sentence start and end markers that
    # match silence and small amounts of extraneous speech.
    g = <s> $prefix $intent? </s>;
    
  2. Let's combine the wake word we previously enrolled with this LVCSR model and tpl-opt-spot-vad-lvcsr:

    % snsr-edit -vvo ww-commands.snsr \
        -t model/tpl-opt-spot-vad-lvcsr-1.25.0.snsr \
        -f 0 udt-kws.snsr \
        -f 1 commands.snsr \
        -s include-wake-word-audio=1
    Loading "model/tpl-opt-spot-vad-lvcsr-1.24.0.snsr" as the template model.
    Loading "udt-kws.snsr" into setting "0".
    Loading "commands.snsr" into setting "1".
    Output written to "ww-commands.snsr".
    

    And then run it over all the enrollment recordings:

    % snsr-eval -t ww-commands.snsr data/enrollments/*-c.wav
    NLU intent: calculate (0) =  18 percent of 643
    NLU entity:   percent (0) = 18
    NLU entity:   number (0) = 643
      375   3195 armadillo 18 percent of 643
    NLU intent: call (0) = call the nearest target
    NLU entity:   place (0) = target
      4695   6360 armadillo call the nearest target
    NLU intent: navigate (0) = how far away is winco
    NLU entity:   place (0) = winco
      7680   9495 armadillo how far away is winco
    NLU intent: avcontrol (0) =  record a video
    NLU entity:   action (0) = record
    NLU entity:   type (0) = video
    14535  16095 armadillo record a video
    NLU intent: startTimer (0) = start a timer for 20 minutes minutes
    NLU entity:   number (0) = 20
    NLU entity:   unit (0) = minutes
    17640  19905 armadillo start a timer for 20 minutes minutes
    NLU intent: navigate (0) = i'm running low on gas
    NLU entity:   place (0) = gas
    21060  22935 jackalope i'm running low on gas
    NLU intent: calendar (0) =  cancel all my meetings on friday
    NLU entity:   action (0) = cancel
    NLU entity:   type (0) = meetings
    NLU entity:   date (0) = friday
    24315  26655 jackalope cancel all my meetings on friday
    NLU intent: navigate (0) = directions to susan's house
    NLU entity:   place (0) = susan's house
    27915  30195 jackalope directions to susan's house
    NLU intent: messaging (0) = do i have any new texts query texts
    NLU entity:   action (0) = query
    NLU entity:   type (0) = texts
    31500  33525 jackalope do i have any new texts query texts
    NLU intent: calendar (0) =  open my calendar to next week
    NLU entity:   type (0) = open
    NLU entity:   date (0) = next week
    34695  36975 jackalope open my calendar to next week
    NLU intent: alarm (0) =  set an alarm for 6 am tomorrow
    NLU entity:   type (0) = set
    NLU entity:   number (0) = 6
    NLU entity:   date (0) = tomorrow
    38160  40665 jackalope set an alarm for 6 am tomorrow