Frequently Asked Questions¶

Is this SDK thread-safe?¶

Yes, as long as Session and Stream handles are not shared between threads. The number of handles per thread are limited only by system resources.

If you need to share one of these handles across threads, you must provide application-level mutual exclusion locking.

Note

There is just one exception to this requirement: You may call stop on a Session handle from a different thread than the one run is executing on.

If you replace the dynamic memory allocator with config and CONFIG_ALLOC the new allocator implementation must be thread-safe. Use allocLock to add thread-safety to an allocator that is not.

How do I diagnose wake word audio issues?¶

Create a new wake word model from the tpl-spot-debug-type template. See the notes and example.

How do I enroll Fixed Trigger models?¶

EFT models use the same API, and follow the same enrollment recipe as UDT models.

Replace the UDT model udt-universal-3.67.1.0.snsr in any of the examples with an EFT enrollment model such as eft-hbg-enUS-23.0.0.9.snsr.

Can I run two wake word models at the same time?¶

Yes, see tpl-spot-concurrent.

What is a Command Set?¶

Command sets are phrase spotters with more than one phrase. These are frequently tuned to have a limited listen-window.

Command set recognizers have task-type == phrasespot and can be used as a drop-in replacement for any wake word. No code changes are required.

Most command sets are tuned for use after an always-listening keyword spotter. The tpl-spot-sequential template provides a convenient way to build such a model.

Can I create a trigger-to-search model?¶

Yes. Create a new phrase spot model from the tpl-spot-vad template.

How do I improve the user experience for wake words in poor audio environments?¶

Use a spotter model with Smart Wake Word support. See low-fr-operating-point and duration-ms.

Can I use models from the beta releases?¶

Yes. This release is compatible with older models, but it requires a modification to the task requirement sanity checks.

Use "~0.5.0 || 1.0.0" instead of "1.0.0", for example:

C/C++Java

snsrRequire(session, SNSR_TASK_VERSION,  "~0.5.0 || 1.0.0");

session.require(Snsr.TASK_VERSION,  "~0.5.0 || 1.0.0");

The models included in the v6.0.0 release use task-version values of 1.0.0. This makes these models incompatible with 5.0.0-beta releases.

How can I reduce application code size?¶

By default, any applications linked against the TrulyNatural library can run any model (.snsr) file supported by the library. You can reduce the overall code size of an application by limiting the library capabilities to only the models of interest.

Use snsr-edit with the -i flag to create custom initialization code that references only the modules used by the models included in your application. For example:

% snsr-edit -v -i -t spot-voicegenie-enUS-6.5.1-m.snsr
Output written to "snsr-custom-init.c".

This creates a custom initialization file, snsr-custom-init.c, that references only the code modules used by spot-voicegenie-enUS-6.5.1-m.snsr. Add this file to your application, and compile with -DSNSR_USE_SUBSET This will replace all calls to snsrNew with a variant that initializes only the required modules.

You can further reduce code size by linking at the function instead of the module level. See sample/c/Makefile for compiler and linker flag examples (-ffunction-sections).

How do I spot phrases on a Real-Time Operating System (RTOS) with a custom audio driver and no filesystem?¶

You should implement a new custom stream similar to data-stream.c which is used in spot-data-stream.c. This shows how to make a custom stream which should encapsulate your audio driver functionality, and which your Session can pull data from.

An alternative is pushing data onto a stream. See spot-data.c. You can take data chunks of any size (perhaps provided by your audio driver) and push them onto a stream to be read by an Session.

Can I avoid dynamic memory allocation?¶

You can avoid all calls to malloc(), realloc(), and free() by replacing the memory allocator with CONFIG_ALLOC.

For embedded use we recommend allocTLSF. Use this with one or more pre-defined read-write memory segments that remain valid for the lifetime of the application.

How do I display international characters in results?¶

On Windows systems, when using Sensory STT models with snsr-eval v7.3.0 or earlier, international characters such as Chinese (zhCN) may appear as garbled symbols such as "Σ╜á σÑ╜ σÉù" instead of correct UTF-8 characters "您好吗".

This is a display encoding issue, not an issue with the recognition output itself.

Solution Options

Set Console Code Page to UTF-8

Before running snsr-eval.exe, run the following command in the Windows Command Prompt:
```
chcp 65001
```
This sets the console's code page to UTF-8, enabling correct display of international characters.

snsr-eval v7.4.0 and later does this before writing any output.
Enable System-Wide UTF-8 Support (Recommended for Long-Term Use)
- Open Settings > Time & Language > Administrative Language Settings
- Under Change system locale, check: "Beta: Use Unicode UTF-8 for worldwide language support"
- Save your changes and restart your computer to apply them
This setting ensures that all applications and the console will handle UTF-8 properly by default.

How do I improve wake word performance?¶

Contact Sensory if interested in pursuing these customizations. There may be additional cost involved. Not all combinations may be possible depending on platform and trigger specification.

How to measure real-time factor and MIPS¶

To measure the real-time factor, time how long it takes to run the spotter over a long audio file. Then, real time factor = (run time in seconds) / (length of audio in seconds).
To measure the MIPS on your device, use a profiler like perf when running the spotter over an audio file. Then, MIPS = (No. of instructions) / (length of audio in seconds * 1000000).

What if the spotter runs too slow, or consumes too many cycles?¶

You could explore one of these options to see an improvement: Try multi-threaded, frame-stacked, or little-big spotters. You may also want to get a smaller spotter model, which uses less CPU (in proportion to its size) with a small reduction in FA and FR performance. Contact Sensory to see if these options are right for you.

What if the spotter consumes too much memory?¶

Contact Sensory for a smaller model.
If your platform runs code directly from ROM, consider converting the spotter to compiled-in code. This will run from read-only code space, and reduce heap requirements. Use the snsr-edit tool to create a C source file from any spotter model. See fromCode and examples spot-data-stream.c and spot-data.c

What is a little-big spotter?¶

A little-big spotter does sequential recognition by first running a low-power spotter. When this spots, it re-processes the audio with a high-power state-of-the-art spotter. This reduces average CPU cycles (and hence power) required to run a spotter with a small increase in latency. This one combined model has the behavior of a high-power spotter.

What is a frame-stacked spotter?¶

Frame stacked spotters reduce the CPU load by 30-45%, in exchange for a small reduction in FA and FR performance.The resolution of time alignments is also reduced by a factor of two.

What is a multi-threaded spotter?¶

Multi-threaded spotters speed up execution on CPUs with more than one core.

Key Factors to consider¶

Are you willing to trade increased latency for fewer compute cycles?
Are you willing to distribute computation across multiple cores?
Yes to both 1 and 2?

How do I use Large Vocabulary Continuous Speech Recognition? tnl¶

This TrulyNatural release includes three different ways of running a speech-to-text recognizer: without audio segmentation, with VAD audio segmentation, and with wake word gated VAD.

Note

The ^result callback only happens when a VAD endpoint is detected, or the end of the input stream is reached. For applications with live audio recognition, LVCSR recognizers should always be used with a VAD, such as tpl-opt-spot-vad-lvcsr, tpl-spot-vad-lvcsr, or tpl-vad-lvcsr.

LVCSR without audio segmentation¶

The stt-enUS-automotive-medium-2.3.15-pnc.snsr model included in this distribution is a generic broad-domain US English speech-to-text recognizer with a special domain focus on automotive commands.

% bin/snsr-eval -t model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \
    data/enrollments/armadillo-1-3-c.wav
P     40    200 Im
P     80    640 Armadillo
P    120   1120 Armadillo playing
P    120   1520 Armadillo play marsa
P    120   1880 Armadillo play more songs by
P    120   2320 Armadillo play more songs by this art
P    120   2600 Armadillo play more songs by this artist
P    120   2640 Armadillo play more songs by this artist
NLU intent: music_player (0.9849) = armadillo play more songs by this artist
   120   2640 Armadillo play more songs by this artist.

Preliminary or partial results above are prefixed with P. Suppress these by setting the partial-result-interval to 0:

% bin/snsr-eval -t model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \
    -s partial-result-interval=0 \
    data/enrollments/armadillo-1-3-c.wav
NLU intent: music_player (0.9849) = armadillo play more songs by this artist
   120   2640 Armadillo play more songs by this artist.

LVCSR with VAD-segmented audio¶

Large vocabulary recognizers perform better when used with a Voice Activity Detector that removes extraneous leading and trailing silence.

Create such a VAD-lvcsr model using the tpl-vad-lvcsr template:

% bin/snsr-edit -t model/tpl-vad-lvcsr-3.17.0.snsr \
    -f 0 model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \
    -o vad-stt-enUS-automotive-medium-pnc.snsr

Evaluate using snsr-eval:

% bin/snsr-eval -t vad-stt-enUS-automotive-medium-pnc.snsr \
    data/enrollments/armadillo-1-0-c.wav
P    230    830 Armadilla
P    270   1150 Armadillo, eight
P    310   1630 Armadillo, eighteen percent
P    310   1910 Armadillo. Eighteen percent of s
P    310   2430 Armadillo, eighteen percent of six hundred
P    310   2790 Armadillo, eighteen percent of six hundred and forty
P    310   3150 Armadillo, eighteen percent of six hundred forty three
NLU intent: no_command (0.9765) = armadillo eighteen percent of 643
NLU entity:   number (0.9564) = 643
   310   3190 Armadillo, eighteen percent of six hundred forty three.

LVCSR following a wake word¶

The tpl-spot-vad-lvcsr template provides a way to start a large-vocabulary recognizer with a spotted wake word. In this example we'll enroll a wake-word, then use the enrolled spotter with the broad-domain recognizer.

Create an enrolled spotter for "jackalope":

% spot-enroll -vt model/udt-universal-3.67.1.0.snsr \
    +jackalope \
    data/enrollments/jackalope-1-0.wav \
    data/enrollments/jackalope-1-1.wav \
    data/enrollments/jackalope-1-4.wav \
    data/enrollments/jackalope-1-3.wav
Adapting: 100% complete.
Enrolled model saved to "enrolled-sv.snsr"

Combine the enrolled spotter and the broad-domain recognizer using the tpl-spot-vad-lvcsr-3.23.0.snsr template:

% snsr-edit -vt model/tpl-spot-vad-lvcsr-3.23.0.snsr \
  -f 0 enrolled-sv.snsr \
  -f 1 model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \
  -s include-leading-silence=1 \
  -o jackalope-stt-enUS-automotive-medium-pnc.snsr
Saved edited model to "jackalope-stt-enUS-automotive-medium-pnc.snsr".

Evaluate using snsr-eval. Note that the wake word is not included in the LVCSR transcription.

% snsr-eval -t jackalope-stt-enUS-automotive-medium-pnc.snsr \
    data/enrollments/jackalope-1-2-c.wav
P   1050   1530 Directions
P   1050   1930 Directions to sus
P   1050   2370 Directions to Susan's house
P   1050   2530 Directions to Susan's house
NLU intent: navigation (0.9973) = directions to susan's house
NLU entity:   navigation_location (0.9811) = susan's house
  1050   2530 Directions to Susan's house.

LVCSR with lightweight NLU parsing¶

The included LVCSR and STT models support a lightweight natural language mark-up. This can significantly simplify application code that has to interpret recognition results. See grammar-based recognition for a description of the grammar syntax.

NLU with custom grammar recognizers¶

% snsr-eval -t model/lvcsr-build-enUS-12.13.1-5MB.snsr \
    -s partial-result-interval=0 \
    -f grammar-stream data/grammars/enrollments-nlu-slot.txt \
    data/enrollments/armadillo-1-4-c.wav
NLU intent: avcontrol (0) =  record a video
NLU entity:   action (0) = record
NLU entity:   type (0) = video
   435   1995 armadillo record a video

NLU with broad-domain recognizers¶

In TrulyNatural v6.16.0 and later, NLU parsing is a separate processing step that occurs after the ^result event. NLU parsing includes a special . symbol that matches any input word. This allows crafting of more robust island parsers that can be used with free-form recognition results from a broad-domain model.

This small example detects a small set of microwave control commands using lvcsr-lib-enUS-1.2.0.snsr.

Note

The stt-enUS-automotive-medium-2.3.15-pnc.snsr model includes machine-learned NLU processing for automotive command tasks. If you use nlu-grammar-stream with this model the grammar-based NLU will override the machine-learned NLU parsing.

tiny-microwave.nlu

# Microwave command NLU post-processor grammar
# tiny-microwave.nlu

# power level setting, "fifty percent". don't capture optional "power"
power = ~s.percent power?;

# timer duration, "two minutes and ten seconds"
duration = ~s.timer;

# defrost command: the word "defrost" followed by
# zero or more power or duration values, both captured
# .* matches any input word sequence
defrost = defrost ( .* ({power} | {duration}) .* )* ;

# default action matches any input and discards it
default = .:*;

# set clock time: the word "clock" or "time" followed by
# a time ("seven twenty nine pm").
# ignore spurious words before and after the time specification
clock = (clock | time) .* {time ~s.time} .*;

# list of all the actions we've defined, captured
action = {defrost} | {clock} | {default};

# match any one of the actions, ignoring unknown words before
# and after
nlu = <s> .* $action .* </s>;

Build and run a recognizer with live input.

% snsr-eval -vat model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \
    -t model/lvcsr-lib-enUS-1.2.0.snsr \
    -f nlu-grammar-stream tiny-microwave.nlu \
    -s partial-result-interval=0
Using live audio from default capture device. ^C to stop.

# "Defrost my soup for 15 minutes at 30% power"
Using live audio from default capture device. ^C to stop.
  4035   8835 [^end] VAD speech region.
NLU intent: defrost (0) = defrost my soup for 15 minutes at thirty percent power
NLU entity:   duration (0) = 15 minutes
NLU entity:   power (0) = thirty percent power
  4310   8470 (0.4805) Defrost my soup for fifteen minutes at thirty percent power.

# "Could you set the clock to 3:43 pm?"
 48165  51810 [^end] VAD speech region.
NLU intent: clock (0) = clock to 15:43
NLU entity:   time (0) = 15:43
 48360  51360 (0.163) Could you set the clock to three? Forty three P? M.

Dealing with NLU parse ambiguity¶

It is possible to get more than one valid parse result if the NLU grammar introduces ambiguity. The NLU processor scores these alternates and returns the best hypotheses in order, up to nlu-match-max. During the ^nlu-slot callback, nlu-match-count reports the number of alternates available, with nlu-match-index the current alternate.

nlu-match-max defaults to 1 for best compatibility with earlier releases.

Warning

Resolving NLU ambiguity can be expensive both in terms of computation and heap memory use.

Avoid using patterns that match arbitrary input in multiple ways:

g = <s> {left .*} {right .*} </s>;

This example uses two NLU grammars: system.nlu for basic functionality provided by a product, and app.nlu to extend NLU processing for a plug-in application. If the application duplicates some of the system NLU actions we need these reported for the system to take appropriate action.

system.nlu

# system.nlu
volume = volume: {volume-level ~s.percent};
preset = preset: number:? ~s.number-integer-0-9;
system = {volume} | {preset};
# :/-0.1 adds a small weight bias towards the ~app class, so
# ~app will outscore $system for identical matches
plugin = :/-0.1 ~app;
action = {system} | {plugin};
nlu = <s> $action </s>;

app.nlu

# app.nlu
media-control = ~s.control.media;
preset = preset: ( one | two | three | four | five );
nlu = {media-control} | {preset};

Build and run a recognizer with live input. Set the value for nlu-match-max to allow up to ten alternate matches.

% snsr-eval -vvat model/stt-enUS-automotive-medium-2.3.14-pnc.snsr \
    -t model/lvcsr-lib-enUS-1.2.0.snsr \
    -f nlu-grammar-stream system.nlu \
    -f nlu-grammar-stream.app app.nlu \
    -s partial-result-interval=0 \
    -s nlu-match-max=10
Using live audio from default capture device. ^C to stop.

# "volume 50%"
# in system grammar
  5235 [^begin]
  4710   6645 [^end] VAD speech region.
NLU intent: system (0) =  fifty percent
NLU entity:   volume.volume-level (0) = fifty percent
NLU  1/1 nlu-slot-value.system (0) = { volume { volume-level fifty percent } }
NLU  1/1 nlu-slot-value.system.volume (0) = { volume-level fifty percent }
NLU  1/1 nlu-slot-value.system.volume.volume-level (0) = fifty percent
phrase:
  4990   6270 (0.8939) Volume. Fifty percent.
words:
  4990   5470 (0.8955) Volume.
  5550   5870 (0.9986) Fifty
  5950   6270 (0.9996) percent.

 # "fast forward"
# in plugin grammar
 17070 [^begin]
 16545  17940 [^end] VAD speech region.
NLU intent: plugin (0) =  fast forward
NLU entity:   media-control (0) = fast forward
NLU  1/1 nlu-slot-value.plugin (0) = { media-control fast forward }
NLU  1/1 nlu-slot-value.plugin.media-control (0) = fast forward
phrase:
 16860  17540 (0.7646) Fast forward.
words:
 16860  17100 (0.9913) Fast
 17220  17540 (0.7713) forward.

# "preset 5"
# in both system and plugin grammars, but plugin reported first
# due to the weight bias
 22290 [^begin]
 21765  23325 [^end] VAD speech region.
NLU intent: plugin (0) =  five
NLU entity:   preset (0) = five
NLU  1/2 nlu-slot-value.plugin (0) = { preset five }
NLU  1/2 nlu-slot-value.plugin.preset (0) = five
NLU intent: system (0) =  five
NLU entity:   preset (0) = five
NLU  2/2 nlu-slot-value.system (0) = { preset five }
NLU  2/2 nlu-slot-value.system.preset (0) = five
phrase:
 22040  22920 (0.9432) Preset. Five.
words:
 22040  22480 (0.9443) Preset.
 22680  22920 (0.9988) Five.

How do I take action on an NLU result?¶

You can think of an intent as specifying which function or method you should call to perform an action. Entities identify parts of the utterance that include additional detail. For example, a call_contact intent might have a contact_name entity that specifies who to call.

Register a handler for ^nlu-intent
In this handler,
- Retrieve nlu-intent-name as a string.
- Map this intent name to an action. Do this by comparing the intent name to all valid intent names for which you want to perform an action.
- If the matched action requires additional data, retrieve the expected nlu-entity-value by name.
- Call a function (specified by the intent value) with zero or more arguments specified by the entity values.
- Return from the intent event handler with OK.