API overview¶

This is a brief overview of the API design goals, the SDK's conceptual model, and the two supported audio processing modes.

Design goals¶

The TrulyNatural SDK API is a result of these design goals:

Pure C implementation.
- Lowest common denominator, widest toolchain availability.
- No C++ runtime overhead.
- Fast.
Simple API.
- Small footprint: limited number of functions and data types.
- Generic, independent of the inference task.
- Fundamental data types only: floating point, integer, strings, streams, and opaque object instance handles.
- Make it easier to provide bindings for languages other than C.
Flexible configuration.
- Hide complexity,
- but still allow for fine-grained configuration if needed.
- Settings indexed by string names, documented settings define public API.
One self-contained model per task.
- Model includes a flow graph that specifies how various low-level internal modules (feature extractors, acoustic models, etc.) connect and interact.
- Includes all required module configurations.
Run on a wide variety of platforms, including ones without file system support.

There is a significant downside to these design choices: Discoverability is very limited. You cannot determine model behavior from function or method names alone. You must refer to the model type documentation for expected task behavior and available settings.

Conceptual model¶

This library uses a dataflow approach to evaluate speech recognition tasks. It uses inversion of control: The SDK invokes event handlers to report results and control task flow.

The API contains two primary data types: Session used for model inference, and a Stream abstraction for input and output.

Sessions hold the entire state of a model instance, and use streams for all input and output. There is, for example, a single load function to load a model into a session, but this supports loading from a named file, an open FILE * handle, a memory segment, from the code segment, and from compressed assets on Android.

Models (snsr files) define flow pipelines and session behavior. These contain the serialized content¹ of the session flow graph, including all binary models and configurations. Think of these as hierarchical key-value databases. Once loaded into a Session, you can query or change the setting keys with generic getter and setter functions.

Processing modes¶

We support two modes for audio processing:

Pull mode, where the run function reads audio from the configured input stream. This blocks on read until new data are available. The run function returns only when the stream runs out of data (for example the end of a file), an event handler tells it to stop, or an error occurs.
Push mode, where the application repeatedly calls the push function with small chunks of the audio data. The push function returns once it has processed or buffered these data. The application eventually calls stop to flush and process any buffered data.

Model evaluation typically follows this recipe:

Pull modePush mode

Create a new session instance with new.
Load a task model into the instance.
Set the input source stream.
Register one or more event handlers.
Enter the main loop by calling run. The library will process the input streams and invoke event handlers at appropriate times. The main loop continues until a terminating condition is reached, such as an event returning an error code.
Release the session instance.

live-spot.c, evalUDT.java

Create a new session instance with new.
Load a task model into the instance.
Register one or more event handlers.
Process audio segments by calling push repeatedly. This will invoke event handlers before push returns.
Call stop once to flush any buffered audio.
Release the session instance.

push-audio.c

This version of TrulyNatural SDK API supports two language bindings: C and Java. There's a one-to-one² mapping between C functions and Java methods.

Similar in concept to protocol buffers, but with streamed unpacking into native data structures in RAM, no need for accessor functions, and additional features such as conversion to code for running from the text segment. ↩
Memory management and error handling differs. ↩