Modules Guide

Complete guide to Audio Capture, Speech-to-Text, and Language Models.

Overview

ElysGenAI provides three integrated modules:

Microphone → Audio Capture → STT → Transcribed Text → LLM → Generated Response

Modules:

Audio Capture - Microphone input and routing
STT - Speech-to-text using Whisper
LLM - Language models using Phi-3-mini

Audio Capture System

Quick Start

// Get subsystem
UERP_AudioCaptureSubsystem* AudioSubsystem = 
    GetGameInstance()->GetSubsystem<UERP_AudioCaptureSubsystem>();

// Start capturing
AudioSubsystem->StartCapture();

Consumer Pattern

Implement IERP_AudioConsumer to receive audio:

UCLASS()
class UMyAudioConsumer : public UObject, public IERP_AudioConsumer
{
    GENERATED_BODY()
    
public:
    virtual void OnAudioDataReceived_Implementation(const FERP_AudioBuffer& Buffer) override
    {
        // Process audio
        ProcessAudio(Buffer.AudioData);
    }
    
    virtual FString GetConsumerName_Implementation() const override
    {
        return TEXT("MyAudioConsumer");
    }
};

Register Consumer

void UMyComponent::BeginPlay()
{
    Super::BeginPlay();
    
    UERP_AudioCaptureSubsystem* AudioSubsystem = 
        GetGameInstance()->GetSubsystem<UERP_AudioCaptureSubsystem>();
    
    if (AudioSubsystem)
    {
        AudioSubsystem->RegisterConsumer(this);
    }
}

void UMyComponent::EndPlay(const EEndPlayReason::Type Reason)
{
    if (UERP_AudioCaptureSubsystem* AudioSubsystem = 
        GetGameInstance()->GetSubsystem<UERP_AudioCaptureSubsystem>())
    {
        AudioSubsystem->UnregisterConsumer(this);
    }
    
    Super::EndPlay(Reason);
}

Push-to-Talk Modes

UENUM(BlueprintType)
enum class EElysPushToTalkMode : uint8
{
    AlwaysOn,      // Continuous capture
    PushToTalk,    // Hold key to talk
    PushToMute     // Hold key to mute
};

Configuration:

AudioSubsystem->SetPushToTalkMode(EElysPushToTalkMode::PushToTalk);
AudioSubsystem->SetPushToTalkActive(true); // Key pressed
AudioSubsystem->SetPushToTalkActive(false); // Key released

Input Binding:

// In PlayerController
void AMyPlayerController::SetupInputComponent()
{
    Super::SetupInputComponent();
    
    InputComponent->BindAction("VoiceChat", IE_Pressed, this, &AMyPlayerController::StartVoiceChat);
    InputComponent->BindAction("VoiceChat", IE_Released, this, &AMyPlayerController::StopVoiceChat);
}

void AMyPlayerController::StartVoiceChat()
{
    auto* AudioSubsystem = GetGameInstance()->GetSubsystem<UERP_AudioCaptureSubsystem>();
    AudioSubsystem->SetPushToTalkActive(true);
}

void AMyPlayerController::StopVoiceChat()
{
    auto* AudioSubsystem = GetGameInstance()->GetSubsystem<UERP_AudioCaptureSubsystem>();
    AudioSubsystem->SetPushToTalkActive(false);
}

Audio Formats

USTRUCT(BlueprintType)
struct FERP_AudioFormat
{
    int32 SampleRate;   // 16000 (STT), 48000 (voice chat)
    int32 NumChannels;  // 1 (mono), 2 (stereo)
    int32 BitDepth;     // 16
};

STT Default: 16kHz mono 16-bit Voice Chat Default: 48kHz stereo 16-bit

Mute Control

// Mute microphone
AudioSubsystem->SetMuted(true);

// Check mute status
bool bIsMuted = AudioSubsystem->IsMuted();

Speech-to-Text (STT)

Quick Setup

// 1. Add component to Actor
UPROPERTY(VisibleAnywhere, BlueprintReadOnly)
UERP_STTComponent* STTComponent;

// 2. Bind event
STTComponent->OnTranscriptionComplete.AddDynamic(
    this, &AMyActor::OnTranscriptionReceived);

// 3. Start listening
STTComponent->StartListening();

// 4. Handle results
void AMyActor::OnTranscriptionReceived(const FERP_STTResult& Result)
{
    UE_LOG(LogTemp, Log, TEXT("Transcription: %s"), *Result.TranscribedText);
}

Component API

StartListening:

UFUNCTION(BlueprintCallable, Category="ElysGenAI|STT")
void StartListening();

StopListening:

UFUNCTION(BlueprintCallable, Category="ElysGenAI|STT")
void StopListening();

IsListening:

UFUNCTION(BlueprintPure, Category="ElysGenAI|STT")
bool IsListening() const;

SetLanguageCode:

UFUNCTION(BlueprintCallable, Category="ElysGenAI|STT")
void SetLanguageCode(const FString& LanguageCode);

Configuration

Component Properties:

UPROPERTY(EditAnywhere, BlueprintReadWrite, Category="ElysGenAI|STT")
bool bAutoStartListening = false;

UPROPERTY(EditAnywhere, BlueprintReadWrite, Category="ElysGenAI|STT")
FString LanguageCode = TEXT("en");

UPROPERTY(EditAnywhere, BlueprintReadWrite, Category="ElysGenAI|STT")
bool bEnableVAD = true;  // Voice activity detection

UPROPERTY(EditAnywhere, BlueprintReadWrite, Category="ElysGenAI|STT")
float MinConfidence = 0.5f;

Project Settings: Project Settings → Elys GenAI Framework → STT

Backend: Whisper
Model Path: (empty = use bundled)
NumThreads: 4 (match CPU cores)
Enable VAD: true

Events

OnTranscriptionComplete:

DECLARE_DYNAMIC_MULTICAST_DELEGATE_OneParam(
    FElysSTTResultDelegate, 
    const FERP_STTResult&, Result
);

UPROPERTY(BlueprintAssignable, Category="ElysGenAI|STT")
FElysSTTResultDelegate OnTranscriptionComplete;

Result Structure:

USTRUCT(BlueprintType)
struct FERP_STTResult
{
    UPROPERTY(BlueprintReadOnly)
    FString TranscribedText;
    
    UPROPERTY(BlueprintReadOnly)
    float Confidence;  // 0.0-1.0
    
    UPROPERTY(BlueprintReadOnly)
    FString Language;
    
    UPROPERTY(BlueprintReadOnly)
    bool bIsFinal;
};

Whisper Models

Model	Size	Use Case
tiny	~75 MB	Testing, prototyping
base.en	~74 MB	Recommended: English only
base	~142 MB	Multilingual (99+ languages)
small	~466 MB	Better accuracy
medium	~1.5 GB	High accuracy
large	~3 GB	Best accuracy

Supported Languages: en, es, fr, de, it, pt, nl, ru, zh, ja, ko, ar, hi, and 87+ more

Voice Activity Detection (VAD)

VAD filters silence automatically:

STTComponent->SetEnableVAD(true);  // Enabled by default

// Adjust sensitivity (0.0-1.0)
// Higher = more aggressive filtering
STTComponent->SetVADThreshold(0.5f);

When to adjust:

Noisy environment: Increase threshold (0.6-0.8)
Quiet environment: Decrease threshold (0.3-0.5)

Performance Tuning

Thread Count:

// Project Settings → STT → NumThreads
// Set to CPU core count for best performance
NumThreads = 4;  // For quad-core CPU

Buffer Duration:

// Project Settings → Audio → Buffer Duration
BufferDuration = 100ms;  // Lower = less latency, higher = better accuracy

Language Models (LLM)

Quick Setup

// 1. Add component
UPROPERTY(VisibleAnywhere)
UERP_LLMComponent* LLMComponent;

// 2. Set system prompt
LLMComponent->SetSystemPrompt(TEXT("You are a friendly merchant NPC."));

// 3. Bind event
LLMComponent->OnGenerationComplete.AddDynamic(
    this, &AMyNPC::OnDialogueGenerated);

// 4. Send message
LLMComponent->SendMessage(TEXT("What are you selling?"));

// 5. Handle response
void AMyNPC::OnDialogueGenerated(const FERP_LLMResult& Result)
{
    DisplayDialogue(Result.GeneratedText);
}

Component API

SendMessage:

UFUNCTION(BlueprintCallable, Category="ElysGenAI|LLM")
void SendMessage(const FString& Message);

SetSystemPrompt:

UFUNCTION(BlueprintCallable, Category="ElysGenAI|LLM")
void SetSystemPrompt(const FString& Prompt);

ClearHistory:

UFUNCTION(BlueprintCallable, Category="ElysGenAI|LLM")
void ClearHistory();

Configuration

Component Properties:

UPROPERTY(EditAnywhere, Category="ElysGenAI|LLM")
FString SystemPrompt = TEXT("You are a helpful assistant.");

UPROPERTY(EditAnywhere, Category="ElysGenAI|LLM")
int32 MaxTokens = 256;

UPROPERTY(EditAnywhere, Category="ElysGenAI|LLM")
float Temperature = 0.7f;  // 0.0-2.0

UPROPERTY(EditAnywhere, Category="ElysGenAI|LLM")
int32 MaxHistoryMessages = 20;

Project Settings: Project Settings → Elys GenAI Framework → LLM

Backend: LlamaCpp
Model Path: (empty = use bundled Phi-3)
ContextLength: 4096 tokens
NumThreads: 4

Events

OnGenerationComplete:

UPROPERTY(BlueprintAssignable)
FERP_LLMResultDelegate OnGenerationComplete;

Result Structure:

USTRUCT(BlueprintType)
struct FERP_LLMResult
{
    UPROPERTY(BlueprintReadOnly)
    FString GeneratedText;
    
    UPROPERTY(BlueprintReadOnly)
    int32 TokenCount;
    
    UPROPERTY(BlueprintReadOnly)
    EElysLLMFinishReason FinishReason;  // Completed, Length, Stop
};

OnTokenGenerated (Streaming):

UPROPERTY(BlueprintAssignable)
FERP_LLMTokenDelegate OnTokenGenerated;

Use for typewriter effects:

LLMComponent->OnTokenGenerated.AddDynamic(this, &AMyNPC::OnToken);

void AMyNPC::OnToken(const FString& Token)
{
    DialogueText += Token;
    UpdateDialogueUI(DialogueText);
}

Bundled Model: Phi-3-mini

Specs:

Size: ~2.7GB (Q4 quantized)
Context: 4096 tokens
License: MIT
Speed: ~20 tokens/sec (CPU)

Use Cases:

NPC dialogue
Quest generation
Item descriptions
Dynamic storytelling

Temperature Guide

Controls creativity/randomness:

// 0.0-0.3: Factual, deterministic (game mechanics, tutorials)
LLMComponent->SetTemperature(0.3f);

// 0.4-0.7: Balanced (NPC dialogue, descriptions)
LLMComponent->SetTemperature(0.7f);

// 0.8-1.5: Creative (storytelling, humor)
LLMComponent->SetTemperature(1.2f);

System Prompt Best Practices

Clear Instructions:

FString SystemPrompt = TEXT(
    "You are a wise wizard NPC named Gandor. "
    "Keep responses under 50 words. "
    "Speak in archaic English. "
    "Never break character."
);

Few-Shot Examples:

FString SystemPrompt = TEXT(
    "You are a merchant. Examples:\n"
    "Player: 'What do you sell?'\n"
    "You: 'Potions, weapons, and armor!'\n"
    "Player: 'How much for a sword?'\n"
    "You: '100 gold pieces.'"
);

Combined Examples

Voice-to-Dialogue Pipeline

// 1. Capture audio → 2. Transcribe → 3. Generate response

UCLASS()
class AMyNPC : public AActor
{
    GENERATED_BODY()

public:
    UPROPERTY(VisibleAnywhere)
    UERP_STTComponent* STTComponent;
    
    UPROPERTY(VisibleAnywhere)
    UERP_LLMComponent* LLMComponent;

protected:
    virtual void BeginPlay() override
    {
        Super::BeginPlay();
        
        // Setup STT
        STTComponent->SetLanguageCode(TEXT("en"));
        STTComponent->SetAutoStartListening(true);
        STTComponent->OnTranscriptionComplete.AddDynamic(
            this, &AMyNPC::OnPlayerSpoke);
        
        // Setup LLM
        LLMComponent->SetSystemPrompt(TEXT("You are a friendly merchant."));
        LLMComponent->OnGenerationComplete.AddDynamic(
            this, &AMyNPC::OnDialogueGenerated);
    }
    
    UFUNCTION()
    void OnPlayerSpoke(const FERP_STTResult& Result)
    {
        // Send transcription to LLM
        LLMComponent->SendMessage(Result.TranscribedText);
    }
    
    UFUNCTION()
    void OnDialogueGenerated(const FERP_LLMResult& Result)
    {
        // Display NPC response
        DisplayDialogue(Result.GeneratedText);
    }
};

Multi-Consumer Audio Routing

// Route audio to both STT and voice chat

UCLASS()
class AMyPlayerController : public APlayerController
{
    GENERATED_BODY()

public:
    UPROPERTY(VisibleAnywhere)
    UERP_STTComponent* STTComponent;
    
    UPROPERTY(VisibleAnywhere)
    UVoiceChatComponent* VoiceChatComponent;

protected:
    virtual void BeginPlay() override
    {
        Super::BeginPlay();
        
        // Both components automatically register as audio consumers
        // Audio flows to both simultaneously
        STTComponent->StartListening();
        VoiceChatComponent->StartTransmitting();
    }
};

Settings Reference

Audio Settings

Setting	Default	Description
Sample Rate	16000 Hz	Audio capture sample rate
Channels	1 (Mono)	Audio channels
Bit Depth	16	Audio bit depth
Buffer Duration	100ms	Audio buffer size

STT Settings

Setting	Default	Description
STT Backend	Whisper	Backend implementation
Model Path	(bundled)	Path to Whisper model
Language	en	Target language code
Enable VAD	true	Voice activity detection
Num Threads	4	Inference threads

LLM Settings

Setting	Default	Description
LLM Backend	LlamaCpp	Backend implementation
Model Path	(bundled)	Path to Phi-3 model
Context Length	4096	Maximum context tokens
Temperature	0.7	Sampling temperature
Max Tokens	512	Maximum generation length
Num Threads	4	Inference threads

Next Steps

Examples - Practical implementation recipes
API Reference - Complete class documentation
Troubleshooting - Common issues and solutions

Overview​

Audio Capture System​

Quick Start​

Consumer Pattern​

Register Consumer​

Push-to-Talk Modes​

Audio Formats​

Mute Control​

Speech-to-Text (STT)​

Quick Setup​

Component API​

Configuration​

Events​

Whisper Models​

Voice Activity Detection (VAD)​

Performance Tuning​

Language Models (LLM)​

Quick Setup​

Component API​

Configuration​

Events​

Bundled Model: Phi-3-mini​

Temperature Guide​

System Prompt Best Practices​

Combined Examples​

Voice-to-Dialogue Pipeline​

Multi-Consumer Audio Routing​

Settings Reference​

Audio Settings​

STT Settings​

LLM Settings​

Next Steps​

Overview

Audio Capture System

Quick Start

Consumer Pattern

Register Consumer

Push-to-Talk Modes

Audio Formats

Mute Control

Speech-to-Text (STT)

Quick Setup

Component API

Configuration

Events

Whisper Models

Voice Activity Detection (VAD)

Performance Tuning

Language Models (LLM)

Quick Setup

Component API

Configuration

Events

Bundled Model: Phi-3-mini

Temperature Guide

System Prompt Best Practices

Combined Examples

Voice-to-Dialogue Pipeline

Multi-Consumer Audio Routing

Settings Reference

Audio Settings

STT Settings

LLM Settings

Next Steps