Pocketsphinx (recognizing voice) + recording

Hi, all nice people.

I’m building an application with matrix voice. It’s using mosquitto (Paho MQTT), to send instructions from PC to raspberry for recording, turning leds on… etc.

The main point is that I am trying to use it at the same time that the device is recognizing speech through pocketsphinx.

I’ve thought about some ways I can perform it. My final thought is to implement a circular buffer and pass the recorded data (matrix-hal read function) to the pocketsphinx thread by implementing producer consumer with semaphores.

Have anyone tried something similar? Is it possible to implement? I’m having some issues…

This is the producer thread (all syncro and buffer circular problem is solved by using virual memory).

void *t_microx(void *arg){
    
    MicrophoneArray microphone_array;
    microphone_array.Setup(&bus);  //bus object is global variable

    MicrophoneCore microphone_core(microphone_array);
    microphone_core.Setup(&bus);

    microphone_array.SetSamplingRate(44100);
    //if (p_mic.gain > 0)
    microphone_array.SetGain(1);
    microphone_array.ShowConfiguration();
    microphone_array.CalculateDelays(0, 0, 1000, 320 * 1000);

    while(1){
        microphone_array.Read();
        for (uint32_t s = 0; s < microphone_array.NumberOfSamples(); s++){
            buffer[samples] = microphone_array.Beam(s);
            samples++;
            if(sem_trywait(&s_mic_stop)>=0){
	            break;
	        }
	        if(samples>=4096){
	            cb_put(&q, buffer, 4096);
	            samples = 0;
	        }
        }
    }
}

This is the loop function in consumer thread (cb_t *q) is the pointer to the circular buffer in VM

const char * recognize_from_microphone(cb_t *q){
    int16_t buffer[4096];
    ps_start_utt(ps);                                // mark the start of the utterance
    utt_started = FALSE;                             // clear the utt_started flag

    while(1) {
        size_t k = cb_get(q, buffer, 4096);
        ps_process_raw(ps, (const int16_t *)buffer, k, FALSE, FALSE);
        in_speech = ps_get_in_speech(ps);   // test to see if speech is being detected

        if (in_speech && !utt_started) {   // if speech has started and utt_started flag is false
	        utt_started = TRUE;                      // then set the flag
        }

        if (!in_speech && utt_started) {     // if speech has ended and the utt_started flag is true
	        ps_end_utt(ps);                          // then mark the end of the utterance
	        hyp = ps_get_hyp(ps, NULL );    // query pocketsphinx for "hypothesis" of decoded statement
	        return hyp;                              // the function returns the hypothesis
	        break;                                   // exit the while loop and return to main
        }
    }
}

The samples seems to be receiving ok in the sphinx thread, but the ‘in_spech’ variable is always valued ‘1’. Also, the function ‘ps_process_raw’ is getting around 1 second to be executed. Help is very appreciated

Cheers.

I made some progress. Changed the buffer size and resized model to the model provided by matrix team.
Here is the deal: I’ve tried the pocketsphinx demo and it works well, but when I’m trying to do it directly with samples received through matrix-hal library, it doesn’t recognize a simple word, what seems strange to me, as the code below is practically the same to demo_pocketsphinx

The buffer is receiving samples ok, and in time, with no waiting or blocking, at 16 Khz, how the demo is configure to work with.

#include <stdio.h>
#include <string.h>
#include <pocketsphinx/pocketsphinx.h>
#include <sphinxbase/ad.h>
#include <sphinxbase/err.h>
#include "CBuffer/cbuffer.h"
#include <time.h>

extern "C" cb_t q;

static const arg_t cont_args_def[] = {
    POCKETSPHINX_OPTIONS,
    /* Argument file. */
    {"-argfile", ARG_STRING, NULL, "Argument file giving extra arguments."},
    {"-adcdev", ARG_STRING, NULL, "Name of audio device to use for input."},
    {"-infile", ARG_STRING, NULL, "Audio file to transcribe."},
    {"-inmic", ARG_BOOLEAN, "no", "Transcribe audio from microphone."},
    {"-time", ARG_BOOLEAN, "no", "Print word times in file transcription."},
    CMDLN_EMPTY_OPTION};

const char * recognize_from_microphone();

ps_decoder_t *ps;		//	Decoder structure
cmd_ln_t *config;		//	Configuration for decoder

uint8 utt_started;      //	Flags for utterance started and speech is producing
char const *hyp;                   //	Hypothesis for given speech
char const *decoded_speech;

void *t_sphinx (void *arg) {

	config = cmd_ln_init(NULL,
			cont_args_def, TRUE,
			"-lm", "/home/pi/modelo_mv/6706.lm",
			"-dict", "/home/pi/modelo_mv/6706.dic",
			"-kws_threshold", "1e-20",
			"-keyphrase", "MATRIX",
			NULL);


	ps_default_search_args(config);
	ps = ps_init(config);                                                        // initialize the pocketsphinx decoder

	while(1){
		decoded_speech = recognize_from_microphone();         		  // call the function to capture and decode speech
		printf("You Said: %s\n", decoded_speech);								// send decoded speech to screen
	}

	ps_free(ps);
}

static void sleep_msec(int32 ms) {
	struct timeval tmo;
	tmo.tv_sec = 0;
	tmo.tv_usec = ms * 1000;

	select(0, NULL, NULL, NULL, &tmo);
}

const char * recognize_from_microphone(){
	int16_t buffer[2046];
	if (ps_start_utt(ps) < 0) E_FATAL("Failed to start utterance\n");
	utt_started = FALSE;
	printf("Ready...\n");

	clock_t start, end;
	double diff_t;
	int k,j;

	while(1) {
		k = cb_get(&q, buffer, 2046);   // capture the number of frames in the audio buffer

		start = clock();
		ps_process_raw(ps, (const int16_t *)buffer, k, FALSE, FALSE);  // send the audio buffer to the pocketsphinx decoder
		const int8_t in_speech = ps_get_in_speech(ps);            // test to see if speech is being detected
		end = clock();

		diff_t = ((double) (end - start)) / CLOCKS_PER_SEC;
		if (j ==40){
			printf("%d\t",in_speech);
			printf("%f\n",diff_t);
			j=0;
		}
		else
			j++;

		if (in_speech && !utt_started) {             // if speech has started and utt_started flag is false
			utt_started = TRUE;                      // then set the flag
			printf("Listening...\n");
		}

		if (!in_speech && utt_started) {             // if speech has ended and the utt_started flag is true
			ps_end_utt(ps);                          // then mark the end of the utterance
			hyp = ps_get_hyp(ps, NULL );             // query pocketsphinx for "hypothesis" of decoded statement
			return hyp;                              // the function returns the hypothesis
			break;								 // exit the while loop and return to main

		}
	    sleep_msec(10);
	}

}

reader thread:

void *t_microx(void *arg){
	MicrophoneArray microphone_array;
	microphone_array.Setup(&bus);

	MicrophoneCore microphone_core(microphone_array);
	microphone_core.Setup(&bus);

	microphone_array.SetSamplingRate(16000);
	//if (p_mic.gain > 0)
	microphone_array.SetGain(1.2);
	microphone_array.ShowConfiguration();
	microphone_array.CalculateDelays(0, 0, 1000,
			320 * 1000);
	mv_mic mv;

	int16_t buffer[2046];
	uint16_t samples = 0;

	while(1){
		microphone_array.Read();
		for (uint32_t s = 0; s < microphone_array.NumberOfSamples(); s++){
			buffer[samples] = microphone_array.Beam(s);
			samples++;
			if(sem_trywait(&s_mic_stop)>=0){
				break;
			}
			if(samples>=2046){
				cb_put(&q, buffer, 2046);
				samples = 0;
			}
		}
	}
}