3.3.5 VAD decision

26.0943GPPAdaptive Multi-Rate (AMR) speech codecMandatory speech CODEC speech processing functionsRelease 17TSVoice Activity Detector (VAD)

Power of the input frame is calculated as follows:

, (3.7)

where samples s(i) of the input frame are pointed by the new_speech pointer of the speech encoder. If the power of the input frame (pow_sum) is lower than the constant POW_PITCH_THR, last pitch flag is set to zero. If the power of the input frame (pow_sum) is lower than the constant POW_COMPLEX_THR, last complex_low flag is set to zero.

The difference between the signal levels of the input frame and background noise estimate is calculated as follows:

, (3.8)

where:

level[n] signal level at band n

bckr_est[n] level of background noise estimate at band n

VAD decision is made by comparing the variable snr_sum to a threshold. The threshold (vad_thr) is tuned to get desired sensitivity at each background noise level. The higher the noise level the lower is the threshold. Specially, a low threshold at high-level background noise is needed to detect speech reliably enough, although probability of detecting noise as speech also increases.

Average level of background noise is calculated by adding noise estimates at each band:

(3.9)

Threshold is calculated using average noise level as follows:

, (3.10)

where VAD_SLOPE, VAD_P1, and VAD_THR_HIGH are constants.

The variable vadreg indicates intermediate VAD decision and it is calculated as follows:

if (snr_sum > vad_thr)

vadreg = 1

else

vadreg = 0

3.3.5.1 Hangover addition

Before the final VAD flag is given, a hangover is added. The hangover addition helps to detect low power endings of speech bursts, which are subjectively important but difficult to detect. Also a long hangover is added if the signal has been found to be of very complex nature for a long time (2 seconds) since the VAD is not likely to work reliably for such a complex signal.

VAD flag is set to "1" if less that hang_len frames with "0" decision have been elapsed since burst_len consecutive "1" decisions have been detected. The variables hang_len and burst_len are set depending on the average noise level (noise_level). The vad_flag is also controlled by the complex_hang_count which indicates that the signal is too complex for the VAD and should not be used with a Comfort noise generation algorithm. The filtered correlation value corr_hp is also used as an activity indication after the VAD has indicated noise for a while (during 200 ms), this will aid in situations where the VAD noise estimate has adapted to a rather stationary but still all to complex signal to make it sound well with CNG.

The power of the input frame is compared to a threshold (VAD_POW_LOW). If the power is lower, the VAD flag is set to "0" and no hangover is added. The VAD_flag is calculated as follows:

if (noise_level > HANG_NOISE_THR)

burst_len = BURST_LEN_HIGH_NOISE

hang_len = HANG_LEN_HIGH_NOISE

else

burst_len = BURST_LEN_LOW_NOISE

hang_len = HANG_LEN_LOW_NOISE

if(complex_hang_timer > CVAD_HANG_LIMIT) {

if(complex_hang_count < CVAD_HANG_LENGTH {

complex_hang_count = CVAD_HANG_LENGTH;

}

}

if (powsum < VAD_POW_LOW){

burst_count = 0

hang_count = 0

complex_hang_count = 0;

complex_hang_timer = 0;

Vad_flag=0;

Goto Exit;

}

VAD_flag=0;

if(complex_hang_count != 0){

burst_count = BURST_LEN_HIGH_NOISE;

complex_hang_count = complex_hang_count – 1 ;

VAD_flag=1;

goto Exit

} else {

if ( (the 10 last out of 11 vadreg values all are zero) AND

(corr_hp > CVAD_THRESH_IN_NOISE ) ) {

VAD_flag = 1;

Goto Exit

}

}

if (vadreg = 1){

burst_count = burst_count + 1}

if (burst_count >= burst_len){

hang_count = hang_len

}

VAD_flag = 1

} else {

burst_count = 0

if (hang_count > 0){

hang_count = hang_count – 1

VAD_flag=1

}

}

Label Exit

3.3.5.2 Background noise estimation

Background noise estimate (bckr_est[n]) is updated using amplitude levels of the previous frame. Thus, the update is delayed by one frame to avoid undetected start of speech bursts to corrupt the noise estimate. If the internal VAD decision is "1" or if pitch has been detected, the noise estimate is not updated upwards. The update speed for the current frame is selected as follows:

if ((vadreg for the last 4 frames has been zero) AND

(pitch for the last 4 frames has been zero) AND

(we are not in complex signal hangover))

alpha_up = ALPHA_UP1

alpha_down = ALPHA_DOWN1

else

if ((stat_count = 0 ) AND (not in complex_signal hangover))

alpha_up = ALPHA_UP2

alpha_down = ALPHA_DOWN2

else

alpha_up = 0

alpha_down = ALPHA3

The variable stat_count indicates stationary and its propose is explained later in this clause. The variables alpha_up and alpha_down define the update speed to upwards and downwards. The update speed for each band n is selected as follows:

if ( < )

alpha = alpha_up

else

alpha = alpha_down

Finally, noise estimate is updated as follows:

, (3.11)

where:

n index of the frequency band

m index of the frame

Level of the background estimate (bckr_est[n]) is limited between constants NOISE_MIN and NOISE_MAX.

If level of background noise increases suddenly, vadreg will be set to "1" and background noise is not updated upwards. To recover from this situation, update of the background noise estimate is enabled if the intermediate VAD decision (vadreg) is "1" for enough long time and spectrum is stationary. Stationary (stat_rat) is estimated using following equation:

(3.12)

If the stationary estimate (stat_rat) is higher than a threshold, the stationary counter (stat_count) is set to the initial value defined by constant STAT_COUNT. The stationary counter (stat_count) is also initialised if pitch or tone or a complex_warning is detected. If the signal is not stationary but speech has been detected (VAD decision is "1"), stat_count is decreased by one in each frame until it is zero.

if (complex_warning){

If(stat_count < CAD_MIN_STAT_COUNT)

stat_count = CAD_MIN_STAT_COUNT

}

if ( (8 last vadreg flags have been zero) OR (2 last pitch flags have been one) OR (5 last tone flags have been one) )

stat_count = STAT_COUNT

else

if (stat_rat > STAT_THR)

stat_count = STAT_COUNT

else

if ((vadreg) AND (stat_count ¹ 0))

stat_count = stat_count – 1

The average signal levels (ave_level[n]) are calculated as follows:

(3.13)

The update speed (alpha) for the previous equation is selected as follows:

if (stat_count = STAT_COUNT)

alpha = 1.0

else if (vadreg = 1)

alpha=ALPHA5

else

alpha = ALPHA4