3 Technical Description
26.1943GPPAdaptive Multi-Rate - Wideband (AMR-WB) speech codecRelease 17Speech codec speech processing functionsTSVoice Activity Detector (VAD)
3.1 Definitions, symbols and abbreviations
3.1.1 Definitions
For the purposes of the present document, the terms and definitions given in TR 21.905 [5] and the following apply. A term defined in the present document takes precedence over the definition of the same term, if any, in TR 21.905 [5].
frame: Time interval of 20 ms corresponding to the time segmentation of the speech
transcoder.
3.1.2 Symbols
For the purposes of this TS, the following symbols apply.
3.1.2.1 Variables
bckr_est[n] background noise estimate at the frequency band "n"
burst_count counts length of a speech burst, used by VAD hangover addition
hang_count hangover counter, used by VAD hangover addition
level[n] signal level at the frequency band "n"
new_speech pointer of the speech encoder, points a buffer containing last received samples of a speech frame [2]
noise_level estimated noise level
pow_sum input power
s(i) samples of the input frame
snr_sum measure between input frame and noise estimate
speech_level estimated speech level
stat_count stationary counter
stat_rat measure indicating stationary of the input frame
tone_flag flag indicating the presence of a tone
vad_thr VAD threshold
VAD_flag Boolean VAD flag
vadreg intermediate VAD decision
3.1.2.2 Constants
ALPHA_UP1 constant for updating noise estimate (see subclause 3.3.5.2)
ALPHA_DOWN1 constant for updating noise estimate (see subclause 3.3.5.2)
ALPHA_UP2 constant for updating noise estimate (see subclause 3.3.5.2)
ALPHA_DOWN2 constant for updating noise estimate (see subclause 3.3.5.2)
ALPHA3 constant for updating noise estimate (see subclause 3.3.5.2)
ALPHA4 constant for updating average signal level (see subclause 3.3.5.2)
ALPHA5 constant for updating average signal level (see subclause 3.3.5.2)
BURST_HIGH constant for controlling VAD hangover addition (see subclause 3.3.5.1)
BURST_P1 constant for controlling VAD hangover addition (see subclause 3.3.5.1)
BURST_SLOPE constant for controlling VAD hangover addition (see subclause 3.3.5.1)
COEFF3 coefficient for the filter bank (see subclause 3.3.1)
COEFF5_1 coefficient for the filter bank (see subclause 3.3.1)
COEFF5_2 coefficient for the filter bank (see subclause 3.3.1)
HANG_HIGH constant for controlling VAD hangover addition (see subclause 3.3.5.1)
HANG_LOW constant for controlling VAD hangover addition (see subclause 3.3.5.1)
HANG_P1 constant for controlling VAD hangover addition (see subclause 3.3.5.1)
HANG_SLOPE constant for controlling VAD hangover addition (see subclause 3.3.5.1)
FRAME_LEN size of a speech frame, 256 samples (20 ms)
MIN_SPEECH_LEVEL1 constant for speech estimation (see subclause 3.3.5.3)
MIN_SPEECH_LEVEL2 constant for speech estimation (see subclause 3.3.5.3)
MIN_SPEECH_SNR constant for VAD threshold adaptation (see subclause 3.3.5)
NO_P1 constant for VAD threshold adaptation (see subclause 3.3.5)
NO_SLOPE constant for VAD threshold adaptation (see subclause 3.3.5)
NOISE_MAX maximum value for noise estimate (see subclause 3.3.5.2)
NOISE_MIN minimum value for noise estimate (see subclause 3.3.5.2)
POW_TONE_THR threshold for tone detection (see subclause 3.3.5)
SP_ACTIVITY_COUNT constant for speech estimation (see subclause 3.3.5.3)
SP_ALPHA_DOWN constant for speech estimation (see subclause 3.3.5.3)
SP_ALPHA_UP constant for speech estimation (see subclause 3.3.5.3)
SP_CH_MAX constant for VAD threshold adaptation (see subclause 3.3.5)
SP_CH_MIN constant for VAD threshold adaptation (see subclause 3.3.5)
SP_EST_COUNT constant for speech estimation (see subclause 3.3.5.3)
SP_P1 constant for VAD threshold adaptation (see subclause 3.3.5)
SP_SLOPE constant for VAD threshold adaptation (see subclause 3.3.5)
STAT_COUNT threshold for stationary detection (see subclause 3.3.5.2)
STAT_THR threshold for stationary detection (see subclause 3.3.5.2)
STAT_THR_LEVEL threshold for stationary detection (see subclause 3.3.5.2)
THR_HIGH constant for VAD threshold adaptation (see subclause 3.3.5)
TONE_THR threshold for tone detection (see subclause 3.3.3)
VAD_POW_LOW constant for controlling VAD hangover addition (see subclause 3.3.5.1)
3.1.2.3 Functions
+ Addition
– Subtraction
* Multiplication
/ Division
| x | absolute value of x
AND Boolean AND
OR Boolean OR
MIN(x,y) =
MAX(x,y) =
3.1.3 Abbreviations
For the purposes of the present document, the abbreviations given in TR 21.905 [5] and the following apply. An abbreviation defined in the present document takes precedence over the definition of the same abbreviation, if any, in TR 21.905 [5].
ANSI American National Standards Institute
DTX Discontinuous Transmission
VAD Voice Activity Detector
CNG Comfort Noise Generation
3.2 General
The function of the VAD algorithm is to indicate whether each 20 ms frame contains signals that should be transmitted, e.g. speech, music or information tones. The output of the VAD algorithm is a Boolean flag (VAD_flag) indicating presence of such signals.
3.3 Functional description
The block diagram of the VAD algorithm is depicted in Figure 1. The VAD algorithm uses parameters of the speech encoder to compute the Boolean VAD flag (VAD_flag). This input frame for VAD is sampled at the 6.4 kHz frequency and thus it contains 256 samples. Samples of the input frame (s(i)) are divided into sub-bands and level of the signal (level[n]) in each band is calculated. Input for the tone detection function are the normalized open-loop pitch gains which are calculated by open-loop pitch analysis of the speech encoder. The tone detection function computes a flag (tone_flag) which indicates presence of a signalling tone, voiced speech, or other strongly periodic signal. Background noise level (bckr_est[n]) is estimated in each band based on the VAD decision, signal stationarity and the tone-flag. Intermediate VAD decision is calculated by comparing input SNR (level[n]/bckr_est[n]) to an adaptive threshold. The threshold is adapted based on noise and long term speech estimates. Finally, the VAD flag is calculated by adding hangover to the intermediate VAD decision.
Figure 1: Simplified block diagram of the VAD algorithm
3.3.1 Filter bank and computation of sub-band levels
The input signal is divided into frequency bands using a 12-band filter bank (Figure 2). Cut-off frequencies for the filter bank are shown in Table 1.
Table 1. Cut-off frequencies for the filter bank
Band number |
Frequencies |
1 |
0 – 200 Hz |
2 |
200 – 400 Hz |
3 |
400 – 600 Hz |
4 |
600 – 800 Hz |
5 |
800 – 1200 Hz |
6 |
1200 – 1600 Hz |
7 |
1600 – 2000 Hz |
8 |
2000 – 2400 Hz |
9 |
2400 – 3200 Hz |
10 |
3200 – 4000 Hz |
11 |
4000 – 4800 Hz |
12 |
4800 – 6400 Hz |
Input for the filter bank is a speech frame pointed by the new_speech pointer of the speech encoder [1]. Input values for the filter bank are scaled down by one bit. This ensures safe scaling, i.e. saturation can not occur during calculation of the filter bank.
Figure 2: Filter bank
The filter bank consists of 5th and 3rd order filter blocks. Each filter block divides the input into high-pass and low-pass parts and decimates the sampling frequency by 2. The 5th order filter block is calculated as follows:
(1a)
(1b)
where
x(i) input signal for a filter block
low-pass component
high-pass component
The 3rd order filter block is calculated as follows:
(2a)
(2b)
The filters ,, andare first order direct form all-pass filters, whose transfer function is given by:
, (3)
where C is the filter coefficient.
Coefficients for the all-pass filters ,, and are COEFF5_1, COEFF5_2, and COEFF3, respectively.
Signal level is calculated at the output of the filter bank at each frequency band as follows:
, (4)
where:
n index for the frequency band
sample i at the output of the filter bank at frequency band n
=
=
Negative indices of refer to the previous frame.
3.3.2 Tone detection
The purpose of the tone detection function is to detect information tones, vowel sounds and other periodic signals. The tone detection uses normalized open-loop pitch gains (ol_gain), which are received from the speech encoder. If the pitch gain is higher than the constant TONE_THR, tone is detected and the tone flag is set:
if (ol_gain > TONE_THR)
tone_flag = 1
The open-loop pitch search and correspondingly the tone flag is computed twice in each frame, except for mode 6.60 kbit/s, where it is computed only once.
3.3.3 VAD decision
The block diagram of the VAD decision algorithm is shown in figure 3.
Figure 3: Simplified block diagram of the VAD decision algorithm
Power of the input frame is calculated as follows:
, (5)
where samples s(i) of the input frame are pointed by the new_speech pointer of the speech encoder. Variable pow_sum is sum of the powers of the current and previous frames. If pow_sum is lower than the constant POW_TONE_THR, tone-flag is set to zero.
The difference between the signal levels of the input frame and the background noise estimate is calculated as follows:
, (6)
where:
level[n] signal level at band n
bckr_est[n] level of background noise estimate at band n
VAD decision is made by comparing the variable snr_sum to a threshold. The threshold (vad_thr) is adapted to get desired sensitivity depending on estimated speech and background noise levels.
Average background noise level is calculated by adding noise estimates at each band except the lowest band:
(7)
If SNR is lower that the threshold (MIN_SPEECH_SNR), speech level is increased as follows:
If (speech_level/noise_level < MIN_SPEECH_SNR)
Speech_level = MIN_SPEECH_SNR * noise_level
Logarithmic value for noise estimate is calculated as follows:
(8)
Before logarithmic value from the speech estimate is calculated, MIN_SPEECH_SNR*noise_level is subtracted from the speech level to correct its value in low SNR situations.
(9)
Threshold for VAD decision is calculated as follows:
Vad_thr = NO_SLOPE * (ilog2_noise_level – NO_P1) + THR_HIGH + MIN(SP_CH_MAX,
MAX(SP_CH_MIN, SP_CH_MIN + SP_SLOPE * (ilog2_speech_level – SP_P1))), (10)
where NO_SLOPE, SP_SLOPE, NO_P1, SP_P1, THR_HIGH, SP_CH_MAX and SP_CH_MIN are constants.
The variable vadreg indicates intermediate VAD decision and it is calculated as follows:
if (snr_sum > vad_thr)
vadreg = 1
else
vadreg = 0
3.3.3.1 Hangover addition
Before the final VAD flag is given, a hangover is added. The hangover addition helps to detect low power endings of speech bursts, which are subjectively important but difficult to detect.
VAD flag is set to "1" if less that hang_len frames with "0" decision have been elapsed since burst_len consecutive "1" decisions have been detected. The variables hang_len and burst_len are computed using vad_thr as follows:
hang_len = MAX(HANG_LOW, (HANG_SLOPE * (vad_thr – HANG_P1) + HANG_HIGH)) (11)
burst_len = BURST_SLOPE * (vad_thr – BURST_P1) + BURST_HIGH) (12)
The power of the input frame is compared to a threshold (VAD_POW_LOW). If the power is lower, the VAD flag is set to "0" and no hangover is added. The VAD_flag is calculated as follows:
Vad_flag = 0;
if (pow_sum < VAD_POW_LOW)
burst_count = 0
hang_count = 0
else
if (vadreg = 1)
burst_count = burst_count + 1
if (burst_count >= burst_len)
hang_count = hang_len
VAD_flag = 1
else
burst_count = 0
if (hang_count > 0)
hang_count = hang_count – 1
VAD_flag=1
3.3.3.2 Background noise estimation
Background noise estimate (bckr_est[n]) is updated using amplitude levels of the previous frame. Thus, the update is delayed by one frame to avoid undetected start of speech bursts to corrupt the noise estimate. The update speed for the current frame is selected using intermediate VAD decisions (vadreg) and stationarity counter (stat_count) as follows:
if (vadreg for the last 4 frames has been zero)
alpha_up = ALPHA_UP1
alpha_down = ALPHA_DOWN1
else if (stat_count = 0)
alpha_up = ALPHA_UP2
alpha_down = ALPHA_DOWN2
else
alpha_up = 0
alpha_down = ALPHA3
The variable stat_count indicates stationary and its purpose is explained later in this subclause. The variables alpha_up and alpha_down define the update speed for upwards and downwards, respectively. The update speed for each band "n" is selected as follows:
if ( < )
alpha[n] = alpha_up
else
alpha[n] = alpha_down
Finally, noise estimate is updated as follows:
, (13)
where:
n index of the frequency band
m index of the frame
Level of the background estimate (bckr_est[n]) is limited between constants NOISE_MIN and NOISE_MAX.
If level of background noise increases suddenly, vadreg will be set to "1" and background noise is not normally updated upwards. To recover from this situation, update of the background noise estimate is enabled if the intermediate VAD decision (vadreg) is "1" for long enough time and spectrum is stationary. Stationary (stat_rat) is estimated using following equation:
, (14)
where:
STAT_THR_LEVEL a constant
n index of the frequency band
m index of the frame
ave_level average level of the input signal
If the stationary estimate (stat_rat) is higher than a threshold, the stationary counter (stat_count) is set to the initial value defined by constant STAT_COUNT. If the signal is not stationary but speech has been detected (VAD decision is "1"), stat_count is decreased by one in each frame until it is zero.
if (5 last tone flags have been one)
stat_count = STAT_COUNT
else
if (8 last internal VAD decisions have been zero) OR (stat_rat > STAT_THR)
stat_count = STAT_COUNT
else
if (vadreg) AND (stat_count ¹ 0)
stat_count = stat_count – 1
The average signal levels (ave_level[n]) are calculated as follows:
(15)
The update speed (alpha) for the previous equation is selected as follows:
if (stat_count = STAT_COUNT)
alpha = 1.0
else if (vadreg = 1)
alpha=ALPHA5
else
alpha = ALPHA4
3.3.3.3 Speech level estimation
First, full-band input level is calculated by summing input levels in each band except the lowest band as follows:
(16)
A frame is assumed to contain speech if its level if high enough (MIN_SPEECH_LEVEL1), and the intermediate VAD flag (vadreg) is set or the input level is higher than the current speech level estimate. Maximum level (sp_max) from SP_EST_COUNT frames is searched. If the SP_ACTIVITY_COUNT number of speech frames is located in within SP_EST_COUNT number of frames, speech level estimate is updated by the maximum signal level (sp_max). The pseudocode for the speech level estimation is as follows:
If (SP_ACTIVITY_COUNT > SP_EST_COUNT – sp_est_cnt + sp_max_cnt)
sp_est_cnt = 0
sp_max_cnt = 0
sp_max = 0
sp_est_cnt = sp_est_cnt + 1
if (in_level > MIN_SPEECH_LEVEL1) AND ((vadreg = 1) OR (in_level > speech_level))
sp_max_cnt = sp_max_cnt + 1
sp_max = MAX(sp_max, in_level)
if (sp_max_cnt > SP_ACTIVITY_COUNT)
if (sp_max > MIN_SPEECH_LEVEL2)
if (sp_max > speech_level)
speech_level = speech_level + SP_ALPHA_UP * (sp_max – speech_level)
else
speech_level = speech_level + SP_ALPHA_DOWN * (sp_max – speech_level)
sp_max_cnt = 0
sp_max = 0
sp_est_cnt = 0