6 Computational description overview

3GPP46.042Half rate speechRelease 17TSVoice Activity Detector (VAD) for half rate speech traffic channels

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

The computational details necessary for the fixed point implementation of the speech transcoding and DTX functions are given in the form of an American National Standards Institute (ANSI) C program contained in GSM 06.06 [5]. This clause provides an overview of the modules which describe the computation of the VAD algorithm.

6.1 VAD modules

The computational description of the VAD is divided into three ANSI C modules. These modules are:

‑ vad_reset;

‑ vad_algorithm;

‑ periodicity_update.

The vad_reset module sets the VAD variables to their initial values.

The vad_algorithm module is divided into nine sub‑modules which correspond to the blocks of figure 1 in the high level description of the VAD algorithm. The vad_algorithm module can be called as soon as the acf[0..8] and rc[1..4] variables are known. This means that the VAD computation can take place after the Autocorrelation Fixed point LAttice Technique (AFLAT) routine in the speech encoder (GSM 06.20 [2]). The vad_algorithm module also requires the value of the ptch variable calculated in the previous frame.

The ptch variable is calculated by the periodicity_update module from the lags[1..4] variable. The individual lag values are calculated for each subframe in the LTP routine of the speech encoder (GSM 06.20 [2]). The periodicity_update module is called after the current 20 ms signal frame has been encoded.

6.2 Pseudo‑floating point arithmetic

All the arithmetic operations follow the precision and format used in the computational description of the speech codec in GSM 06.06 [5]. To increase the precision within the fixed point implementation, a pseudo‑floating point representation of some variables is used. This applies to the following variables (and related constants) of the VAD algorithm:

‑ pvad: Energy of filtered signal;

‑ thvad: Threshold of the VAD decision;

‑ acf0: Energy of the input signal.

For the representation of these variables, two 16‑bit integers are needed:

‑ one for the exponent (e_pvad, e_thvad, e_acf0);

‑ one for the mantissa (m_pvad, m_thvad, m_acf0).

The value e_pvad represents the lowest power of 2 just greater or equal to the actual value of pvad, and the m_pvad value represents an integer which is always greater than or equal to 16384 (normalized mantissa). It means that the pvad value is equal to

This scheme provides a large dynamic range for the pvad value and always keeps a precision of 16 bits. All the comparisons are easy to make by comparing the exponents of two variables, and the VAD algorithm needs only one pseudo floating point addition and multiplication. All the computations related to the pseudo‑floating point variables require simple 16 or 32‑bit arithmetic operations defined in the detailed description of the speech codec.

Some constants, represented by a floating point format, are needed and symbolic names (in capital letters) for their exponent and mantissa are used; table 8 lists all these constants with the associated symbolic names and their numerical constant values.

Table 8: List of floating point constants

Constant	Exponent	Mantissa
pth	E_PTH = 18	M_PTH = 26250
margin	E_MARGIN = 27	M_MARGIN = 27343
plev	E_PLEV = 20	M_PLEV = 17500

Annex A (informative):
VAD performance

In the optimization of a VAD, a trade‑off has to be made between speech clipping, which reduces the subjective performance of the system, and the mean channel activity factor. The benefit of DTX is increased as the activity factor is reduced. However, in general, a reduction of the activity factor will be associated with a greater risk of audible speech clipping.

In the optimization process, emphasis has been placed on avoiding unnecessary speech clipping. However, it has been found that a VAD with virtually no audible clipping would result in a high activity and little DTX advantage. The VAD specified in the present document introduces audible and possibly objectionable clipping in certain cases, mainly for low input levels and low signal to noise ratios.

An indication of the mean channel activity in DTX mode is given in table A.1. The figure quoted is the average calculated over a large number of conversations covering factors such as different talkers, noise characteristics and locations. It should be noted that the actual activity of a particular talker in a specific conversation may vary considerably from the figure given in the table. This is due to both talker behaviour and the level dependency of the VAD (the channel activity has been found to decrease by about 0.5% per dB of level reduction). However, as mentioned above, a decreased speech input level increases the risk of objectionable clipping.

Table A.1: Mean channel activity factor in DTX mode

Channel activity factor

60%

Annex B (informative):
Simplified block filtering operation

Consider an 8th order transversal filter with filter coefficients a[0..8], through which a signal is being passed, the output of the filter being:

s’n = – SUM (a[i]*s[n-i]) (1)

i=0

If we apply block filtering over 20 ms frames, then this equation becomes:

min(8,n)

s’n = – SUM (a[i]*s[n-i]) ; n = 0..167 (2)

i=0 ; 0 <= n <= 167

If the energy of the filtered signal is then obtained for every 20 ms frame, the equation for this is:

167 min(8,n)

pvad = SUM ( – SUM (a[i]*s[n-i]))2 ; 0 <= n-i <= 159 (3)

n=0 i=0

We know that:

159

acf[i] = SUM (s[n]*s[n-i]) ; i = 0..8 (4)

n=i ; 0 <= n-i <= 159

If equation (3) is expanded and acf[0..8] are substituted for s[n] then we arrive at the equations:

pvad = r[0]*acf[0] + 2*SUM (r[i]*acf[i]) (5)

i=1

Where:

8-i

r[i] = SUM (a[k]*a[k+i]) ; i = 0..8 (6)

k=0

Annex C (informative):
Pole frequency calculation

This annex describes the algorithm used to determine whether the pole frequency for a second order analysis of the signal frame is less than 385 Hz.

The filter coefficients for a second order synthesis filter are calculated from the first two unquantized reflection coefficients rc[1..2] obtained from the speech encoder. If the filter coefficients a[0..2] are defined such that the synthesis filter response is given by:

H(z) = 1/(a[0] + a[1]z‑1 + a[2]z‑2) (1)

Then the positions of the poles in the Z‑plane are given by the solutions to the following quadratic:

a[0]z2 + a[1]z + a[2] = 0, a[0] = 1 (2)

The positions of the poles, z, are therefore:

z = re + j*sqrt(im), j2 = ‑1 (3)

where:

re = – a[1] / 2 (4)

im = (4*a[2] – a[1]2)/4 (5)

If im is negative then the poles lie on the real axis of the Z‑plane and the signal is not a tone and the algorithm terminates. If re is negative then the poles lie in the left hand side of the Z‑plane and the frequency is greater than 2000 Hz and the prediction error test can be performed.

If im is positive and re is positive then the poles are complex and lie in the right hand side of the Z‑plane and the frequency in Hz is related to re and im by the expression:

freq = arctan(sqrt(im)/re)*4000/pi (6)

Having ensured that both im and re are positive the test for a pole frequency less than 385 Hz can be derived by substituting equations 4 and 5 into equation 6 and re‑arranging:

(4*a[2] – a[1]2 )/a[1]2 < tan2(pi*385/4000) (7)

(4*a[2] – a[1]2)/a[1]2 < 0.0973 (8)

If this test is true then the signal is not a tone and the algorithm terminates, otherwise the prediction error test is performed.

Annex D (informative):
Change history

Change history
SMG No.	TDoc. No.	CR. No.	Section affected	New version	Subject/Comments
SMG#15				4.1.1	ETSI Publication
SMG#20				5.0.1	Release 1996 version
SMG#27				6.0.0	Release 1997 version
SMG#29				7.0.0	Release 1998 version
				7.0.1	Version update to 7.0.1 for Publication
SMG#31				8.0.0	Release 1999 version
				8.0.1	Update to Version 8.0.1 for Publication

Change history
Date	TSG #	TSG Doc.	CR	Rev	Subject/Comment	Old	New
03-2001	11				Version for Release 4		4.0.0
06-2002	16				Version for Release 5	4.0.0	5.0.0
12-2004	26				Version for Release 6	5.0.0	6.0.0
06-2007	36				Version for Release 7	6.0.0	7.0.0
12-2008	42				Version for Release 8	7.0.0	8.0.0
12-2009	46				Version for Release 9	8.0.0	9.0.0
03-2011	51				Version for Release 10	9.0.0	10.0.0
09-2012	57				Version for Release 11	10.0.0	11.0.0
09-2014	65				Version for Release 12	11.0.0	12.0.0
12-2015	70				Version for Release 13	12.0.0	13.0.0

Change history
Date	Meeting	TDoc	CR	Rev	Cat	Subject/Comment	New version
03-2017	SA#75					Version for Release 14	14.0.0
06-2018	SA#80	–	–	–	–	Version for Release 15	15.0.0
2020-07	–	–	–	–	–	Update to Rel-16 version (MCC)	16.0.0
2022-04	–	–	–	–	–	Update to Rel-17 version (MCC)	17.0.0