6.1.4 Post-processing

26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

The decoded signal is conveyed to several post-processing blocks. First an adaptive post-filtering is applied for enhancing the formant and harmonic structure of the signal. In a second step, a bass-post-filter treats the low frequencies.

6.1.4.1 Adaptive post-filtering

The post-filtering is similar to ITU-T G.729 post-processing with the main difference that it is performed at 12.8 or 16 kHz. The adaptive post-filter is a cascade of three filters: a long-term post-filter, , a short-term post-filter, , and a tilt compensation filter, , followed by an adaptive gain control procedure. The post-filter coefficients are updated in every subframe. The post-filtering process is organized as follows. First, the reconstructed signal, , is inverse-filtered through to produce the residual signal, . This signal is used to compute the delay, , and gain, , of the long-term post-filter . The signal, , is then filtered through the long-term post-filter, , and the synthesis filter, . Finally, the output signal of the synthesis filter, is passed through the tilt compensation filter, , to generate the post-filtered reconstructed signal, . Adaptive gain control is then applied to to match its energy to the energy of . The post-filter parameters and are described in detail in subclauses 6.1.4.1.3. and 6.1.4.1.4.

The long-term post-filter is only applied for NB modes and is bypassed for WB and SWB. In WB and SWB cases, the post-filtering consists of a cascade of only two filters: a short-term post-filter, (see subclause 6.1.4.3), and a tilt compensation filter, (see subclause 6.1.4.4), followed by an adaptive gain control procedure (see subclause 6.1.4.5).

At 9.6 kbit/s NB decoding, the long-term post-filter, is active only for clean speech when the level of background noise is less than 20 dB. It is also desactivacted for UC mode.

6.1.4.1.1 Long-term post-filter

The long-term post-filter is given by:

()

where is the pitch delay, and g_l is the gain coefficient. Note that is bounded by 1, and is set to zero if the long-term prediction gain is less than 3 dB. The factor controls the amount of long‑term post-filtering and has the value of . The long-term delay and the gain are computed from the residual signal, , obtained by filtering through , which is the numerator of the short-term post-filter (see subclause 6.1.4.2). That is

()

The long-term delay is computed using a two-pass procedure. The first pass selects the best integer pitch delay, , in the range , where is the integer part of the (transmitted) fractional pitch lag in the first subframe. The best integer, , is the one that maximizes the correlation

()

The second pass chooses the best fractional pitch delay, , with resolution 1/8 around . This is done by finding the delay with the highest pseudo-normalized correlation

()

where is the residual signal at a fractional delay, . The fractional delayed signal, , is first computed using an interpolation filter of length 33. Once the optimal fractional delay, , is found, is recomputed with a longer interpolation filter of length 129. The new signal replaces the previous one only if the longer filter increases the value of. Then, the corresponding correlation, , is normalized with the square-root of the energy of . The squared value of this normalized correlation is then used to determine if the long-term post-filter should be disabled. That is, if

()

the long-term post-filter is disabled by setting . Otherwise, the value of is computed as

, constrained by ()

6.1.4.1.2 Short-term post-filter

The short-term post-filter is given by

()

where is the quantized LP analysis filter (LP analysis is not done at the decoder) and the factors and control the amount of short-term post-filtering. The gain, , is calculated on the truncated impulse response, , of the filter and is given by

()

Note that the gain, , will be modified according to the noise level as explained in the next clause.

6.1.4.1.3 Post-filter NB parameters

In the ITU-T G.729 codec, the post-filter parameters ,and have fixed values. If a variable, called the long-term normalized noise gain, , is less than 25.0 and an active signal is detected, has a value limited in the range [0.55, 0.70] and has a value limited in the range [0.65, 0.75] as expressed by

()

Otherwise (not an active signal or ≥ 25.0), = 0.1 and = 0.15.

In the case of the GSC mode the post-filter parameters , and are set to 1.

The long‑term normalized noise gain, , is updated only when in UC mode and when no signal activity is detected (). The update is performed as

()

where is the normalized gain of random excitation in the UC mode, calculated as

(1523)

In the equation above, is the quantized gain of the random excitation, , used in TC mode, which has been quantized with 7 bits in the logarithmic energy domain. The modified value of in equation (1523) is not filtered. The modified value of is computed as

()

where the factor is derived from as follows

, constrained by ()

Thus, the short-term post-filter, described in subclause 6.1.4.1.2, is used with the modified value of gain, , and not . These modifications help to diminish the effect of post-filtering in noisy conditions.

6.1.4.1.4 Post-filter WB and SWB parameters

The post-filter parameters , for WB and SWB have fixed values, which depend on decoding mode. The filter may operate at both internal sampling frequencies 12.8 kHz and 16 kHz. In case of 12.8 kHz internal frequency the parameters take the default value = 0.7, = 0.75

Table 157 Post filter WB and SWB parameters for 12.8 kHz

Mode	Inactive & AMRWB IO clean speech	< 13.2 kbit/s clean speech	< 24.4 kbit/s clean speech	≤ 32 kbit/s clean speech	< 15.85 kbit/s noisy speech	≤ 32 kbit/s noisy speech
	0.7	0.80	0.75	0.72	0.75	0.7

In case of 16 kHz internal frequency, noisy speech (the level of background noise is less than 20) and for any mode not depicted in the table below the parameters take the default value = 0.76, = 0.76.

Table 158 Post filter WB and SWB parameters for 16 kHz

Mode	13.2 kbit/s	16.4 kbit/s	24.4 kbit/s	32 kbit/s
	0.82	0.80	0.78	0.78

6.1.4.1.5 Tilt compensation

The filter compensates for the tilt in the short-term post-filter and is given by

()

where is a tilt factor with being the first reflection coefficient, calculated fromas

()

where

()

The gain term compensates for the decreasing effect ofin. Furthermore, it has been shown that the product has generally no gain. Two values are used for depending on the sign of . If is negative, = 0.9, and if is positive, = 0.2.

6.1.4.1.6 Adaptive gain control

Adaptive gain control is used to compensate for gain differences between the synthesized signal, , and the post-filtered signal, . A gain factor, , for the current subframe is computed by

()

Then, the post-filtered signal, , is scaled as

, for n = 0,…,63 ()

where is a continuous gain, updated on a sample-by-sample basis for NB input as

for NB input

, for n = 0,…,63 ()

for SWB TBE input

, for n = 0,…,63 ()

The initial value of is used. Then, for each new subframe, is set equal to of the previous subframe.

For NB signals, the post-filtered synthesized signal, , is used instead of for signal de‑emphasis, as described in subclause 6.3.

6.1.4.2 Bass post-filter

This clause describes the functionality of the bass post-filter, a low-frequency pitch enhancement procedure, which is closely related to the corresponding procedures in [11].

The main difference compared to the previous standards is that the last step of post filtering is performed in the frequency domain. The reason for this is a different method of resampling from the internal sampling frequency to the output sampling frequency. Instead of time domain resampling (see clause 7.6 in [25]) complex low delay filter bank synthesis is used (see subclause 6.9).

The filter is applied to all LP-based modes up to 32 kbit/s except for NB noisy speech (the level of background noise > 20).

The bass post-filter uses two-band decomposition and adaptive filtering is applied only to the lower band. This results in a total post-processing that is mostly targeted at frequencies near the first harmonics of the synthesized signal.

Figure 91: Block diagram of bass post-filter

Figure shows the block diagram of the low-band pitch enhancer. Note that this is a simplified block diagram, which is equivalent to adding the low-pass filtered enhanced signal to the high-pass filtered signal (see subclause 6.1.3 in [11]). The decoded signal, , is first processed through an adaptive pitch enhancer module leading to an enhanced (full-band) signal, . By subtracting the decoded signal, an enhancement signal, , is obtained. Then CLDFB analysis (see subclause 5.1.2) is applied to transform signal into frequency domain . This signal is subsequently filtered in the frequency domain through a low-pass filter to obtain the signal which is the low-band part of this response. The enhanced signal after post-processing, , is then obtained by adding the low-band enhancement signal to the transformed into frequency domain decoded signal. Resampling to the output sampling frequency and converting into time domain signal, , which is performed by CLDFB synthesis, is not a part of the bass post-filter and is applied for all modes (see subclause 6.9).

The object of the pitch enhancer module is to reduce the inter-harmonic noise in the decoded signal, which is achieved here by a time-varying linear filter described by the following equation:

(1533)

where is the output signal of the pitch enhancer, is a coefficient that controls the inter-harmonic attenuation. The signal is the two-sided long-term prediction signal that is computed in each subframe as

(1534)

where is the pitch period of the decoded signal . Parameters and vary in time. With a value of , the gain of the filter described by equation () is exactly 0 at frequencies ,, , etc.; i.e., at the mid-point between the harmonic frequencies , , , , , etc. When approaches 0, the attenuation between the harmonics produced by the filter of equation () decreases.

The pitch lag parameter is the received closed-loop pitch lag of the respective subframe. However, this parameter is only accurate for the part of the two-sided long-term prediction of Equation () predicting from the past pitch cycle. The prediction from the future pitch cycle may be less accurate, especially if the pitch lag value is not constant.

Thus, in order to improve the prediction accuracy, in case of voiced onset frames it is preferable to make use of the pitch lag value of the subframe containing the future pitch cycle, i.e., of that subframe whose closed-loop pitch lag points into the present subframe. This requires the availability of pitch lag parameters of a frame following the current frame.

The pitch lag parameter, , is further enhanced by means of a pitch tracker which makes the pitch contour smoother and avoids pitch doublings.

The factor is computed as follows. First, the correlation between the signal and the predicted signal is given by

()

and the energy of the predicted signal is given by

()

The factor is given by

, constrained by , ()

where is the mean prediction error energy in dB in the present subframe and k₁ takes values of 0.5 or 1 depending on the operating point. The mean prediction error energy, is updated as follows. The long-term prediction error is first computed by

()

where k₂ equals C_p/E_p or 1 depending on the operating point, and then emphasized in the low frequencies using the relation

()

The energy of the emphasized error signal is then computed in dB as

()

and the mean error energy is then updated in every subframe by

()

with initial value .

The factor is further adapted to a measure of signal stationarity, which limits the level of inter‑harmonic attenuation when the signal is not in a steady-state mode. The adaptation is based on the stability factor of the current frame, and a recursively smoothed version of stability factor defined as

()

The factor , defined in equation (1537), is finally scaled as

()

Since larger portions of noise are aurally masked when the signal rapidly changes, the above adaptation gives a better balance between attenuation of quantization noise and signal degradation.

At 16.4 and 24.4 kbps, the factor is adjusted by decoding the gain adjustment , which is quantized at the encoder (see subclause 5.2.4) and transmitted in the bitstream on 2 bits.

()