6.9 Common post-processing

26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

6.9.1 Comfort noise addition

In this clause, we describe a post-processing technique for enhancing the quality of noisy speech coded and transmitted at bit-rates up to 13.2 kbps. At such low bit-rates, the coding of noisy speech, i.e. speech recorded with background noise, is usually not as efficient as the coding of clean speech. The decoded synthesis is usually prone to artifacts as the two different kinds of sources – the noise and the speech – cannot be efficiently coded by a coding scheme relying on a single-source model.

The comfort noise addition (CNA) consists in modelling and synthesizing the background noise at the decoder side, requiring thereby no side-information. It is achieved by estimating the level and spectral shape of the background noise at the decoder side, and by generating artificially a comfort noise in the frequency domain. In principle, the noise estimation and generation in CNA is therefore similar to the FD-CNG presented in clause 6.7.3. However, a noticeable difference is that FD-CNG is applied in DTX operations only, whereas CNA can be used in any case when coding noisy speech at bit-rates up to 13.2 kbps. The generated noise is added to the decoded audio signal and allows masking coding artifacts.

6.9.1.1 Noisy speech detection

The CNA should be triggered in noisy speech scenarios only, i.e., not in clean speech or clean music situations. To this end, a noisy speech detector is used in the decoder. It consists in estimating the long-term SNR by separately adapting long-term estimates of either the noise or the speech/music energies, depending on a VAD decision.

The VAD decision is deduced directly from the information decoded from the bitstream. It is 0 if the current frame is a SID frame, a zero frame, or an IC (Inactive Coding mode, see clause 5.1.13) frame. It is 1 otherwise.

The long-term noise estimate and long-term speech/music estimateare initialized with -20 dB and +25 dB, respectively. When, the long-term noise energy is updated on a frame-by-frame basis as follows:

, (2086)

whererefers to the noise energy spectrum estimated in the decoder to apply FD-CNG,is the number of spectral partitions, and corresponds to the size of each partition (see clause 6.7.3.2.2). Otherwise, i.e. if, the long-term speech/music energy is updated on a frame-by-frame basis as follows:

, ()

wheredenotes the frame size in samples andis the output frame of the core decoder at the CELP sampling rate. Furthermore, the long-term noise estimate is lower limited by for each frame.

The flag for noisy speech detection is set to 1 if the SNR is smaller than 28dB, i.e.

. ()

6.9.1.2 Noise estimation for CNA

To be able to produce an artificial noise resembling the actual input background noise in terms of spectro-temporal characteristics, the CNA needs an estimate of the noise spectrum in each FFT bin.

6.9.1.2.1 CNA noise estimation in DTX-on mode when FD-CNG is triggered

In DTX-on mode and provided that FD-CNG is triggered, the FD-CNG noise levels can be directly used. As described in clause 6.7.3, they are obtained by capturing the fine spectral structure of the background noise present during active phases, while updating only the spectral envelop of the noise during inactive parts with the help of the SID information.

6.9.1.2.2 CNA noise estimation in DTX-on mode when LP-CNG is triggered

To enable tracking of the noise spectrum when LP-CNG is triggered in DTX-on mode, the FD-CNG noise estimation algorithm (see clause 6.7.3.2.2) is applied at the output of the LP-CNG during inactive frames, yielding noise estimates in each spectral partition. Following the technique described in clause 6.7.3.2.3.1., the parameters are then interpolated to yield the full-resolution FFT power spectrum, which overwrites the current FD-CNG levels, i.e. .

6.9.1.2.3 CNA noise estimation in DTX-off mode

In DTX-off mode, the noise estimates are obtained by applying the FD-CNG noise estimation algorithm at the output of the core decoder when only, i.e. during speech pauses. As in the previous clause, the interpolation techniques described in clause 6.7.3.2.3.1 is then used to obtain a full-resolution FFT power spectrum, which overwrites the current FD-CNG levels, i.e..

6.9.1.3 Noise generation in the FFT domain and addition in the time domain

In CNA and when the current frame is not a MDCT-based TCX frame, a random noise is generated in the FFT domain, separately for the real and imaginary parts. This is the same approach as in the FD-CNG (see clause 6.7.3.3.2). The noise is then added to the decoder output after performing an inverse FFT transform of the random noise using the overlap-add method.

The level of added comfort noise should be limited to preserve intelligibility and quality. The comfort noise is hence scaled to reach a pre-determined target noise level. Typically, the decoded audio signal exhibits a higher SNR than the original input signal, especially at low bit-rates where the coding artifacts are the most severe. This attenuation of the noise level in speech coding is coming from the source model paradigm which expects to have speech as input. Otherwise, the source model coding is not entirely appropriate and won’t be able to reproduce the whole energy of no-speech components. Hence, the amount of additional comfort noise is adjusted to roughly compensate for the noise attenuation inherently introduced by the coding process. The assumed amount of noise attenuation is chosen depending on the bandwidth and the bit-rate, as shown in the tables below.

Table 176: Assumed noise attenuation level for EVS primary modes

Bandwidth	NB				WB				SWB
Bit-rates [kbps]		8	9.6	13.2		8	9.6	13.2		13.2
[dB]	-3.5	-3	-2.5	-2	-3	-2.5	-1.5	-2.5	-2	-1

Table 177: Assumed noise attenuation level for EVS primary modes for AMR-WB IO modes

Bandwidth	AMR-WB IO
Bit-rates [kbps]	6.60	8.85
[dB]	-4	-3

The energy of the random noise is adjusted for each FFT bin j as

, (2089)

where

(2090)

can be interpreted as the likelihood of being in a noisy speech situation. It is used as a soft decision to reject clean speech or music situations where the noisy speech detection flagbecomes zero (see clause 6.9.1.1).

6.9.1.4 Noise generation and addition in the MDCT domain

If the current frame is an MDCT based TCX frame, the comfort noise addition is performed directly in the MDCT domain. The random noise adjustment for each MDCT binis derived from the FFT-based comfort noise adjustment:

. ()

The adjusted random noise is subsequently added to the MDCT bins as last steps before doing the inverse transformation to time-domain:

. ()

6.9.2 Long term prediction processing

For the TCX coding mode and bitrates up to 48kbps, LTP post filtering is applied to the output signal, using the LTP parameters transmitted in the bitstream.

6.9.2.1 Decoding LTP parameters

If LTP is active, integer pitch lag , fractional pitch lag and gain are decoded from the transmitted indices and :

(2093)

()

If LTP is not active, LTP parameters are set as follows:

On encoder side the pitch lag is computed on the LTP sampling rate, therefore it has to be converted to the output sampling rate first:

()

For 48kbps bitrate the LTP gain is reduced as follows:

()

6.9.2.2 LTP post filtering

For long-term prediction with fractional pitch lags polyphase FIR interpolation filters are used to interpolate between past synthesis samples. For each combination of LTP sampling rate and output sampling rate a different set of filter coefficients is used. The index of the interpolation filter to use is determined according to the following table:

Table 178: LTP index of the interpolation filter


0	4	8
1	5	9
2	6	10
3	7	11

The predicted signal is computed by filtering the past synthesis signal with the selected FIR filter. The filtered range of the past synthesis signal is determined by the integer part of the pitch lag . The polyphase index of the filter is determined by the fractional part of the pitch lag .

The filtered signal is computed by low-pass filtering the current synthesis signal with polyphase index 0 of the selected interpolation filter, so that its frequency response matches the one of the predicted signal.

Both and are multiplied with the LTP gain . The filtered signal is then subtracted from the synthesis signal , the predicted signal is added to it.

If both LTP gain and pitch lag are the same as in the previous frame, the full frame can be processed the same way:

()

However, if gain and/or pitch lag have changed compared to the previous frame, a 5ms transition is used to smooth the parameter change. If no delay compensation is needed, the transition starts at the beginning of the frame. If a delay of needs to be compensated, the transition starts at offset from the beginning of the frame. In that case the signal part before the transition is processed using the LTP parameters of the previous frame:

()

If the LTP gain of the previous frame is zero (i.e. LTP was inactive in the previous frame), a linear fade-in is used for the gain in the transition region:

()

If the LTP gain of the current frame is zero (LTP is inactive, but was active in the previous frame), a linear fade-out is used for the gain in the transition region, using the LTP parameters of the previous frame:

()

If LTP is active in previous and current frame and LTP parameters have changed, a zero input response is used to smooth the transition.

The LPC coefficients for zero input LP filtering are computed from the past 20ms LTP output before the beginning of the transition, using autocorrelation and Levinson-Durbin algorithm as described in 5.1.9.

()

The zero input response is then computed by LP synthesis filtering with zero input, and applying a linear fade-out to the second half of the transition region:

()

Finally the output signal in the transition region is computed by LTP filtering using the current frame parameters and subtracting the zero input response:

()

6.9.3 Complex low delay filter bank synthesis

The analysis stage of the CLDFB is described in sub-clause 5.1.2.1. The synthesis stage transforms the time-frequency matrix of the complex coefficients and to the time domain. The combination of analysis and synthesis is used for sample rate conversions. Also adaptive sample rate conversions are handled by the CLDFB, including sample rate changes in the signal flow.

The sample rate of the reconstructed output signal depends on the number of bands used for the synthesis stage, i.e. . In case, (number of bands in analysis stage), the coefficients are initialized to zero before synthesizing.

For the synthesis operation, a demodulated vector is computed for each time step of the sub-bands.

()

where is identical to the one defined for the analysis operation (see 5.1.2.1). The vector is then windowed by the filter bank prototype to prepare the overlap-add operation

()

Then the recent ten windowed vectors are combined in an overlap-add operation to reconstruct the signal from the CLDFB coefficients.

()

6.9.4 High pass filtering

At the final stage, the signal is high pass filtered to generate the final output signal. The high pass operation is identical to the one used in the pre-processing of the EVS encoder as described in 5.1.1.