5.4.6 Handling of multiple frame losses and muting

26.4473GPPCodec for Enhanced Voice Services (EVS)Error concealment of lost packetsRelease 17TS

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

5.4.6.1 TCX MDCT

5.4.6.1.1 Background level tracing for rates 48, 96 and 128 kbps

A background noise level is traced in the time domain using a simplified version of the minimum statistics algorithm [7]. The tracing depends on the class being transmitted in the bitstream: It is performed for UC only.

In contrast to the FD-CNG – which also makes use of the minimum statistics approach (see [5], subclause 4.4.3) – the noise level estimation is not carried for each spectral band separately, but directly in the time domain. The background level tracing delivers therefore an estimate of the total noise level. Furthermore, the bias compensation is disregarded in this application. Tracing of the noise level is hence achieved by computing a smoothed version of the decoder output frame amplitude and by searching for the minimum smoothed amplitude over a sliding temporal window.

Ifdenotes the frame size in samples,denotes the sample index,denotes the frame index, andis the output frame of the core decoder at the TCX sampling rate, the current total frame level is computed as follows:

. (204)

It is first lower-limited by 0.01 and smoothed with a first-order recursive low-pass process, i.e.

, (205)

where

(206)

is an optimal smoothing parameter which depends on the signal leveland the background tracing level in the previous frame. The tracing levelfor the current frameis obtained by searching for the minimum in a buffer containing the last 50 values of the smoothed level:

. (207)

At initialization, the buffer is filled with the value 0.01 and the smooth signal level is initialized as

5.4.6.1.2 TCX time domain concealment

In the case of TCX time domain concealment as stated in subclause 5.4.2.2, the following applies.

5.4.6.1.2.1 Fading to background level

At rates 9.6, 16.4 and 24.4kbps the fading is identical to what is described in subclause 5.3.4.2.1.

At rates 48, 96 and 128kbps the fading is identical to what is described in subclause 5.3.4.2.1 with the exception, that the target level in the time domain is not derived from the FFT provided by CNG, but that it is gained from the background level tracing as described in subclause 5.4.6.1.

5.4.6.1.2.2 Fading to background spectral shape

No fading of the LPC is applied.

5.4.6.1.3 MDCT frame repetition with sign scrambling

In the case of TCX frequency domain concealment, i.e. frame repetition with sign scrambling as stated in subclause 5.4.2.3 and/or tonal concealment using phase prediction as stated in subclause 5.4.2.4, the following applies.

5.4.6.1.3.1 Fading to background level

The time domain signal is faded towards a target background noise level as described in equation (107) and (107a) . The initial gain is 1. The derivation of is outlined in subclause 5.4.6.1.4.

At rates 9.6, 16.4 and 24.4kbps the target level is derived during the first lost frame based on the background noise spectrum derived by CNG during clean channel decoding (section 4.3 of [5]) as stated in subclause 5.3.4.2.1 under a).

At rates 48, 96 and 128kbps the target level is gained from the background level tracing as described in subclause 5.4.6.4.

The gain compensation for the LPC synthesis / de-emphasis as given in equation (109) is applied, see also subsection 5.2.5.

5.4.6.1.3.2 Fading to background spectral shape

The fading to background spectral shape is achieved by the following fading procedures, taking place in parallel:

a) The excitation itself is faded towards white noise in the frequency domain prior to the FDNS, on which a tilt is applied.

b) The excitation is shaped by FDNS towards a previously measured background shape.

c) The LTP is faded out.

5.4.6.1.3.2.1 Fading the excitation to noise

For 9.6, 16.4 and 24.4kbps, the sign scrambled excitation (input to FDNS, see subclause 5.4.2.3) is faded towards a white noise, on which a tilt is applied prior to the fading procedure. The method is based on the following parameters: the last received excitation spectrum , a noise tilt compensation factor (derived similar to the clean channel operation) and a damping factor .

The tilt factor is given by

(208)

Subsequently a tilt vector is derived as

(209)

Thegiven by equation (123) then gets multiplied with the tilt to achieve a target noise vector with the desired tilt:

(210)

The energy of this target noise vector is derived

(211)

and the energy of the last excitation is derived

. (212)

The excitation is then derived as follows:

(213)

with and is given by equation (122). The fading speed controlled by as described in subclause 5.4.6.1.4.

5.4.6.1.3.2.2 Shaping the excitation towards the background shape

The excitation is shaped towards a target spectral shape by altering the LPC coefficients. The fading from the last good LPC coefficients to the target LPC coefficients is performed in the LSF domain as follows:

(214)

where: are LPC coefficients in the LSF domain of the current frame;

are LPC coefficients in the LSF domain of the previous frame;

are the target LPC coefficients, derived according to formula 111

is the fading factor as described in subclause 5.3.4.2.3, but limited to the minimum value of 0.8.

void (215)

For 9.6, 16.4 and 24.4kbps, the target spectral shape of the excitation is derived during the first lost frame based on the background noise spectrum derived by CNG during clean channel decoding (see section 4.3 of [5]). Its derivation is performed as described in subclause 5.3.4.2.2 for the harmonic excitation.

For 48, 96 and 128kbps, the target spectral shape of the excitation is the short term mean of the last three LPC coefficient sets. Its derivation is performed as described in subclause 5.3.4.2.2 for the innovative excitation.

The achieved LPC is converted into FDNS parameters as follows:

(216)

(217)

where are the LPC coefficients. The two signals and get zero filled to the length of 128 before a complex Fourier transform of length 128 will be applied on them to receive the real part and the imaginary part (see [5], subsection 5.1.4). The FDNS parameters will finally be obtained as:

(218)

5.4.6.1.3.2.3 LTP fade-out

The LTP continues to run during concealment. The LTP lag is kept constant. The LTP gain is faded towards zero as follows:

(219)

where: is the LTP gain of the current frame;

is the LTP gain of the previous frame;

is the damping factor, its derivation is outlined in subclause 5.4.6.1.4.

5.4.6.1.4 Fading speed

Several algorithms use a time-varying damping factor for fade-out, cross-fade etc. Depending on the application, either the damping factor or the cumulative damping factor is needed.

The damping factor, here described as, depends on the number of lost frames and the ISF stability factor. The ISF stability factor is already computed in the clean channel. With the lost frame having the index 0, it is derived as follows

(220)

(221)

(222)

The cumulative damping factor, here described as, is initialized with 1 during clean-channel decoding and derived as follows during concealment:

(223)

5.4.6.1.5 Waveform adjustment

The fade out is performed as described in section 5.4.6.1.3, just that no lpc gain compensation (see section 5.2.5) takes place.

5.4.6.2 HQ MDCT

5.4.6.2.1 Burst loss handling for 8 kHz audio output sampling rate

The burst loss handling for 8 kHz audio sampling rate is described as part of the HQ MDCT PLC method description for 8 kHz signals, see clauses 5.4.3.3 and 5.4.3.4.

5.4.6.2.2 Burst loss handling audio output sampling rates larger or equal to 16 kHz

In case the audio output signal frequency exceeds 8 kHz and the current frame loss is the first loss after a good HQ MDCT frame the PLC method is selected according to the method described in subclause 5.4.3.2. If however the current frame loss is at least the second consecutive loss after a preceding good HQ MDCT frame, then the procedure described in this clause applies.

In case the current frame loss is the second loss in a row and the PLC method according to subclause 5.4.3.6 was applied for the first bad frame, Phase ECU according to subclause 5.4.3.5 is applied with the following adaptations: Transient analysis and spectrum analysis are carried out with the previous synthesis signal of the last good HQ MDCT frame. The offset in number of samples since the last good frame is accordingly incremented by.

Otherwise, in case the current frame loss is the second loss in a row and if Phase ECU was applied for the first frame loss, Phase ECU according to subclause 5.4.3.5 is applied with the adaptation that no spectral analysis is carried out and that transient analysis relies on previously calculated and stored parameters. Details are described in subclause 5.4.3.5.1.

In case the current frame loss is the third or more in a row Phase ECU is applied according to subclause 5.4.3.5 with the adaptation that no spectral analysis is carried out and that transient analysis relies on previously calculated and stored parameters (that were calculated based on the synthesis of the last good HQ MDCT frame). The operation of the Phase ECU is modified in response to the frame loss burst condition. Specifically, magnitude and phase of the substitution frame spectrum are adjusted in order to mitigate potential quality losses that might otherwise arise from too periodic or tonal sounds. With increasing loss burst length, the magnitude spectrum is adjusted by gradually increasing attenuation. At the same time the phase spectrum is dithered with an increasing degree. Further details are described in subclause 5.4.3.5.1.

A special feature is the long-term muting behaviour in case of long loss bursts with many consecutive lost frames. In that case, the quality of the audio signal that is reconstructed by Phase ECU might still suffer from tonal artefacts, despite the performed phase randomization. Too strong magnitude attenuation could at the same time lead to quality impairments, as this could be perceived as signal drop-outs. The feature avoids such impairments to a large degree by gradually superposing the substitution signal of the Phase ECU with a noise signal, where the frequency characteristic of the noise signal is a low-resolution spectral representation of a previously received good frame. With increasing number of frame losses in a row, the substitution signal of the Phase ECU is gradually attenuated. At the same time, the frame energy loss is compensated for through the addition of a noise signal with similar spectral characteristics like the last received good frame but with a certain degree of low-pass behaviour. For very long frame loss bursts () the additional noise contribution faded out in order to enforce a muting characteristic of the decoder. Further details of the long-term muting feature are described in subclauses 5.4.3.5.1 and 5.4.3.5.3.