6.7.3 Decoding for FD-CNG

26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS

In FD-CNG, the comfort noise is generated in the frequency domain. Based on the information provided by the SID frames, the amplitude of the random sequences can be individually computed in each band such that the spectrum of the generated comfort noise resembles the spectrum of the actual background noise in the input signal.

Unfortunately, the limited number of parameters transmitted in the SID frames allows only to reproduce the smooth spectral envelop of the background noise. Hence, it cannot capture the fine spectral structure of the noise. At the output of a DTX system, the discrepancy between the smooth spectrum of the reconstructed comfort noise and the spectrum of the actual background noise can become very audible at the transitions between active frames (involving regular coding and decoding of a noisy speech portion of the signal) and the comfort noise frames. Since the fine spectral structure of the background noise cannot be transmitted efficiently from the encoder to the decoder, it is highly desirable to recover this information directly at the decoder side. This can be carried out using a noise estimator.

Note that noise-only frames are considered as inactive frames in a DTX system. Therefore, the noise estimation in the decoder must operate during active phases only, i.e., on noisy speech contents. In FD-CNG, the decoder uses in fact the same noise estimation algorithm as in the encoder, but applying a significantly higher spectral resolution at the decoder than at the encoder.

6.7.3.1 Decoding SID frames in FD-CNG

The decoded SID parameters describe the energy of the background noise in the spectral partitions defined in subclause 5.6.3.6. The first parameters capture the spectral energy of the noise in FFT bins covering the core bandwidth. The remainingparameters capture the spectral energy of the noise in CLDFB bands above the core bandwidth.

6.7.3.1.1 SID parameters decoding

The SID parameters are decoded using MSVQ decoding and global gain adjustment.

Seven indices are decoded from the bitstream. The first six indices are used for MSVQ decoding. They correspond to the six stages of the MSVQ. The first index is encoded on 7 bits and the five next indices are encoded on 6 bits. The last index, encoded on 7 bits, is used for decoding the global gain.

The MSVQ decoder output is given by

, (1972)

where is the -th coefficient of the -th vector in the codebook of stage .

The decoded global gain is given by

. (1973)

The SID parameters are then obtained

. (1974)

Finally the last band parameter is adjusted in case the encoded last band size is different from the decoded last band size

6.7.3.1.2 SID parameters interpolation

The SID parameters are interpolated using linear interpolation in the log domain. The interpolation is carried out separately for the FFT partitions () and CLDFB partitions ().

, (1975)

where

(1976)

denotes the centre bin in each spectral partition, and

(1977)

is the multiplicative increment.

6.7.3.1.3 LPC estimation from the interpolated SID parameters

A set of LPC coefficients is estimated from the SID spectrum in order to update excitation and LPC related memories.

A noise floor is first added to the interpolated SID parameters and a pre-emphasis function is then applied in the frequency domain

, (1978)

with is the pre-emphasis factor (0.68 at 12.8 kHz and 0.72 at 16 kHz).

The noise estimates are then transformed using an inverse FFT, producing autocorrelation coefficients

. (1979)

Then the first autocorrelation coefficient is adjusted

. (1980)

And finally LPC coefficients are estimated from the autocorrelation coefficients using Levinson-Durbin (see subclause 5.1.9.4).

6.7.3.2 Noise tracking during active frames in FD-CNG

At the decoder side, the noise estimator is applied at the output of the core decoder during active frames. To achieve a trade-off between spectral resolution and computational complexity, spectral energies are averaged among groups of spectral bands called partitions, just like in the encoder (see subclause 5.6.3.1). However, the size of each partition is significantly smaller in the decoder compared to the encoder, yielding thereby a finer quantization of the frequency axis in the decoder. Moreover, the decoder-side noise estimation operates solely in the FFT domain and covers only the core bandwidth. Hence FFT partitions are formed, but no CLDFB partitions.

6.7.3.2.1 Spectral partition energies

The output of the core decoder is first transformed by an FFT of size, whererefers to the sampling rate of the core decoder output. Then partitions are formed as follows:

, (1981)

where is the FFT transform of the core output signal. The following table lists the number of partitions and their upper boundaries for the different FD-CNG configurations at the decoder, as a function of bandwidths and bit-rates.

Table : FD-CNG decoder parameters

Bit-rates
(kbps)


(Hz)

NB

56

50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500, 525, 550, 575, 600, 625, 650, 675, 700, 725, 750, 775, 800, 825, 850, 875, 900, 925, 950, 975, 1000, 1075, 1175, 1275, 1375, 1475, 1600, 1725, 1850, 2000, 2150, 2325, 2500, 2700, 2925, 3150, 3400, 3975

WB/
SWB/
FB

62

50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500, 525, 550, 575, 600, 625, 650, 675, 700, 725, 750, 775, 800, 825, 850, 875, 900, 925, 950, 975, 1000, 1075, 1175, 1275, 1375, 1475, 1600, 1725, 1850, 2000, 2150, 2325, 2500, 2700, 2925, 3150, 3375, 3700, 4050, 4400, 4800, 5300, 5800, 6375

61

50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500, 525, 550, 575, 600, 625, 650, 675, 700, 725, 750, 775, 800, 825, 850, 875, 900, 925, 950, 975, 1000, 1075, 1175, 1275, 1375, 1475, 1600, 1725, 1850, 2000, 2150, 2325, 2500, 2700, 2925, 3150, 3400, 3700, 4400, 5300, 6400, 7700, 7975

For each partition,corresponds to the frequency of the last band in the i-th partition. The indicesand of the first and last bands in each spectral partition can be derived as a function of the FFT size and the sampling rate of the core decoder as follows:

, (1982)

, (1983)

where is the frequency of the first band in the first spectral partition. Hence the FD-CNG generates some comfort noise above 50Hz only.

6.7.3.2.2 FD-CNG noise estimation

In FD-CNG, encoder and decoder rely on the same noise estimator, except that the number of partitions differs. As in subclause 5.6.3.2, the input partition energies are first processed by a non-linear transform before applying the noise tracking algorithm on the inputs. The inverse transform is then used to recover the original dynamic range. In the sequel, the resulting decoder-side noise estimates are referred to as . They are used as shaping parameters in the next subclause.

6.7.3.2.3 Noise shaping in FD-CNG

Note that the shaping parameterscomputed directly at the decoder differ from the SID parameters which are transmitted via SID frames. Both sets are computed from FFT partitions covering the core bandwidth, but the decoder benefits from a significantly higher number of spectral partitions, i.e., . In fact, the high-resolution noise estimates obtained at the decoder capture information about the fine spectral structure of the background noise. However, the decoder-side noise estimates cannot adapt to changes in the actual background noise during inactive phases. In contrast, the SID frames deliver new information about the spectral envelop at regular intervals during inactive phases. The FD-CNG therefore combines these two sources of information in an effort to reproduce the fine spectral structure captured from the background noise present during active phases, while updating only the spectral envelop of the comfort noise during inactive parts with the help of the SID information.

6.7.3.2.3.1 Conversion to a lower spectral resolution

Interpolation is first applied to the shaping parametersto obtain a full-resolution FFT power spectrum as follows:

, (1984)

where

(1985)

denote the center FFT bins in each spectral partition, and is the multiplicative increment for the interpolation. The above corresponds in fact to a linear interpolation in the log domain of the FFT shaping partitions.

The full-resolution spectrum is subsequently converted again to a lower resolution based on the SID spectral partitions (see subclause 5.6.3.6). The resulting noise energy spectrum exhibits therefore the same spectral resolution as the SID parameters. Hence, both sets are comparable and can be combined in the next subclause.

6.7.3.2.3.2 Combining SID and shaping parameters

Comparing the low-resolution noise estimatesandobtained from the encoder (via SID frames) and decoder, respectively, the full-resolution noise spectrum can now be scaled to yield a full-resolution noise power spectrum as follows:

, (1986)

whereand refer to the first and last FFT bin of the i-th SID partition (see subclause 5.6.3.6). The full-resolution noise power spectrumis recomputed for each active frame. It can be used to accurately adjust the level of comfort noise in each FFT bin during SID frames or zero frames, as shown in the next subclause.

6.7.3.3 Noise generation for SID or zero frames in FD-CNG

6.7.3.3.1 Update of the noise levels for FD-CNG

During the first non-active frame following an active frame, the low-resolution shaping parameters are recomputed from by averaging over the SID partitions as in subclause 5.6.3.1.1.

If an SID frame occurs while the noise estimator (subclause 6.7.3.2.2) is still in its initialization phase, the interpolated SID parameters (see subclause 6.7.3.1.2) are used as comfort noise levels.

If an SID frame occurs once the noise estimator left the initialization phase, the comfort noise levels are computed for FFT bins by combining the noise estimates from encoder and decoder as described in subclause 6.7.3.2.3.2, while the CLDFB levels are obtained directly by interpolating the SID parameters corresponding to the CLDFB partitions, i.e.,

, (1987)

whereare the SID parameters decoded from the SID frames,are computed during the first non-active frame following an active frame, as explained just above, andis the full-resolution noise spectrum obtained by interpolating the decoder-side noise estimates during active frames (see subclause 6.7.3.2.3.1).

6.7.3.3.2 Comfort noise generation in the frequency domain

The FFT noise spectrum corresponding to the first parameters in the array can finally be used to generate some random Gaussian noise of zero mean and variance separately for the real and imaginary parts of the FFT coefficients.

The second part of the CNG spectrum, i.e., corresponds to the CLDFB noise levels for frequencies above the core bandwidth. For each CLDFB time slot, some random Gaussian noise of zero mean and variance is generated separately for the real and imaginary parts of the CLDFB coefficients corresponding to frequencies above the core bandwidth.

6.7.3.3.3 Comfort noise generation in the time domain

The FFT coefficients obtained after comfort noise generation in the frequency domain are transformed by an inverse FFT, producing a CNG time-domain signal of length. This signal is then windowed using a sine-based window that can be defined as follow

, for . ()

, for . ()

, for . ()

, for . ()

, for . ()

An overlap-add method is finally applied on the current windowed CNG signal and the previous windowed CNG signal. The final FD-CNG frame corresponds to.

To avoid discontinuities at transitions from active frames to inactive frames, a cross-fading mechanism is employed. Several approaches are described in the following subclauses, depending on the bitrate and the previous encoding mode.

6.7.3.3.3.1 Transitions from MDCT to FD-CNG at 9.6kbps, 16.4kbps and 24.4kbps

At 9.6 kbps, 16.4 kbps and 24.4 kbps and when the previous frame was active and encoded with an MDCT-based coding mode, a cross-fading operation is performed using the MDCT window.

First, the left part of the FD-CNG frame is not windowed after the inverse FFT and there is no overlap-add operation with the previous (missing) FD-CNG frame.

Instead, the left part of the FD-CNG frame is windowed using the complementary version of the MDCT window used in the right part of the previous MDCT-based frame (see subclause 6.2.4). An overlap-add method is then applied on the MDCT-windowed FD-CNG frame and the previous MDCT-based frame.

6.7.3.3.3.2 Transitions from ACELP to FD-CNG at 9.6kbps, 16.4kbps and 24.4kbps

At 9.6 kbps, 16.4 kbps and 24.4 kbps and when the previous frame was active and encoded with ACELP, a cross-fading operation is performed using an extrapolation of the previous ACELP frame.

A random excitation with the same energy as the last half of the excitation of the previous ACELP frame is computed

()

where is the total excitation of the previous ACELP frame (see subclause 6.1.1.2), is a random Gaussian noise with zero mean and is the length of the ACELP frame.

The random excitation is then filtered through the same LPC synthesis filter and de-emphasis filter as used in the last subframe of the previous ACELP frame (see subclause 6.1.3), producing the extrapolated ACELP signal.

Finally the extrapolated ACELP signal is windowed and the overlap-add method is applied as if the extrapolated ACELP signal was the previous FD-CNG frame.

6.7.3.3.3.3 Transitions from active to FD-CNG at bitrates<=8kbps and at 13.2kbps

At bitrates<=8 kbps and at 13.2 kbps, a FD-CNG frame is converted to a combination of an excitation signal and a set of LPC coefficients. It then becomes very similar to a LP-CNG frame, and can be further processed as such.

First, the left part of the FD-CNG frame is not windowed after the inverse FFT and there is no overlap-add operation with the previous (missing) FD-CNG frame.

The time-domain FD-CNG signal is then pre-emphasized and filter through the LP analysis filter (see subclause 6.7.3.1.3), producing an excitation signal.

The set of LPC coefficients associated with the excitation is. These LPC coefficients are then used to synthesize the final CNG signal using the excitation and the filter memories from the previous frame, as it is done in LP-CNG.

6.7.3.3.4 FD-CNG decoder memory update

The LPC coefficients are converted to LSP and to LSF. The LPC, LSP and LSF are then used to update all LPC/LSP/LSF-related memories.

The FD-CNG time-domain output is used to update all signal domain memories.

The FD-CNG time-domain signal is pre-emphasized and the pre-emphasized signal is then used to update all pre-emphasized signal domain memories.

The pre-emphasized signal is filtered through the LP analysis filter and the obtained residual is then used to update all excitation domain memories.

All the other memories are updated similarly to LP-CNG (see subclause 6.7.2.2).