6.8.1 Decoding and speech synthesis

26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS

6.8.1.1 Excitation decoding

The decoding process is performed in the following order:

Decoding of LP filter parameters: The received indices of ISP quantization are used to reconstruct the quantized ISP vector. The interpolation described in subclause 5.7.2.6 is performed to obtain 4 interpolated ISP vectors (corresponding to 4 subframes). For each subframe, the interpolated ISP vector is converted to LP filter coefficient domain , which is used for synthesizing the reconstructed speech in the subframe.

The following steps are repeated for each subframe:

1. Decoding of the adaptive codebook vector: The received pitch index (adaptive codebook index) is used to find the integer and fractional parts of the pitch lag. The adaptive codebook vector is found by interpolating the past excitation (at the pitch delay) using the FIR filter described in subclause 5.7. The received adaptive filter index is used to find out whether the filtered adaptive codebook is or .

2. Decoding of the innovative vector: The received algebraic codebook index is used to extract the positions and amplitudes (signs) of the excitation pulses and to find the algebraic codevector . If the integer part of the pitch lag is less than the subframe size 64, the pitch sharpening procedure is applied which translates into modifying by filtering it through the adaptive prefilter which consists of two parts: a periodicity enhancement part and a tilt part , where is the integer part of the pitch lag and is related to the voicing of the previous subframe and is bounded by [0.0,0.5].

3. Decoding of the adaptive and innovative codebook gains: The received index gives the fixed codebook gain correction factor . The estimated fixed codebook gain is found as described in subclause 5.8. First, the predicted energy for every subframe is found by

(1994)

and then the mean innovation energy is found by

()

The predicted gain is found by

()

The quantized fixed codebook gain is given by

()

4. Computing the reconstructed speech: The following steps are for n = 0, …, 63. The total excitation is constructed by:

()

6.8.1.2 Excitation post-processing

Before the synthesis, a post-processing of the excitation signal is performed to form the updated excitation signal, , as follows.

6.8.1.2.1 Anti-sparseness processing

This is the same as described in subclause 6.1.1.3.1

6.8.1.2.2 Gain smoothing for noise enhancement

This is the same as described in subclause 6.1.1.3.2

6.8.1.2.3 Pitch enhancer

This is the same as described in subclause 6.1.1.3.3.

6.8.1.3 Synthesis filtering

Once the excitation post-processing is done, the modified excitation is passed through the synthesis filter, as described in subclause 6.1.3, to obtain the decoded synthesis for the current frame. Based on the content bandwidth in the decoded synthesis signal, an output mode is determined (e.g., NB or WB). If the output mode is determined to be NB, then the content above 4 kHz is attenuated using CLDFB synthesis (e.g., as described in clause 6.9.3) and, subsequently, high frequency synthesis (6.8.3) is not performed on the bandlimited content.

6.8.1.4 Music and Unvoiced/inactive Post-processing

6.8.1.4.1 Music post processing

Most of the music post processing is the same as in as clause 6.1.1.3.4. The main difference related to the fact that a first synthesis is computed and a first stage classification is derived from this synthesis as described in subclause 5.3.1 of [6]. If the synthesis is classified as unvoiced or the content is INACTIVE (VAD ==0) or the long term background noise () as defined below is greater or equal to 15 dB, the AMR-IO decoder will go through the unvoiced, inactive post processing path as described in subclause 6.8.1.1.5.

The long term background noise energy is updated in case of INACTIVE frame as:

()

and is the long-term background noise energy. is updated only when a current frame is classified as INACTIVE. The pitch lag value, T’, over which the background noise energy, , is given by

()

where is the fractional pitch lag at subframe i, is the frame length and is the subframe length. Otherwise it enters the music post processing is entered as described below.

6.8.1.4.1.1 Excitation buffering and extrapolation

This is the same as described in subclause 6.1.1.3.4.1

6.8.1.4.1.2 Windowing and frequency transform

This is the same as described in subclause 6.1.1.3.4.2

6.8.1.4.1.3 Energy per band and per bin analysis

This is the same as described in subclause 6.1.1.3.4.3

6.8.1.4.1.4 Excitation type classification

This is the same as described in subclause 6.1.1.3.4.4

6.8.1.4.1.5 Inter-tone noise reduction in the excitation domain

This is the same as described in subclause 6.1.1.3.4.5

6.8.1.4.1.6 Inter-tone quantization noise estimation

This is the same as described in subclause 6.1.1.3.4.6

6.8.1.4.1.7 Increasing spectral dynamic of the excitation

This is the same as described in subclause 6.1.1.3.4.7

6.8.1.4.1.8 Per bin normalization of the spectrum energy

This is the same as described in subclause 6.1.1.3.4.8

6.8.1.4.1.9 Smoothing of the scaled energy spectrum along the frequency axis and the time axis

This is the same as described in subclause 6.1.1.3.4.9

6.8.1.4.1.10 Application of the weighting mask to the enhanced concatenated excitation spectrum

This is the same as described in subclause 6.1.1.3.4.10

6.8.1.4.1.11 Inverse frequency transform and overwriting of the current excitation

This is the same as described in subclause 6.1.1.3.4.11

6.8.1.4.2 Unvoiced and inactive post processing

When the classifier described in subclause 5.3.1 of [6] considers the synthesis as unvoiced or inactive and containing background noise, the unvoiced and inactive post processing module is used to determine a cut-off frequency where the time-domain contributions should stop. Then the content above this cut-off frequency is replaced with random noise giving a smoother rendering of the synthesis. This post processing module is used when the local attack flag (laf as defined in subclause 5.3.1 [6] is set to 0 and the coding type is INACTIVE and the bitrate is below or equal to 12650 bps. It is also used at 6600 bps if the synthesis is classified as UNVOICED or VOICED_TRANSITION.

When the synthesis is considered as INACTIVE and the energy of the synthesis as defined in subclause 5.3.1 of [6] is greater than -3 dB, the LP filter coefficients that will be used to do the synthesis filtering, as described below in subclause 6.8.1.1.4.5, are smoothed as between past and current frame as follow:

()

where represents the LP filter of the previous frame. At the end of the post processing is updated using.

6.8.1.4.2.1 Frequency transform

During the frequency-domain modification phase, the excitation needs to be represented into the transform-domain. The time-to-frequency conversion is achieved with a type II DCT giving a resolution of 25Hz. The frequency representation of the time-domain CELP excitation is given below:

()

where , is the time-domain excitation, and L is the frame length and its value is 256 samples for a corresponding inner sampling frequency of 12.8 kHz.

6.8.1.4.2.2 Energy per band analysis

Before any modification to the excitation, the energy per band is computed and kept in memory for energy adjustment after the excitation spectrum reshaping. The energy can be computed as follow :

()

where is the cumulative frequency bins per band and number of bins per band defined as :

and

The low frequency bands correspond to the critical audio bands, but the frequency band above 3700 Hz are a little shorter to better match the possible spectral energy variation in those bands.

6.8.1.4.2.3 Excitation modification

6.8.1.4.2.3.1 Cut off frequency of the temporal contribution

To achieve a transparent switching between the non-modified excitation and the modified excitation for unvoiced and inactive signals, it is preferable to keep at least the lower frequencies of the temporal contribution. The frequency where the temporal contribution stop to be used, the cut-off frequency, has a minimum value of 1.2 kHz. It means that the first 1.2 kHz of the decoded excitations is always kept and depending of the pitch value, this cut-off frequency can be higher. The 8th harmonic is computed from the lowest pitch of all subframes and the temporal contribution is kept up to this 8th harmonic. The estimate is performed as follow:

()

where and T  the decoded subframe pitch.

For all bands a verification is made to find the band in which the 8th harmonic is located by searching for the highest frequency band for which the following inequality is still verified:

()

where the frequency band is defined as :

The index of that band will be called and it indicates the band where the 8th harmonic is likely located. The finale cut-off frequency is computed as the higher frequency between the 1.2 kHz and the last frequency of the frequency band in which the 8th harmonic is located , using the following relation:

()

6.8.1.4.2.3.2 Normalization and noise fill

For unvoiced and inactive frames, the frequency bins below are normalized between [0, 4] :

()

And the frequency bins above are zeroed. Then, a simple noise fill is performed to add noise over all the frequency bins at a constant level. The function describing the noise addition is defined below as:

()

Where is a random number generator which is limited between -1 to 1 as :

()

6.8.1.4.2.3.3 Energy per band analysis of the modified excitation spectrum

The energy per band after the spectrum reshaping is calculated again with exactly the same method as described in subclause 6.8.1.1.4.2.

6.8.1.4.2.3.4 Amplification of high frequencies

An amplification factor compensates for the poor energy matching in high frequency of the LP filter at low bit rate. It is based on the voice factor and computed as follow:

()

where is given by:

()

and is defined in sub-clause 6.1.1.3.2.

The amplification factor is applied linearly between 6kHz and 6.4kHz as follow:

()

6.8.1.4.2.3.5 Energy matching

The energy matching consists in adjusting the energy per band after the excitation spectrum modification to its initial value. For each bands i, the gain to apply to all bins in the band for matching the energy of the original excitation is defined as:

()

For a specific band i, the denormalized spectral excitation can be written as :

()

where and are defined in subclause 6.8.1.1.4.2.

6.8.1.4.2.4 Inverse frequency transform

After the frequency domain is completed, an inverse frequency-to-time transform is performed in order to find the temporal excitation. The frequency-to-time conversion is achieved with the same type II DCT as used for the time-to-frequency conversion. The modified time-domain excitation is obtained as below:

()

where , is the frequency representation of the modified excitation, and L is the frame length that is equal to 256 samples.

6.8.1.5 Synthesis filtering and overwriting the current CELP synthesis

Once the excitation modification is done, the modified excitation is passed through the synthesis filter, as described in in subclause 6.1.3, to obtain a modified synthesis for the current frame. This modified synthesis is then used to overwrite the decoded synthesis.

6.8.1.6 Formant post-filter

The decoded synthesis is post-filtered as described in subclause 6.1.4.1.

6.8.1.7 Comfort noise addition

For frames exhibiting a high background noise level (background noise level >= 15), comfort noise is added for bitrates of 8.85 kbps and below. The comfort noise addition is described in subclause 6.9.1.

6.8.1.8 Bass post-filter

This is the same as described in subclause 6.1.4.2