6.8.1 Decoding and speech synthesis
26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS
6.8.1.1 Excitation decoding
The decoding process is performed in the following order:
Decoding of LP filter parameters: The received indices of ISP quantization are used to reconstruct the quantized ISP vector. The interpolation described in subclause 5.7.2.6 is performed to obtain 4 interpolated ISP vectors (corresponding to 4 subframes). For each subframe, the interpolated ISP vector is converted to LP filter coefficient domain , which is used for synthesizing the reconstructed speech in the subframe.
The following steps are repeated for each subframe:
1. Decoding of the adaptive codebook vector: The received pitch index (adaptive codebook index) is used to find the integer and fractional parts of the pitch lag. The adaptive codebook vector is found by interpolating the past excitation
(at the pitch delay) using the FIR filter described in subclause 5.7. The received adaptive filter index is used to find out whether the filtered adaptive codebook is
or
.
2. Decoding of the innovative vector: The received algebraic codebook index is used to extract the positions and amplitudes (signs) of the excitation pulses and to find the algebraic codevector . If the integer part of the pitch lag is less than the subframe size 64, the pitch sharpening procedure is applied which translates into modifying
by filtering it through the adaptive prefilter
which consists of two parts: a periodicity enhancement part
and a tilt part
, where
is the integer part of the pitch lag and
is related to the voicing of the previous subframe and is bounded by [0.0,0.5].
3. Decoding of the adaptive and innovative codebook gains: The received index gives the fixed codebook gain correction factor . The estimated fixed codebook gain
is found as described in subclause 5.8. First, the predicted energy for every subframe
is found by
(1994)
and then the mean innovation energy is found by
()
The predicted gain is found by
()
The quantized fixed codebook gain is given by
()
4. Computing the reconstructed speech: The following steps are for n = 0, …, 63. The total excitation is constructed by:
()
6.8.1.2 Excitation post-processing
Before the synthesis, a post-processing of the excitation signal is performed to form the updated excitation signal, , as follows.
6.8.1.2.1 Anti-sparseness processing
This is the same as described in subclause 6.1.1.3.1
6.8.1.2.2 Gain smoothing for noise enhancement
This is the same as described in subclause 6.1.1.3.2
6.8.1.2.3 Pitch enhancer
This is the same as described in subclause 6.1.1.3.3.
6.8.1.3 Synthesis filtering
Once the excitation post-processing is done, the modified excitation is passed through the synthesis filter, as described in subclause 6.1.3, to obtain the decoded synthesis for the current frame. Based on the content bandwidth in the decoded synthesis signal, an output mode is determined (e.g., NB or WB). If the output mode is determined to be NB, then the content above 4 kHz is attenuated using CLDFB synthesis (e.g., as described in clause 6.9.3) and, subsequently, high frequency synthesis (6.8.3) is not performed on the bandlimited content.
6.8.1.4 Music and Unvoiced/inactive Post-processing
6.8.1.4.1 Music post processing
Most of the music post processing is the same as in as clause 6.1.1.3.4. The main difference related to the fact that a first synthesis is computed and a first stage classification is derived from this synthesis as described in subclause 5.3.1 of [6]. If the synthesis is classified as unvoiced or the content is INACTIVE (VAD ==0) or the long term background noise () as defined below is greater or equal to 15 dB, the AMR-IO decoder will go through the unvoiced, inactive post processing path as described in subclause 6.8.1.1.5.
The long term background noise energy is updated in case of INACTIVE frame as:
()
and is the long-term background noise energy.
is updated only when a current frame is classified as INACTIVE. The pitch lag value, T’, over which the background noise energy,
, is given by
()
where is the fractional pitch lag at subframe i,
is the frame length and
is the subframe length. Otherwise it enters the music post processing is entered as described below.
6.8.1.4.1.1 Excitation buffering and extrapolation
This is the same as described in subclause 6.1.1.3.4.1
6.8.1.4.1.2 Windowing and frequency transform
This is the same as described in subclause 6.1.1.3.4.2
6.8.1.4.1.3 Energy per band and per bin analysis
This is the same as described in subclause 6.1.1.3.4.3
6.8.1.4.1.4 Excitation type classification
This is the same as described in subclause 6.1.1.3.4.4
6.8.1.4.1.5 Inter-tone noise reduction in the excitation domain
This is the same as described in subclause 6.1.1.3.4.5
6.8.1.4.1.6 Inter-tone quantization noise estimation
This is the same as described in subclause 6.1.1.3.4.6
6.8.1.4.1.7 Increasing spectral dynamic of the excitation
This is the same as described in subclause 6.1.1.3.4.7
6.8.1.4.1.8 Per bin normalization of the spectrum energy
This is the same as described in subclause 6.1.1.3.4.8
6.8.1.4.1.9 Smoothing of the scaled energy spectrum along the frequency axis and the time axis
This is the same as described in subclause 6.1.1.3.4.9
6.8.1.4.1.10 Application of the weighting mask to the enhanced concatenated excitation spectrum
This is the same as described in subclause 6.1.1.3.4.10
6.8.1.4.1.11 Inverse frequency transform and overwriting of the current excitation
This is the same as described in subclause 6.1.1.3.4.11
6.8.1.4.2 Unvoiced and inactive post processing
When the classifier described in subclause 5.3.1 of [6] considers the synthesis as unvoiced or inactive and containing background noise, the unvoiced and inactive post processing module is used to determine a cut-off frequency where the time-domain contributions should stop. Then the content above this cut-off frequency is replaced with random noise giving a smoother rendering of the synthesis. This post processing module is used when the local attack flag (laf as defined in subclause 5.3.1 [6] is set to 0 and the coding type is INACTIVE and the bitrate is below or equal to 12650 bps. It is also used at 6600 bps if the synthesis is classified as UNVOICED or VOICED_TRANSITION.
When the synthesis is considered as INACTIVE and the energy of the synthesis as defined in subclause 5.3.1 of [6] is greater than -3 dB, the LP filter coefficients that will be used to do the synthesis filtering, as described below in subclause 6.8.1.1.4.5, are smoothed as between past and current frame as follow:
()
where represents the LP filter of the previous frame. At the end of the post processing
is updated using
.
6.8.1.4.2.1 Frequency transform
During the frequency-domain modification phase, the excitation needs to be represented into the transform-domain. The time-to-frequency conversion is achieved with a type II DCT giving a resolution of 25Hz. The frequency representation of the time-domain CELP excitation is given below:
()
where , is the time-domain excitation, and L is the frame length and its value is 256 samples for a corresponding inner sampling frequency of 12.8 kHz.
6.8.1.4.2.2 Energy per band analysis
Before any modification to the excitation, the energy per band is computed and kept in memory for energy adjustment after the excitation spectrum reshaping. The energy can be computed as follow :
()
where is the cumulative frequency bins per band and
number of bins per band defined as :
and
The low frequency bands correspond to the critical audio bands, but the frequency band above 3700 Hz are a little shorter to better match the possible spectral energy variation in those bands.
6.8.1.4.2.3 Excitation modification
6.8.1.4.2.3.1 Cut off frequency of the temporal contribution
To achieve a transparent switching between the non-modified excitation and the modified excitation for unvoiced and inactive signals, it is preferable to keep at least the lower frequencies of the temporal contribution. The frequency where the temporal contribution stop to be used, the cut-off frequency, has a minimum value of 1.2 kHz. It means that the first 1.2 kHz of the decoded excitations is always kept and depending of the pitch value, this cut-off frequency can be higher. The 8th harmonic is computed from the lowest pitch of all subframes and the temporal contribution is kept up to this 8th harmonic. The estimate is performed as follow:
()
where and T the decoded subframe pitch.
For all bands a verification is made to find the band in which the 8th harmonic is located by searching for the highest frequency band for which the following inequality is still verified:
()
where the frequency band is defined as :
The index of that band will be called and it indicates the band where the 8th harmonic is likely located. The finale cut-off frequency
is computed as the higher frequency between the 1.2 kHz and the last frequency of the frequency band in which the 8th harmonic is located
, using the following relation:
()
6.8.1.4.2.3.2 Normalization and noise fill
For unvoiced and inactive frames, the frequency bins below are normalized between [0, 4] :
()
And the frequency bins above are zeroed. Then, a simple noise fill is performed to add noise over all the frequency bins at a constant level. The function describing the noise addition is defined below as:
()
Where is a random number generator which is limited between -1 to 1 as :
()
6.8.1.4.2.3.3 Energy per band analysis of the modified excitation spectrum
The energy per band after the spectrum reshaping is calculated again with exactly the same method as described in subclause 6.8.1.1.4.2.
6.8.1.4.2.3.4 Amplification of high frequencies
An amplification factor compensates for the poor energy matching in high frequency of the LP filter at low bit rate. It is based on the voice factor
and computed as follow:
()
where is given by:
()
and is defined in sub-clause 6.1.1.3.2.
The amplification factor is applied linearly between 6kHz and 6.4kHz as follow:
()
6.8.1.4.2.3.5 Energy matching
The energy matching consists in adjusting the energy per band after the excitation spectrum modification to its initial value. For each bands i, the gain to apply to all bins in the band for matching the energy of the original excitation
is defined as:
()
For a specific band i, the denormalized spectral excitation can be written as :
()
where and
are defined in subclause 6.8.1.1.4.2.
6.8.1.4.2.4 Inverse frequency transform
After the frequency domain is completed, an inverse frequency-to-time transform is performed in order to find the temporal excitation. The frequency-to-time conversion is achieved with the same type II DCT as used for the time-to-frequency conversion. The modified time-domain excitation is obtained as below:
()
where , is the frequency representation of the modified excitation, and L is the frame length that is equal to 256 samples.
6.8.1.5 Synthesis filtering and overwriting the current CELP synthesis
Once the excitation modification is done, the modified excitation is passed through the synthesis filter, as described in in subclause 6.1.3, to obtain a modified synthesis for the current frame. This modified synthesis is then used to overwrite the decoded synthesis.
6.8.1.6 Formant post-filter
The decoded synthesis is post-filtered as described in subclause 6.1.4.1.
6.8.1.7 Comfort noise addition
For frames exhibiting a high background noise level (background noise level >= 15), comfort noise is added for bitrates of 8.85 kbps and below. The comfort noise addition is described in subclause 6.9.1.
6.8.1.8 Bass post-filter
This is the same as described in subclause 6.1.4.2