6.1 Decoding and speech synthesis

26.1903GPPAdaptive Multi-Rate - Wideband (AMR-WB) speech codecSpeech codec speech processing functionsTranscoding functionsTS

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

The decoding process is performed in the following order:

Decoding of LP filter parameters: The received indices of ISP quantization are used to reconstruct the quantized ISP vector. The interpolation described in Section 5.2.6 is performed to obtain 4 interpolated ISP vectors (corresponding to 4 subframes). For each subframe, the interpolated ISP vector is converted to LP filter coefficient domain a_k, which is used for synthesizing the reconstructed speech in the subframe.

The following steps are repeated for each subframe:

1. Decoding of the adaptive codebook vector: The received pitch index (adaptive codebook index) is used to find the integer and fractional parts of the pitch lag. The adaptive codebook vector v(n) is found by interpolating the past excitation u(n) (at the pitch delay) using the FIR filter described in Section 5.7. The received adaptive filter index is used to find out whether the filtered adaptive codebook is v₁(n)= v(n) or .

2. Decoding of the innovative vector: The received algebraic codebook index is used to extract the positions and amplitudes (signs) of the excitation pulses and to find the algebraic codevector c(n). If the integer part of the pitch lag is less than the subframe size 64, the pitch sharpening procedure is applied which translates into modifying c(n) by filtering it through the adaptive prefilter F(z) which consists of two parts: a periodicity enhancement part 1/(1-0.85z^^T) and a tilt part (1 – ₁ z^1), where T is the integer part of the pitch lag and ₁(n) is related to the voicing of the previous subframe and is bounded by [0.0,0.5].

3. Decoding of the adaptive and innovative codebook gains: The received index gives the fixed codebook gain correction factor . The estimated fixed codebook gain g’_c is found as described in Section 5.8. First, the predicted energy for every subframe n is found by

( 59)

and then the mean innovation energy is found by

( 60)

The predicted gain is found by

( 61)

The quantized fixed codebook gain is given by

( 62)

4. Computing the reconstructed speech: The following steps are for n = 0, …, 63. The total excitation is constructed by:

( 63)

Before the speech synthesis, a post-processing of excitation elements is performed.

5. Anti-sparseness processing (6.60 and 8.85 kbit/s modes): An adaptive anti-sparseness post-processing procedure is applied to the fixed codebook vector c(n) in order to reduce perceptual artifacts arising from the sparseness of the algebraic fixed codebook vectors with only a few non-zero samples per subframe. The anti-sparseness processing consists of circular convolution of the fixed codebook vector with an impulse response. Three pre-stored impulse responses are used and a number impNr=0,1,2 is set to select one of them. A value of 2 corresponds to no modification, a value of 1 corresponds to medium modification, while a value of 0 corresponds to strong modification. The selection of the impulse response is performed adaptively from the adaptive and fixed codebook gains. The following procedure is employed:

Detect onset by comparing the fixed codebook gain to the previous fixed codebook gain. If the current value is more than three times the previous value an onset is detected.

If not onset and impNr=0, the median filtered value of the current and the previous 4 adaptive codebook gains are computed. If this value is less than 0.6, impNr=0.

If not onset, the impNr-value is restricted to increase by one step from the previous subframe.

If an onset is declared, the impNr -value is increased by one if it is less than 2.

In case of 8.85 kbit/s mode, the impNr -value is increased by one.

6. Noise enhancer: A nonlinear gain smoothing technique is applied to the fixed codebook gain in order to enhance excitation in noise. Based on the stability and voicing of the speech segment, the gain of the fixed codebook is smoothed in order to reduce fluctuation in the energy of the excitation in case of stationary signals. This improves the performance in case of stationary background noise.

The voicing factor is given by =0.5(1-r_v) with r_v=(E_v–E_c)/(E_v+E_c), where E_v and E_c are the energies of the scaled pitch codevector and scaled innovation codevector, respectively. Note that since the value of r_v is between –1 and 1, the value of  is between 0 and 1. Note that the factor  is related to the amount of unvoicing with a value of 0 for purely voiced segments and a value of 1 for purely unvoiced segments.

A stability factor  is computed based on a distance measure between the adjacent LP filters. Here, the factor  is related to the ISP distance measure and it is bounded by 01, with larger values of  corresponding to more stable signals.

Finally, a gain smoothing factor S_m is given by

S_m = . (64)

The value of S_m approaches 1 for unvoiced and stable signals, which is the case of stationary background noise signals. For purely voiced signals or for unstable signals, the value of S_m approaches 0.

An initial modified gain g₀ is computed by comparing the fixed codebook gain to a threshold given by the initial modified gain from the previous subframe, g_-1. If is larger or equal to g_-1, then g₀ is computed by decrementing by 1.5 dB bounded by g₀ g_-1. If is smaller than g_-1, then g₀ is computed by incrementing by 1.5 dB bounded by g₀ g_-1.

Finally, the gain is update with the value of the smoothed gain as follows

, ( 65)

7. Pitch enhancer: A pitch enhancer procedure modifies the total excitation by filtering the fixed codebook excitation through an innovation filter whose frequency response emphasizes the higher frequencies more than lower frequencies, and whose coefficients are related to the periodicity in the signal. A filter of the form

, ( 66)

where c_pe=0.125(1+ r_v), with r_v=(E_v–E_c)/(E_v+E_c) as described above. The filtered fixed codevector is given by

( 67)

and the updated excitation is given by

( 68)

The above procedure can be done in one step by updating the excitation as follows

( 69)

8. Post-processing of excitation elements (6.60 and 8.85 kbit/s modes): A post‑processing of excitation elements procedure is applied to the total excitation by emphasizing the contribution of the adaptive codebook vector:

(70)

Adaptive gain control (AGC) is used to compensate for the gain difference between the non‑emphasized excitation u(n) and emphasized excitation The gain scaling factor  for the emphasized excitation is computed by:

( 71)

The gain‑scaled emphasized excitation signal is given by:

. ( 72)

The reconstructed speech for the subframe of size 64 is given by

(73)

where are the interpolated LP filter coefficients.

The synthesis speech is then passed through an adaptive postprocessing which is described in the following section.