6.1.2 TCX mode decoding and signal synthesis

26.2903GPPAudio codec processing functionsExtended Adaptive Multi-Rate - Wideband (AMR-WB+) codecRelease 17Transcoding functionsTS

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

The TCX decoder is shown in Figure 13.

Figure 13: Block diagram of the TCX decoder

Figure 13 shows a block diagram of the TCX decoder including the following two cases:

Case 1: Packet-erasure concealment in TCX-256 when the TCX frame length is 256 samples and the related packet is lost i.e. BFI_TCX = (1), as shown in Figure 13-a.

Case 2: Normal TCX decoding, possibly with partial packet losses, as shown in Figure 13-b..

In Case 1, no information is available to decode the 256-sample TCX frame. The TCX synthesis is found by processing the past excitation delayed by T, where T=pitch_tcx is a pitch lag estimated in the previously decoded TCX frame, by a non-linear filter roughly equivalent to . A non-linear filter is used instead of to avoid clicks in the synthesis. This filter is decomposed in 3 steps:

Step 1: filtering by

to map the excitation delayed by T into the TCX target domain;

Step 2: applying a limiter (the magnitude is limited to  rms_wsyn)

Step 3: filtering by

to find the synthesis. Note that the buffer OVLP_TCX is set to zero in this case.

Decoding of the algebraic VQ parameters

In Case 2, TCX decoding involves decoding the algebraic VQ parameters describing each quantized block of the scaled spectrum X’, where X’ is as described in Step 2 of Section 5.3.5.7. Recall that X’ has dimension N, where N = 288, 576 and 1152 for TCX-256, 512 and 1024 respectively, and that each block B’_k has dimension 8. The number K of blocks B’_k is thus 36, 72 and 144 for TCX-256, 512 and 1024 respectively. The algebraic VQ parameters for each block B’_k are described in Step 5 of Section 5.3.5.7. For each block B’_k , three sets of binary indices are sent by the encoder:

a) the codebook index n_k, transmitted in unary code as described in Step 5 of Section 5.3.5.7;

b) the rank I_k of a selected lattice point c in a so-called base codebook, which indicates what permutation has to be applied to a specific leader (see Step 5 of Section 5.3.5.7) to obtain a lattice point c;

c) and, if the quantized block (a lattice point) was not in the base codebook, the 8 indices of the Voronoi extension index vector k calculated in sub-step V1 of Step 5 in Section; from the Voronoi extension indices, an extension vector z can be computed as in reference [7]. The number of bits in each component of index vector k is given by the extension order r, which can be obtained from the unary code value of index n_k . The scaling factor M of the Voronoi extension is given by M = 2^r.

Then, from the scaling factor M, the Voronoi extension vector z (a lattice point in RE₈) and the lattice point c in the base codebook (also a lattice point in RE₈), each quantized scaled block can be computed as

= M c + z

When there is no Voronoi extension (i.e. n_k < 5, M=1 and z=0), the base codebook is either codebook Q₀, Q₂, Q₃ or Q₄ from reference [6]. No bits are then required to transmit vector k. Otherwise, when Voronoi extension is used because is large enough, then only Q₃ or Q₄ from reference [6] is used as a base codebook. The selection of Q₃ or Q₄ is implicit in the codebook index value n_k,, as described in Step 5 of Section 5.3.5.7.

Decoding of the noise-fill parameter

The noise fill-in level _noiseis decoded by inverting the 3-bit uniform scalar quantization calculated at the encoder as in Step 4 of Section 5.3.5.7 . For an index 0  idx₁  7, _noise is given by: _noise = 0.1 * (8 – idx₁). However, it may happen that the index idx₁is not available. This is the case when BFI_TCX = (1) in TCX-256, (1 X) in TCX-512 and (X 1 X X) in TCX-1024, with X representing an arbitrary binary value. In this case, _noise is set to its maximal value, i.e. _noise = 0.8.

Comfort noise is injected in the subvectors B_k rounded to zero and which correspond to a frequency above Fs/2 kHz 4. More precisely, Z is initialized as Z = Y and for K/6  k  K (only), if Y_{k =} (0, 0, …,0), Z_k is replaced by the 8-dimensional vector:

_noise * [ cos(₁) sin(₁) cos(₂) sin(₂) cos(₃) sin(₃) cos(₄) sin(₄) ],

where the phases ₁, ₂, ₃ and ₄are randomly selected.

Low-frequency de-emphasis

After decoding the algebraic VQ parameters and noise-fill parameter, we obtain the quantized pre-shaped TCX spectrum X’. De-shaping is then applied as in Section 5.3.5.6.

Estimation of the dominant pitch value

The estimation of the dominant pitch is performed so that the next frame to be decoded can be properly extrapolated if it corresponds to TCX-256 and if the related packet is lost. This estimation is based on the assumption that the peak of maximal magnitude in spectrum of the TCX target corresponds to the dominant pitch. The search for the maximum M is restricted to a frequency below Fs/64 kHz

M = max_i=1..N/32 ( X‘_2i)²+( X‘_2i+1)²

and the minimal index 1  i_max  N/32 such that ( X‘₂_i)²+( X‘₂_i+1)² = M is also found. Then the dominant pitch is estimated in number of samples as T_est = N / i_max (this value may not be integer). Recall that the dominant pitch is calculated for packet-erasure concealment in TCX-256. To avoid buffering problems (the excitation buffer being limited to 256 samples ), if T_est > 256 samples, pitch_tcx is set to 256 ; otherwise, if T_est  256, multiple pitch period in 256 samples are avoided by setting pitch_tcx to

pitch_tcx = max {  n T_est  | n integer > 0 and n T_est  256}

where . denotes the rounding to the nearest integer towards -.

Inverse transform

To obtain the quantized perceptual signal, an inverse transform is applied to the de-shaped spectrum X’. The transform used at the encoder and decoder is a the discrete Fourier transform, and is implemented as an FFT and IFFT, respectively. Recall that due to the ordering used at the TCX encoder, the transform coefficients X’=(X’₀,…,X’_N-1) are such that:

X’₀corresponds to the DC coefficient,

X’₁corresponds to the Nyquist frequency, and

the coefficients X’_2k and X’_2k+1, for k=1..N/2-1, are the real and imaginary parts of the Fourier component of frequency of k(/N/2) * Fs/4 kHz.

X’₁ is always forced to 0. After this zeroing, the time-domain TCX target signal x’_w is found by applying an inverse FFT to the quantized scaled spectrum X. Rescaling will be applied in the following section, to obtain the total quantized weighted signal prior to windowing and overlapping.

Decoding of the glocal TCX gain and scaling

The (global) TCX gain g_TCX is decoded by inverting the 7-bit logarithmic quantization calculated in the TCX encoder as in Section 5.2.5.10 . First, the r.m.s. value of the TCX target signal x’_w is computed as:

rms = sqrt(1/N (x‘_w0²+ x‘_w1² +…+ x‘_w_L-1²))

From the received 7-bit index 0  idx₂ 127, the TCX gain is given by:

The (logarithmic) quantization step is around 0.71 dB.

This gain is used to scale x’_w into x_w. Note that from the mode extrapolation and the gain repetition strategy, the index idx₂ is available in case of frame loss. However, in case of partial packet losses (1 loss for TCX-512 and up to 2 losses for TCX-1024) the least significant bit of idx₂ may be set by default to 0 in the demultiplexer.

Windowing and overlap

Since the TCX encoder employs windowing with overlap and weighted ZIR removal prior to transform coding of the target signal, the reconstructed TCX target signal x = (x₀, x₁, …, x_N_-1) is actually found by overlap-add. The overlap-add depends on the type of the previous decoded frame (ACELP or TCX). The TCX target signal is first multiplied by a window w = [w₀ w₁ … w_N-1], whose shape is described in Section 5.3.5.4.

Then, the overlap from the past decoded frame (OVLP_TCX) is added to the present windowed signal x. The overlap length OVLP_TCX depends on the past TCX framelength and on the mode of the past frame (ACELP or TCX).

Computation of the synthesis signal

The reconstructed TCX target is then filtered through the zero-state inverse perceptual filter to find the synthesis signal which will be applied to the synthesis filter. The excitation is also calculated to update the ACELP adaptive codebook and allow to switch from TCX to ACELP in a subsequent frame. Note that the length of the TCX synthesis is given by the TCX frame length (without the overlap): 256, 512 or 1024 samples.