5.3.5 TCX Excitation encoder

26.2903GPPAudio codec processing functionsExtended Adaptive Multi-Rate - Wideband (AMR-WB+) codecRelease 17Transcoding functionsTS

This section presents the details of the TCX encoder, which is one of the possible modes to encode the mono, low-frequency signal in the 0-Fs/4 kHz band. Section 5.3.5.1 first presents the block diagram of the TCX encoder. Then, the details of each module are given in sections 5.3.5.2 to 5.3.5.13.

5.3.5.1 TCX encoder block diagram

Figure 6 shows a block diagram of the TCX encoding mode. The TCX encoding principle is similar for TCX frames of 256, 512 and 1024 samples, with a few differences mostly involving the windowing and filter interpolation. The input audio signal is first filtered through a time-varying weighting filter (same perceptual filter as in AMR-WB) to obtain a weighted signal x. The weighting filter coefficients are interpolated in the ISP domain as in Section 5.3.2.6. The interpolation is linear, and the beginning and end of the interpolation depend on the refresh rate of the LPC filter. The LPC filter is transmitted only once per TCX frame. For longer frames (TCX512 and TCX1024) the interpolated LPC filters will be farther apart that in the case of TCX256 or ACELP frames.

Continuing in Figure 6, if the past frame was an ACELP frame, the zero-input response (ZIR) of the weighting filter is removed from the weighted signal, using the filter state at the end of the previous (ACELP) frame. The signal is then windowed (the window shape will be described in section 5.3.5.4) and a transform is applied to the windowed signal. In the transform domain, the signal is first pre-shaped, to minimize coding noise artefact in the low-frequencies, and then quantized using a specific lattice quantizer. Specifically, an 8-dimensional multi-rate lattice quantizer is used, based on an extension of the Gosset lattice.

After quantization, the inverse pre-shaping function is applied to the spectrum which is then inverse transformed to provide a quantized time-domain signal. The gain for that frame is then rescaled to optimize the correlation with the original weighted signal. After gain rescaling, a window is again applied to the quantized signal to minimize the block effects due to quantizing in the transform domain. Overlap-and-add is used with the previous frame if it was also in TCX mode. Finally, the excitation signal is found through inverse filtering with proper filter memory updating. This TCX excitation is in the same "domain" as the ACELP (AMR-WB) excitation.

Figure 6: Principle of TCX encoding

Each module of Figure 6 will now be detailed in the following subsections.

5.3.5.2 Computation of the target signal for transform coding

To obtain the weighted signal, the input frame of audio samples is filtered with a perceptual filter having the following transfer function:

Here, is the quantized LP filter, interpolated at every 64-sample sub-frame in the ISP domain as in Section 5.3.2.6, and is the weighted version of that filter. The denominator of W(z) is a constant polynomial of order 1, which is equal to the numerator of the pre-emphasis filter in Section 5.3.1.

5.3.5.3 Zero-input response subtraction

If the previous encoded frame was ACELP, then the zero-input response (ZIR) of the combination of the weighting filter and synthesis filter is removed from the weighted signal. The ZIR is truncated to 128 samples and windowed in such a way that its amplitude monotonically decreases to zero at after 128 samples. The truncated ZIR is computed through the following steps:

Using the filter states at the end of the previous frame, compute the ZIR of the following transfer function over 2 consecutive subframes (128 samples duration):

where and are as defined in Section 5.3.5.2.

Then, calling z(n) the truncated ZIR of H(z) (truncated to the first 2N samples, where N=64 is the subframe length), compute zw(n), the windowed ZIR such that it is always forced to zero at the last sample:

zw(n) = z(n)*w(n) for n = 0 to 2*N-1

where

w(n) = 1 for n = 0 to N-1

and w(n) = (2*Nn) / N for n = N to 2*N-1

The shape of w(n) is shown in Figure 7 below, for a value of N= 64.

Figure 7: Shape of window to truncate the ZIR

After computing zw(n), it is removed from the first 2*N samples of the weighted signal x(n). This removal of the ZIR from the past frame is performed only when the past frame was in ACELP mode.

5.3.5.4 Windowing of target signal

In TCX mode, windowing in applied prior to the transform, and after the inverse transform, in order to apply overlap-and-add to minimize the framing effects due to quantization.

To smooth the transition between ACELP and TCX modes, proper care has to be given to windowing and overlap of successive frames. Figure 8 shows the window shapes depending on the TCX frame length and the type of the previous frame (ACELP of TCX).

Figure 8: Target signal windowing in TCX coding

The window is defined as the concatenation of the following three sub-windows:

w1(n) = sin(2 n / (4 L1) ) for n = 0, …, L1-1

w2(n) = 1 for n = 0, …, L L1-1

w3(n) = sin(2 n / (4 L2) ) for n = L2, …, 2L2-1

The constants L1, L2 and L are defined as follows.

L1 = 0 when the previous frame is a 256-sample ACELP frame

L1 = 32 when the previous frame is a 256-sample TCX frame

L1 = 64 when the previous frame is a 512-sample TCX frame

L1 = 128 when the previous frame is an 1024-sample TCX frame

Additionally:

For 256-sample TCX: L = 256 and L2 = 32

For 512-sample TCX: L = 512, and L2 = 64

For 1024-sample TCX: L = 1024, and L2 = 128 and

We note again that all these window types are applied to the weighted signal, only when the present frame is a TCX frame. Frames of type ACELP are encoded as in AMR-WB encoding (i.e. through analysis-by-synthesis encoding of the excitation signal, so as to minimize the error in the target signal – the target signal is essentially the weighted signal from which the zero-input response of the weighting filter is removed).

5.3.5.5 Transform

After windowing, the signal is mapped to the frequency domain through a Discrete Fourier Transform (DFT), defined as:

where LTOT is the number of samples in the DFT. LTOT depends on the frame length (256, 512 or 1024 samples, plus the lookahead which is a function of the frame length).

An FFT is used to accelerate the computation of the Fourier coefficients. A radix-9 FFT is used to adapt to the frame length which is not a power of 2. Including the overlap in the windowing described in Section 5.3.5.4, the number of samples at the input of the FFT is, respectively, LTOT = 288 for 256-sample TCX frames (256 samples in the frame plus 32 samples in the look-ahead), LTOT =576 for 512-sample TCX (512 samples in the frame plus 64 samples in the lookahead), and LTOT =1152 samples for 1024-sample TCX (1024 samples in the frame plus 128 samples in the lookahead).

5.3.5.6 Spectrum pre-shaping

Once the Fourier spectrum (FFT) is computed, an adaptive low-frequency emphasis module is applied to the spectrum, to minimize the perceived distortion in the lower frequencies. The inverse low-frequency emphasis will be applied at the decoder, as well as in the encoder to allow obtaining the excitation signal necessary to encode the next frames. The adaptive low-frequency emphasis is applied only on the first quarter of the spectrum, as follows.

First, we call X the transformed signal at the output of the transform (FFT) in Figure 6. The Fourier coefficient at Nyquist frequency is systematically set to 0. Then, if LTOT is the number of samples in the FFT (LTOT is thus the window length), the K= LTOT /2 complex-valued Fourier coefficients are grouped in blocks of four consecutive coefficients, forming 8-dimensional real-valued blocks. This block size of 8 is chosen to coincide with the 8-dimensional lattice quantizer used for spectral quantization. The energy of each block is computed, up to the first quarter of the spectrum. The energy Emax and position index I of the block with maximum energy are stored. Then, we calculate a factor for each 8-dimensional block with position index m smaller than I, as follows:

– calculate the energy Em of the 8-dimensional block at position index m

– compute the ratio Rm = Emax / Em

– compute the value (Rm) ¼

– if Rm > 10, then set Rm = 10 (maximum gain of 20 dB)

– also, if Rm > Rm-1 then Rm = R m-1

This last condition ensures that the ratio function Rm decreases monotonically. Further, limiting the ratio Rm to be smaller or equal to 10 means that no spectral components in the low-frequency emphasis function will be modified by more than 20 dB.

After computing the ratio Rm = (Emax / Em) ¼ for all blocks with position index smaller that I (and with the limiting conditions described above), we then apply these ratios as a gain for each corresponding block. This has the effect of increasing the energy of blocks with relatively low energy compared to the block with maximum energy Emax. Applying this procedure prior to quantization has the effect of shaping the coding noise in the lower band, such that low energy components before the first spectral peak will be better encoded.

5.3.5.7 Split multi-rate lattice VQ

To quantize the pre-shaped spectrum X of the weighted signal in TCX mode, a method based on lattice quantizers is used. Specifically, the spectrum is quantized in 8-dimensional blocks using vector codebooks composed of subsets of the Gosset lattice, referred to as the RE8 lattice (see [6]). All points of a given lattice can be generated from the so-called generator matrix G of the lattice, as c = k G, where k is a line vector with integer values and c is the generated lattice point. To form a vector codebook at a given rate, only lattice points inside a sphere (in 8 dimensions) of a given radius are taken. Multi-rate codebooks can thus be formed by taking subsets of different radii.

In lattice quantization, the operation of finding the nearest neighbour of an input vector x among all codebook points is reduced to a few simple operations, involving rounding the components of a vector and verifying a few constraints. Hence, no exhaustive search is carried out as in stochastic quantization, which uses stored tables. Once the best lattice codebook point is determined, further calculations are also necessary to compute the binary index that will be sent to the decoder. The larger the components of the input vector x, the more bits will be required to encode the index of its nearest neighbour in the lattice codebook. Hence, to remain within a pre-defined bit budget, a gain-shape approach has to be used, where the input vector is first scaled down, i.e. divided by a gain which has to be estimated, then quantized in the lattice, then scaled up again to produce the quantization result. To reduce computation complexity, the binary indices will actually only be calculated if a given TCX mode is retained as the best mode for a frame.

For simplicity, we let N be the length of the DFT. Since the transform used to obtain X is a Discrete Fourier Transform, there are N/2+1 Fourier coefficients including X(N/2) at Nyquist frequency. In the quantization process, coefficient X(N/2) is always set to 0, so there are exactly N/2 Fourier coefficients to quantize. Then, all coefficients of X are complex, except X(0) which is real.

To be quantized using the RE8 lattice codebooks, the pre-shaped spectrum X is split into consecutive blocks of 8 real values (4 consecutive complex coefficients). There are K=N/8 such blocks in the whole spectrum. We call Bk the kth block, with k = 0, 1, …, K-1. To remain within the total bit budget, the spectrum X will have to be divided by a global gain g prior to quantization, and multiplied by the quantized global gain after each block Bk is encoded using the RE8 lattice. We call X‘=X/g the scaled spectrum and B’k = Bk / g the kth scaled block. Thus, the parameters sent to the decoder to encode the TCX spectrum X are the global gain g and the index of the nearest neighbour of each block Bk within the lattice codebook.

The index of the nearest neighbour in the lattice is actually composed of three parts: 1) a codebook index, which essentially represents the bit allocation for each 8-dimensional vector; 2) a vector index, which uniquely identifies a lattice vector in a so-called base codebook C; and 3) an extention index k, which is used to extend the base codebook when the selected point in the lattice is not in the base codebook C. The extension used, called the Voronoi extension, will be described in Step 5 below.

These parameters are encoded using the 5 Steps described below.

Step 1 Find the energy Ek of each block Bk:

and obtain from Ek a first estimate of the bit budget using the starting assumption that the global gain g equals 1 (i.e. that the spectrum X is quantized without scaling first):

The formula for Rk(1) is based on the properties of the underlying RE8 lattice, and the method used for encoding the index of a lattice point selected by the quantizer. These properties and encoding method will be described in Steps 3 and 5.

Unless the energy of the frame is very small, the block energies Ek will be too large to ensure that the total bit consumption (sum of all Rk(1)) remains within the total bit budget for the frame. Hence, it is necessary to estimate a gain g so that the quantization of X‘=X/g in the RE8 lattice will produce a set of indices that stay within the bit budget. This gain estimation is performed in Step 2.

Step 2 The estimation of the global gain g for the TCX frame is performed in an iteration, as follows.

Initialisation: Set fac = 128, offset = 0 and nbits_max = 0.95*(NB_BITS_ – K)

Iteration: Do the following block of operations NITER times (here, NITER = 10).

1- offset = offset + fac

2-

3- if nbits <= nbits_max, then offset = offset – fac

fac = fac / 2

After the iteration, the global gain is estimated as

The scaled spectrum can then be obtained as X’=X/g. The input to the lattice quantizer described in Step 3 are the scaled blocks B’k = Bk / g, each an 8-dimensional vector of real components. The assumption if that the total number of bits used to quantize B’k into the lattice codebook will be close to the bit budget.

Step 3 In this step, each 8-dimensional block B’k of the scaled spectrum X‘=X/g is rounded as a point in the RE8 lattice, to produce its quantized version,. Before looking at the quantization procedure, it is worthwhile to look at the properties of this lattice. RE8 is defined as follows:

that is as the union of the 2D8 lattice and a version of 2D8 shifted by the vector (1,1,1,1,1,1,1,1). Therefore, searching for the nearest neighbour in the lattice RE8 is equivalent to searching for the nearest neighbour in the lattice 2D8, then searching for the nearest neighbour in the lattice 2D8 + (1,1,1,1,1,1,1,1), and finally selecting the best of those two lattice points. The lattice 2D8 is just the D8 lattice scaled by a factor of 2, with the D8 lattice defined as:

That is, the lattice points in D8 are all integers, with the constraint that the sum of all components is even. This also implies that the sum of the components of a lattice point in 2D8 is an integer multiple of 4.

From this definition of RE8, it is straightforward to develop a fast algorithm to search for the nearest neighbour of an 8-dimensional block B’k among all lattice points in RE8. This is done by applying the following operations. We note that the components of B’k are floating point values. The result of the quantization, , will be a vector of integers.

1. zk = 0.5 * B’k

2. Round each component of zk to the nearest integer, to generate

3. y1k = 2

4. calculate S as the sum of the components of y1k

5. If S is not an integer multiple of 4 (negative values are possible), then modify one of its components as follows:

– find the position I where abs(zk(i)- y1k(i)) is the highest

– if zk(I)- y1k(I) < 0, then y1k(I) = y1k(I) – 2

– if zk(I)- y1k(I) > 0, then y1k(I) = y1k(I) + 2

6. zk = 0.5 * (B’k1.0) where 1.0 denotes a vector with all 1’s

7. Round each component of zk to the nearest integer, to generate

8. y2k = 2

9. calculate S as the sum of the components of y2k

10. If S is not an integer multiple of 4 (negative values are possible), then modify one of its components as follows:

– find the position I where abs(zk(i)- y2k(i)) is the highest

– if zk(I)- y2k(I) < 0, then y2k(I) = y2k(I) – 2

– if zk(I)- y2k(I) > 0, then y2k(I) = y2k(I) + 2

11. y2k = y2k + 1.0

12. Compute e1k = (B’ky1k)2 and e2k = (B’ky2k)2

13. If e1k > e2k, then the best lattice point (nearest neighbour in the lattice) is y1k

otherwise the best lattice point is y2k .

This is noted as where ck is the best lattice point as selected above.

Through this quantization procedure, the scaling gain g, estimated in Step 2, is left unquantized. The gain will be quantized only after being recomputed as in Section 5.3.5.10, to obtain . The quantized spectrum will then be obtained as .

We note that after this lattice quantization step, the indices of the selected lattice points are not known. The indices will only be computed if a particular TCX mode is selected instead of an ACELP mode. (See Step 5 for the lattice index computation)

Step 4 A last step in the quantization procedure is the determination and quantization of a comfort noise factor. Comfort noise enhances the perceived quality in transform-based coders, which is the case for the TCX modes. Comfort noise will be added only to unquantized spectral components in the upper-half of the spectrum (Fs/8 kHz and above). Taking again K as the total number of 8-dimensional blocks in the spectrum, the comfort noise factor is calculated as follows:

Initialisation: Set nbits = 0, n = 1 and take the offset value at the end of the iteration in Step 2 above.

Iteration : For k = K / 2 to K-1, do

1 tmp = Rk(1) – offset ( with Rk(1) as calculated in Step 1)

2. if (tmp < 5), then nbits = nbits + tmp and n = n + 1

Noise factor calculation:

Set nbits = nbits/ n, and evaluate the noise factor as

The noise factor will be comprised between 0 and 1.

Noise level quantization

The comfort noise factor is quantized using 3 bits in the range from 0.8 to 1.0.

Step 5 In Step 3, each scaled block B’k was rounded as a point in the RE8 lattice. The result is , the quantized version of B’k. If the corresponding TCX mode is actually retained as the best encoding mode for that frame (in open-loop fashion as in Section 5.2.4 or in closed-loop fashion as in Section 5.2.3), then an index has to be computed for each for transmission to the decoder. The computation of these indices is described in this final Step.

The calculation of an index for a given point in the RE8 lattice is based on two basic principles:

1- All points in the lattice RE8 lie on concentric spheres of radius with m = 0, 1, 2, 3, etc., and each lattice point on a given sphere can be generated by permuting the coordinates of reference points called leaders. There are very few leaders on a sphere, compared to the total number of lattice points which lie on the sphere. Codebooks of different bit rates can be constructed by including only spheres up to a given number m. See reference [6] for more details, where codebooks Q0, Q1, Q2, Q3, Q4, and Q5 are constructed with respectively 0, 4, 8, 12, 16 and 20 bits. Hence, codebook Qn requires 4n bits to index any point in that codebook.

2- From a base codebook C (i.e. a codebook containing all lattice points from a given set of spheres up to a number m), an extended codebook can be generated by multiplying the elements of C by a factor M, and adding a second-stage codebook called the Voronoi extension. This construction is given by y = M z + v, where M is the scale factor, z is a point in the base codebook and v is the Voronoi extension. The extension is computed in such a way that any point y = M z + v is also a lattice point in RE8. The extended codebook includes lattice points that extend further out from the origin than the base codebook.

The base codebook C in the present TCX modes can be either codebook Q0, Q2, Q3 or Q4 from reference [6]. When a given lattice point is not included in these base codebooks, the Voronoi extension is applied, using this time only the Q3 or Q4 part of the base codebook. Note that here, Q2 Q3 but Q3 Q4.

Then, the calculation of the index for each lattice point (quantization result in Step 3) is done according to the following operations.

Verify if is in the base codebook C. Here, this implies verifying if is an element of Q0, Q2, Q3 or Q4 from [6]. If y is in C, the index used to encode is thus the codebook number plus the index Ik of codevector in . The codebook number is encoded as a unary code, as follows:

Q0  unary code foris 0

Q2  unary code for is 10

Q3  unary code for is 110

Q4  unary code for is 1110

The terminating "0" in this unary code will indicate to the decoder the separation between the successive blocks in the bit stream.

The index Ik indicates the rank of , i.e. the permutation to be applied to a specific leader to obtain (see [6]). Note that if = 0, then Ik uses no bits. Otherwise, the index Ik uses 4bits. Hence, a total of 5bits are required to index any lattice point in the base codebook: bits for the unary code specifying the codebook number (with the exception that 1 bit is required for Q0), and 4 bits to index the lattice point in that codebook.

If is not in the base codebook, then apply the Voronoi extension through the following sub-steps, using this time only Q3 or Q4 as the base codebook.

V0 Set the extension order r = 1 and the scale factor M = 2r = 2.

V1 Compute the Voronoi index k of the lattice point . The Voronoi index k depends on the extension order r and the scale factor M. The Voronoi index is computed via modulo operations such that k depends only on the relative position of in a scaled and translated Voronoi region:

Here,is the generator matrix and modm(·) is the component-wise modulo-M operation. Hence, the Voronoi index k is a vector of integers with each component comprised in the interval 0 to M– 1.

V2 Compute the Voronoi codevector v from the Voronoi index k. This can be implemented using an algorithm described in [7].

V3 Compute the difference vector w = v. This difference vector w always belongs to the scaled lattice m, where is the lattice RE8. Compute z = w/M, i.e., apply the inverse scaling to the difference vector w. The codevector z belongs to the lattice since w belongs to M.

V4 Verify if z is in the base codebook C (i.e. in Q3 or Q4)

If z is not in C, increment the extension order r by 1, multiply the scale factor M by 2, and go back to sub-step V1.

Otherwise, if z is in C, then we have found an extension order r and a scaling factor M = 2r sufficiently large to encode the index of . The index is formed of three parts: 1) the codebook index nk as a unary code defined below; 2) the rank Ik of z in the corresponding base codebook (either Q3 or Q4); and 3) the 8 indices of the Voronoi index vector k calculated in sub-step V1, where each index requires exactly r bits (r is the Voronoi extension order set in sub-step V0).

The codebook index nk is encoded in unary code as follows:

nk = 11110 when the base codebook is Q3

and the Voronoi extension order is r = 1

nk = 111110 when the base codebook is Q4

and the Voronoi extension order is r = 1

nk = 1111110 when the base codebook is Q3

and the Voronoi extension order is r = 2

nk = 11111110 when the base codebook is Q5

and the Voronoi extension order is r = 2

etc.

The lattice point is then described as

= M z + v

The packetisation of these indices into transmission packets will be described in Section 5.6.1.

5.3.5.8 Spectrum de-shaping

Spectrum de-shaping is applied to the quantized spectrum prior to applying the inverse FFT. The de-shaping is done according to the following steps:

– calculate the energy Em of the 8-dimensional block at position index m

– compute the ratio Rm = Emax / Em

– compute the value (Rm) ½

– if Rm > 10, then set Rm = 10 (maximum gain of 20 dB)

– also, if Rm > Rm-1 then Rm = R m-1

After computing the ratio Rm = (Emax / E m) ½ for all blocks with position index smaller that I (and with the limiting conditions described above), we then divide each block by the corresponding ratio. Note that if we neglect the effects of quantization, this de-shaping is the inverse of the pre-shaping function as applied in Section 5.3.5.6.

5.3.5.9 Inverse transform

The quantized spectrum is inverse transformed to obtain the time-domain quantized signal . The Inverse DFT is applied, as defined by:

where LTOT is the number of samples in the TCX frame, as defined in Section 5.3.5.5. An Inverse FFT is used to optimize the computation time of the inverse DFT.

5.3.5.10 Gain optimization and quantization

The global gain estimated in Section 5.3.5.7 to scale the spectrum prior to the multi-rate lattice quantization is not guaranteed to maximize the correlation between the original weighted signal x and the quantized weighted signal . Thus, after the inverse transform of the quantized spectrum (Section 5.3.5.9), the optimal gain between x and is computed as follows:

with LTOT as defined previously. Then, the gain g* is quantized on a logarithmic scale to a 7-bit index, using the following procedure. The procedure is purely algebraic and does not require storing a gain codebook.

1. Calculate the energy of the quantized weighted signal:

2. Compute the RMS value: (known also at the decoder)

3. Set G = g* x rms (normalization step)

4. Calculate the index as where denotes removing the fractional part of x (rounding towards 0).

5. If index < 0, then set index =0, and if index >127, then set index = 127.

The quantized gain can be calculated as follows, both as the encoder and decoder since the decoder can calculate locally the value of rms:

5.3.5.11 Windowing for overlap-and-add

After gain scaling, the quantized weighted signal is windowed again, according to the TCX frame length and the mode of the previous frame. The window shapes are as shown in Figure 8 and defined in Section 5.3.5.4.

To reconstruct the complete quantized weighted signal, overlap-and-add is applied between the memory of the past frame and the beginning of the present frame corresponding to the non-flat portion of the window. Recall that if the past frame was in ACELP mode, the memory of the past frame corresponds to the windowed, truncated ZIR of the perceptual filter, as calculated in Section 5.3.5.3.

5.3.5.12 Memory update

The samples in the lookahead (windowed portion to the right of the TCX frames in Figure 8) are kept in memory for the overlap-and-app procedure in the next TCX frame.

5.3.5.13 Excitation signal computation

The excitation signal is finally computed by filtering the quantized weighted signal through the inverse weighting filter with zero-memory. The excitation is needed at the encoder in particular to update the long-term predictor memory.