5.3.2 Time-to-frequency transformations

26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS

5.3.2.1 Transform sizes and MDCT configurations

The MDCT encoder operates with a frame of length at the input sampling frequency .

5.3.2.2 Long block transformation (ALDO window)

The window used in long block transformation is an asymmetrical low delay optimized (ALDO) window. This ALDO window is stored in ROM in two versions, one at 48 kHz and another at 25.6 kHz respectively; the ALDO window at other input sampling frequencies (namely 8, 12.8, 16, 32 kHz) is obtained by on-the-fly decimation in the folding process described in subclause 5.3.2.2.1.

The ALDO window has a time support of 40 ms and its definition is the same at 48 and 25.6 kHz. To simplify notations the sampling frequency (48 or 25.6 kHz) associated with the ALDO window is not included in the following text. This window, of length , is structured in 4 segments:

(877)

where is the frame length (20 ms) and is the length of the last segment with a weight of 0 (5.625 ms). The different segments consist of an increasing segment , a constant segment with a weight of 1, a decreasing segment , a constant segment with a weight of 0, as illustrated in figure 53. Note that in practice the parts corresponding to weights of 1 and 0 are not stored explicitly.

Figure 53 is also useful to illustrate the time alignment of input signal frames at the MDCT encoder. As the fourth segment of the ALDO window has by definition a weight of 0, the frame of new input samples, , is time-aligned in such a way that its end coincides with the end of the third ALDO window segment ; this alignment allows to save samples of lookahead (5.625 ms); the previous frame of input signal, , is partially weighted by the first ALDO window segment, .

Figure 53: ALDO window

The ALDO window was obtained at 48 and 25.6 kHz by a two-step process: first an initial window was obtained, then this initial window was regularized in order to ensure perfect reconstruction. The initial window is defined in 4 segments with an increasing segment , a constant segment with a weight of 1 identical to , a decreasing segment , a constant segment with a weight of 0 identical to . The ALDO window is therefore defined by spelling out the initial window and the regularization term .

ALDO window at 48 kHz

At 48 kHz, the 20 ms frame has samples. The first segment of the ALDO window is given by

(878)

where the first segment of the initial window is

(879)

and the regularization factor is

, (880)

is the length of the first segment (690 samples at 48 kHz), is a constant. The second segment of the ALDO window is . The third segment of the ALDO window is given by

(881)

where the third segment of the initial window is

(882)

and the regularization factor is

, (883)

is the length of the first segment (420 samples at 48 kHz), is a constant. The fourth segment of the ALDO window is (270 sample at 48 kHz).

ALDO window at 25.6 kHz

The ALDO window at 25.6 kHz is defined as the ALDO window at 48 kHz, except the following parameters are used:

It follows that and .

ALDO window at other sampling frequencies

The ALDO window at 48 kHz is used to derive on the fly the ALDO window at 8, 16, and 32 kHz. Similarly, the ALDO window at 25.6 kHz is used to derive on the fly the ALDO window at 12.8 kHz. Details on this on-the-fly decimation are provided in subclause 5.3.2.2.1.

5.3.2.2.1 Folding and on-the-fly window decimation

The folding operations and window decimation (when applicable) are combined in the same process. To achieve perfect reconstruction ALDO windows the 48 and 25.6 kHz are irregularly decimated. The decimation consists in selecting a subset of coefficients from the initial window (48 or 25.6 kHz) at specific indices to preserve perfect reconstruction; this decimation is combined with the folding process for efficiency.

The folding process is illustrated in figure 54. The MDCT folding is performed by dividing the 40 ms time support of the ALDO window in 4 sections delimited by dotted vertical lines.

Figure 54: Folding with ALDO window.

For each frame of new input samples, a block of length is folded into a block , . The ALDO window at 48 or 25.6 kHz is defined at a sampling frequency corresponding to two frames of length at 48 or 25.6 kHz () . The ratio between and is called the decimation factor.

The folded frame is given for by:

, (884)

, (885)

where the ratio is the decimation factor and is an offset that depends on the decimation factor. The ALDO window used for decimation, decimation factor and offset are given in table 77.

Table 77: ALDO window used for decimation, decimation factor and offset parameters.

MDCT core sampling frequency (kHz)

ALDO window used for decimation (kHz)

Decimation factor

Offset

8

48

6

2

12.8

25.6

2

0

16

48

3

1

25.6

25.6

1

0

32

48

3

1

48

48

1

0

The 32 kHz sampling frequency is a special case, as the ratio between 48 and 32 kHz is not integer. For this 32 kHz case, in order to achieve perfect reconstruction, the ALDO window decimated from 48 to 16 kHz is applied to samples with even indices, the samples with odd indices are weighted by a complementary window . For the 32 kHz case, the folded frame is given for by:

, (886)

, (887)

, (888)

, (889)

where is the frame length at 16kHz, is the length of MDCT core frame at 32kHz, is the decimation factor and is the offset.

The complementary window of size is stored in ROM. To obtain the window , the ALDO window at 48 kHz is decimated by 3 (with an offset of 1), the resulting decimated window at 16 kHz is considered as if it was obtained from a 32 kHz window decimated by 2 (without any offset) so that half of the samples of the 32 kHz window is known; the other half is obtained doing a linear interpolation between known samples. The combination of the 16 kHz window obtained by decimation and is such that perfect reconstruction is ensured for the overall 32 kHz window.

5.3.2.2.2 eDCT

The , are transformed to frequency domain by an eDCT, which is built upon a discrete cosine transform type IV (DCTIV) but the eDCT requires less storage and has lower complexity.

The original L-point DCTIV formula is:

(890)

where is the windowed input signal of the current audio frame and is the -th DCT spectral component. This formula can be rewritten as:

(891)

where the values are given by

(892)

and , , and .

Hence, the eDCT is computed using the following steps which lead to less storage and lower complexity:

1) Pre-processing

Apply the twiddle factors to the time domain data , so as to obtain twiddled signal :

(893)

Pre-rotate the twiddled signal by using a symmetric rotation factor:

,

where in the rotation factor may also be expressed in the following form:

(894)

which satisfies conditions of , and , , and therefore, in the implementation, only one data table of cosine values covering L/2-points, needs to be stored, and is a constant..

2) Fast Fourier Transform

Perform a Fast Fourier Transform (FFT) of L/2 points on the pre-rotated data .

3) Perform an in-place fixed rotate compensation

The FFT data , of is rotated in-place by multiplying with a fixed rotate compensation factor :

(895)

4) Post-processing

The data , is finally post-rotation processed with a symmetric rotation factor:

, ,

where in the rotation factor may also be expressed in the following form:

(896)

In the specific implementation, only one cosine data table of L/2- values needs to be stored i.e. , which is the same as the pre-processing, and is a constant.

5) Obtain frequency domain data

Then the real parts of the post- rotated data are expressed as, , which represent the odd number frequency bins of the frequency domain data; and the frequency reversed imaginary parts of the post-rotated data are expressed as , , which represent the even number frequency bins of the frequency domain data.

(897)

where , .

5.3.2.3 Transient location dependent overlap and transform length

Beside ALDO window, 3 more overlap shapes are used for windowing:

  • FULL: 8.75 milliseconds overlap described in subclause 5.3.2.5, used for transition from transform with long ALDO window to short transform
  • HALF: 3.75 milliseconds symmetric sine overlap
  • MINIMAL: 1.25 milliseconds symmetric sine overlap

For long MDCT based TCX (TCX20) transformation, that is not after ACELP, 9 combinations of one overlap shape (ALDO, HALF, MINIMAL) on the left side of the window and another overlap shape (ALDO, HALF, MINIMAL) on the right side of the window is possible. For HQ MDCT 4 combinations of one overlap shape (ALDO, FULL, HALF, MINIMAL) on the left side of the window and ALDO on the right side of the window is possible.

For short MDCT based TCX transformation (TCX10 or TCX5), 7 of the 9 possible combinations of one symmetric overlap shape (FULL, HALF, MINIMAL) on the left side of the window and another symmetric overlap shape (FULL, HALF, MINIMAL) on the right side of the window are used (see table 80).

The overlap length and the transform length of the TCX are dependent on the existence of a transient and its location and are chosen so that a transient is mostly contained in only one window as shown in table 78.

Table 78: Coding of the overlap and the transform length based on the transient position

Attack index

Overlap with the first window of the following frame

Short/Long Transform decision (binary coded)

0 – Long, 1 – Short

Binary code for the overlap width

Overlap code

none

ALDO

0

0

00

-2

FULL

1

0

10

-1

FULL

1

0

10

0

FULL

1

0

10

1

FULL

1

0

10

2

MINIMAL

1

10

110

3

HALF

1

11

111

4

HALF

1

11

111

5

MINIMAL

1

10

110

6

MINIMAL

0

10

010

7

HALF

0

11

011

If the transient detector described in the subclause 5.1.8 does not detect a transient, but there is increase of energy at high frequencies in the current frame relative to the previous frame ( is the high frequency energy for the current frame and is the high frequency energy for the previous frame, as defined in subclause 5.1.2.2):

(898)

then the short transform is chosen and overlap is dependent on the attack index set in the transient detector or half overlap is chosen if the transient detector didn’t set attack index. The transient detector described in the subclause 5.1.8 basically returns the index of the last attack with the restriction that if there are multiple transients then MINIMAL overlap is preferred over HALF overlap which is preferred over FULL overlap. If an attack at position 2 or 6 is not strong enough then HALF overlap is chosen instead of the MINIMAL overlap.

For bitrates below 48 kbps the short transform length is not allowed and thus the attack index lower than 6 would trigger ALDO window and the overlap code 00.

If the previous frame was coded using ACELP a specific configuration is used as described in subclause 5.4.2.2.

Depending on the right overlap (current overlap code) and the left overlap (previous overlap code) in the current frame the configuration of the transform lengths in the current frame is chosen as presented in table 79.

Table 79: Transform lengths decision table

Previous overlap code

Current overlap code

00/10

111/011

110/010

00

TCX20

TCX20

TCX20

10

2xTCX5,

TCX10

2xTCX5,

TCX10

2xTCX5,

TCX10

111

2xTCX5,

TCX10

4xTCX5

4xTCX5

110

2xTCX5,

TCX10

4xTCX5

4xTCX5

010

TCX20

TCX20

TCX20

011

TCX20

TCX20

TCX20

If the current frame is classified as the transition frame (current overlap code is 10, 111 or 110) the detailed configuration of the transform and overlap lengths is presented in table 80.

Table 80: Transform and overlap length configuration for transient frame

Prev. code

Curr. code

00/10

111/011

110/010

10

FULL,TCX5,MIN.,TCX5,

MIN.,TCX10,FULL

HALF,TCX5,MIN.,TCX5,

HALF,TCX10,FULL

MIN.,TCX5,MIN.,TCX5,

MIN.,TCX10,FULL

111

FULL,TCX10,HALF,TCX5,

MIN.,TCX5,HALF

HALF,TCX5,MIN.,TCX5,HALF,

TCX5,MIN.,TCX5,HALF

MIN.,TCX5,MIN.,TCX5,MIN.,TCX5,MIN.,TCX5,HALF

110

FULL,TCX10,MIN.,TCX5,

MIN.,TCX5,MIN.

HALF,TCX5,MIN.,TCX5,MIN.,TCX5,MIN.,TCX5,MIN.

MIN.,TCX5,MIN.,TCX5,MIN.,TCX5,MIN.,TCX5,MIN.

In table 79 and table 80, the value 00/10 for previous overlap code indicates that the same configuration is used if the previous overlap code is 00 as well as when it is 10. The meaning of values 111/011 and 110/010 follows the same logic.

As example “HALF, TCX5, MIN., TCX5, HALF., TCX10, FULL” from Table 80 indicates that first TCX5 has HALF overlap on the left side and MINIMAL overlap on the right side, followed by second TCX5 with MINIMAL overlap on the left side and HALF overlap on the right side, followed by TCX10 with HALF overlap on the left side and FULL overlap on the right side. The TCX10 with HALF overlap on the left side and FULL overlap on the right is one example of the window described in subclause 5.3.2.5.2.

5.3.2.4 Short block transformation

5.3.2.4.1 Short window transform in TDA domain

5.3.2.4.1.1 Transient detection

Figure 55: Transient detection algorithm

The block diagram of the transient detection algorithm is shown in figure 55. The input signalis first high-pass filtered; the high-pass filter serves as a precaution against undesired low-frequency components. A first order IIR filter is used, and it is given by:

(899)

The same filter is used for all sampling frequencies.

The output of the high-pass filter is obtained according to:

(900)

wheredenotes the 20-ms frame length. The high-pass filtered signal is sectioned into four sub-frames, each corresponding to 5 ms, samples each.

The energy of each sub-frame,, is computed according to:

(901)

The signal’s long-term energy corresponding to each sub-frame,, is updated according to the following equation:

(902)

In the above equation, the forgetting factor is set to 0.25, and the convention that from the previous frame is used. The high-pass filter,, states as well as are saved for processing in the next frame.

During switching, when core has changed or extension layer has changed, then is initialized with the energy of.

For each sub-frame , a comparison between the short-term energy and the long-term energy is performed. A transient is detected whenever the energy ratio is above a certain threshold. Formally, a transient is detected whenever:

(903)

whereis the energy ratio threshold and is set according to table 81.

Table 81: Threshold table for transient detector

Extension layer

Condition

Threshold

Generic Mode

Inactive

10

Not Inactive

13.5

Otherwise

SWB,

Otherwise

6

For the first coded frame with extension layer using the Generic mode then the transient detection, after Equation 903, is set to 0.

In general, the time-frequency transform is applied on a 40-ms frame; therefore, a transient will affect two consecutive frames. To overcome this, a hangover for detected transient is used. A transient detected at a certain frame will also trigger a transient at the next frame.

The output of the transient detector is a flag, denoted IsTransient. The flag is set to the logical value TRUE if a transient is detected, or FALSE otherwise.

If there is a change in extension layer, then the hangover and filter memories are reset before the start of the transient detector.

Figure 56: Transient detection algorithm for NB

The block diagram of the transient detection algorithm for NB is shown in figure 56. Most of algorithm is same as that of WB and SWB. In the Compute block energy block, the energy of each sub-frame for ,, is computed according to:

(904)

where the EHP(0) is the energy of the 4th sub-frame in the previous frame.

The Transient information refinement block performs an additional verification and checks whether the current frame that has been determined as a transient frame is truly a transient frame. This is to prevent a transient determination error occurring due to the high-pass filtering block removing energy in a low frequency. The operation of the Transient information refinement block is described for the case where the current frame is detected to be transient.

Figure 57: Transient information refinement for NB

The position which is detected as a transient is one of 4 sub-frames in frame n. The number of blocks included in the region L and the number of blocks included in the second region H will vary depending which of these 4 sub-frames is detected as the transient, as represented in the figure 57.

First, the ratio of the average of the short-term energy in the region H (E_high) to the average of the short-term energy in the region L (E_low) is calculated. Region H includes the block which was previously detected to be transient and blocks existing thereafter. Region L includes the blocks prior to region H.

Next, the ratio of the short-term energy of the frame n before the high pass filtering (E_in) to the short-term energy of the frame n after the high pass filtering (E_out) is calculated.

If these ratios meet the following conditions, frame n is determined to the normal frame instead of a transient frame.

if( ((E_high/E_low<2.0f)&&(E_high/E_low>0.7f)) && ((E_in/E_out)>Thres) )

IsTransient = 0;

5.3.2.4.1.2 Short window transform

When the flag IsTransient is set then the transform is switched from a long transform to a series of short transforms as is depicted in figure 58. This subclause describes the short transform.

Figure 58: Adaptive time-frequency transform

Windowing and time-domain aliasing are done as described in subclause 5.3.2.2. A reordering of the time-domain aliased signal is performed. In order to keep the temporal coherence of the input signal, the output of the time‑domain aliasing operation needs to be reordered before further processing. The ordering operation is necessary, without ordering the basis functions of the resulting filter-bank will have an incoherent time and frequency responses. The reordering is done according to equation 905, and consists of shuffling the upper and lower half of the TDA output signal .

(905)

This reordering is only conceptual and in reality no computations are involved. Higher time resolution is obtained by zero-padding the signal and dividing the resulting signal into four overlapped equal length sub-frames. The amount of zero-padding is equal to on each side of the signal. The segments are 50% overlapped and each segment has a length equal to. The two inner segments are post-windowed using a sine window of length. The windows for outer segments are constructed using half a sine window.

This operation is depicted in figure 59; each resulting post-windowed segment is further processed by applying the modified discrete cosine transform (MDCT) (i.e., time aliasing + DCTIV). The output of the MDCT for each segment represents the signal spectrum at different time instants, thus, in case of transients, a higher time resolution is used.

The length of the output of each of the four MDCTs is half the length of the input segment, i.e.,, therefore this operation does not add additional redundancy.

Figure 59: Higher time resolution, division into four segments

5.3.2.4.1.3 Interleaving

After the short transform described in subclause 5.3.2.4.1.2, then the coefficients of the equivalent four 5-ms transforms are interleaved. Table 82 shows the band lengths used for interleaving. Coefficients of interleave band length are read from the first transform, , then the second transform, , and so on until the fourth transform. This is then repeated for all interleave bands. The interleaved bands are written to the vector .

Table 82: Band widths for transient mode interleaving

Band

0

1

2

3

4

5

6

7

8

WB

16

16

16

16

16

SWB

16

16

16

16

24

24

24

24

FB

16

16

16

16

24

24

24

32

32

5.3.2.4.2 Short window transform for MDCT based TCX

After choosing the transform and overlap length configuration for transient frame as described in subclause 5.3.2.3 each sub-frame (TCX5 or TCX10) is windowed and transformed using the MDCT, which is implemented using TDA and DCTIV.

After the TNS described in subclause 5.3.3.2.2, the MDCT bins of 2 TCX5 sub-frames are interleaved:

(906)

5.3.2.5 Special window transitions

The overlap mode FULL is used for transitions between long ALDO windows and short symmetric windows (HALF, MINIMAL) for short block configurations (TCX10, TCX5) described in subclause 5.3.2.3. The transition window is derived from the ALDO and sine window overlaps.

5.3.2.5.1 ALDO to short transition

The left overlap of the transition window has a length of 8.75ms (FULL overlap) and is derived from the ALDO window. The left slope of the ALDO analysis window (long slope) is first shortened to 8.75ms by removing the first 5.625ms. Then the first 1.25ms of the remaining slope are multiplied with the 1.25ms MINIMAL overlap slope to smooth the edge. The resulting 8.75ms slope is depicted in figure 60.

For the right part of the transition window HALF or MINIMAL overlap is used.

Figure 60: Window shape transition ADLO to Short

5.3.2.5.2 Short to ALDO transition

HALF or MINIMAL overlap is used for the left part of the transition window.

For the right part of the transition window the right slope of the ALDO analysis window (short slope) is used, which has a length of 8.75ms (see figure 61).

Figure 61: Window shape transition Short to ALDO

Modified Discrete Sine Transform

The MDST is computed in the same way as the MDCT defined in the previous subclauses, with two exceptions:

  • The TDAC equation is described by:

(907)

  • The DCTIV given by equation (890) is replaced by a DSTIV:

(908)

The MDST transformation always follows the MDCT parameters such as transform size or windowing.