6.2.4 Frequency-to-time transformation

26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS

6.2.4.1 Long block transformation (ALDO window)

6.2.4.1.1 eDCT

The IDCTIV is identical to the DCTIV and is given by the following equation, with omitted normalization:

(1881)

6.2.4.1.2 Unfolding and windowing

The frame , , coming from the inverse eDCT transform is unfolded in order to obtain two frames that can be used for overlap- add with the previous unfolded frame to remove the aliasing introduced by the folding process at the encoder.

Similar to the folding done at the encoder, unfolding and window decimation operations are combined in the same process to automatically resample the ALDO windows at 48 and 25.6 kHz while keeping perfect reconstruction conditions. The decimation factor and offset parameters are the same are the one used in the encoder.

The frame issued from the eDCT inverse transform is unfolded into a block of length . The ALDO window is stored at a sampling rate corresponding to two frames of length (). The ratio between and is called the decimation factor (). The unfolding and windowing process is illustrated in figure 109.

Figure 109: Unfolding and windowing with ALDO window.

The unfolded frame is obtained for :

, (1882)

, (1883)

, (1884)

, (1885)

where is the time-reversed version of the ALDO window used in the encoder

, (1886)

is the decimation factor and is the offset.

For the 32 kHz case, to have perfect reconstruction, the ALDO window decimated from 48 to 16 kHz applied on one sample over 2, the other samples are weighted by a complementary window . For this 32 kHz case, the unfolded frames are given by:

, (1887)

, (1888)

, (1889)

, (1890)

, (1891)

, (1892)

, (1893)

, (1894)

where is the length of the 16kHz frame, is the length of MDCT core frame at 32kHz, is the decimation factor and is the offset.

6.2.4.1.3 Overlap-add

Finally, the output full-band signal is constructed by overlap-adding the signals for two successive frames:

()

6.2.4.1.4 Pre-echo attenuation

A typical artefact in transform coding known as pre-echo is observed especially when the signal energy grows suddenly, like speech onsets or music percussions. The origin of pre-echoes is explained below. The quantization noise in the frequency domain is translated into the time domain by an inverse MDCT transform and an add/overlap operation. Thus the quantization noise is spread uniformly in the MDCT synthesis window. In case of an onset, the part of the input signal preceding the onset often has a very low energy compared to the energy of the onset part. Since the quantization noise level depends on the mean energy of the frame, it can be quite high in the whole synthesis window. In this case, the signal to noise ratio (SNR) is very low (often negative) in the low energy part. The quantization noise can be audible before the onset as an extra artificial signal called pre-echo. To prevent the pre-echo artefact, an attenuation scheme is necessary when there is a significant energy increase (attack or onset) in some part of the synthesis window, and the pre-echo reduction has to be performed in the low energy part of the synthesis window preceding the onset. In the following, this low energy part preceding an onset will be referenced as "pre-echo zone". On the other hand the signal energy after pre-echo reduction should not lower than the mean energy in the preceding frames. However, if the preceding frame have low frequency spectrum, knowing that the pre-echo has often white noise like spectrum, even if the energy of the pre-echo zone is reduced to the level of the previous frames the pre-echo is still audible in the higher frequencies.

To improve the pre-echo reduction, an adaptive spectral shaping filtering is applied in the pre-echo zone up to the detected attack or onset to eliminate undesirable higher frequency pre-echo noise. This adaptive spectral shaping filter is realized by a two-band filterbank: the decoded signal is decomposed into two sub-signals according to a frequency criterion to obtain two sub-bands and a pre-echo attenuation factor is calculated in the determined pre-echo zone for each sample in both sub-bands. The attenuation factors of the sub-bands that determinate the spectral response of the filter are computed in function of several parameters of the full-band and sub-band signals as detailed below. The pre-echo attenuation is made in the sub-bands by applying these attenuation factors in the pre-echo-zone. Finally the two attenuated sub-bands are combined to obtain the pre-echo attenuated decoded signal. The pre-echo attenuation is activated for received frames, when the previous frame was also received, and when the bitrate is not higher than 32 kbit/s.

A pre-echo in the current frame can be caused by a sharp onset in the current or the next frame, as the MDCT analysis window covers these two consecutive frames. An onset in the next frame can be detected by analysing the memory of inverse MDCT that will be used in the next frame in the overlap-add operation. A discrimination of pre-echo/non-pre-echo zones and the attenuation factor computation are based on two signals of the inverse MDCT transform: on the decoded output full-band signal , and on the first un-windowed memory of inverse MDCT , that will be used in the next frame in the overlap-add operation to synthesize the output content for the next frame and the pre-echo reduction is done in echo zones preceding the onsets.

Decomposition in two sub-bands

The decoded signal is decomposed in a lower and an upper frequency band sub-signals. These signals are computed by applying an adaptive zero-delay FIR filter with transfer function in low-band, with = 0.25 in the current frame and 0 otherwise; the high-band is given by the complementary filter. The first, lower band sub-signal is obtained by a first filtering of the full-band signal by the low-pass filter

(1896)

and the second, higher band sub-signal is obtained by subtracting the lower band sub-signal from the decoded signal:

()

For the memory part only the higher-band component is computed as

()

Discrimination procedure of pre-echo/non-pre- echo zones

The discrimination procedure between pre-echo zones and non-pre-echo zones is based on the concatenated signal formed from , and , . This signal is divided in sub-blocks and its temporal envelope is computed.

The current frame part of the concatenated signal, , is divided into sub-blocks of samples where =8 (2.5 ms sub-blocks). The temporal envelope of this signal is computed as successive sub-block energies.

, ()

The memory part of the concatenated signal forms one sub-block, its energy is computed as

()

The energy of the first half and the first ¾ samples of each sub-blocks of the current frame are also memorized:

, ()

, ()

The temporal envelope of the higher band in the current frame is also computed:

, ()

Then, , is then modified as follows:

()

In this paragraph, index is used for samples, and index is used for sub-blocks.

In the concatenated signal the sub-block with maximal energy, including the memory sub-block, is also searched:

()

The transition of the temporal envelope to a high-energy zone is detected in the sub-block with the index given by:

()

Note that when =0 either no pre-echo attenuation is made or the pre-echo attenuation of the previous frame is finished on the first samples of the current frame.

The zero-crossing rate , is also computed for each sub-block. A zero-crossing is detected when the product of two consecutive samples is smaller or equal to 0. The parameter , is defined as the number of times when the following condition is verified:

, ()

The zero crossings between two consecutive sub-blocks count for the next sub-block. The zero crossing rate of the memory part is also computed.

The maximum length without zero crossing , is also stored for each sub-block. A period without zero crossing that covers a sub-block border is taken into account for the previous sub-block.

The maximal energy is compared to that of the preceding sub-blocks:

, ()

The low energy sub-blocks preceding the sub-block in which a transition has been detected with > 16 are determined as pre-echo zone. However in the following cases the sub-block is considered as non-pre-echo zone:

if

or

if and

or

if and

where, computed in the previous frame and memorised,

()

and = 10 for narrowband signals and 16 otherwise.

Even the previous sub-blocks are considered as non-pre-echo zone if their energy is higher than .

The pre-echo attenuation of low energy sub-block determined as pre-echo zone is made by multiplying the two sub-band signals, the lower band and the higher band by attenuation factors and respectively, where and are determined as a function of the temporal envelope of the concatenated signal ,.

For each sample of the pre-echo zone sub-blocks, these gains are set to 0.01 if > 32 and to 0.1 otherwise. For the other sub-blocks, the initial gains are set to 1, they form the non-pre-echo zone. Following this is set as the index of the first non-echo sub-block (where the initial gain is equal to 1).

A false alarm detection is made at this point. If the last pre-echo attenuation gain in the previous frame is higher than 0.5 and in the current frame only one sub-block has attenuation gain of 0.1 and the other gains are 0, is set to 0.

The initial pre-echo attenuation gains depend also on the energy of the previous frame: a minimal attenuation value for each pre-echo zone sub-block and for both sub-bands are also fixed as a function of the temporal envelope of the reconstructed signal of the previous frame. This value is fixed in a way that the attenuated sub-block energy in the sub-band cannot be lower than the pre-echo attenuation gain compensated mean energy of the previous frame in that sub-band, to preserve background noise energy. In the lower band:

()

for and where was computed in the previous frame as:

()

However is set to 1 if or .

In a similar way the initial pre-echo attenuation gain for the higher band signal is computed as:

()

for where

()

Note that the initial attenuation gain in both the lower band and the higher band are identical for each samples of a sub-block.

Before applying the pre-echo attenuation gains the position of the onset is refined. If the onset was detected in the current frame, each sub frames from index to are divided into sub-sub-blocks where =4 if the sampling frequency is 8 kHz and =8 otherwise. If = 0 only the first sub-block is considered.

The energy of these sub-sub-blocks is computed:

, ()

where

()

and

()

When the onset was detected in the future memory part , only the first samples are examined and and

, ()

The maximum of these values is searched:

, ()

The values are compared to adaptive thresholds. The first one is independent of the sub-sub-block index:

()

The second one is computed as:

()

where

()

If and this value is modified as:

()

Initially the starting position of the onset for both the lower and the higher band is the beginning of the sub-block . This position is delayed by samples by sub-sub-blocks as long as . The pre-echo attenuation gain of these samples moved from the non-pre-echo zone to the pre-echo-zone is set equal to the gain of last sample of the original pre-echo zone and respectively in the 2 sub-bands. In the following these new samples in the pre-echo zone are considered as the part of the last pre-echo zone sub-block (index ), the length of this sub-block can be longer than .

To avoid false pre-echo detection, the energies of the last 2 or 3 sub-blocks preceding the onset is verified for both the full-band and the high-band signals: the regression coefficient for these sub-blocks energies is computed by the least squares estimation technique and compared to thresholds. If at least one regression coefficient is lower to its threshold the pre-echo attenuation is inhibited. In fact it is checked whether the sub-blocks preceding the onset have stable or increasing energy, this is always true for pre-echos. For easy comparison to threshold the regression coefficients are normalised by the sub-band energies when the threshold is different to 0. If the threshold is 0, only the sign of the regression coefficient is checked, no normalisation is needed.

When the onset is detected in the first or second sub-block this verification is not possible.

When the onset is detected in the third sub-block only the high-band regression coefficient is computed and compared to the threshold . As only the sign is checked here no normalization is needed for the regression coefficient:

If < the pre-echo attenuation in the pre-echo zone is inhibited.

When the onset is detected in the fourth or later sub-block both the full-band regression coefficient and the normalized high-band regression coefficient are computed on the last 3 sub-blocks preceding the onset and they are compared to the thresholds and respectively. Let’s note the index of the sub-block where the onset is detected .

In the full-band only the sign is checked, no normalization of the regression coefficient is needed:

In the higher band the normalized regression coefficient is estimated as:

The comparison is equivalent to :

If the < or < the pre-echo attenuation in the pre-echo zone is inhibited.

The pre-echo attenuation functions and are stair-like, the gain is constant within a sub-block. To avoid annoying noise due to this discontinuity, the final pre-echo attenuation gain for the lower band is obtained by linear smoothing of the initial pre-echo attenuation gain introducing intermediate levels between the gains of consecutive sub-blocks. For narrow band signals = 20, for other bandwidths = 4. This smoothing is done before the detected onset position and at the beginning of each sub-block. For the first sub-block the smoothing is done between the memorized last gain value of the previous frame and the gain of the first sub-block of the current frame. If the onset position is detected in the next frame no smoothing is done at the end of the frame, this will be done at the beginning of the next frame. For example at the beginning sub-block , and if the gains determined for the sub-blocks and are and respectively, for wide band signals (= 4) the gains are smoothed in the following way:

Before smoothing:

index

gain

After smoothing:

index

gain

In the higher band no smoothing is necessary, .

In both sub-band, the pre-echo is attenuated in the pre-echo-zone by applying these gains to the sub-signals:

()

()

The final pre-echo attenuated synthesized signal is obtained by combining the two attenuated sub-signals:

()

6.2.4.2 Transient location dependent overlap and transform length

The configuration for the overlap and transform lengths is depends on the overlap code for the current frame and on the overlap code for the previous frame as described in subclause 5.3.2.3, where the overlap code is obtained as described in subclause 6.2.2.2.1.

6.2.4.3 Short block transformation

6.2.4.3.1 Short window transform in TDA domain

This processing is done when the Transient mode is selected, the spectrum is first de-interleaved into four spectra with m = 0,…,3. This operation is the inverse of the interleaving performed in the encoder, see subclause 5.3.2.4.1.3.

The four spectra corresponding to short 5-ms transforms are first transformed to the time-aliased domain using the short inverse DCTIV transforms. The obtained signals are denoted and are each of a length.

The obtained time-domain aliased signals are further expanded into the time domain by using the inverse time-domain aliasing operation. This operation can be seen as a pseudo-inverse of the matrix used in equation (8) (with replaced by).

Formally, this is performed according to:

()

The length of the resulting signal for each sub-frame index is equal to double the length of the input spectrum, i.e., .

Figure 110: Algorithm for inverse transform in the case of transient mode.

The resulting time domain aliased signals for each sub-frame are windowed using the same configuration of windows as those in the encoder. The resulting windowed signals are overlap‑added. Note that the window for the first m = 0 and last m = 3 sub-frame is zero. This is due to the zero padding that is used in the encoder. These two frame edges do need to be computed and are effectively dropped. The resulting signal of the overlap-add operations of all sub-frames is reordered using the inverse operation performed in the encoder, which leads to the signal , . An overview of these operations is shown in figure . Then windowing and overlap-add are performed on the same as for the long window transform in subclause 5.3.2.2.

6.2.4.3.2 Short window transform for MDCT based TCX

After choosing the transform and overlap length configuration for transient frame as described in subclause 6.2.4.2 each sub-frame (TCX5 or TCX10) is windowed and transformed using the inverse MDCT, which is implemeted using IDCTIV and inverse TDA.

6.2.4.4 Special window transitions

The overlap mode FULL is used for transitions between long ALDO windows and short symmetric windows (HALF, MINIMAL) for short block configurations (TCX10, TCX5) described in subclause 6.2.4.2. The transition window is derived from the ALDO and sine window overlaps.

6.2.4.4.1 ALDO to short transition

The left part of the transition window uses the left slope of the ALDO synthesis window (short slope), which has a length of 8.75ms (see figure 111).

For the right part of the transition window HALF or MINIMAL overlap is used.

Figure 111: ALDO to short transition

6.2.4.4.2 Short to ALDO transition

HALF or MINIMAL overlap is used for the left part of the transition window.

The right overlap of the transition window has a length of 8.75ms (FULL overlap) and is derived from the ALDO window. The right slope of the ALDO synthesis window (long slope) is first shortened to 8.75ms by removing the last 5.625ms. Then the last 1.25ms of the remaining slope are multiplied with the 1.25ms MINIMAL overlap slope to smooth the edge. The resulting 8.75ms slope is depicted in figure112 .

Figure 112: Short to ALDO transition

6.2.4.5 Low Rate MDCT Synthesis

In addition to the full MDCT synthesis of length , which gives an output signal at the configured output sampling rate , a further MDCT synthesis of the lower part of the spectrum is performed to obtain an output signal at the CELP sampling rate . The low rate output is required for switching from TCX to ACELP. An MDCT synthesis of length is performed on the lowest MDCT coefficients. In case the output sampling rate is set lower than the CELP sampling rate (so that ), the spectrum is padded with zeroes to the required length . The low rate synthesis transform is performed as defined above with the only difference being the transform length.