5.4.3 HQ MDCT

26.4473GPPCodec for Enhanced Voice Services (EVS)Error concealment of lost packetsRelease 17TS

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

5.4.3.1 Preliminary signal analysis of past synthesis

The buffer containing the past decoded signal is analysed in a preliminary step to prepare the PLC selection method described in clause 5.4.3.2 and the MDCT concealment described in clause 5.4.3.6.

5.4.3.1.1 Resampling to 8 kHz

The last 2 frames of the previous synthesis signal are resampled to 8 kHz using zero-delay low-pass FIR filter with a cutoff frequency at 4 kHz. The FIR filter order is 20, 40, 60 for a sampling frequency of 16, 32, 48 kHz, respectively. The FIR filter coefficients are denoted at 16 kHz, at 32 kHz and at 48 kHz.

Low-pass filtering and downsampling steps are jointly performed with a polyphase approach; the resampled signal at 8kHz, , can be computed using the relationship based on the past synthesis , , at respectively 16, 32 and 48 kHz:

(162)

Note that in the above summations, the past synthesis outside the last 2 frames is by convention considered to be zero. For instance, at 16 kHz, it is considered that when or .

5.4.3.1.2 Pitch search by cross-correlation

The past synthesis signal resampled to 8 kHz and of length 40 ms, , is used to perform an open-loop pitch search as follows:

– The target signal is defined as the last 6 ms segment from the 40 ms buffer at 8 kHz:

– A search vector of the same length (6 ms), , with sliding starting point is used. The search range covers 33 ms when the voicing parameter indicates a voiced segment (i.e. =1) and 28 ms otherwise; therefore the pitch search range is adapted depending on the voicing indicator , to use a longer search range in case of voiced signals. The cross-correlation is computed for each index as:

(163)

To minimize computational complexity the term is pre-computed and the term is updated incrementally by removing the first term and adding a new term in each iteration.

For each index , the maximum correlation and maximum location are updated as follows: If , and , with the initial conditions and ; this loop is stopped whenever =0 and .

The pitch is then defined as , which corresponds to the time offset with respect to the beginning of the target signal (i.e. 34 ms after the beginning of the past synthesis ).

5.4.3.2 PLC method selection

In case the last good frame prior to a loss was coded with HQ MDCT a range of different specifically optimized PLC methods is available that are selected based on second level criteria described in this subclause.

The criteria evaluated in this second level PLC method selection are:

– Output sampling rate
The output sampling rate in which response the second level PLC method is selected is one out of the set of {8000Hz, 16000Hz, 32000Hz, 48000Hz}.

– Bit rate
The bit rate in which response the second level PLC method is selected is one out of the set of the supported bit rates of the EVS default operation mode [5].

– Voicing
The voicing parameter in which response the second level PLC method is selected is a binary parameter.

– Correlation
The correlation parameter computed as in clause 5.4.3.1.2, in which response the second level PLC method is selected is a correlation coefficient defined in the number range from [0…1].

– Transient condition
The transient condition in which response the second level PLC method is selected is a vector of dimension 2 of binary parameters indicating a transient condition in the last good frame or in the frame before . The determination of the transient condition for a given HQ MDCT frame is specified in [5], subclause 5.3.2.4.1.1.

– Spectral envelope stability based speech/music classification
The Spectral envelope stability based speech/music classification in which response the second level PLC method is selected is a binary parameter. This parameter is a post-processed instance of the envelope stability parameter that is specified in [5], subclause 6.2.3.2.1.3.2.3 (Noise level adjustment). The spectral envelope stability based speech/music classification is calculated during the decoding of the preceding good HQ MDCT frame and stored for use in the context of the PLC method selection during a bad frame.

The post-processing of this parameter is a Markov smoother with:

– {speech, music} as hidden states,

– the normalized envelope stability parameter,

(164)

and its reverse
as direct state observation likelihoods for music and, respectively, speech,

– and the transition probabilities
for going from speech or, respectively, music state to speech state, and
for going from speech or, respectively, music state to music state.

For each good HQ MDCT frame the following sequence of operations is executed:

1) Calculation of the normalized envelope stability parameter and its reverse .

2) Calculation of a priori likelihoods for speech and music states based on the state likelihoods for the instant of the previous (good) frame and the transition probabilities:

(165)

3) Element-wise multiplication of the vector of a priori likelihoods with the vector of direct state observation likelihoods for music and, respectively, speech:

(166)

Subsequent normalization yield the vector of state likelihoods of the current frame:

(167)

4) Finally, the index of the largest element of the state likelihood vector is identified and taken as speech/music classification result for the present frame.

(168)

5) The state likelihood vector of the current frame is stored for subsequent use in the next good HQ MDCT frame.

With the above-specified parameters the second level PLC method selection is performed as follows:

– Firstly, if output sampling rate equals 8000 Hz, the PLC method specified in clauses 5.4.3.3, 5.4.3.4 is applied.

– Otherwise (if output sampling rate is equal or exceeds 16000 Hz), then:

– in case the bit rate is less or equal to 48 kbps and

– if the voicing parameter is set or the correlation parameter exceeds 0.85, then

– the frame loss concealment method specified in subclause 5.4.3.6 is applied;

– otherwise,

– the frame loss concealment method specified in subclause 5.4.3.5 is applied.

– otherwise (in case the bit rate is larger than 48 kbps), then:

– the frame loss concealment method specified in subclause 5.4.3.6 is applied under the same condition as above (for bit rates less or equal to 48 kbps) except for the case that the spectral envelope stability based speech/music classification indicates music, in which case this frame loss concealment method is only applied if the correlation parameter is below 0.6 or if the voicing parameter is set;

– otherwise, if the above condition is not satisfied the frame loss concealment method specified in subclause 5.4.3.5 is applied.

– However, in addition to the conditions specified above, the frame loss concealment method specified in subclause 5.4.3.6 is only applied under the provision that the current frame is the first bad frame following a good frame and that the transient condition vector does not indicate a transient in the previous or it indicates a transient in the frame before the previous frame. If this provision is not satisfied, the frame loss concealment method specified in subclause 5.4.3.5 is applied.

The decoding of HQ MDCT for NB includes the following modules:

– a frequency domain packet loss concealment (PLC) block,

– a spectrum decoding block,

– a memory update block,

– an IMDCT block,

– and a time-domain PLC block.

If it is determined that there is an erased frame, the erased frame is concealed using a PLC method. The bad frame indicator (BFI) set to 1 indicates that a current frame is erased, or that no useful information exists for that frame. Similarly, the Prev_BFI flag set to 1 indicates that a previous frame has been erased.

Figure 2 shows the block diagram for packet loss concealment of NB signals for the MDCT mode. A frequency-domain approach operates on the frequency domain signal such as the input to the IMDCT block in the figure. A time-domain approach operates on the time domain signal after the IMDCT block. When a frame erasure occurs, the spectral coefficients of the current frame are estimated. To accomplish this using the frequency-domain approach, the synthesized spectral coefficients of the last good frame are repeated for the current frame with signal modification such as a gain scaling and a random sign changing. In the time-domain approach, an additional PLC operation is added to enhance the performance of the frequency-domain approach depending on the input signal characteristics. For this additional operation, the appropriate packet loss concealment tool, either the phase matching tool or the repetition and smoothing tool is selected.

Figure 2: Block diagram for NB PLC for MDCT mode

5.4.3.3 MDCT frame repetition with random sign and gain scaling

When a first frame erasure occurs, packet loss concealment is performed as follows. In order to conceal the erasure, the signal characteristics of a decoded signal are used, which results in a classification of the characteristics of the decoded signal into a stationary and normal frame. A current frame is determined to be transient using the frame type (is_transient) which is transmitted from the encoder. The energy difference (energy_diff) is used to determine if the current frame is stationary, and is represented by the following equation. The energy difference indicates the absolute value of a normalized energy difference between energy E_curr of the current frame and a moving average E_MA of per-frame energy. E_MA will be updated to E_{MA_old} in the next frame.

(169)

Where,

(170)

(171)

Depending on the frame type and characteristics, scaling and a random sign are used when the spectral coefficients are repeated for the current erased frame.

if ( is_transient == 0 ) {

if(energy_diff < ED_THRES) {

/* Stationary frame */

Repeating the spectral coefficients of the last good frame without scaling;

}

else{

/* Non-stationary frame */

Repeating the spectral coefficients of the last good frame with 3dB scale-down;

}

else {

if( st->old_is_transient[1] == 1 ) {

Repeating the spectral coefficients of the last good frame with 3dB scale-down;

}

else {

Repeating the spectral coefficients of the last good frame with 3dB scale-down;

Use random sign from the 2^nd band (8^th spectral coefficient)

}

When multiple erasures have occurred, an adaptive fade-out by regression method is used. In this adaptive fade-out by regression, a grouped average norm value of an erased frame is predicted using K grouped average norm values of the previous good frame through regression analysis.

Figure 3 illustrates the structure of grouped sub-bands when the regression analysis is applied to a narrowband (supported up to 4.0 KHz) signal. Grouped average norm values obtained from grouped sub-bands form a vector, which is referred to as an average vector of grouped norms. K grouped average norm values of each grouped sub-band (GSb) are used for the regression analysis.

Figure 3: Structure of grouped sub-bands

Figure 4 illustrates the concept of a linear regression analysis and a non-linear regression analysis. Between the two methods the linear regression analysis is applied to the adaptive fade-out, wherein the ‘average of norms’ indicates an average norm value obtained by grouping several bands and is the target the regression analysis is applied to. A linear regression analysis is performed when the quantized value of the norm is used for an average norm value of a previous frame. ‘Number of PGF’, which is used for a regression analysis, indicates the number of the previous good frames and is used for a regression analysis is a variable. The linear regression analysis is represented by equations (172) and (173).

(172)

(173)

As in equation (172), when a linear equation is used, the upcoming transition(y) is predicted by obtaining and . In this equation, can be a frame index. In equation (173), and are obtained by an inverse matrix. Gauss-Jordan Elimination is a simple method of obtaining an inverse matrix.

Figure 4: The concept of a linear regression analysis and a non-linear regression analysis

Figure 5 is a block diagram of a packet loss concealment block with adaptive fade-out. Referring to Figure 5, the signal characteristic determiner determines the characteristics of a signal by using a decoded signal and classifies the characteristics of the decoded signal into transient and normal frames. A method of determining a transient frame is now described. The current frame classification of transient is determined by two parameters: the frame type (is_transient) which is transmitted from the encoder, and the energy difference (energy_diff), which is represented by Equation (169).

if(energy_diff < ED_THRES && is_transient == 0 ) {

/* Not Transient */

num_pgf = 4;

}

else{

num_pgf = 2;

}

In the above context, ED_THRES denotes a threshold and is set to 1.0. According to the result of the transient determination, the number of PGFs (num_pgf), referred to in the subclause on regression analysis, can be controlled for packet loss concealment.

Another parameter for packet loss concealment is a scaling method of burst erasure duration. The same energy difference value is used for the duration of a single burst.

if((energy_diff<ED_THRES) && (is_transient==0)) {

/* Not Transient */

mute_start = 5;

random_start = 2;

}

else {

mute_start = 2;

random_start = 2;

}

If it is determined that the current frame is an erasure and is not transient, then when a burst erasure occurs frames starting from the fifth frame of the burst are forcibly scaled to a fixed value of 3 dB regardless of the regression analysis of the decoded spectral coefficient of the previous frame. Otherwise, if it is determined that the current frame that is erased is transient, when a burst erasure occurs, frames starting from the second frame are forcibly scaled to a fixed value of 3 dB regardless of the regression analysis of the decoded spectral coefficient of the previous frame.

Because regression analysis is performed only when a burst erasure has occurred, when nbLostCmpt indicates the number of contiguous erased frames is two, that is, from the second contiguous erased frame, the regression analysis is performed.

if (nbLostCmpt==2){

regression_anaysis();

}

Even though an nth frame is a good frame, if the th and th frames are erased frames, a totally different signal is generated in an overlapping process. Thus, when erasures occur in a non-consecutive order (an erasure frame, a good frame, and an erasure frame), although nbLostCmpt of the third frame (the second erasure) is 1, nbLostCmpt is forcibly increased by 1. As a result, nbLostCmpt is 2, and it is determined that a burst erasure has occurred, and thus the regression analysis will be used.

if( prev_old_bfi == 1 && nbLostCmpt == 1 && output_frame_org == L_FRAME8k )

{

nbLostCmpt++;

}

In the above context, prev_old_bfi denotes frame erasure information of the second previous frame. This process is applicable when the current frame is an erased frame. To reduce complexity, the regression analyzer block forms each group by grouping 8 or 2 bands, and applying the regression analysis to the mean vector of grouped norms. The number of previous good frames for the regression analysis is set to either 2 or 4, and is controlled by the result of the signal characteristic determiner block. In addition, the number of rows of the matrix for the regression analysis is set to 2. As a result of the regression analysis by the regression analyzer block, an average norm value of each group is predicted for an erased frame and is done by calculating the values and from a linear regression analysis equation (173). In this block the calculated value can be adjusted to the predetermined range as follows. In EVS the range is always limited to a negative value. In the following pseudo-code, norm_values is an average norm value of each group in the previous good frame and norm_p is a predicted average norm value of each group.

if( a > 0 ){

a = 0;

norm_p[i] = norm_values[0];

}

else {

norm_p[i] = (b+a*(nbLostCmpt-1+num_pgf);

}

With this modified value of , the average norm value of each group is predicted. When the predicted norm is larger than zero and the norm of the previous frame is non-zero, the gain calculator block calculates a gain between the average norm value of each group that is predicted for the erased frame and an average norm value of each group in the previous good frame. Otherwise, the gain is scaled down by 3 dB from the initial value of 1.0.

The calculated gain is also adjusted to a predetermined range. In EVS, the maximum value of the gain is 1.0.

The scaler block applies gain scaling to the previous good frame to predict spectral coefficients of the erased frame. The scaler block also applies adaptive muting to the erased frame and a random sign to predicted spectral coefficients according to characteristics of an input signal, which is also controlled by the results of the signal characteristic determiner block.

The number indicated by mute_start indicates that muting forcibly starts when bfi_cnt is equal to or greater than mute_start and when continuous frame erasures occur. In addition, random_start, related to the random sign, is analysed in the same way.

According to a method of applying adaptive muting, spectral coefficients are forcibly down-scaled by 3dB. In addition, the sign of each of the spectral coefficients is randomly modified to reduce modulation noise generated due to repetition of spectral coefficients in each frame.

In addition, the random sign is applied to frequency bands equal to or higher than the second frequency band, as it should be better to use the sign of a spectral coefficient that is identical to that of the previous frame in a very low frequency band (0~200Hz for the first band). Accordingly, a sharp change in the signal can be smoothed, and an erased frame accurately restored to be adaptive to the characteristics of the signal, in particular, a transient characteristic.

Figure 5: Block diagram of a packet loss concealment block with adaptive fade-out

5.4.3.4 MDCT frame repetition with sign prediction

An analysis of the sign change of the MDCT coefficients in the received frames is continuously performed. The analysis of and is performed on 4-dimensional bands up to 1.6 kHz ( MDCT coefficients divided into bands).

Two 16-dimensional state variables, used to determine the sign of the reconstructed MDCT vector, and hold the number of sign switches between consecutive frames. The analysis takes also into account signal dynamics (measured by a transient detector), to decide on the reliability of past data. Updates for both state variables are done only for, if the values are set to zero.

Within a sub- band, first state variable is incremented whenever the sign of the corresponding MDCT coefficients switches:

(174)

The second state variable accumulates number of sign switches over consecutive frames:

(175)

When frame is lost, the missing MDCT vector is reconstructed by copying the last available coefficients. The sign of the reconstructed vector can be preserved or changed on a sub-band basis (every 4 coefficients). Inside a band the decision whether to change the sign or not is based on comparing the second state variable to a pre-determined threshold as follows (wherein a sign flip or reversal is indicated by -1 and preservation of the sign is indicated by +1):

(176)

The threshold is adjusted to the past decision of the transient detector. The sequential decision logic is illustrated in Table 10.

Table 10: Sign extrapolation decision logic

If any of frames or contains transient	Apply random sign to the copied coefficients
If frames or are good, but frame contains transient	Apply sign extrapolation with
If frames ,, and are good	Apply sign extrapolation with

5.4.3.5 Phase ECU

Phase ECU is a frame loss concealment method especially suitable for general audio and music signals. It provides a smooth and faithful time evolution of the reconstructed signal for a lost frame, wherein the audible impact of a frame loss is minimized.

Phase ECU is a frame loss concealment technique that operates with a sinusoidal model under the assumption that the audio signal is composed of a limited number of individual sinusoidal components. The general principle of Phase ECU comprises sinusoidal analysis of a previously received good HQ MDCT coded frame of the audio signal (analysis frame), wherein the sinusoidal analysis involves identifying frequencies of sinusoidal components of the audio signal. Further, a sinusoidal model is applied on this previously synthesized frame, wherein it is used as a prototype frame in order to create a substitution frame for a lost audio frame. The creation of the substitution frame is done by time-evolving the identified sinusoidal components of the prototype frame, up to the time instance of the lost audio frame, in response to the corresponding identified sinusoidal frequencies.

In more detail Phase ECU operation comprises the steps of sinusoidal analysis, described in subclause 5.4.3.5.2, and application of the sinusoidal model based on a prototype frame of the earlier synthesized signal in order to generate a substitution frame for the lost audio frame, described in subclause 5.4.3.5.3. In addition and prior to this basic Phase ECU operation a transient analysis step is carried out, described in subclause 5.4.3.5.1 with the purpose to detect audio signal and burst frame loss conditions under which the basic Phase ECU operation is adapted in order to ensure maximum reconstruction signal quality.

5.4.3.5.1 Transient analysis

The purpose of the calculations in transient analysis is the detection of properties of the previously reconstructed good signal frame or the frame loss statistics that could lead to suboptimal signal reconstruction quality with the Phase ECU. Upon such detected conditions phase and magnitude of the substitution frame are selectively adjusted in order to mitigate potential quality degradations. Conditions under which such adjustments are carried out are detected transients or burst losses with several consecutive frame losses. The result of the transient analysis is phase and magnitude modification factors corresponding to such adjustments.

Transient analysis is performed on each lost frame, and the following steps are performed for the first lost frame or for the second lost frame in case the first lost frame was handled with the method according to subclause 5.4.3.6. For subsequent lost frames transient analysis relies on previously calculated and stored parameters (that were calculated based on the synthesis of the last good HQ MDCT frame). For these losses transient analysis adjusts magnitude spectrum attenuation factors and phase dithering degrees in response to detected transient or burst loss condition.

The transient analysis is performed in the frequency domain. Two FFTs are performed on a left and a right part of the analysis frame buffer which contains the previous synthesis

(177)

where is the length of the transient analysis, set to 64, 128, or 192 for WB, SWB, and FB, and is a hamming window of corresponding length. The resulting FFT spectrum is split into bands according to Table 11 that are approximately following the size of the human auditory critical bands, and the energy in each band is calculated.

Table 11: Band start of Phase ECU

	0	1	2	3	4	5	6	7	8
	1	3	6	10	16	32	64	128	192

(178)

Next the ratio of these energies is calculated as

(179)

This means that the transient detection is made frequency selectively for each frequency band. The gain is then compared with an upper and a lower threshold for onset or respectively offset detection. If or is fulfilled, then band contains a transient and is set to 1. The gains are set to 1. If a band has a transient, then gain is updated to:

(180)

The gains for the first lost frame are saved into.

The derivation of magnitude and phase modification factors in response to a detected burst loss condition is described in the following. The variable is set to 1, is set to 0, and. An average energy of each band is calculated:

(181)

corresponds to a low-resolution spectrum of the last good frame. This spectrum is used as part of the burst loss handling feature of Phase ECU. It is used for a spectrally shaped additive noise signal to which the substitution signal is pulled in case of burst frame losses.

For subsequent lost frames, the gain is updated according to:

(182)

where

(183)

Here is the number of consecutive lost frames andis envelope stability feature described in [5] subclause 6.2.3.2.1.3.3, where the range endpoints 0 and 1 represent speech and music respectively. If then .

The attenuation factors and are updated as:

(184)

Through variable the concealment method is modified by selectively adjusting the magnitude of the substitution frame spectrum, based on the frequency domain transient detector status, see equation (189).

The scaling factor is used to scale the spectrally shaped additive noise signal such that, except for the incorporated gradual muting behaviour through factor , it compensates for the energy loss caused by the attenuation with factor . This is an aspect of the long-term muting behaviour which is outlined in subclause 5.4.6.2.2.

For , is further adjusted as and for . This superimposes a low-pass characteristic on the additive noise signal, which avoids unpleasant high-frequency noise in the substitution signal.

The variable is initialized to 0, and for the phase dither is calculated:

(185)

5.4.3.5.2 Spectrum analysis

The spectrum analysis is carried out in the frequency domain. It is only done once for the first lost frame after a good HQ MDCT frame. The buffer with the previous synthesis of the last good HQ MDCT frame (analysis frame) is windowed and passed through a FFT.

(186)

where is a hamming-rectangular window, and is the length of the FFT set to 512, 1024, or 1536 for WB, SWB, and FB signals,

(187)

and is the length of the hamming part, and is 96, 192, or 288 for WB, SWB, and FB.

The spectrum is saved and used for all consecutive frame losses. Then the magnitude spectrum is calculated. Then the peaks of this magnitude spectrum are located by a peak picking method. The number of found peaks is, and the peaks locations are. The frequency resolution of these peak locations is however still insufficient for good Phase ECU performance, since the true frequencies of the sinusoidal model components are rather found in the vicinity of them. Thus, after the peaks in the magnitude FFT spectrum are found, their positions are further refined to make them available in highest possible resolution. The refinement is carried out by using parabolic interpolation, which yields the fractional peak locations.

This sinusoidal model is also used in reconstruction of the lost audio frame.

5.4.3.5.3 Frame reconstruction

The substitution frame for the lost frame is calculated by applying the sinusoidal model on a frame of the previously synthesized good frame signal, where this frame serves as a prototype frame. The previously calculated sinusoidal components of this prototype frame are time evolved to the time instant of the lost frame. For numerical simplicity, this prototype frame and its spectrum are chosen to be identical to the windowed analysis frame and, respectively, its already calculated and saved spectrum (see subclause 5.4.3.5.2). While the exact time evolution of the sinusoids of the windowed prototype frame would require a complex superposition of frequency-shifted, phase-evolved and sampled instances of the spectrum of the used window function, Phase ECU operates with an approximation of the window function spectrum such that it comprises only a region around its main lobe. With this approximation the substitution frame spectrum is composed of strictly non-overlapping portions of the approximated window function spectrum and hence the time-evolution of the sinusoids of the windowed prototype frame reduces to phase-shifting the sinusoidal components of the prototype spectrum in -regions around the corresponding spectral peaks by an amount. Note that this amountmerely depends on the respective sinusoidal frequency (peak location) and the time shift between the lost frame and the prototype frame. This is expressed in the following equation. The phase shift is calculated as:

(188)

where is the offset in number of samples since the last good frame. is a variable incremented by for each lost frame, andequals 40, 80, or 120 for WB, SWB, or FB signals.

Note, that if either of the last two frames have the flag set, then the number of peaks is set to 0.

Next the spectrum around the spectral peaks is updated and random noise component related to burst loss handling is added

(189)

where, is set according to Table 11, is a random number between -1 and 1, and

(190)

If is non-zero, the amplitude is adjusted, and a small random component is added to the phase

(191)

(192)

The spectral coefficients which have not been updated are also updated in a similar manner but with a randomized phase.

For clarity it is to be noted that the first additive term in equation (189) relates to phase shifting the sinusoidal components of the prototype spectrum. In addition, if is non-zero, the phase is modified with a random component. This avoids quality degrading tonal sounds due to too strong periodicity and is useful both in case of transients and burst frame loss. In addition, for the same reason the magnitude of the prototype frame spectral coefficients is attenuated with the scaling factor. The second additive term in equation (189) modifies the substitution frame spectral coefficients by an additive noise component, where the magnitude of the additive noise component corresponds to the scaled coefficient of the low-resolution magnitude spectrum of the previous good frame, , which derivation is described in subclause 5.4.3.5.1. The scaling factor is chosen such that, except for the incorporated gradual muting behaviour, it compensates for the energy loss caused by the attenuation with factor . This is an aspect of the long-term muting behaviour which is outlined in subclause 5.4.6.2.2.

The reconstructed substitution frame spectrum is passed through the IFFT to create a time domain substitution frame.

(193)

Where .The signal is zero extended outside of this range. This signal is then windowed and time-domain aliased as described in [5], clause 5.3.2.2 (is the number of zero samples in the ALDO window). The resulting windowed and time-domain aliased signal is then overlap-added with the previous frame as described in [5], clause 6.2.4.1.

5.4.3.6 MDCT concealment based on sinusoidal synthesis and adaptive noise filling

The MDCT concealment based on sinusoidal synthesis and adaptive noise injection is illustrated in Figure 6. Note that the resampling to 8 kHz and pitch search are already described in clause 5.4.3.1.

Figure 6: Block diagram of MDCT concealment based on sinusoidal synthesis
and adaptive noise filling

5.4.3.6.1 FFT

The pitch cycle of length is extracted from the resampled past synthesis of length 320 as: . This pitch cycle is analyzed in frequency domain using the following steps:

– The signal is linearly interpolated to a length corresponding to a power of 2 to obtain the segment of length such that where is the rounding upward to the nearest integer. A linear interpolation is applied as follows:

(194)

where is the rounding downward to the nearest integer and .

– The segment is decomposed in frequency domain by FFT of length to obtain the spectrum ,

– The amplitude spectrum is computed for and the overall amplitude is also computed as:

(195)

5.4.3.6.2 Selection of sinusoidal components

Sinusoidal components are selected by first detecting the number of spectral peaks following condition: and . When the binary voicing indication has the value =1, the peak selection is extended to select not only the peak at index meeting the preceding condition but also neighboring peaks at index and ; this allows capturing a larger portion of the overall spectral energy and lowering the noise to be re-injected for voiced signals.

The final number of peaks to be kept is to reduce the computational load of the subsequent sinusoidal synthesis. This final selection of peaks is performed by iteratively selecting the peak maximizing among peaks that are not yet selected as long as the conditions =1 or is still met, where the latter condition ensures that 70% of the amplitude spectrum is covered.

For each i-th peak that gets selected, the amplitude , phase and normalized frequency are computed.

5.4.3.6.3 Sinusoidal synthesis

A segment of length corresponding to 2 frames of 20 ms (40 ms) and 8 kHz to resampling delay is generated from the selected frequency bins as:

, (196)

This sinusoidal synthesis is implemented using an autoregressive of order 2. The extra segment length (after the current frame) is used for crossfading with the next decoded frame and to compensate for resampling delay.

5.4.3.6.4 Adaptive noise filling

Frequency components that do not correspond to selected sinusoids below 4 kHz or that are above 4 kHz are re-injected by adaptive noise filling, in particular to compensate for energy loss.

The pitch computed according to clause 5.4.3.1.2 is mapped to the output sampling rate as, where is the decimation factor used in sub-clause 5.4.3.1.1 with =2, 4, 6 for =16, 32 or 48 kHz respectively. For a 20 ms frame length at , a residual signal is computed as:

, (197)

where the residual length is if < and otherwise. Then, if the binary voicing indicator has the value =1, the residual is further scaled down by a factor of 0.25 as , .

This residual signal is repeated iteratively by adding blocks of variable length until the length of 2 frames (40 ms) is reached. The start index for the residual repetition is initialized to . In the -th iteration:

– A block length is pseudo-randomly computed by alternating between and , where is a random number between 0 and 1.

– A sine window of length is computed as:

, (198)

This calculation is performed by running an autoregressive filter of order 2.

– Two blocks are extracted from the residual signal:, and , . Note that the blocks overlap with each other and the length of the overlap depends on the value of in the current iteration.

– These two blocks are overlap-added to update the noise vector from the current start index :

(199)

Note that the upper limit of the time interval is actually saturated to .

– The start index is updated:

The iterations stop as soon as .

5.4.3.6.5 Synthesis

The signal is synthesized as:

, (200)

Note that when the binary voicing indicator has the value =1, the noise vector has been scaled down by a factor of 0.25, to avoid artefacts for voiced signals.

This signal is overlap-added with the previously decoded synthesis to ensure signal continuity between frames.

5.4.3.7 Time-domain PLC and OLA

5.4.3.7.1 PLC mode selection

The frequency domain PLC block includes a frequency domain erasure concealment algorithm and operates when the BFI flag is set to 1 and the decoding mode of the previous frame is the frequency domain mode. The frequency domain PLC block generates spectral coefficients of the erased frame by repeating the synthesized spectral coefficients of the previous good frame stored in memory. With these coefficients the IMDCT block generates the time domain signal by performing a time-frequency inverse transform. The conventional OLA block performs a general OLA processing by using the time domain signal of the previous frame, and generates a final time domain signal of the current frame as a result of the general OLA processing.

To achieve an additional quality enhancement taking into account the input signal characteristics, the time-domain PLC introduces two concealment tools, consisting of a phase matching tool and a repetition and smoothing tool. With these tools, an appropriate concealment method is selected by checking the stationarity of the input signal.

Figure 7 shows the two concealment tools and the conventional OLA for the time-domain PLC.

The phase matching block in the figure will be introduced in subclause 5.4.3.7.2 and the repetition and smoothing block in the figure will be introduced in subclause 5.4.3.7.3.

Figure 7: Block diagram of a Time-domain PLC module

Table 12 summarizes the PLC modes that are used for time-domain PLC. There are two tools for the time-domain PLC. Each of these tools has several modes representing the erased frame types. The erased frame types are classified as single erasure frame, burst erasure frame, next good frame after erasure frame, and next good frame after burst erasure.

Table 12: Used PLC modes for Time-domain PLC

Name of tools	Single erasure frame	Burst erasure frame	Next good frame	Next good frame after burst erasures
Phase matching	Phase matching for erased frame	Phase matching for burst erasures	Phase matching for next good frame	Phase matching for next good frame
Repetition & Smoothing	Repetition &smoothing for erased frame	Repetition &smoothing for erased frame	Repetition &smoothing for next good frame	Next good frame after burst erasures

Table 13 summarizes the PLC mode selection method for the PLC mode selection block in Figure.7.

Table 13: PLC mode selection

Parameters	Status of Parameters						Definitions
BFI	1	0	1	1	0	0	Bad frame indicator for the current frame
Prev_BFI	–	1	1	–	1	1	BFI for the previous frame
nbLostCmpt	1	–	–	–	–	>1	The number of contiguous erased frames
Phase_mat_flag	1	–	–	0	0	0	The flag for the Phase matching process (1: used, 0: not used)
Phase_mat_next	0	1	1	0	0	0	The flag for the Phase matching process for burst erasures or next good frame (1: used, 0: not used)
stat_mode_out	–	–	–	(1)^*	(1)^*	0	The flag for Repetition &smoothing process (1: used, 0: not used)
diff_energy	–	–	–	(<0.159063)^*	(<0.159063)^*	0.159063	Energy difference
Selected PLC mode	Phase matching for erased frame	Phase matching for next good frame	Phase matching for burst erasures	Repetition &smoothing for erased frame	Repetition &smoothing for next good frame	Next good frame after burst erasures
Name of tools	Phase matching			Repetition and Smoothing
NOTE: ^*The () means "OR" connections.

The pseudo code to select a PLC mode for the phase matching tool is as follows.

if( (nbLostCmpt==1)&&(phase_mat_flag==1)&&(phase_mat_next==0) ) {

Phase matching for erased frame ();

}

else if((prev_bfi == 1)&&(bfi == 0) &&(phase_mat_next == 1)) {

Phase matching for next good frame ();

}

else if((prev_bfi == 1)&&(bfi == 1) &&(phase_mat_next == 1)) {

Phase matching for burst erasures ();

}

Using this selection method, the phase matching flag (phase_mat_flag) determines at the point of the memory update block in the previous good frame whether phase matching erasure concealment processing is used for every good frame when an erasure occurs in a next frame. To this end, energy and spectral coefficients of each sub-band are used. The energy is obtained from the norm value. More specifically, when a sub-band having the maximum energy in a current frame belongs to a predetermined low frequency band, and the inter-frame energy change is not large, the phase matching flag is set to 1.

The detailed method is as follows. When a sub-band having the maximum energy in the current frame is within the range of 75 Hz to 1000 Hz, a difference between the index of the current frame and the index of a previous frame with respect to a corresponding sub-band is 1 or less, and the current frame is a stationary frame of which an energy change is less than the threshold (ED_THRES_90P), and three past frames stored in the buffer are not transient frames, then phase matching erasure concealment processing will be applied to a next frame to which an erasure has occurred.

if ((Min_ind<5) && ( abs(Min_ind – old_Min_ind)< 2) && (diff_energy<ED_THRES_90P) && (!bfi) && (!prev_bfi) && (!prev_old_bfi) && (!is_transient) && (!old_is_transient[1])) {

if((Min_ind==0) && (Max_ind<3)) {

phase_mat_flag = 0;

}

else {

phase_mat_flag = 1;

}

else {

phase_mat_flag = 0;

}

The PLC mode selection method for the repetition and smoothing tool and the conventional OLA is as follows.

The result of the stationarity detection of an erased frame is performed by a memory update block. In this detection we introduce a hysteresis in order to prevent a frequent change of the detected result. The stationarity detection of the erased frame determines whether the current erased frame is stationary by receiving information including a stationary mode stat_mode_old of the previous frame, an energy difference diff_energy, and the like. Specifically, the stationary mode flag stat_mode_curr of the current frame is set to 1 when the energy difference diff_energy is less than 0.032209. The energy difference(E_d) is given by the Equation (169).

If it is determined that the current frame is stationary, the hysteresis application generates a final stationarity parameter, stat_mode_out from the current frame by applying the stationarity mode parameter stat_mode_old of the previous frame to prevent a frequent change in stationarity information of the current frame. The pseudo code for the hysteresis application is as follows.

/* Apply Hysteresis to prevent frequent mode changing */

if(stat_mode_old == stat_mode_curr)

{

stat_mode_out = stat_mode_curr;

}

stat_mode_old = stat_mode_curr;

First, the operation of the PLC mode selection depends on whether the current frame is an erased frame or the next good frame after an erased frame. Referring to Table 13, for an erased frame, a determination is made whether the input signal is stationary by using various parameters. More specifically, when the previous good frame is stationary and the energy difference is less than the threshold, it is concluded that the input signal is stationary. In this case, the repetition and smoothing processing is performed. If it is determined that the input signal is not stationary, then the general OLA processing is be performed.

Referring to Table 13, a determination whether the input signal is stationary is made by using the same parameters and same method. If the input signal is not stationary, then for the next good frame after an erased frame a determination is made whether the previous frame is a burst erasure frame by checking whether the number of consecutive erased frames is greater than one. If this is the case, then erasure concealment processing on the next good frame is performed in response to the previous frame that is a burst erasure frame. If it is determined that the input signal is not stationary and the previous frame is a random erasure, then the conventional OLA processing is performed.

If the input signal is stationary, then the erasure concealment processing, i.e. repetition and smoothing processing, on the next good frame is performed in response to the previous frame that is erased. This repetition and smoothing for next good frame has two types of concealment methods. One is repetition and smoothing method for the next good frame after an erased frame, and the other is repetition and smoothing method for the next good frame after burst erasures.

The pseudo code to select a PLC mode for the Repetition and Smoothing tool and the conventional OLA is as follows.

if(BFI == 0 && st->prev_ BFI == 1) {

if((stat_mode_out==1) || (diff_energy<0.032209) ) {

Repetition &smoothing for next good frame ();

}

else if(nbLostCmpt > 1) {

Next good frame after burst erasures ();

}

else {

Conventional OLA ();

}

else { /* if(BFI == 1) */

if( (stat_mode_out==1) || (diff_energy<0.032209) ) {

if(Repetition &smoothing for erased frame () ) {

Conventional OLA ();

}

else {

Conventional OLA ();

}

5.4.3.7.2 Phase matching

Figure 8 is a block diagram of the phase matching PLC module. The phase matching tool includes a PLC mode selection block and three phase matching packet loss concealment blocks.

Basically the phase matching error concealment performs phase matching packet loss concealment processing on a current erased frame when the previous good frame has the maximum energy in a predetermined low frequency band and the change in energy is less than a predetermined threshold.

The phase matching tool does not use the conventional OLA block but generates the time domain signal for the current erased frame by copying the phase-matched time domain signal obtained from the previous good frames. Once the phase matching tool is used for an erased frame, the tool shall also be used for the next good frame or subsequent burst erasures. For the next good frame, the phase matching for next good frame tool is used. For subsequent burst erasures, the phase matching tool for burst erasures is used.

The phase matching tool for the next good frame performs phase matching packet loss concealment processing on the current frame when the previous frame is an erasure and when phase matching error concealment processing on the previous frame has been performed.

The phase matching function for burst erasures performs phase matching packet loss concealment processing on the current frame that is part of a burst erasure when the previous frame is an erasure and phase matching error concealment processing on the previous frame has been performed.

Figure 8: Block diagram of a phase matching PLC module

Figure 9 shows a block diagram of the phase matching for erased frame block in Figure 8. In order to use the phase matching tool, the phase_mat_flag shall be set to 1. Even though this condition is satisfied, a second condition shall be satisfied. As a second condition, a correlation scale accA is obtained, and either phase matching erasure concealment processing or general OLA processing is selected. The selection depends on whether the correlation scale accA is within a predetermined range. That is, phase matching packet loss concealment processing is conditionally performed depending on whether a correlation between segments exists in a search range and a cross-correlation between a search segment and the segments exists in the search range. The correlation scale is given by Equation (201).

(201)

In Equation (201), denotes the number of segments existing in the search range, denotes a cross-correlation used to search for the matching segment having the same length as the search segment ( signal) with respect to the past good frames ( signal) stored in the buffer, and denotes a correlation between segments existing in the past good frames stored in the buffer. Next, it is be determined whether the correlation scale is within the predetermined range. If this is the case phase matching erasure concealment processing takes place on the current erased frame. Otherwise, the conventional OLA processing on the current frame is performed. If the correlation scale is less than 0.5 or greater than 1.5, the conventional OLA processing is performed. Otherwise, phase matching erasure concealment processing is performed.

The phase matching packet loss concealment processing includes a maximum correlation search block, a copying block, a smoothing block, and a memory update block. The maximum correlation search block searches for a matching segment, which has the maximum correlation to, i.e. is most similar to, a search segment adjacent to a current frame, from a decoded signal in a previous good frame from among N past good frames stored in a buffer. A position index of the matching segment obtained as a result of the search is provided to the copying block.

The copying block copies a predetermined duration starting from an end of the matching segment to the current frame that is an erasure frame by referring to the location index of the matching segment. At this time, a duration corresponding to a window length is copied to the current frame. When the copy starting from the end of the matching segment is shorter than the window length, the copy, starting from the end of the matching segment will be repeatedly copied into the current frame.

The smoothing block generates a time domain signal on the -concealed current frame by performing smoothing processing through OLA to minimize the discontinuity between the current frame and adjacent frames. After smoothing, the memory update for the phase matching will be performed in the memory update block.

Figure 9: Block diagram of the phase matching for erased frame

Figure 10 illustrates the operation of phase matching erasure concealment described in Figure 9. Referring to Figure 10, the decoded signal from a previous frame from among the N past good frames stored in a buffer is searched for a matching segment. When the copy process is completed, the overlapping process on a copied signal and on an Oldauout signal stored in the previous frame n-1 for overlapping is performed at the beginning part of the current frame n by a first overlap duration. The length of the overlap duration is 2 ms. This results in the generation of the final repeated signal.

Figure 10: The operation of phase matching erasure concealment

Phase matching for burst erasures as shown in Figure 8 is described as follows. This method utilizes a smoothing process similar to that of phase matching for the first erased frame. Phase matching for burst erasures does not have maximum correlation search block nor the copying block, as all information needed for these blocks can be reused by phase matching for the erased frame. The only difference for the smoothing block is the smoothing that is done between the signal corresponding to the overlap duration of the copied signal and the Oldauout signal stored in the current frame n for overlapping purposes. The Oldauout is actually a copied signal by the phase matching process in the previous frame.

The phase matching for next good frame in Figure 8 is described as follows.

This method utilizes the mean_en_high parameter, denoting a mean energy of high bands and indicating the similarity of the last good frames. This parameter is calculated by following equation,

(202)

where is start band index of the determined high bands.

If is larger than 2.0 or smaller than 0.5, oldout_pha_idx is set to 1. oldout_pha_idx is used as a switch using the Oldauout memory. The two sets of Oldauout were saved at the both the phase matching for erased frame block and the phase matching for burst erasures block. The 1st Oldauout is generated from a copied signal by a phase matching process, and the 2nd Oldauout is generated by the time domain signal resulting from the IMDCT. If the oldout_pha_idx is set to 1, it indicates that the high band signal is unstable and the 2nd Oldauout will be used for the OLA process in the next good frame. If the oldout_pha_idx is set to 0, it indicates that the high band signal is stable and the 1st Oldauout will be used for OLA process in the next good frame.

if((mean_en_high>2.0)||(mean_en_high<0.5)) {

oldout_pha_idx = 1;

}

else {

oldout_pha_idx = 0;

}

5.4.3.7.3 Repetition and smoothing

Figure 11 depicts the repetition and smoothing tool (OLA modes for time-domain PLC).

Figure 11: Repetition and smoothing

Each tool in the block diagram is described as follows. Figure 12 is a block diagram of conventional OLA method. The conventional OLA method includes a windowing block and an OLA block. Referring to Figure 12, the windowing block performs a windowing process on an IMDCT signal of the current frame to remove time domain aliasing. The case of a window having an overlap duration less than 50% will be described below with reference to Figure 13. The OLA block performs OLA processing on the windowed IMDCT signal.

Figure 13 illustrates the general OLA method with the window format for concealing an erased frame. When an erasure occurs in frequency domain encoding, past spectral coefficients are usually repeated, and thus, it may be impossible to remove time domain aliasing in the erased frame.

Figure 12: Block diagram of conventional OLA

Figure 13: Diagram for describing a windowing of conventional OLA

Figure 14 is a block diagram of the repetition and smoothing method for an erased frame. When the current frame is an erasure, and if a method of repeating past spectral coefficients obtained in the frequency domain is used, and if OLA processing is performed after IMDCT and windowing, a time domain aliasing component in the beginning part of the current frame is modified. Thus perfect reconstruction is not possible, thereby resulting in unexpected noise. The repetition and smoothing method is used to minimize the occurrence of noise even though the original repetition method is used.

The repetition and smoothing method includes a windowing block, a repetition block, and a smoothing block. Referring to Figure 14, the windowing block performs the same operation as that of the windowing block of Figure 12. The repetition block applies an IMDCT signal of a frame that is two frames previous to the current frame (referred to as "previous old" in figure 15) to a beginning part of the current erased frame. The smoothing block consists of the OLA unit and the smoothing unit. The OLA unit performs OLA processing on the signal repeated by the repetition block and the IMDCT signal of the current frame. As a result, the audio output signal of the current frame is generated, and the occurrence of noise in a beginning part of the audio output signal is reduced. When scaling is applied together with the repetition of the spectrum of the previous frame in the frequency domain, the likelihood of noise occurring in the beginning part of the current frame is greatly reduced. The smoothing unit applies a smoothing window between the signal of the previous frame (old audio output) and the signal of the current frame (referred to as "current audio output") and performs OLA processing. The smoothing window is formed such that the sum of overlap durations between adjacent windows is equal to one. In the EVS codec, the sine wave window is used, and in this case, the window function is represented by Equation (203).

(203)

In Equation (203), denotes the duration of the overlap to be used in the smoothing processing. By performing smoothing processing as described above, when the current frame is an erasure, the discontinuity between the previous frame and the current frame, which may occur by using an IMDCT signal copied from the frame that is two frames previous to the current frame instead of an IMDCT signal stored in the previous frame, is prevented.

After completion of the repetition and smoothing, the energy of an overlapping region is compared with the energy of a non-overlapping region. When the energy of the overlapping region decreases after the packet loss concealment processing, conventional OLA processing is performed. The comparison is made by the operation depicted in Figure 14.

If the energy difference between the overlapping region () and the non-overlapping region () is large as a result of the comparison in the block, conventional OLA processing is performed.

Figure 15 illustrates the repetition and smoothing method with an example window for concealing an erased frame.

Figure14: Block diagram of repetition & smoothing method for erased frame

Figure 15: Diagram for describing a windowing of repetition & smoothing method for erased frame

Figure 16 is a block diagram of the repetition and smoothing method for the next good frame after an erased frame. This method only includes the smoothing block. The smoothing block applies the smoothing window to the old IMDCT signal and to a current IMDCT signal and performs OLA processing. Likewise, the smoothing window is formed such that a sum of overlap durations between adjacent windows is equal to one. That is, when the previous frame is a first erased frame and a current frame is a good frame, it is difficult to remove time domain aliasing in the overlap duration between an IMDCT signal of the previous frame and an IMDCT signal of the current frame. Thus, noise can be minimized by performing the smoothing processing based on the smoothing window instead of the conventional OLA processing. Figure 17 illustrates the repetition and smoothing method with an example of a window for smoothing the next good frame after an erased frame.

Figure 16: Block diagram of repetition and smoothing method for the next good frame after an erased frame

Figure 17: Diagram for describing a windowing of repetition and smoothing method for the next good frame after an erased frame

If the input signal is stationary and the previous frame is a burst erasure frame, then the repetition and smoothing method for the next good frame after the multiple erased frames as depicted in Figure 18 is used.

This method includes a repetition block, a scaling block, a first smoothing block, and a second smoothing block. Referring to Figure18, the repetition block copies, to a beginning part of the current frame, a part used for the next frame of the IMDCT signal of the current frame. The scaling block adjusts the scale of the current frame to prevent a sudden signal increase. In the EVS codec, the scaling block performs down-scaling by 3 dB. The first smoothing block applies a smoothing window to the IMDCT signal of the previous frame and the copied IMDCT signal from a future frame and performs OLA processing. Likewise, the smoothing window is formed such that a sum of overlap durations between adjacent windows is equal to one. That is, when the copied signal is used, windowing is necessary to remove the discontinuity which may occur between the previous frame and the current frame, and an old IMDCT signal may be replaced with a signal obtained by OLA processing of the first smoothing block. The second smoothing block performs the OLA processing while removing the discontinuity by applying a smoothing window between the old IMDCT signal that is a replaced signal and a current IMDCT signal that is the current frame signal. Likewise, the smoothing window is formed such that the sum of overlap durations between adjacent windows is equal to one. That is, when the previous frame is a burst erasure and the current frame is a good frame, time domain aliasing in the overlap duration between the IMDCT signal of the previous frame and the IMDCT signal of the current frame cannot be removed. In the burst erasure frame, since noise may occur due to a decrease in energy or continuous repetitions, the method of copying a signal from the future frame for overlapping with the current frame is applied. In this case, smoothing processing is performed twice to remove the noise which may occur in the current frame and simultaneously remove the discontinuity which occurs between the previous frame and the current frame. Figure 19 illustrates the repetition and smoothing method with an example window for smoothing the next good frame after burst erasures.

Figure 18: Block diagram of repetition and smoothing method for the next good frame after burst erasures

Figure 19: Diagram for describing a windowing of repetition and smoothing method for the next good frame after burst erasures

Figure 20 is the block diagram of the next good frame after burst erasures shown in Figure 11. Regarding the usage of the future signal the main operation is same as that of the repetition and smoothing method for the next good frame after burst erasures shown in Figure 18.

This method includes a repetition block, a scaling block, a smoothing block, and an OLA block. Referring to Figure 20, the repetition block, scaling block, and smoothing block are exactly the same as that of Figure 18. Instead of the second smoothing block, the next good frame after burst erasures uses the OLA between the replaced OldauOut signal and the current IMDCT signal. Figure 21 illustrates the next good frame after burst erasures.

Figure 20: Block diagram of the next good frame after burst erasures

Figure 21: Diagram for describing a windowing of the next good frame after burst erasures