5.4.2 TCX MDCT
26.4473GPPCodec for Enhanced Voice Services (EVS)Error concealment of lost packetsRelease 17TS
5.4.2.1 PLC method selection
In case the last good frame prior to a loss was coded with the MDCT based TCX, a range of different specifically optimized PLC methods are available that are selected based on second level criteria described in this subclause. The PLC methods are:
– TCX time domain concealment
– MDCT frame repetition with sign scrambling
– tonal MDCT concealment using phase prediction
– non-tonal concealment with waveform adjustment
The criteria evaluated in this second level PLC method selection are
– Last MDCT mode: The MDCT mode of the last good frame is obtained by decoding the bitstream in every good frame.
– Number of consecutively lost frames: The number of consecutively lost frames is increased in case of a frame loss and is reset in a good received frame.
– Last unmodified LTP gain: If LTP information is updated in the last good frame, the variable contains the LTP gain, and otherwise it is zero.
– Tonal MDCT peak detection flag: The flag describes whether tonal MDCT concealment using phase prediction should be done. It is set to zero by default and remains zero if one of the following conditions is true:
– the last core or the second last core is not mode TCX20
– the last unmodified LTP gain is bigger than 0.4 and the last pitch is bigger than
– the last pitch differs from the second last pitch
– TNS was active in the last or second last frame
Otherwise, is set to one if the output of the peak detection of tonal components (see subclause 5.4.2.4.2) matches one of the following criteria:
– the number of found peaks is higher than 10; or
– the number of found peaks is higher than 5 and the difference between the 3rd and 2nd last pitch is smaller than 0.5 or
– at least one peak is found and the last good frame was either UNVOICED_TRANSITION or UNVOICED_CLAS and the difference between the 3rd and 2nd last pitch is smaller than 0.5 and the last unmodified LTP gain is .
– Flag enabling non-tonal concealment with waveform adjustment: The flag is set to one if the bit rate is one out of the set of {48 kbps, 96 kbps, 128 kbps}.
– Intelligent gap filling:
The intelligent gap filling flag describes whether intelligent gap filling is active (1) or not (0) (see subclause 5.4.2.6).
– TCX_Tonality flag array:
array of tonality flags of the last ten received frames (see subclause 5.4.5.3a).
The decision logic of the different PLC methods is done with the criteria shown above. The selection of the PLC is performed only in the first lost frame after a good frame and pertained in subsequently lost frames.
TCX time domain concealment is selected if:
– flag is zero; and
– is TCX_CORE and and the last good frame was neither UNVOICED_TRANSITION nor UNVOICED_CLAS.
In all other cases, the three MDCT-based concealment methods are selected as described below.
MDCT frame repetition with sign scrambling is selected if:
– is one (in conjunction with tonal MDCT concealment using phase prediction); or
– is zero and non-tonal concealment with waveform adjustment is not active.
Tonal MDCT Concealment using phase prediction is selected if:
– is one
Non-tonal concealment with waveform adjustment is selected if:
– is one, is zero and there is no transition having a larger frame size than a normal TCX20 frame; and
– the lost frame is considered to be a non-tonal frame, which requires that the TCX_Tonality flag array contains five or less ones or one out of the last three frames is not TCX20.
If a MDCT-based PLC mode is selected and is one, some missing information are added with the intelligent gap filling concealment.
5.4.2.2 TCX time domain concealment
The time domain PLC for TCX is called if the core of the last good frame was TCX and if in the PLC method selection as describe in subclause 5.4.2.1 the TCX time domain concealment was chosen. This concealment method has some similarity to the ACELP like concealment described in subclause 5.3.1. Due to the fact, that this method is operating in the excitation domain to shape the noise towards the vocal tract and preventing discontinuities, a local LPC analysis is applied to the synthesized time domain signal of the last frame. To improve the LPC analysis, first the signal is filtered with the pre-emphasis filter described in subclause 5.1.3 of [5] to obtain . After that, an LPC analysis is applied on same as in subclause 5.1.5 of [5], but with the frame lengthand the analysis window, which first three-quarter part is a hamming window and last quarter part is a cosine window. The residual signal is obtained by filtering through the inverse filter same as in subclause 5.2.2.4.1.1 of [5]. The local LPC parameters and the excitation signal are stored for multiple frame loss.
5.4.2.2.1 Construction of the periodic part of the excitation
If the last good frame was neither UNVOICED_CLAS nor UNVOICED_TRANSITION in combination with coder type being GENERIC, a harmonic part and a random part of excitation have to be generated for a concealment of erased frames. Otherwise, only a random part has to be generated. The harmonic part of the excitation is constructed by repeating the last pitch period of the previous frame. If this is the case of the first erased frame after a good frame and the ISF stability factor is lower than one, the first pitch cycle is first low-pass filtered. The filter used is core sampling rate dependant and consists of an 11-tap linear phase FIR filter. The filter coefficients for core sampling rates lower or equal then 16 000 Hz are:
, (112)
for core sampling rate equal to 25600 Hz
(113)
and for higher core sampling rates
. (114)
The periodic part of the excitation is constructed as described in subclause 5.3.1 including the pitch extrapolation as described in subclause 5.3.1.1 and the glottal pulse resynchronization as described in subclause 5.3.1.3. The pitches used to get the pitch extrapolation are based on the LTP lag and gains coming from the last TCX frames.
These LTP lag and gain are sent in the bitstream as side information. The specific handling as described in subclause 5.3.1.5 is used for TCX time domain concealment at all bitrates, additionally with the specific low-pass filtering of the first pitch cycle as described above.
The gain of pitch,, is calculated on as follows:
(115)
where are samples of pre-emphased prior time data, is the length of a subframe in samples and is the rounded pitch period equal to the LTP lag of the last good frame. The gain of pitch is limited between zero and one to prevent unexpected increase of energy. The formed adaptive excitation is attenuated sample-by-sample throughout the frame starting with one and ending with the damping factor calculated same as in subclause 5.3.4.2.3. To get a proper overlap add in the case the next good frame is a valid TCX frame, half a frame is additional created the same as describe above.
The attenuation strategy of the periodic part of the excitation is the same as done in subclause 5.3.4.2.1.
5.4.2.2.2 Construction of the random part of the excitation
The innovative (non-periodic) part of the excitation is generated by a simple random generator with approximately uniform distribution. If the last good frame was VOICED_CLAS or ONSET, a pre-emphased filtering of the noise is done same as [19] subclause 5.1.3, but with the pre-emphasis factor of 0.2 for core sampling rates lower or equal then 16 kHz and 0.6 for all other rates. The filtering is applied to decrease the amount of noisy components in the lower frequencies speech region. Furthermore, to shift the noise more to higher frequencies, the noise gets filtered by a 10-order high pass FIR filter in case of the first erased frame after a good frame and if the last good frame was neither UNVOICED_CLAS nor UNVOICED_TRANSITION. The filter coefficients are
(116)
for core sampling rates lower or equal than 16000 Hz,
(117)
for the core sampling rate of 25600 Hz and
(118)
for all other rates. For the second and further lost frames, the noise is composed via a linear interpolation between the fullband and a highpass-filtered version of it as
(119)
where are noise samples generated as described in beginning of this subclause, are filtered with the highpass filter above and is a frame wise cumulative factor of the damping factors. This ensures that the noise fades to fullband noise with the fading speed dependently on the damping factor.
The innovation gain,, which is used for adjusting the noise level, is calculated as
(120)
where is calculated as in equation (115). However, if the last good frame was neither UNVOICED_CLAS nor UNVOICED_TRANSITION in combination with coder type being GENERIC, is set to zero for calculating and the pitch buffer get reset.
The attenuation strategy of the random part of the excitation is somewhat different from the attenuation of the periodic excitation. The reason is that the pitch excitation is converging to zero while the random excitation is fading towards the background level described in 5.3.4.2.1. The background level is limited to. The random part of the excitation is attenuated linearly throughout the frame on a sample-by-sample basis starting with and going to the end of the frame gain which is
(121)
where is the gain in the last sample of the noise signal and is the damping factor as calculated in subclause 5.3.4.2.3. Due to the fact that is a relative component, the noise gets normalized. If the last good frame was UNVOICED_CLAS and the coder type is not UNVOICED, the innovative excitation is further attenuated by a factor of 0.8. Otherwise, if the last good frame was not UNVOICED_CLAS and not UNVOICED_TRANSITION, the excitation is further attenuated by .
To get a proper overlap add in the case the next good frame is a valid TCX frame, half a frame is additional created the same as describe above.
5.4.2.2.3 Construction of the total excitation, synthesis and updates
Finally, the random part of the excitation is added to the adaptive excitation to form the total excitation signal. If the last good frame is UNVOICED_CLAS or last good frame is UNVOICED_TRANSITION and coder type is GENERIC, only the innovative excitation is used as mentioned above. The synthesized signal is obtained by filtering the total excitation signal through the LP synthesis filter (see [5] subclause 6.1.3) with the local calculated LPC parameters and post-processed with the de-emphases filter, which is the inverse of [5] subclause 5.1.3.
If LTP information is available in the last good frame and is equal to zero then the is reset to zero. In the end the overlap and add buffers get updated same as in subclause 5.4.5.
5.4.2.3 MDCT frame repetition with sign scrambling
The excitation of the concealed frame (input to FDNS) is derived by sign scrambling of the last received excitation spectrum :
(122)
is the IGF cross over frequency. The is derived as
(123)
For any lost frame following a received frame, the initial value is reset:
(123a)
If the last 2 spectra are coded using TCX5, then the one with smaller energy is chosen.
The spectrum is faded towards noise as described in subclause 5.4.6.1.3.2.1.
5.4.2.4 Tonal MDCT concealment using phase prediction
5.4.2.4.1 Overview
The phase prediction described in subclause 5.4.2.4.3 is performed on the spectral coefficients belonging to tonal components found using the peak detection described in subclause 5.4.2. For the spectral coefficients not belonging to the tonal components, the sign scrambling is applied as described in subclause 5.4.2.3.
5.4.2.4.2 Peak detection of tonal components
Peak detection is performed if the current frame is lost but the previous frame has been received.
The peaks are first searched in the power spectrum of frame , using predefined thresholds. Based on the location of the peaks in frame , the thresholds for the search in the power spectrum of frame are adapted, whereas frame represents the second 10ms of frame and the first 10ms of frame . Thus, peaks existing in both spectra ( and ) are found. Their exact location is based on the power spectrum of frame .
The power spectra and are obtained as follows:
(124)
where represents the MDST coefficients and represents the MDCT coefficients and being the number of spectral coefficients. A minimum significant value of a spectral line in the power spectrum is assured by this operation:
(125)
and are derived from the time domain signal via MDCT/MDST. is given and is estimated:
(126)
If the change of the pitch lag between the last and the second last frame is larger or equal than 0.25 or the pitch lag is smaller than 10ms (corresponding to ), the index of the fundamental frequency is set to zero. Otherwise the index of the fundamental frequency is determined as:
. (128)
10 strongest peaks are found at the positions . Distance between peaks are calculated as . The most common among differences is . If there are at least 3 equal to 1 and if ; or less than 5 are equal to , then is not changed. If there are more than 5 equal to , then is set to . Otherwise is set to 0.
An envelope of each power spectrum is calculated using a moving average filter:
. (129)
The filter length depends on the index of the fundamental frequency and is limited to the range [11,23], as shown in Table 1. If the fundamental frequency is not available or not reliable, the filter length FL is set to 15, otherwise:
. (130)
Table 8: Filter length depending on the fundamental frequency
F0 |
FL |
0 |
15 |
<= 10 |
11 |
>= 22 |
23 |
else |
The smoothed power spectra are calculated as follows:
(131)
5.4.2.4.2.1 Detection of the peak candidates
If the smoothed spectrum is above the envelope at bin and the smoothed spectrum at bin is bigger than at bins and , is treated as peak candidate and the right and left foot of this peak candidate are searched for.
The right foot is defined as the spectral bin with index , for which
(132)
and
(133)
It is also allowed for an that is true, but only if and if there is a for which:
(134)
and
(135)
The left foot is defined in the same way as the right foot, but on the left side of the bin .
The local maximum is then found between the left and the right foot.
The thresholds for the peak search in are set at positions as:
(136)
If the change of the pitch between the last and the second last frame is smaller then 0.5, then for each
– , for each , being the number of the harmonics of ,
– , for each ,
thresholds are updated as follows:
(137)
with
(138)
For all bins not belonging to peaks or harmonics the threshold is set as:
. (140)
Note: The base threshold 7.59, as given in equation 129, corresponds to . All other thresholds, represented by , are given relative to this base threshold. Thus,
– 0.35 corresponds to
– 0.7 corresponds to
– 1.1 corresponds to
– 1.5 correspnds to
– 16 corresponds to
5.4.2.4.2.2 Final detection of the tonal components
After setting the thresholds as described in subclause 5.4.2.4.2.1, peaks detected in frame are now searched for in the power spectrum of frame .
If the following is fulfilled:
(141)
the right and left foot of the peak is searched for in around . The algorithm for the foot search is the same as the one in subclause 5.4.2.4.2.1.
The local maximum is then found between the left and the right foot.
A tonal component is defined as the set of spectral bins . If two neighboring tonal components would overlap, their surroundings are symmetrically reduced such that each spectral bin belongs only to one tonal component. All tonal components then build the set .
5.4.2.4.3 Phase prediction
For all found tonal components , that include spectrum peaks and their surroundings, as described in subclause 5.4.2.4.2.2, the MDCT phase prediction is used. For all other spectrum coefficients sign scrambling described in subclause 5.4.2.3 is used.
The phases are derived for each bin of a tonal component as:
, (142)
The fractional part is given by:
(143)
with a given in Table 2, depending on the neighboring bins around a spectral peak .
Table 9: Variable a from equation (143)
if |
a |
else |
Where the bandwidth b is 7, the maximum ratio is 44.8 and the constant G is .
The phase shift, being the same for every spectrum bin in , is derived as follows
, (144)
where is the index of the bin closest to the peak and is the fractional part (i.e. distance of the peak from given as the fractional number of bins).
The current phase is estimated for each using:
(145)
where for the first concealed frame and is increased for 1 for every consecutive frame loss. The correspondingt MDCT bins are estimated as:
. (146)
5.4.2.5 Non-tonal concealment with waveform adjustment
5.4.2.5.1 Preliminary concealment in frequency domain
The MDCT coefficients of the current lost frame are computed by using the MDCT coefficients of the frame prior to the current lost frame as follows:
The MDCT coefficients at all frequency points of the frame prior to the current lost frame are multiplied by random signs to obtain the MDCT coefficients of all frequency points of the current lost frame. In other words, when the current lost frame is the frame,
(147)
wherein is the MDCT coefficient at the frequency point of the frame, is the total number of the frequency points, and is the random sign at the frequency point .
The obtained MDCT coefficients of the current lost frame are transformed by an IMDCT to produce the initially compensated signal of the current lost frame.
5.4.2.5.2 Waveform adjustment in time domain
Waveform adjustment is performed on the initially compensated signal of the current lost frame to obtain the compensated signal of the current lost frame. The detailed procedure of the waveform adjustment is described as follows:
When the first lost frame occurs, the pitch period of the current lost frame is estimated as follows:
The pitch search is performed over the time-domain signal of the frame prior to the current lost frame by using the autocorrelation method to obtain the value of the pitch period of the frame prior to the current lost frame. The obtained pitch period value is used as the pitch period value of the current lost frame and to compute the maximum of normalized autocorrelation of the current lost frame. Detailedly, is searched so that
(148)
achieves the maximum value, then the resulting is the value of the pitch period, denoted by , wherein and are the upper and lower limits for the pitch searching, respectively, and is the frame length, with is the time-domain signal (the signal before TCX long-time prediction and post-processing) over which the pitch search is performed. and are obtained as follows:
(149)
(150)
wherein denotes the rounding operation. Define
(151)
then is the maximum of normalized autocorrelation. When the frame length is not greater than 320, define:
(152)
wherein indicates taking the greatest integer value less than or equal to . Comparing with , the pitch period is reset as in case .
When the frame length is greater than 320, in the procedure of estimating pitch period the following processing is carried out before pitch searching over the time-domain signal of the frame prior to the current lost frame: the time-domain signal of the frame prior to the current lost frame is down-sampled towards a half sampling rate, and the down-sampled time-domain signal is used to replace the original time-domain signal of the frame prior to the current lost frame for the pitch estimate. Accordingly, the searching limits and herein are obtained specifically as follows:
(153)
(154)
The following procedure is used to determine whether the pitch period value of the current lost frame estimated by the above method is usable regarding subsequent waveform adjustment:
i. Verify the following conditions to find if any one of them is met. If so, the obtained pitch period value is unusable.
(1) The cross-zero rate of the initially compensated signal of the first lost frame, denoted by , is greater than a threshold , wherein for , and in other cases.
(2) In the frame prior to the current lost frame, the ratio of lower-frequency energy to whole-frame energy, denoted by , is smaller than a threshold of 0.02. This ratio is defined as
(155)
wherein when the current lost frame is TCX20, when the current lost frame is TCX10, is the total number of the frequency points.
(3) In the frame prior to the current lost frame, the spectral tilt, denoted by, is smaller than a threshold , wherein for and otherwise. This spectral tilt is defined as
(156)
wherein is a low-pass filtered signal of the time-domain signal of the prior frame. The low-pass filter is given by:
(157)
(4) In the frame prior to the current lost frame, the cross-zero rate of the second half frame is greater than that of the first half frame by four times.
ii. If none of the above-mentioned conditions (i.e. the conditions (1)-(4)) is met, verify whether the obtained pitch period value is usable according to the following criteria:
(a) When the current lost frame is within a silence segment, the obtained pitch period value is considered to be unusable. The silence segment is identified if the logarithm energy of the frame prior to the current lost frame is smaller than a threshold of 50 or the following two conditions are met simultaneously:
(1) The maximum of normalized autocorrelation mentioned above in the pitch estimate procedure is smaller than 0.9.
(2) The result of the current long-time logarithm energy minus the logarithm energy of the frame prior to the current lost frame is greater than 8.0.
The logarithm energy is defined as:
(158)
where is the time-domain signal used as the final decoder output.
The long-time logarithm energy is defined as follows:
Set an initial value . For each frame, if its logarithm energy is greater than 50 and its cross-zero rate is smaller than 100, the long-time logarithm energy is updated as below:
(159)
where and .
(b) When the current lost frame is not within a silence segment and the maximum of normalized autocorrelation mentioned above is greater than 0.8, the obtained pitch period value is considered to be usable.
(c) When the criteria (a) and (b) are not met and the cross-zero rate of the frame prior to the current lost frame is greater than 100, the obtained pitch period value is considered to be unusable,
(d) When the criteria (a), (b), and (c) are not met and the result of the current long-time logarithm energy minus the logarithm energy of the frame prior to the current lost frame is greater than 6.0, the obtained pitch period value is considered to be unusable,
(e) When the criteria (a), (b), (c), and (d) are not met, and the result of the logarithm energy of the frame prior to the current lost frame minus the current long-time logarithm energy is greater than 1.0 and the maximum of normalized autocorrelation mentioned above is greater than 0.6, the obtained pitch period value is considered to be usable,
(f) When the criteria (a), (b), (c), (d), (e), and (f) are not met, the harmonic characteristic of the frame prior to the current lost frame is verified. When a value representing the harmonic characteristic is smaller than a threshold , the obtained pitch period value is considered to be unusable, When the value is greater than or equal to the threshold , the obtained pitch period value is considered to be usable, In this case, . can be computed as follows:
(160)
wherein is the fundamental frequency point, is the harmonic frequency point of , is the MDCT coefficient at the frequency point . Due to the quantitative relation between the pitch period and the pitch frequency, the value of can be computed with the pitch period value mentioned above. When is not an integer, is computed with its adjacent one or several frequency points by using rounding.
When the current lost frame is not the first lost frame, the pitch period of the first lost frame is taken as the estimated pitch period of the current lost frame,
If the pitch period of the current lost frame is not usable, the initially compensated signal of the current lost frame is taken as the compensated signal of the current lost frame; if the pitch period is usable, waveform adjustment is performed on the initially compensated signal with the time-domain signal of the frame prior to the current lost frame, that is, the pitch period is adjusted under certain conditions at first, and then the following are conducted:
It is supposed that the current lost frame is the lost frame, wherein , and when is larger than 4, the initially compensated signal of the current lost frame is taken as the compensated signal of the current lost frame, otherwise the following steps are performed;
(a) A buffer is established with a length of ;
(b) When equals 1, the first samples of the buffer are configured as a first -length signal of the initially compensated signal of the current lost frame, wherein is the pitch period of the current lost frame;
(c) When equals 1, the last pitch period of time-domain signal of the frame prior to the current lost frame and the first -length signal in the buffer are concatenated, and repeatedly copied into the buffer, until the buffer is filled up to obtain a time-domain signal with a length of , and during each copy, if the length of the existing signal in the buffer is , the signal is copied to locations from to of the buffer, wherein , and for the resultant overlapped area with a length of , the signal of the overlapped area is obtained by adding signals of two overlapping parts after windowing respectively; when is larger than 1, the last pitch period of compensated signal of the frame prior to the current lost frame is repeatedly copied into the buffer without overlapping, until the buffer is filled up to obtain a time-domain signal with a length of ;
(d) When is less than 4, the signal in the buffer is taken as the compensated signal of the current lost frame; when equals 4, overlap-add is performed on the signal in the buffer and the initially compensated signal of the current lost frame, and the obtained signal is taken as the compensated signal of the current lost frame.
For each lost frame without overlap-add processing, an additional signal as a noise is added to the compensated signal of the frame after the compensated signal is obtained. The detailed method of adding additional signal is as follows: firstly, a past signal, namely, the time-domain signal of the frame prior to the first lost frame (in the case of the first lost frame) or the initially compensated signal of the prior lost frame (in the case of the second, third, or fourth lost frame) is passed through a high-pass filter given as follows to obtain an additional signal:
(160a)
secondly, additional-signal gain values of the lost frame are estimated as follows:
(160b)
wherein is updated sample by sample during a series of consecutively lost frames with an initial value of zero at the beginning of the first lost frame and
(160c)
where is the maximum of normalized autocorrelation as described by equation (151); then, the additional signal is multiplied with the estimated additional-signal gain values sample by sample, and the additional signal resulting from multiplication is added to the compensated signal, to obtain a new compensated signal. For each lost frame with overlap-add processing, overlap-add is performed after the additional signal is added to the signal in the buffer.
For the first correctly received frame after the frame loss, if the number of consecutively lost frames is less than 4, a buffer is established with a length of , the last pitch period of compensated signal of the frame prior to the first correctly received frame is repeatedly copied into the buffer without overlapping until the buffer is filled up, overlap-add is performed on the signal in the buffer and the time-domain signal obtained by decoding the first correctly received frame, and the obtained signal is taken as a time-domain signal of the first correctly received frame. The additional signal described above is added to the signal in the buffer before overlap-add.
5.4.2.6 Intelligent gap filling
The intelligent gap filling tool is applied on the constructed signal, generated from one of the three MDCT-based TCX PLC methods, as described in [5], subclause 6.2.2.3.8. However, with increasing number of lost frames, the tiled IGF signal gets further attenuated by changing the IGF gain factor for each scale factor band.
In case of a lost frame, the IGF gain factors calculated in [5] subclause 6.2.2.3.8.3.8 firstly get limited to the maximum value of 12. After that, the gain factors get changes as follows:
(161)
where is the IGF gain factor at scale factor band and are the number of consecutively lost frames.