5.4.2 TCX MDCT
26.4473GPPCodec for Enhanced Voice Services (EVS)Error concealment of lost packetsRelease 17TS
5.4.2.1 PLC method selection
In case the last good frame prior to a loss was coded with the MDCT based TCX, a range of different specifically optimized PLC methods are available that are selected based on second level criteria described in this subclause. The PLC methods are:
– TCX time domain concealment
– MDCT frame repetition with sign scrambling
– tonal MDCT concealment using phase prediction
– non-tonal concealment with waveform adjustment
The criteria evaluated in this second level PLC method selection are
– Last MDCT mode: The MDCT mode of the last good frame is obtained by decoding the bitstream in every good frame.
– Number of consecutively lost frames: The number of consecutively lost frames is increased in case of a frame loss and is reset in a good received frame.
– Last unmodified LTP gain: If LTP information is updated in the last good frame, the variable contains the LTP gain, and otherwise it is zero.
– Tonal MDCT peak detection flag: The flag describes whether tonal MDCT concealment using phase prediction should be done. It is set to zero by default and remains zero if one of the following conditions is true:
– the last core or the second last core is not mode TCX20
– the last unmodified LTP gain is bigger than 0.4 and the last pitch is bigger than
– the last pitch differs from the second last pitch
– TNS was active in the last or second last frame
Otherwise, is set to one if the output of the peak detection of tonal components (see subclause 5.4.2.4.2) matches one of the following criteria:
– the number of found peaks is higher than 10; or
– the number of found peaks is higher than 5 and the difference between the 3rd and 2nd last pitch is smaller than 0.5 or
– at least one peak is found and the last good frame was either UNVOICED_TRANSITION or UNVOICED_CLAS and the difference between the 3rd and 2nd last pitch is smaller than 0.5 and the last unmodified LTP gain is .
– Flag enabling non-tonal concealment with waveform adjustment: The flag is set to one if the bit rate is one out of the set of {48 kbps, 96 kbps, 128 kbps}.
– Intelligent gap filling:
The intelligent gap filling flag describes whether intelligent gap filling is active (1) or not (0) (see subclause 5.4.2.6).
– TCX_Tonality flag array:
array of tonality flags of the last ten received frames (see subclause 5.4.5.3a).
The decision logic of the different PLC methods is done with the criteria shown above. The selection of the PLC is performed only in the first lost frame after a good frame and pertained in subsequently lost frames.
TCX time domain concealment is selected if:
– flag is zero; and
– is TCX_CORE and and the last good frame was neither UNVOICED_TRANSITION nor UNVOICED_CLAS.
In all other cases, the three MDCT-based concealment methods are selected as described below.
MDCT frame repetition with sign scrambling is selected if:
– is one (in conjunction with tonal MDCT concealment using phase prediction); or
– is zero and non-tonal concealment with waveform adjustment is not active.
Tonal MDCT Concealment using phase prediction is selected if:
– is one
Non-tonal concealment with waveform adjustment is selected if:
– is one, is zero and there is no transition having a larger frame size than a normal TCX20 frame; and
– the lost frame is considered to be a non-tonal frame, which requires that the TCX_Tonality flag array contains five or less ones or one out of the last three frames is not TCX20.
If a MDCT-based PLC mode is selected and is one, some missing information are added with the intelligent gap filling concealment.
5.4.2.2 TCX time domain concealment
The time domain PLC for TCX is called if the core of the last good frame was TCX and if in the PLC method selection as describe in subclause 5.4.2.1 the TCX time domain concealment was chosen. This concealment method has some similarity to the ACELP like concealment described in subclause 5.3.1. Due to the fact, that this method is operating in the excitation domain to shape the noise towards the vocal tract and preventing discontinuities, a local LPC analysis is applied to the synthesized time domain signal of the last frame. To improve the LPC analysis, first the signal is filtered with the pre-emphasis filter described in subclause 5.1.3 of [5] to obtain . After that, an LPC analysis is applied on same as in subclause 5.1.5 of [5], but with the frame lengthand the analysis window, which first three-quarter part is a hamming window and last quarter part is a cosine window. The residual signal is obtained by filtering through the inverse filter same as in subclause 5.2.2.4.1.1 of [5]. The local LPC parameters and the excitation signal are stored for multiple frame loss.
5.4.2.2.1 Construction of the periodic part of the excitation
If the last good frame was neither UNVOICED_CLAS nor UNVOICED_TRANSITION in combination with coder type being GENERIC, a harmonic part and a random part of excitation have to be generated for a concealment of erased frames. Otherwise, only a random part has to be generated. The harmonic part of the excitation is constructed by repeating the last pitch period of the previous frame. If this is the case of the first erased frame after a good frame and the ISF stability factor is lower than one, the first pitch cycle is first low-pass filtered. The filter used is core sampling rate dependant and consists of an 11-tap linear phase FIR filter. The filter coefficients for core sampling rates lower or equal then 16 000 Hz are:
, (112)
for core sampling rate equal to 25600 Hz
(113)
and for higher core sampling rates
. (114)
The periodic part of the excitation is constructed as described in subclause 5.3.1 including the pitch extrapolation as described in subclause 5.3.1.1 and the glottal pulse resynchronization as described in subclause 5.3.1.3. The pitches used to get the pitch extrapolation are based on the LTP lag and gains coming from the last TCX frames.
These LTP lag and gain are sent in the bitstream as side information. The specific handling as described in subclause 5.3.1.5 is used for TCX time domain concealment at all bitrates, additionally with the specific low-pass filtering of the first pitch cycle as described above.
The gain of pitch,, is calculated on as follows:
(115)
where are samples of pre-emphased prior time data, is the length of a subframe in samples and is the rounded pitch period equal to the LTP lag of the last good frame. The gain of pitch is limited between zero and one to prevent unexpected increase of energy. The formed adaptive excitation is attenuated sample-by-sample throughout the frame starting with one and ending with the damping factor calculated same as in subclause 5.3.4.2.3. To get a proper overlap add in the case the next good frame is a valid TCX frame, half a frame is additional created the same as describe above.
The attenuation strategy of the periodic part of the excitation is the same as done in subclause 5.3.4.2.1.
5.4.2.2.2 Construction of the random part of the excitation
The innovative (non-periodic) part of the excitation is generated by a simple random generator with approximately uniform distribution. If the last good frame was VOICED_CLAS or ONSET, a pre-emphased filtering of the noise is done same as [19] subclause 5.1.3, but with the pre-emphasis factor of 0.2 for core sampling rates lower or equal then 16 kHz and 0.6 for all other rates. The filtering is applied to decrease the amount of noisy components in the lower frequencies speech region. Furthermore, to shift the noise more to higher frequencies, the noise gets filtered by a 10-order high pass FIR filter in case of the first erased frame after a good frame and if the last good frame was neither UNVOICED_CLAS nor UNVOICED_TRANSITION. The filter coefficients are
(116)
for core sampling rates lower or equal than 16000 Hz,
(117)
for the core sampling rate of 25600 Hz and
(118)
for all other rates. For the second and further lost frames, the noise is composed via a linear interpolation between the fullband and a highpass-filtered version of it as
(119)
where are noise samples generated as described in beginning of this subclause, are filtered with the highpass filter above and is a frame wise cumulative factor of the damping factors. This ensures that the noise fades to fullband noise with the fading speed dependently on the damping factor.
The innovation gain,, which is used for adjusting the noise level, is calculated as
(120)
where is calculated as in equation (115). However, if the last good frame was neither UNVOICED_CLAS nor UNVOICED_TRANSITION in combination with coder type being GENERIC, is set to zero for calculating and the pitch buffer get reset.
The attenuation strategy of the random part of the excitation is somewhat different from the attenuation of the periodic excitation. The reason is that the pitch excitation is converging to zero while the random excitation is fading towards the background level described in 5.3.4.2.1. The background level is limited to. The random part of the excitation is attenuated linearly throughout the frame on a sample-by-sample basis starting with and going to the end of the frame gain which is
(121)
where is the gain in the last sample of the noise signal and is the damping factor as calculated in subclause 5.3.4.2.3. Due to the fact that is a relative component, the noise gets normalized. If the last good frame was UNVOICED_CLAS and the coder type is not UNVOICED, the innovative excitation is further attenuated by a factor of 0.8. Otherwise, if the last good frame was not UNVOICED_CLAS and not UNVOICED_TRANSITION, the excitation is further attenuated by .
To get a proper overlap add in the case the next good frame is a valid TCX frame, half a frame is additional created the same as describe above.
5.4.2.2.3 Construction of the total excitation, synthesis and updates
Finally, the random part of the excitation is added to the adaptive excitation to form the total excitation signal. If the last good frame is UNVOICED_CLAS or last good frame is UNVOICED_TRANSITION and coder type is GENERIC, only the innovative excitation is used as mentioned above. The synthesized signal is obtained by filtering the total excitation signal through the LP synthesis filter (see [5] subclause 6.1.3) with the local calculated LPC parameters and post-processed with the de-emphases filter, which is the inverse of [5] subclause 5.1.3.
If LTP information is available in the last good frame and is equal to zero then the is reset to zero. In the end the overlap and add buffers get updated same as in subclause 5.4.5.
5.4.2.3 MDCT frame repetition with sign scrambling
The excitation of the concealed frame (input to FDNS) is derived by sign scrambling of the last received excitation spectrum :
(122)
is the IGF cross over frequency. The is derived as
(123)
For any lost frame following a received frame, the initial value is reset:
(123a)
If the last 2 spectra are coded using TCX5, then the one with smaller energy is chosen.
The spectrum is faded towards noise as described in subclause 5.4.6.1.3.2.1.
5.4.2.4 Tonal MDCT concealment using phase prediction
5.4.2.4.1 Overview
The phase prediction described in subclause 5.4.2.4.3 is performed on the spectral coefficients belonging to tonal components found using the peak detection described in subclause 5.4.2. For the spectral coefficients not belonging to the tonal components, the sign scrambling is applied as described in subclause 5.4.2.3.
5.4.2.4.2 Peak detection of tonal components
Peak detection is performed if the current frame is lost but the previous frame has been received.
The peaks are first searched in the power spectrum of frame , using predefined thresholds. Based on the location of the peaks in frame , the thresholds for the search in the power spectrum of frame are adapted, whereas frame represents the second 10ms of frame and the first 10ms of frame . Thus, peaks existing in both spectra ( and ) are found. Their exact location is based on the power spectrum of frame .
The power spectra and are obtained as follows:
(124)
where represents the MDST coefficients and represents the MDCT coefficients and being the number of spectral coefficients. A minimum significant value of a spectral line in the power spectrum is assured by this operation:
(125)
and are derived from the time domain signal via MDCT/MDST. is given and is estimated:
(126)
If the change of the pitch lag between the last and the second last frame is larger or equal than 0.25 or the pitch lag is smaller than 10ms (corresponding to ), the index of the fundamental frequency is set to zero. Otherwise the index of the fundamental frequency is determined as:
. (128)
10 strongest peaks are found at the positions . Distance between peaks are calculated as . The most common among differences is . If there are at least 3 equal to 1 and if ; or less than 5 are equal to , then is not changed. If there are more than 5 equal to , then is set to . Otherwise is set to 0.
An envelope of each power spectrum is calculated using a moving average filter:
. (129)
The filter length depends on the index of the fundamental frequency and is limited to the range [11,23], as shown in Table 1. If the fundamental frequency is not available or not reliable, the filter length FL is set to 15, otherwise:
. (130)
Table 8: Filter length depending on the fundamental frequency
F0 |
FL |
0 |
15 |
<= 10 |
11 |
>= 22 |
23 |
else |
The smoothed power spectra are calculated as follows:
(131)
5.4.2.4.2.1 Detection of the peak candidates
If the smoothed spectrum is above the envelope at bin and the smoothed spectrum at bin is bigger than at bins and , is treated as peak candidate and the right and left foot of this peak candidate are searched for.
The right foot is defined as the spectral bin with index , for which
(132)
and
(133)
It is also allowed for an that is true, but only if and if there is a for which:
(134)
and
(135)
The left foot is defined in the same way as the right foot, but on the left side of the bin .
The local maximum is then found between the left and the right foot.
The thresholds for the peak search in are set at positions as:
(136)
If the change of the pitch between the last and the second last frame is smaller then 0.5, then for each
– , for each , being the number of the harmonics of ,
– , for each ,
thresholds are updated as follows:
(137)
with
(138)
For all bins not belonging to peaks or harmonics the threshold is set as:
. (140)
Note: The base threshold 7.59, as given in equation 129, corresponds to . All other thresholds, represented by , are given relative to this base threshold. Thus,
– 0.35 corresponds to
– 0.7 corresponds to
– 1.1 corresponds to
– 1.5 correspnds to
– 16 corresponds to
5.4.2.4.2.2 Final detection of the tonal components
After setting the thresholds as described in subclause 5.4.2.4.2.1, peaks detected in frame are now searched for in the power spectrum of frame .
If the following is fulfilled:
(141)
the right and left foot of the peak is searched for in around . The algorithm for the foot search is the same as the one in subclause 5.4.2.4.2.1.
The local maximum is then found between the left and the right foot.
A tonal component is defined as the set of spectral bins . If two neighboring tonal components would overlap, their surroundings are symmetrically reduced such that each spectral bin belongs only to one tonal component. All tonal components then build the set .
5.4.2.4.3 Phase prediction
For all found tonal components , that include spectrum peaks and their surroundings, as described in subclause 5.4.2.4.2.2, the MDCT phase prediction is used. For all other spectrum coefficients sign scrambling described in subclause 5.4.2.3 is used.
The phases are derived for each bin of a tonal component as:
, (142)
The fractional part is given by:
(143)
with a given in Table 2, depending on the neighboring bins around a spectral peak .
Table 9: Variable a from equation (143)
if |
a |
else |
Where the bandwidth b is 7, the maximum ratio is 44.8 and the constant G is .
The phase shift, being the same for every spectrum bin in