5.3.1 General

26.4473GPPCodec for Enhanced Voice Services (EVS)Error concealment of lost packetsRelease 17TS

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

In case of frame erasures, the concealment strategy can be summarized as a convergence of the signal energy and the spectral envelope to the estimated parameters of the background noise. A frame erasure is signalled to the decoder by setting the bad frame indicator variable for the current frame active. The periodicity of the signal is converged to zero. The speed of the convergence is dependent on the parameters of the last correctly received frame and the number of consecutive erased frames, and is controlled by an attenuation factor, . The factor, , is further dependent on the stability, , of the LP filter for UNVOICED_CLAS frames. In general, the convergence is slow if the last good received frame is in a stable segment and is rapid if the frame is in a transition segment. The values of  are summarized in subclause 5.3.4.1 for the excitation concealment of rates: 5.9, 7.2, 8.0, 13.2, 32 and 64 kbps and in subclause 5.3.4.2.3 for the rates: 9.6, 16.4 and 24.4 kbps. Similar values are also defined for LSF concealment as described in subclause 5.2.

5.3.1.1 Extrapolation of future pitch

In case of a frame loss, an estimation of the end-of-frame pitch is done to help keeping the adaptive codebook in sync to the error free case as good as possible. If the error free end-of-frame pitch can be predicted precisely, the recovery after the loss will be a lot quicker. The pitch extrapolation assumes that the encoder uses a smooth pitch contour. The information on the estimated end-of-frame pitch is used by the glottal pulse resynchronization tool described in subclause 5.3.1.2.

The pitch extrapolation is done only if the last good frame was classified as UNVOICED TRANSITION, VOICED TRANSITION or VOICED_CLAS. Also the pitch extrapolation is only performed if the frame before the loss was a good frame. The extrapolation is done based on the pitch lags, , of the last 5 subframes before the erasure. Also the history of the pitch gains, , of the last 6 subframes before the erasure is needed. The history update of the pitch lags and pitch gains is done after the synthesis of every frame.

First, the difference between the pitch lags is computed:

for (27)

where denotes the last subframe of the previous frame, denotes the second last sub-frame of the previous frame, and so on.

In case the last good frame contained information about future pitch gains and pitch lags, is instead calculated by:

for (28)

Also in case of information about future pitch gains and pitch lags was contained in the previous frame, the history of pitch gains is shifted by 2 subframes in a way that the -th pitch gain is moved to the -th pitch gain, for .

Future subframe information might be available if the last good frame was coded with TCX MDCT and there was LTP information available, or the last good frame was coded with ACELP and there was future pitch information transmitted in the bitstream (see subclause 5.3.3.1).

The sum of the differences is computed as

(29)

The position of the maximum absolute difference, , is found.

If the criterion AND is met, pitch prediction is performed. Else no prediction is performed and is used for building the adaptive codebook during concealment.

Pitch prediction is performed by minimizing this error equation.

(30)

where:

is the error function,

are the past adaptive codebook gains,

and are unknown variables which need to be determined,

are the adaptive codebook lags from the past frames,

is the subframe index

The past adaptive codebook gains are multiply by a vector {1.25f, 1.125f, 1.f, 0.875f, .75f};

Minimizing of this function is done by deviating the error function by a and b separately

(31)

By setting the derivatives and to zero, this leads to:

(32)

(33)

where

(34)

The end-of-frame pitch is determined according to this, if no information about future subframes was available in the previous frame:

(35)

In case there was information about future pitch gains and pitch lags available the end-of-frame pitch is predicted by:

(36)

After this processing, the predicted pitch is limited between and .

5.3.1.2 Construction of the periodic part of the excitation

For a concealment of erased frames following a correctly received UNVOICED_CLAS frame, no periodic part of the excitation is generated. For a concealment of erased frames following a correctly received frame other than UNVOICED_CLAS, the periodic part of the excitation is constructed by repeating the low-pass filtered last pitch period of the previous frame. The low-pass filter used is a simple 3-tap linear phase FIR filter with the coefficients equal to 0.18, 0.64 and 0.18. The pitch period, Tc, used to select the last pitch pulse, and hence used during the concealment, is defined so that pitch multiples or submultiples can be avoided or reduced. The following logic is used in determining the pitch period, Tc

if ((T[–1] < 1.8Ts) AND (T[–1] > 0.6Ts)) OR (Tcnt >=5)

tmp_tc = T[–1]

else

tmp_tc = Ts

Tc = round(tmp_tc)

Here, T[–1] = is the pitch period of the last subframe of the last good received frame and Ts is the pitch period of the last subframe of the last good stable voiced frame with coherent pitch estimates. A stable voiced frame is defined here as a VOICED_CLAS frame, preceded by a frame of voiced type (VOICED TRANSITION, VOICED_CLAS, ONSET). The coherence of pitch is verified by examining whether the closed-loop pitch estimates are reasonably close; i.e. whether the ratio between the 4th subframe pitch, at 12.8 kHz core sampling frequency or at 16 kHz core sampling frequency, and the 2nd subframe pitch, at 12.8 kHz core sampling frequency or at 16 kHz core sampling frequency, is within the interval [0.7, 1.4], and whether the ratio between the 2nd subframe pitch ( or ) and the last subframe pitch of the preceding frame, , is also within that interval, where y = 5 when the core sampling frequency is 12.8 kHz and y = 6 otherwise. The pitch is also assumed cohererent if the coding type is transition.

This determination of the pitch period, Tc, implies that if the pitch at the end of the last good frame and the pitch of the last stable frame are close, the pitch of the last good frame is used. Otherwise, this pitch is considered unreliable and the pitch of the last stable frame is used instead to avoid the impact of erroneous pitch estimates at voiced onsets. This logic is valid only if the last stable segment is not too far in the past. Hence, a counter, Tcnt, is defined that limits the effect of the last stable segment. If Tcnt is greater than or equal to 5; i.e. if there are at least 5 frames since the last Ts update, the last good frame pitch is used systematically. Tcnt is reset to 0 every time a stable segment is detected and Ts is updated. The period Tc is then maintained constant during the concealment for the entire erased block.

5.3.1.2.1 Particularity of rate 5.9, 7.2, 8.0 and 13.2 kbps

On top of UNVOICED_CLAS, the periodic component of the excitation is not constructed when the last coding mode was GSC AUDIO without a temporal contribution or the last class was INACTIVE without a temporal contribution.

5.3.1.3 Glottal pulse resynchronization

The construction of the periodic part of the excitation, described in the subclause 5.3.1.2, may result in a drift of the glottal pulse position in the concealed frame during voiced segments, since the pitch period used to build the excitation can be different from the encoder pitch period. This will cause the adaptive codebook (or past CELP excitation) to be desynchronized from the actual CELP excitation. Thus, in case a good frame is received after an erased frame, the pitch excitation (or adaptive codebook excitation) will have an error which may persist for several frames and affect the performance of the correctly received frames.

To overcome this problem and improve the decoder convergence, a resynchronization method is used which adjusts the position of the glottal pulses in the concealed frame to be synchronized with the estimated true glottal pulses positions where the positions of the glottal pulses are estimated at the decoder based on the pitch extrapolation performed as in subclause 5.3.1.1. Therefore, this resynchronization procedure is performed based on the estimation of phase information and it aligns the maximum pulse in each pitch period of the concealed frame to the estimated position of the glottal pulse.

The starting point is the constructed periodic part of the excitation src_exc, constructed as described in subclause 5.3.1.2. If then samples are removed from src_exc and if then samples are added to src_exc. The samples are added or removed at the locations of the minimum energy, between the estimated locations of the glottal pulses as well as the locations of the minimum energy before the estimated location of the first and after the estimated location of the last glottal pulse. The periodic part of the excitation, modified in such way, is stored into dst_exc.

5.3.1.3.1 Condition to perform resynchronisation

The glottal pulse resynchronisation is performed only if some conditions, which describe that a reliable estimation of true pulse positions is available and is different from the actual pulse positions, are met. First the extrapolation of the future pitch as performed in subclause 5.3.1.1 shall have been successful. The pitch period Tc as defined in subclause 5.3.1.2 shall be different than the rounded predicted pitch as defined in subclause 5.3.1.1. The absolute difference between the pitch period Tc and the rounded predicted pitch shall be smaller than . In order to have enough samples for the pulse resynchronization in the periodic part of the excitation constructed by repeating the last pitch period, the relative pitch change shall be greater than a threshold as described below:

(37)

where is the number of subframes as defined in subclause 5.1.2.

If the conditions to perform the glottal pulse resynchronization are not met, the samples from src_exc are simply copied to dst_exc. In some instances, where the glottal pulse resynchronization is used, this is implemented in a such way that src_exc points to the final location of the modified periodic part of the excitation, and that its contents are first copied to another temporary buffer which is then considered as src_exc inside the pulse resynchronization and the final location which for the caller is src_exc is considered as dst_exc inside the pulse resynchronization.

5.3.1.3.2 Performing glottal pulse resynchronization

First the pitch change per sub-frame is calculated as:

(38)

Then the number of samples to be added (to be removed if negative) is calculated as:

(39)

Then the location of the first maximum pulse , among first samples in src_exc is found using simple search for the maximum absolute value.

The index of the last pulse that will be present in dst_exc is calculated as:

(40)

The delta of the samples to be added or removed between consecutive pitch cycles is calculated as:

(41)

The number of samples to be added or removed before the first pulse is calculated as:

(42)

The number of samples to be added or removed before the first pulse is rounded down and the fractional part is kept in memory:

(43)

For each region between 2 pulses the number of samples to be added or removed is calculated as:

, (44)

The number of samples to be added or removed between 2 pulses, taking into account the remaining fractional part from the previous rounding, is rounded down:

(45)

If, due to the added , for some i it happens that , then the values for and are swapped.

The number of samples to be added or removed after the last pulse is calculated as:

(46)

The maximum number of samples to be added or removed among the minimum energy regions is calculated as:

(47)

The location of the minimum energy segment between the first two pulses in src_exc, that has length, is then found by simple search for minimum in the moving average of length . For every consecutive minimum energy segment between two pulses, the position is calculated as:

, (48)

If then the location of the minimum energy segment before the first pulse is calculated using . Otherwise the location of the minimum energy segment before the first pulse in src_exc is found by simple search for minimum in the moving average of length .

If then the location of the minimum energy segment after the last pulse is calculated using . Otherwise the location of the minimum energy segment after the last pulse in src_exc is found by simple search for minimum in the moving average of length .

If there is going to be just one pulse in dst_exc, that is if is equal to 0, the search for is limited to . then points to the location of the minimum energy segment after the only pulse in dst_exc.

If then samples are added at location for to the signal src_exc and stored in dst_exc, otherwise if then samples are removed at location for from the signal src_exc and stored in dst_exc. There are regions where the samples are added or removed.

5.3.1.4 Construction of the random part of the excitation

The innovative (non-periodic) part of the excitation is generated randomly. A simple random generator with approximately uniform distribution is used. Before adjusting the innovation gain, the randomly generated innovation is scaled to some reference value, fixed here to the unitary energy per sample. At the beginning of an erased block, the innovation gain, g_s, is initialized by using the innovative excitation gains of each subframe of the last good frame

for 4 subframes:

(49)

for 5 subframes:

(49a)

where , , , and are the algebraic codebook gains of the four subframes of the last correctly received frame. The attenuation strategy of the random part of the excitation is somewhat different from the attenuation of the pitch excitation. The reason is that the pitch excitation (and thus the excitation periodicity) is converging to 0 while the random excitation is converging to the CNG excitation energy. The innovation gain attenuation is calculated as

(50)

where is the innovative gain at the beginning of the next frame, is the innovative gain at the beginning of the current frame, is the gain of the excitation used during the comfort noise generation and α is as defined in Table 4. The comfort noise gain, , is given as the square root of the energy as described in subclause 5.4.3.6.4. Similarly to the periodic excitation attenuation, the gain is thus attenuated linearly throughout the frame on a sample-by-sample basis starting with, , and going to the value of that would be achieved at the beginning of the next frame.

Finally, if the last correctly received frame is different from UNVOICED_CLAS, the innovation excitation is filtered through a linear phase FIR high-pass filter with coefficients –0.0125, –0.109, 0.7813,
–0.109, and –0.0125. To decrease the amount of noisy components during voiced segments, these filter coefficients are multiplied by an adaptive factor equal to , with denoting the voicing factor as defined in equation (1475) in subclause 6.1.1.3.2 of [5]. The random part of the excitation is then added to the adaptive excitation to form the total excitation signal. If the last good frame is UNVOICED_CLAS, only the innovative excitation is used and it is further attenuated by a factor of 0.8. In this case, the past excitation buffer is updated with the innovation excitation, as no periodic part of the excitation is available. If the last good frame is UNVOICED_CLAS or INACTIVE but it is not coded with UC mode signalling non‑stationary unvoiced frame, the innovation excitation is further attenuated by a factor of 0.8.

5.3.1.5 Spectral envelope concealment, synthesis and updates

To synthesize the decoded speech, the LP filter parameters shall be obtained. The spectral envelope is gradually moved to an estimated reference envelope, see clause 5.2. The estimated LSF vector is converted to an LSP vector and interpolated with the last frame’s LSP vector for 4 or 5 subframes, depending on the ACELP sampling rate being 12.8 kHz or 16 kHz.

The synthesized signal is obtained by filtering the sum of the adaptive and the random excitation signal through the LP synthesis filter (see clause 6.1.3 of [5]) and post-processed similar to the steps performed in clean channel.

As the LSF quantizers uses prediction, their memories would not be up to date after the normal operation is resumed. To reduce this effect, the quantizers’ memories (moving average and auto-regressive) are estimated and updated at the end of each erased frame.

5.3.1.5.1 Specifics for rates 9.6, 16.4 and 24.4 kbps

The coefficients of the filter used in subclause 5.3.1.2 for low pass filtering of the first pitch cycle are dependent on the sampling rate. The pitch period, tmp_tc, is always equal to T[-1], where T[-1] is the pitch period of the last sub-frame of the last good received frame, and Tc, used to select the last pitch pulse is thus equal to round(T[-1]).

The periodic part of the excitation will be generated further by repeatedly copying the last pitch cycle of the dst_exc for an additional half frame, which is used for correctly updating the overlap-add buffers for MDCT recovery.

In contrast to subclause 5.3.1.5, both excitation signals are not added up and filtered. The synthesized signal is obtained by filtering the adaptive excitation through the LP synthesis filter based on the LSF interpolation according to formula (19). The random part of the excitation is filtered through the LP filter based on formula (21). After obtaining two separate synthesis signals, they are added up, post processed and played out like in a correctly received frame. Note, that the memories for both of the LP synthesis filters are initialized with the last known state of the last good frame in the beginning of a frame loss. For consecutive loss, they are updated and stored separately.

5.3.1.6 GSC mode concealment

When the concealment is performed based on the GSC core, the construction of the periodic part of the excitation is performed as described in subclauses 5.3.1.2 and 5.3.1.3. The reconstruction of the periodic part of the excitation corresponds to the time domain contribution of the GSC model. Thus, the reconstructed periodic excitation is converted into the frequency domain using the DCT_IV as described in subclause 5.2.3.5.3.1 of [5] and the spectrum above the last known cut-off frequency is smoothed-out to zero.

Then, the spectral concealment is performed by using the last good band energies received. In case of INACTIVE content or active SWB UC mode, the last good decoded spectrum is mixed with random noise at a rate of 4/5 random noise and 1/5 the last decoded spectrum, making the spectrum to become noisy quite fast. In case the last good frame was AUDIO, no noise is added, but the spectrum dynamic is attenuated by 25 %.

The next step consists in adding the spectrum of the reconstructed periodic excitation to the concealed spectrum of the frequency domain contribution and to perform the inverse DCT_IV similarly as done in subclause 5.2.3.5.3.1 of [5] to get the final concealed excitation in case of GSC mode.

5.3.1.7 Specifics for AMR-WB IO modes

In case of AMR-WB IO the subclause 5.1.2 is complement with a few more parameters that allow the interoperable decoder to know if the decoded frame contains more likely speech of generic audio and if the current frame contains an onset. The generic audio can include music, reverberant speech and can also include background music. To determine with good confidence that the current frame can be categorized as generic audio, two parameters are used. The total frame energy as formulated in subclause 5.1.2 and the statistical deviation of the energy variation history.

First, a mean of the past forty (40) total frame energy variations is calculated using the following relation:

(51)

Then, a statistical deviation of the energy variation history over the last fifteen (15) frames is determined using the following relation:

(52)

The resulting deviation gives an indication on the energy stability of the decoded synthesis. Typically, music has a higher energy stability (lower statistical deviation of the energy variation history) than speech.

Additionally, the first step classification is used to evaluate the interval between two frames classified as unvoiced when the coder type is different from INACTIVE. When a frame is classified as unvoiced and the coder type is different from INACTIVE, meaning that the signal is unvoiced but not silence, if the long term active content energy, as formulated in subclause 5.1.2, is below 40 dB the unvoiced interval counter is set to 16, otherwise the unvoiced interval counter is decreased by 8 and also limited between 0 and 300 for active signal and between 0 and 125 for inactive signal. It is reminded that, the difference between active and inactive signal is deduced from the voice activity detection VAD information included in the bitstream.

A long term average is derived from this unvoiced frame counter as follow for active signal:

(53)

And as follows for inactive signal:

(54)

Furthermore, when the long term unvoiced average is greater than 100 and the deviation is greater than 5 and the difference between the current frame energy and the last frame energy is smaller than 12 dB, the long term average is modified as follow:

(55)

This parameter on long term average of the number of frames between frames classified as unvoiced is used by the classifier to determine if the frame should be considered as generic audio or not. The more the unvoiced frames are close in time, the more likely the frame has speech characteristics (less probably generic audio). In the illustrative example, the threshold to decide if a frame is considered as generic audio is defined as follows:

A frame is declared if

(56)

The parameter, defined at the beginning of this subclause, is added to not classify large energy variation as generic audio, but to keep it as active content. A flag named local attack flag and used in subclause 6.8.1.3.5 of [6] is derived from variation of energy parameter. The local attack flag is set to 1 when the energy variation is greater than 6 dB and the frame is classified as GENERIC AUDIO SOUND or when the energy variation is greater than 9 dB.

The modification performed on the excitation depends on the classification of the frame and for some type of frames there is no modification at all. The next table 3 summarizes the case where a modification can be performed or not.

Table 6: Signal category for excitation modification

Frame Classification	Voice activity detected? Y/N	Category	Modification Y/N
ONSET VOICED_CLAS UNVOICED TRANSITION ARTIFICIAL ONSET	Y (VAD=1)	Active voice	N
GENERIC AUDIO SOUND	Y	Generic audio	Y
VOICED TRANSITION UNVOICED_CLAS	Y	Active unvoiced	Y
ONSET VOICED_CLAS UNVOICED TRANSITION ARTIFICIAL ONSET GENERIC AUDIO SOUND VOICED TRANSITION UNVOICED_CLAS	N	Inactive content	Y

The output of the second stage classifier will be used to activate or not different post processing based on content category.

5.3.1.8 Reconstructed excitation

The total excitation from layer 1 in each subframe is constructed by

(57)

where is the pre-filtered algebraic code vector. The excitation signal, , is used to update the contents of the adaptive codebook for the next frame. The excitation signal, , is then post‑processed as described in subclause 6.1.1.3 of [5] to obtain the post-processed excitation signal , which is finally used as an input to the synthesis filter . The final steps of synthesis, post‑processing, de-emphasis and resampling are described in subclauses 6.1.4 of [5].

5.3.1.8.1 Particularity of rate 5.9, 7.2, 8.0 and 13.2 kbps

In case of GSC based concealment, with or without a time domain contribution, the excitation corresponds directly to the output of subclause 5.3.1.6.