5.3.3 Guided concealment and recovery

26.4473GPPCodec for Enhanced Voice Services (EVS)Error concealment of lost packetsRelease 17TS

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

5.3.3.1 Specifics for rate 24.4 kbps

As described in subclause 5.5.4 of [5], the activation flag and differential pitch lag are transmitted as side information to obtain better pitch lag estimates and excitation signal for the future frame to be concealed.

The first 1 bit of the side information is read from the bit-stream yielding the activation flag. In case the activation flag equals 0, no further decoding is performed. If the flag equals 1, additional 4 bits are decoded yielding the differential pitch lag. With 4bits 16 different states are signalled. 15 states are used to represent the differential pitch lag, ranging from -7 to 7. The remaining signalling state is used to signal, that the pitch lag difference was outside the +-7 range on encoder side.
In case the pitch lag difference is inside the signalled valid rage of +-7, the differential pitch lag is added to the pitch lag of the last sub-frame. The result is used as an initial pitch lag estimate of the future 1^st and 2^nd sub-frame. The initial pitch lag estimates are used as an input to the pitch lag extrapolation procedure described in subclause 5.3.1.1. If the initial pitch lag estimates are available, the history of pitch lags used for the pitch extrapolation is updated with the initial pitch lag estimates. In case the criteria in clause 5.3.1.1 is not met, instead of , the initial pitch lag estimate is used for building the first and second subframe of the adaptive codebook during concealment .

In case the pitch lag difference indicates that the difference is outside the valid range of +-7 the pitch extrapolation is performed like there is no future pitch lag information available in the bitstream.

5.3.3.2 Specifics for rates 9.6, 16.4 and 24.4 kbps

As described in subclause 5.5.5 of [5], side information on the activation of spectral envelope diffuser is transmitted to suppress too sharp peak at LP spectrum at the decoder side, 1 bit is decoded to obtain the activation flag. In case the value equals to 1, spectral envelope diffuser is activated, otherwise de-activated. When spectral envelope diffuser is active, the following procedure is performed at the recovery frame.

A modified LSF parameter for the previous frame is calculated by placing the low-order coefficients of the LSF parameter of the concealed frame at equal space.

(88)

Then the LSF parameter for the current frame is replaced by the sum of mean vector of the current coder type and the residual LSF vector obtained in the decoding of the current frame, and bandwidth separation is applied.

This bandwidth separation is applied to ensure stability and suppress too sharp peak in LP spectrum. The distances are wider than the distance used in the normal LSF decoding process. In case the internal sampling frequency is 12.8 kHz, the distances are as follows:

(89)

Then, LP coefficients are calculated based on those modified LSF parameters, and used instead of LP coefficients obtained in the ordinary decoding process. The procedure for conversion from LSF to LPC is the same as normal decoding process.

5.3.3.3 Energy control during recovery

Precise control of the speech energy is very important in frame erasure concealment. The importance of the energy control becomes more evident when a normal operation is resumed after an erased block of frames. Since VC and GC modes are heavily dependent on prediction, the actual energy cannot be properly estimated at the decoder. In voiced speech segments, the incorrect energy can persist for several consecutive frames, which can be very annoying, especially when this incorrect-valued energy increases.

The goal of the energy control is to minimize energy discontinuities by scaling the synthesized signal to render the energy of the signal at the beginning of the recovery frame (a first non-erased frame received following frame erasure) to be similar to the energy of the synthesized signal at the end of the last frame erased during the frame erasure. The energy of the synthesized signal in the received first non-erased frame is further made converging to the energy corresponding to the received energy parameter toward the end of that frame while limiting an increase in energy.

If the available bitrate is sufficiently high, the synthesized speech energy information can be estimated in the encoder and transmitted as a side information to the decoder. In EVS, the energy information is transmitted only at 32 and 64 kb/s, using 5 bits. Further, it is transmitted only in the GC mode. In the TC mode, the energy control is not needed as the TC mode does not make use of the adaptive codebook, and memory-less LSF quantization is used. At lower bitrates, the correct energy is estimated at the decoder.

The energy control for LP-based decoding is triggered in the first non-erased frame following frame erasure for other than TC modes. At frames coded at 7.2 and 8 kb/s, in case that this first non-erased frame is using the Autoregressive (AR) prediction for the LP filter quantization, the energy control is continued in all subsequent frames using the AR prediction. The energy control is then maintained in yet another frame as the synthesis filter can still be affected by the filter coefficients interpolation.

The energy control is done in the synthesized speech signal domain. Even if the energy is controlled in the speech domain, it is the LP excitation signal that is scaled. The synthesis is then repeated to smooth the transitions.

The energy control in a recovery frame is done as follows. Let denote the gain used to scale the 1st sample in the current frame and the gain used at the end of the frame. The excitation signal is then scaled as follows

(90)

where is the scaled excitation, is the excitation before the scaling, is the frame length and is the gain starting from and converging exponentially to . That is

(91)

with the initialization . The factor is the attenuation factor set to the value of 0.98. This value has been found experimentally as a compromise of having a smooth transition from the previous (erased) frame on one side, and scaling the last pitch period of the current frame as much as possible to the correct (transmitted) value on the other side. The gains and are defined as

(92)

(93)

where is the energy computed at the end of the previous (erased) frame, is the energy at the beginning of the current (recovered) frame, is the energy at the end of the current frame, and is the target energy at the end of the current frame. At higher bitrates,is computed at the encoder, quantized and transmitted. The energy information is quantized using a 5-bit linear quantizer in the range of 0 dB to 96 dB with a step of 3 dB. The quantization index is given by

(94)

The index is limited to the range [0, 31]. At lower bitrates, is estimated at the decoder.

The energy of the synthesized speech at the end of the first non erased frame is first computed as follows. The energy is the maximum of the signal energy for frames classified as VOICED_CLAS or ONSET, or the average energy per sample for all other frames. For VOICED_CLAS or ONSET frames, the maximum signal energy is computed pitch-synchronously at the end of the current frame as follows:

(95)

where L is the frame length at internal sampling rate. Signal is the local synthesis signal sampled at the internal sampling rate. The integer pitch period length is the rounded pitch period of the last subframe, i.e. .

For all other classes, is the average energy per sample of the last half of the current frame, i.e.

. (96)

is computed similarly using the synthesized speech signal of the previous (last erased) frame. When is computed pitch synchronously (i.e. if the class of the previous frame was VOICED_CLAS or ONSET), it uses the concealment pitch period .

When is computed pitch synchronously (the class of the current frame is VOICED_CLAS or ONSET), it is done similarly using the rounded pitch value of the first subframe:

(97)

For other frame classes:

. (98)

As mentioned previously, is transmitted from the encoder, but only at high bitrates. If is not available, it is initialized to and further limited as described below.

The gains and are further limited to a maximum allowed value to prevent too strong energy. This value has been set to 1.2 with the exception of very low energy frames ( < 1.1). In this case, is limited to 1. If is not transmitted, further precautions shall be taken because of the possible mismatch between the excitation signal energy and the LP filter gain.

At 7.2 or 8 kb/s, this is done by upper-limiting the energy by a value , scaled by a factor .

In the recovery frame or in the scaled frames following the recovery frame coded by the GC mode using AR prediction, and if the following conditions are met: 1) the end-frame LP filter is resonant in low frequencies (measured by means of the filter tilt), and 2) the evolution of the transmitted pitch is stable within the frame or the mean of the pitch value over all subframes is lower than 34 samples. In the remaining scaled frames following the recovery frame, the scaling factor and equals to the larger value between and an average energy of recent voiced frames . If is computed pitch synchronously, is the running average of pitch-synchronous energy of previous frames. If is computed as average energy per sample, is the running average of average energy per sample of of previous frames. For other bitrates, when the erasure occurs during a voiced speech segment (i.e. the last good frame before the erasure and the first good frame after the erasure are classified as VOICED TRANSITION, VOICED_CLAS or ONSET) and the LP filter impulse response energy of the first frame after an erasure is twice as high as the LP filter impulse response energy of the last frame before the erasure, the energy of the excitation is adjusted to the gain of the new LP filter as follows:

. (99)

Here is the energy of the LP filter impulse response of the last good frame before the erasure and is the energy of the LP filter of the first good frame after the erasure. The LP filters of the last subframes are used. Further, if (already initialized to ), it is further limited as follows:

(100)

At 9.6 and 13.2 kb/s, there is however one exception to this energy scaling strategy if the LP filter is found resonant in low frequencies and the frame is classified as UNVOICED_CLAS or INACTIVE_CLAS. This situation indicates a possible error in the classification and the energy is scaled as in the case of the 7.2 or 8 kb/s recovery frame.

The following exceptions, all related to transitions in speech signal of good frames following an erasure, further overwrite the computation of . If artificial onset is used in the current frame, is set to , to make the onset energy increase gradually. In the case of a first good frame after an erasure is classified as ONSET, the gain is prevented from being higher than . This precaution is taken to prevent a positive gain adjustment at the beginning of the frame from amplifying the voiced onset at the end of the frame. Finally, during a transition from voiced to unvoiced (i.e. the last good frame being classified as VOICED TRANSITION, VOICED_CLAS or ONSET and the current frame being classified UNVOICED_CLAS) or during a transition from a non-active speech period to an active speech period, the value of is set to .

Additionally, the synthesis energy control is performed also in the erased frames following frames coded at 7.2 or 8 kb/s or using the AMR-WB IO mode. Here the energy control is simpler in the sense that it is just verified that the gain is not increasing. Energies , and are computed similarly as in the recovery frames, but is used instead of , and the gains and are limited to 1.

After the energy control, the speech signal is resynthesized by filtering the scaled excitation signal through the LP synthesis filter. The running energy average is finally updated in good voiced frames as with initializion to .

5.3.3.4 Specifics for rates 32 and 64 kbps

5.3.3.4.1 Adaptive codebook resynchronization and fast recovery (WB)

Fast recovery is an approach where side information with some bit rate overhead is transmitted to arrest error propagation into future frames, thereby improving performance under frame erasures. Side information includes parameters like energy, frame classification information and phase information. Specifically, the phase side information is used to align the glottal pulse position at the decoder to that of the encoder thereby synchronizing the adaptive codebook content. The information on the lost frame which becomes available on receiving the future frame is used to correct the excitation (pitch) memory before synthesizing the correctly received future frame. This helps to significantly contain the error propagation into future frames and improves decoder convergence when good frames are received after the erased frame. The waveform interpolation technique ([5], clause 5.2.3) is used to avoid abrupt changes in the pitch contour between the error concealed lost frame and the memory corrected future frame.

5.3.3.4.1.1 Decoding glottal pulse position

The glottal pulse position information consists of the position in the past, , of the absolute maximum pulse from the beginning of the current frame and its sign. If the first decoded pitch of the current frame is smaller then 128, the received quantized position is used as is, else the received quantized position is multiplied by 2.

5.3.3.4.1.2 Performing glottal pulse resynchronization

The goal of the resynchronization is to correct the difference between the target transmitted position of the last glottal pulse in the adaptive codebook of the current frame, and its actual position in the concealed adaptive codebook excitation signal. The position T(0) of the maximum pulse in the concealed adaptive excitation, u(n), from the beginning of the frame is determined as described in the previous subclause. If the decoded maximum pulse position is positive, then the maximum positive pulse in the concealed adaptive codebook excitation from the beginning of the frame is determined. If the decoded maximum pulse position is negative, the maximum negative pulse is determined.

The target position of the absolute maximum pulse with respect to the beginning of the current frame is given by:

(101)

where has been defined as a decoded pulse or estimated as done in subclause 7.11.2.5.2. The error in the pulse position of the last concealed pulse in the frame is found by searching for the pulse closest to the actual pulse, . The error is given by:

(102)

where is the index of the pulse closest to and the difference between the actual pulse and the closest one. If , then no resynchronization is required. If then samples need to be inserted. If , then samples need to be removed. Further, the resynchronization is performed only if and , where is the absolute difference between and the pitch lag of the first subframe in the future frame, or its extrapolated value if it is not available.

The samples that need to be added or deleted are distributed across the pitch cycles in the frame. The minimum energy regions in the different pitch cycles are determined and the sample deletion or insertion is performed in those regions. The number of pitch pulses in the frame is at positions. The number of minimum energy regions is . If Tc ≤ 128, there shall be at least 2 minimum energy regions in the current frame. The minimum energy regions are determined by computing the energy using a sliding 5-sample window. The minimum energy position is set at the middle of the window at which the energy is at a minimum. The search performed between two pitch pulses at position and is restricted between and .

The sample deletion or insertion is performed around , where are the minimum positions described above and is the number of minimum energy regions. The samples to be added or deleted are distributed across the different pitch cycles as follows.

If , then there is only one minimum energy region and all samples are inserted or deleted at .

For , a simple algorithm is used to determine the number of samples to be added or removed at each pitch cycle whereby less samples are added/removed at the beginning and more towards the end of the frame. If the total number of pulses to be removed/added is , and the number of minimum energy regions is , the number of samples to be removed/added per pitch cycle, , is found using the following recursive relation:

(103)

where .

Note that at each stage, if then the values of and are interchanged. The values correspond to pitch cycles starting from the beginning of the frame. corresponds to , corresponds to , …, corresponds to . Since are in increasing order, more samples are added/removed towards the cycles at the end of the frame.

Removing samples is straightforward. Adding samples is performed by copying the last samples after dividing by 20 and inverting the sign. For example, if 5 samples need to be inserted at position, the following is performed:

(104)

Using the above procedure, the last maximum pulse in the concealed adaptive codebook excitation is forced to be aligned with the actual maximum pulse position at the end of the adaptive codebook frame which is transmitted in the current frame.

5.3.3.4.2 Artificial onset reconstruction

If the frame is classified as ARTIFICIAL ONSET, it means that the lost frame probably contained a voice onset, and the transition mode has not taken care of it (e.g., several consecutive frames were erased, containing the voiced onset frame, but also the following frame usually coded with TC mode). If the position of the glottal pulse of the previous frame is in the bitstream of the first good frame (i.e. 32 and 64 kbps), the onset is reconstructed artificially inside the adaptive codebook.

The lost onset case is the most complicated situation related to the use of the long-term prediction in CELP decoding. The lost onset means that the voiced speech onset happened somewhere during the erased block. In this case, the last good received frame was unvoiced and thus no periodic excitation is found in the excitation buffer. The first good frame after the erased block is however voiced, the excitation buffer at the encoder is highly periodic and the adaptive excitation has been encoded using this periodic past excitation. As this periodic part of the excitation is completely missing at the decoder, it can several frames to recover from this loss. It is worth emphasizing that this problem does not occur in the EVS codec for single frame erasures, as the frames following frames with voiced onsets are generally coded with TC mode and, TC mode does not make use of inter-frame long-term prediction.

If an ONSET frame is lost (i.e. a VOICED_CLAS good frame arrives after an erasure), but the last good frame before the erasure was UNVOICED_CLAS, a special technique is used to artificially reconstruct the lost onset and to trigger the voiced synthesis. The position of the last glottal pulse in the concealed frame can be available from the first good frame after the erase (in case the phase information related to the previous frame is transmitted in the bitstream). The artificial onset reconstruction does not affect the concealment of the erased frame; it is only matter of recovery. However, the last pulse of the erased frame is artificially reconstructed based on the position and sign information available in the first good frame after the erasure. This information consists of the position, , of the maximum pulse from the end of the frame and its sign. The last glottal pulse in the erased frame is thus constructed as a low-pass filtered pulse placed in the memory of the adaptive excitation buffer (previously initialized to zero), and centred at the decoded position, . If the pulse sign is positive, the low-pass filter used is a simple linear phase FIR filter with the impulse response. If the pulse sign is negative, the low-pass filter used is a linear phase FIR filter with the impulse response.

Placing the low-pass filtered glottal pulse at the proper position at the end of the concealed frame significantly improves the performance of the consecutive good frames and accelerates the decoder convergence to actual decoder states.

The energy of the periodic part of the artificial onset excitation is then scaled by the gain corresponding to the quantized and transmitted energy E_q, as described in subclause 5.3.3.3, for frame erasure concealment and divided by the gain of the LP synthesis filter.

(105)

The LP synthesis filter gain being computed as:

(106)

where is the LP synthesis filter impulse response. Finally, the artificial onset gain is reduced by multiplying the periodic part by 0.96.

The LP filter for the output speech synthesis is not interpolated in the case of an artificial onset construction. Instead, the received LP parameters are used for the synthesis of the whole frame.