26.4473GPPCodec for Enhanced Voice Services (EVS)Error concealment of lost packetsRelease 17TS
18.104.22.168 Specifics for rate 24.4 kbps
As described in subclause 5.5.4 of , the activation flag and differential pitch lag are transmitted as side information to obtain better pitch lag estimates and excitation signal for the future frame to be concealed.
The first 1 bit of the side information is read from the bit-stream yielding the activation flag. In case the activation flag equals 0, no further decoding is performed. If the flag equals 1, additional 4 bits are decoded yielding the differential pitch lag. With 4bits 16 different states are signalled. 15 states are used to represent the differential pitch lag, ranging from -7 to 7. The remaining signalling state is used to signal, that the pitch lag difference was outside the +-7 range on encoder side.
In case the pitch lag difference is inside the signalled valid rage of +-7, the differential pitch lag is added to the pitch lag of the last sub-frame. The result is used as an initial pitch lag estimate of the future 1st and 2nd sub-frame. The initial pitch lag estimates are used as an input to the pitch lag extrapolation procedure described in subclause 22.214.171.124. If the initial pitch lag estimates are available, the history of pitch lags used for the pitch extrapolation is updated with the initial pitch lag estimates. In case the criteria in clause 126.96.36.199 is not met, instead of , the initial pitch lag estimate is used for building the first and second subframe of the adaptive codebook during concealment .
In case the pitch lag difference indicates that the difference is outside the valid range of +-7 the pitch extrapolation is performed like there is no future pitch lag information available in the bitstream.
188.8.131.52 Specifics for rates 9.6, 16.4 and 24.4 kbps
As described in subclause 5.5.5 of , side information on the activation of spectral envelope diffuser is transmitted to suppress too sharp peak at LP spectrum at the decoder side, 1 bit is decoded to obtain the activation flag. In case the value equals to 1, spectral envelope diffuser is activated, otherwise de-activated. When spectral envelope diffuser is active, the following procedure is performed at the recovery frame.
A modified LSF parameter for the previous frame is calculated by placing the low-order coefficients of the LSF parameter of the concealed frame at equal space.
Then the LSF parameter for the current frame is replaced by the sum of mean vector of the current coder type and the residual LSF vector obtained in the decoding of the current frame, and bandwidth separation is applied.
This bandwidth separation is applied to ensure stability and suppress too sharp peak in LP spectrum. The distances are wider than the distance used in the normal LSF decoding process. In case the internal sampling frequency is 12.8 kHz, the distances are as follows:
Then, LP coefficients are calculated based on those modified LSF parameters, and used instead of LP coefficients obtained in the ordinary decoding process. The procedure for conversion from LSF to LPC is the same as normal decoding process.
184.108.40.206 Energy control during recovery
Precise control of the speech energy is very important in frame erasure concealment. The importance of the energy control becomes more evident when a normal operation is resumed after an erased block of frames. Since VC and GC modes are heavily dependent on prediction, the actual energy cannot be properly estimated at the decoder. In voiced speech segments, the incorrect energy can persist for several consecutive frames, which can be very annoying, especially when this incorrect-valued energy increases.
The goal of the energy control is to minimize energy discontinuities by scaling the synthesized signal to render the energy of the signal at the beginning of the recovery frame (a first non-erased frame received following frame erasure) to be similar to the energy of the synthesized signal at the end of the last frame erased during the frame erasure. The energy of the synthesized signal in the received first non-erased frame is further made converging to the energy corresponding to the received energy parameter toward the end of that frame while limiting an increase in energy.
If the available bitrate is sufficiently high, the synthesized speech energy information can be estimated in the encoder and transmitted as a side information to the decoder. In EVS, the energy information is transmitted only at 32 and 64 kb/s, using 5 bits. Further, it is transmitted only in the GC mode. In the TC mode, the energy control is not needed as the TC mode does not make use of the adaptive codebook, and memory-less LSF quantization is used. At lower bitrates, the correct energy is estimated at the decoder.
The energy control for LP-based decoding is triggered in the first non-erased frame following frame erasure for other than TC modes. At frames coded at 7.2 and 8 kb/s, in case that this first non-erased frame is using the Autoregressive (AR) prediction for the LP filter quantization, the energy control is continued in all subsequent frames using the AR prediction. The energy control is then maintained in yet another frame as the synthesis filter can still be affected by the filter coefficients interpolation.
The energy control is done in the synthesized speech signal domain. Even if the energy is controlled in the speech domain, it is the LP excitation signal that is scaled. The synthesis is then repeated to smooth the transitions.
The energy control in a recovery frame is done as follows. Let denote the gain used to scale the 1st sample in the current frame and the gain used at the end of the frame. The excitation signal is then scaled as follows
where is the scaled excitation, is the excitation before the scaling, is the frame length and is the gain starting from and converging exponentially to . That is
with the initialization . The factor is the attenuation factor set to the value of 0.98. This value has been found experimentally as a compromise of having a smooth transition from the previous (erased) frame on one side, and scaling the last pitch period of the current frame as much as possible to the correct (transmitted) value on the other side. The gains and are defined as
where is the energy computed at the end of the previous (erased) frame, is the energy at the beginning of the current (recovered) frame, is the energy at the end of the current frame, and is the target energy at the end of the current frame. At higher bitrates,is computed at the encoder, quantized and transmitted. The energy information is quantized using a 5-bit linear quantizer in the range of 0 dB to 96 dB with a step of 3 dB. The quantization index is given by
The index is limited to the range [0, 31]. At lower bitrates, is estimated at the decoder.
The energy of the synthesized speech at the end of the first non erased frame is first computed as follows. The energy is the maximum of the signal energy for frames classified as VOICED_CLAS or ONSET, or the average energy per sample for all other frames. For VOICED_CLAS or ONSET frames, the maximum signal energy is computed pitch-synchronously at the end of the current frame as follows:
where L is the frame length at internal sampling rate. Signal is the local synthesis signal sampled at the internal sampling rate. The integer pitch period length is the rounded pitch period of the last subframe, i.e. .
For all other classes, is the average energy per sample of the last half of the current frame, i.e.
is computed similarly using the synthesized speech signal of the previous (last erased) frame. When is computed pitch synchronously (i.e. if the class of the previous frame was VOICED_CLAS or ONSET), it uses the concealment pitch period .
When is computed pitch synchronously (the class of the current frame is VOICED_CLAS or ONSET), it is done similarly using the rounded pitch value of the first subframe:
For other frame classes:
As mentioned previously, is transmitted from the encoder, but only at high bitrates. If is not available, it is initialized to and further limited as described below.
The gains and are further limited to a maximum allowed value to prevent too strong energy. This value has been set to 1.2 with the exception of very low energy frames ( < 1.1). In this case, is limited to 1. If is not transmitted, further precautions shall be taken because of the possible mismatch between the excitation signal energy and the LP filter gain.
At 7.2 or 8 kb/s, this is done by upper-limiting the energy by a value , scaled by a factor .
In the recovery frame or in the scaled frames following the recovery frame coded by the GC mode using AR prediction, and if the following conditions are met: 1) the end-frame LP filter is resonant in low frequencies (measured by means of the filter tilt), and 2) the evolution of the transmitted pitch is stable within the frame or the mean of the pitch value over all subframes is lower than 34 samples. In the remaining scaled frames following the recovery frame, the scaling factor and equals to the larger value between and an average energy of recent voiced frames . If is computed pitch synchronously, is the running average of pitch-synchronous energy of previous frames. If is computed as average energy per sample, is the running average of average energy per sample of of previous frames. For other bitrates, when the erasure occurs during a voiced speech segment (i.e. the last good frame before the erasure and the first good frame after the erasure are classified as VOICED TRANSITION, VOICED_CLAS or ONSET) and the LP filter impulse response energy of the first frame after an erasure is twice as high as the LP filter impulse response energy of the last frame before the erasure, the energy of the excitation is adjusted to the gain of the new LP filter as follows:
Here is the energy of the LP filter impulse response of the last good frame before the erasure and is the energy of the LP filter of the first good frame after the erasure. The LP filters of the last subframes are used. Further, if (already initialized to ), it is further limited as follows:
At 9.6 and 13.2 kb/s, there is however one exception to this energy scaling strategy if the LP filter is found resonant in low frequencies and the frame is classified as UNVOICED_CLAS or INACTIVE_CLAS. This situation indicates a possible error in the classification and the energy is scaled as in the case of the 7.2 or 8 kb/s recovery frame.
The following exceptions, all related to transitions in speech signal of good frames following an erasure, further overwrite the computation of . If artificial onset is used in the current frame, is set to , to make the onset energy increase gradually. In the case of a first good frame after an erasure is classified as ONSET, the gain is prevented from being higher than . This precaution is taken to prevent a positive gain adjustment at the beginning of the frame from amplifying the voiced onset at the end of the frame. Finally, during a transition from voiced to unvoiced (i.e. the last good frame being classified as VOICED TRANSITION, VOICED_CLAS or ONSET and the current frame being classified UNVOICED_CLAS) or during a transition from a non-active speech period to an active speech period, the value of is set to .
Additionally, the synthesis energy control is performed also in the erased frames following frames coded at 7.2 or 8 kb/s or using the AMR-WB IO mode. Here the energy control is simpler in the sense that it is just verified that the gain is not increasing. Energies , and are computed similarly as in the recovery frames, but is used instead of , and the gains and are limited to 1.
After the energy control, the speech signal is resynthesized by filtering the scaled excitation signal through the LP synthesis filter. The running energy average is finally updated in good voiced frames as with initializion to .
220.127.116.11 Specifics for rates 32 and 64 kbps
18.104.22.168.1 Adaptive codebook resynchronization and fast recovery (WB)
Fast recovery is an approach where side information with some bit rate overhead is transmitted to arrest error propagation into future frames, thereby improving performance under frame erasures. Side information includes parameters like energy, frame classification information and phase information. Specifically, the phase side information is used to align the glottal pulse position at the decoder to that of the encoder thereby synchronizing the adaptive codebook content. The information on the lost frame which becomes available on receiving the future frame is used to correct the excitation (pitch) memory before synthesizing the correctly received future frame. This helps to significantly contain the error propagation into future frames and improves decoder convergence when good frames are received after the erased frame. The waveform interpolation technique (, clause 5.2.3) is used to avoid abrupt changes in the pitch contour between the error concealed lost frame and the memory corrected future frame.
22.214.171.124.1.1 Decoding glottal pulse position
The glottal pulse position information consists of the position in the past, , of the absolute maximum pulse from the beginning of the current frame and its sign. If the first decoded pitch of the current frame is smaller then 128, the received quantized position is used as is, else the received quantized position is multiplied by 2.
126.96.36.199.1.2 Performing glottal pulse resynchronization
The goal of the resynchronization is to correct the difference between the target transmitted position of the last glottal pulse in the adaptive codebook of the current frame, and its actual position in the concealed adaptive codebook excitation signal. The position T(0) of the maximum pulse in the concealed adaptive excitation, u(n), from the beginning of the frame is determined as described in the previous subclause. If the decoded maximum pulse position is positive, then the maximum positive pulse in the concealed adaptive codebook excitation from the beginning of the frame is determined. If the decoded maximum pulse position is negative, the maximum negative pulse is determined.
The target position of the absolute maximum pulse with respect to the beginning of the current frame is given by:
where has been defined as a decoded pulse or estimated as done in subclause 188.8.131.52.2. The error in the pulse position of the last concealed pulse in the frame is found by searching for the pulse closest to the actual pulse, . The error is given by:
where is the index of the pulse closest to and the difference between the actual pulse and the closest one. If , then no resynchronization is required. If then samples need to be inserted. If , then samples need to be removed. Further, the resynchronization is performed only if and , where is the absolute difference between and the pitch lag of the first subframe in the future frame, or its extrapolated value if it is not available.
The samples that need to be added or deleted are distributed across the pitch cycles in the frame. The minimum energy regions in the different pitch cycles are determined and the sample deletion or insertion is performed in those regions. The number of pitch pulses in the frame is at positions. The number of minimum energy regions is . If Tc ≤ 128, there shall be at least 2 minimum energy regions in the current frame. The minimum energy regions are determined by computing the energy using a sliding 5-sample window. The minimum energy position is set at the middle of the window at which the energy is at a minimum. The search performed between two pitch pulses at position and is restricted between and .
The sample deletion or insertion is performed around , where are the minimum positions described above and is the number of minimum energy regions. The samples to be added or deleted are distributed across the different pitch cycles as follows.
If , then there is only one minimum energy region and all samples are inserted or deleted at .
For , a simple algorithm is used to determine the number of samples to be added or removed at each pitch cycle whereby less samples are added/removed at the beginning and more towards the end of the frame. If the total number of pulses to be removed/added is , and the number of minimum energy regions is , the number of samples to be removed/added per pitch cycle, , is found using the following recursive relation:
Note that at each stage, if then the values of and are interchanged. The values correspond to pitch cycles starting from the beginning of the frame. corresponds to , corresponds to , …, corresponds to . Since are in increasing order, more samples are added/removed towards the cycles at the end of the frame.
Removing samples is straightforward. Adding samples is performed by copying the last samples after dividing by 20 and inverting the sign. For example, if 5 samples need to be inserted at position