5.5 Frame erasure concealment side information

26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

The codec has been designed with emphasis on performance in frame erasure conditions and several techniques limiting the frame erasure propagation have been implemented, namely the TC mode, the safety-net approach for LSF quantization, and the memory-less gain quantization. To further enhance the performance in frame erasure conditions, side information, consisting of concealment/recovery parameters, is sent in the bitstream to the decoder. This supplementary information improves the frame erasure concealment (FEC) and the convergence and recovery of the decoder after erased frames. The detailed concealment and recovery processing is described in [6].

The concealment/recovery parameters that are transmitted to the decoder depend on the bitrate and coding mode. They will be described in the following subclauses together with the information about configurations when they are transmitted.

5.5.1 Signal classification parameter

The signal classification parameter is determined based on the classification for FEC, described in subclause 5.1.13.3. The classification uses the following five classes to classify speech signals: UNVOICED, UNVOICED TRANSITION, VOICED TRANSITION, ONSET and VOICED. For the purpose of the FEC classification, inactive signals fall into the UNVOICED category. Though there are five signal classes, they can be encoded with only two bits as the differentiation of the both TRANSITION classes can be done unambiguously determined based on the class of the preceding frame. The rules for the FEC signal classification are described in subclause 5.1.13.3.3.

The signal classification parameter is not transmitted at lowest bitrates. Further, it does not need to be transmitted in coding modes that allow to classify the frame implicitly, e.g. in the UC and VC modes. The signal classification parameter is transmitted in GC and TC modes at 13.2 kb/s, 32 kb/s and 64 kb/s.

5.5.2 Energy information

Precise control of the speech energy is very important in frame erasure concealment. The importance of the energy control becomes more evident when a normal operation is resumed after an erased block of frames. Since VC and GC modes are heavily dependent on prediction, the actual energy cannot be properly estimated at the decoder. In voiced speech segments, the incorrect energy can persist for several consecutive frames, which can be very annoying, especially when this incorrect-valued energy increases.

To better control the energy of the synthesized sound signal at the decoder in case of frame erasure, the energy information is estimated and sent using 5 bits. The goal of the energy control is to minimize energy discontinuities by scaling the synthesized signal to render the energy of the signal at the beginning of the recovery frame (a first non erased frame received following frame erasure) to be similar to the energy of the synthesized signal at the end of the last frame erased during the frame erasure. The energy of the synthesized signal in the received first non erased frame is further made converging to the energy corresponding to the received energy parameter toward the end of that frame while limiting an increase in energy.

The energy information is the maximum of the signal energy for frames classified as VOICED or ONSET, or the average energy per sample for all other frames. For VOICED or ONSET frames, the maximum signal energy is computed pitch-synchronously at the end of the current frame as follows:

()

where is the frame length for the 12.8 kHz internal sampling rate, and is the frame length for the 16 kHz sampling rate. Signal is the local synthesis signal sampled at 12.8 kHz or 16 kHz depending on the internal sampling rate. The integer pitch period length is the rounded pitch period of the last subframe, i.e. for the 12.8 kHz core, and for the 16 kHz core.

For all other classes, is the average energy per sample of the last two subframes of the current frame, i.e.,

()

The energy information is quantized using a 5-bit linear quantizer in the range of 0 dB to 96 dB with a step of 3 dB. The quantization index is given by

()

The index is limited to the range [0,…, 31].

The energy information is sent only in GC mode at 32 and 64 kb/s.

5.5.3 Phase control information

The phase control is particularly important when recovering after a lost voiced segment of a signal. After a block of erased frames, the decoder memories become unsynchronized with the encoder memories. Sending some phase information helps in the re-synchronization of the decoder. The rough position of the last glottal pulse in the previous frame is sent.

Let be the integer closed-loop pitch lag for the last subframe of the previous frame. The position of the last glottal pulse, , is searched among the last samples of the previous frame by looking for the sample with the maximum amplitude of low-pass filtered LP residual signal. A simple FIR low-pass filter with coefficients 0.25, 0.5 and 0.25 is used.

The position of the last glottal pulse, , is encoded using 8 bits in the following manner. The precision used to encode the position of the pulse depends on the integer part of the closed-loop pitch lag for the first subframe of the current frame, . This is possible because this value is known both at the encoder and the decoder, and is not subject to erasure propagation after one or several frame losses. When is less than 128, the position of the last glottal pulse, relative to the end of the previous frame, is encoded directly with a precision of one sample. When , the position of the last glottal pulse, relative to the end of the previous frame, is encoded with a precision of two samples by using a simple integer division, i.e., . Finally, the information about the sign of the impulse is encoded by incrementing the transmitted index by 128 when the sign of the glottal pulse is negative. The MSB in the 8-bit index thus represents the sign of the last glottal pulse. The inverse procedure is done at the decoder.

The phase control information is sent only in GC mode at 32 and 64 kb/s.

5.5.4 Pitch lag information

To improve speech quality under erroneous channel, a pitch lag estimate of the next frame is calculated at the encoder and transmitted as a side information for better excitation at concealed frame. By exploiting the 8.75-ms look-ahead signal used for the frame-end autocorrelation calculation, the pitch lag can be obtained without any additional delay. This tool is activated only at ACELP frame under operational modes of 24.4kbps.

The side information includes activation flag. For the frames classified as ONSET or VOICED under GC or VC mode, the activation flag is set to 1. For the other frames, the activation flag is set to 0.

In case the activation flag equals to 1, the pitch lag is encoded with 4 bits and transmitted on top of the activation flag. In case the activation flag equals to 0, only the activation flag is transmitted as side information.

To estimate a pitch lag at the look-ahead signal, this tool uses an extrapolated LSF parameter and corresponding LP coefficients. For the LSF parameter extrapolation, the mean LSF vector is updated every frame and the extrapolated LSF is calculated as follows.

(1302)

where is LSF vector of the last 3 frame, and is the mean LSF vector. , depends on the previous coder type and the signal class. Decision rule for the constants used for LSF concealment applies to the decision of , . The extrapolated LSF is converted into LSP parameter and extrapolated LP coefficients. The procedure is the same as the one under clean channel.

LP residual is calculated for the 8.75-ms look-ahead signal with the extrapolated LP coefficients. LP residual is used as target signal without perceptual weighting to estimate the pitch lag with low complexity. The pitch lag estimate is obtained by maximizing the following correlation.

(1303)

where L’ is the number of samples of the 8.75-ms look-ahead sub-frame, and is the past excitation at the delay of k. The search range is limited to [,], where is the pitch lag of the last sub-frame. Then the differential pitch lag from is included to the side information with 4 bits.

5.5.5 Spectral envelope diffuser

A frame loss around speech onset sometimes causes too sharp peaks at LP spectrum, and sudden power increase in concealed signal. Spectral envelope diffuser mitigates the sudden power change and provides better recovery of concealed signal. The activation flag for spectral envelope diffuser is encoded with 1 bit and transmitted as a side information. This tool is active only at 9.6, 16.4, 24.4 kbps.

The activation is based on the function of merits depending on LSF improvement counter , quantized LSF parameter of the previous frame , the extrapolated LSF obtained in the guided PLC for pitch lag at the previous frame, and a modified LSF parameter . The modified LSF parameter is calculated as follows.

(1304)

where is the lowest number of j which satisfies the following equation.

(1305)

(1306)

where is computed as follows. is a threshold which equals to 1900 for 12.8 kHz internal sampling frequency, 2375 for 16 kHz internal sampling frequency.

After initialized with 0, the LSF improvement counter is computed as follows:

(1307)

In case the following 4 equations are satisfied, the activation flag is set to 1, otherwise set to 0. equals to 90 for 12.8 kHz internal sampling frequency, 112.5 for 16 kHz internal sampling frequency. equals to 800 for 12.8 kHz internal sampling frequency, 1000 for 16 kHz internal sampling frequency.

(1308)

(1309)

(1310)

(1311)

(1310)

(1311)

The activation flag is updated based on algebraic codebook gain of the previous and the current frame. In case one of the following equations is satisfied, the activation flag is set to 0.

(1312)

where is the minimum value of algebraic codebook gains of current frame, is the mean value of algebraic codebook gains of current frame, and is the mean value of algebraic codebook gains of the previous frame.

The activation flag is further updated with the stability factor of LSF parameter. In case the stability factor is greater than 0.02, the activation flag is set to 0.

Finally the activation flag is encoded with 1 bit and transmitted as side information.

5.5.6 Tonality flag information

The flag is set to one if the bit rate is one out of the set of {48 kbps, 96 kbps, 128 kbps}. For every frame for which is one, two parameters of spectral flatness are computed as follows:

Let be the sequence number of an arbitrary frame and denote the spectral flatness of the ‘th frame. is defined as follows:

(1312a)

where is the geometric mean of the signal amplitudes, is the arithmetic mean of the signal amplitudes, is the MDCT coefficient at frequency point , and is the number of the frequency points. The MDCT coefficients are either the original MDCT coefficients or the spectrum-shaped MDCT coefficients.

The original MDCT coefficients and the spectrum-shaped MDCT coefficients are used to compute two parameters of spectral flatness of this frame, denoted and respectively. For the frame with the mode of TCX20, the spectral flatness of the frame is computed by using the MDCT coefficients of the whole frame. For the frame with the mode of TCX10, the spectral flatness of the frame is computed by using the MDCT coefficients of the second sub-frame. In both cases, is the smaller value between the number of the frequency points of the MDCT coefficients and 200. If is smaller than a threshold , the flag of frame type is set to tonal type; otherwise, the flag of frame type is set to non-tonal type. If is smaller than another threshold , the flag of frame type is reset to tonal type. Then the obtained flag of frame type, together with the coded bit stream, is transmitted to the decoder side.