6.7.2 Decoding for LP-CNG
26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS
6.7.2.1 LP-CNG decoding Overview
When the decoder is in the LP-CNG operation, a procedure to synthesize a comfort noise signal is applied.
For each received SID frame, the one bit indicating the bandwidth type of the SID frame is first decoded. WB SID frame is received if the bandwidth bit equals “0”, otherwise the SWB SID frame is received. The LP-CNG decoder only operates in WB mode if no SWB SID frame has been received, in which case the comfort noise is only generated for low-band. Otherwise, the LP-CNG decoder will switch to SWB mode upon the receiving of the SWB SID frame. Since the transmission of high-band CN parameter is not synchronized with the transmission of the low-band CN parameters, WB SID frames can be received even the LP-CNG decoder operates in SWB mode. In which case, the energy parameter for high-band CN synthesis is extrapolated from the low-band CN synthesis signal. The low-band excitation energy is decoded from each LP-CNG SID frame based on which a smoothed low-band excitation energy used for low-band CNG synthesis is computed, as described in subclause 6.7.2.1.3. The low-band LSF vector is decoded from each LP-CNG SID frame then converted to LSP vector based on which a smoothed LSP vector is computed then converted to LP coefficients to obtain the low-band CNG synthesis filter, as described in subclause 6.7.2.1.4. If WB LP-CNG SID frame is received, the residual spectral envelope is decoded based on which a smoothed residual spectral envelope is computed, as described in subclause 6.7.2.1.5. A random excitation signal is generated from the smoothed low-band excitation energy which is combined with a second excitation signal generated from the smoothed residual spectral envelope to form the final excitation signal for the low-band CNG synthesis, as described in subclause 6.7.2.1.5. Low-band comfort noise is synthesized by filtering the low-band final excitation signal through the low-band CNG synthesis filter, as described in subclause 6.7.2.1.6.
In subclause 6.7.2.1.7, high-band decoding and synthesis is described if the decoder is operating in SWB mode. When SWB LP-CNG SID frame is received, the high-band energy of the frame is decoded from the SID frame. For other types of received frames, that is the WB LP-CNG SID frames and the NO_DATA frames, the high-band energy of the frame is generated locally at the decoder by extrapolating from the smoothed low-band energy of the frame which is obtained from the low-band CNG synthesis together with a high-band to low-band energy ratio calculated at the last received SWB LP-CNG SID frame. The high-band energy of the frame is further smoothed in each frame to be used for final high-band CNG synthesis. For each CN frame, the high-band LSF spectrum used to obtain the high-band CNG synthesis filter for each CN frame is interpolated from the LSF spectrum of the hangover frames. The high-band comfort noise is synthesized for each CN frame by filtering a random excitation through the high-band CNG synthesis filter, then scaled to the level corresponding to the smoothed high-band energy. The scaled high-band synthesis signal is finally spectral flipped to the bandwidth from 12.8 kHz to 14.4kHz, as described in subclause 6.1.5.1.12. The resulting spectral flipped high-band synthesis signal is added to the low-band synthesis signal so to form the final SWB comfort noise synthesis signal.
6.7.2.1.1 CNG parameter updates in active periods
During actively encoded periods without comfort noise parameters, four buffers of the fixed predetermined size are kept updated with the current actively encoded frame’s LSPs, an LSP domain flag memory, the frame’s excitation energy(in the LP-residual domain) and the current low frequency spectral envelope of the excitation as:
(1947)
()
()
()
where and
are, respectively, the real and the imaginary parts of the
-th frequency bin as outputted by the FFT of the LP excitation signal,
= 256 is the size of FFT analysis. The attenuation factor
is given by
(1950a)
where is determined by the latest bitrate used for actively encoded frames
, not including the current frame, according to Table 172a.
Table 172a: Attenuation factor selection
|
Latest active bitrate [kbps] |
|
|
|
1.7938412 |
|
|
1.3952098 |
|
|
1.0962363 |
|
|
0.9965784 |
|
|
0.9965784 |
These buffers are implemented as circular FIFO (First in First Out) buffers of sizeto save complexity.
6.7.2.1.2 DTX-hangover based parameter analysis in LP-CNG mode
To provide smoother sounding comfort noise synthesis in transitions from active to inactive (CNG) coding, the 3 bit parameter is decoded from the bit stream and used as the indicator for determining the initial subset of the
sized buffers with (
) parameters from the last active frames. The most recent
number of frames of the stored parameters (
), are used for an additional comfort noise parameter analysis in the very first SID frame after an active speech segment.
Before copying the most recent
vectors to the CNG-analysis buffer
, the past LSP’s which were analysed with a different sampling frequency than the current SID frame’s sampling frequency is converted to the current frames sampling frequency according to the information available in the flag vector
. The
most recent
values are copied into the analysis vector
and the
most recent
values are copied into the analysis vector
.
An age weighted average energy of the entries which are less than 103% of the most recent energy value and greater than 70 % of the most recent energy value, is computed as
, further the number of entries in
used for this average calculation is stored as
. The age weights for the
computation are:
()
Further the vectors corresponding in time to the past residual energies in
used for
are saved in a buffer
. The buffer
is converted into the LSF domain in buffer
.
Two outlier vector indices [] in the
buffer among the
vector entries are found, by analysing the maximum average LSF deviation to a uniform LSF spectrum. An average LSP-vector
is calculated, by computing the LSP-average with exclusion of zero, one or two of the found outlier vectors, depending on the value of
.
The sum of the LSP-deviations with respect to the received SID-frame’s quantized LSPs , is computed as:
()
Further the maximum individual distortion contribution in the summation above is saved as.
If there were no past CN-parameters to analyse or an residual energy step was detected, the received vector is used as the final
vector right away, on the other hand if there were some past active SAD hangover frames to analyse and there was no energy step detected , the
and
are now used to control the vector update over
, of the final CNG LSPs
using the average LSP-vector
as follows:
(1953)
The energy step is detected if it is the first CN frame after an active frame and the energy quantization index decoded from the current SID frame is greater than the previous energy quantization index
by more than 1, where
. Additionally, if there were past CN-parameters, an energy step is detected if the most recent energy value in
is more than four times larger than the smoothed quantized excitation energy
. Further the
vectors that originate from active or WB SID frames among the
vectors corresponding in time to the past residual energies in
used for
are saved in a buffer
. The average envelope of
is computed and from which two times of the smoothed residual spectral envelope of the previous frame,
, calculated in equation (1958) is subtracted. The resulting average envelope is used to initialize the smoothed residual spectral envelope if there is no energy step detected.
When a SID frame is received and there was no energy step detected, first the received and decoded LSP vector is added to the CNG-analysis buffer
in a FIFO manner for a buffer size of up to
, and secondly the decoded residual energy value in the SID frame
is added to the CNG analysis buffer
in a FIFO manner for a buffer size of up to
, then thirdly, if applicable (depending on if the SID frame is of WB type or not), the decoded low frequency envelope of the excitation from the SID frame is added to the CNG analysis buffer
for a buffer size of up to
.
During actively encoded periods, i.e. not including SID frames, the currently least recent buffer element (firstly added) in the buffer and the corresponding element in the buffer
are excluded from the buffers with a period of number of consecutive actively encoded frames given by the decrement factor
. As circular FIFO buffers are implemented the elements do not actually have to be deleted but the variable
representing the number of valid buffer elements, i.e. elements used for determination of
and
, is given by:
(1953a)
where is the number of valid buffer elements in the very beginning of the actively encoded period,
is a non-negative integer and
is a counter of consecutive actively encoded frames. The variable
does together with a pointer to the most recently added buffer element determine the valid buffer elements.
6.7.2.1.3 LP-CNG low-band energy decoding
The quantized low-band excitation energy in logarithmic domain is decoded from each LP-CNG SID frame and converted to linear domain using the procedure described in subclause 5.6.2.1.5. The resulting linear domain low-band excitation energy is used to obtain the smoothed low-band excitation energy
used for low-band CNG synthesis in the same way as described in subclause 5.6.2.1.6.
6.7.2.1.4 LP-CNG low-band filter parameters decoding
The quantized LSF vector is found in the same way as described in subclause 6.1.1.1.1. For the two stage quantizer there are two indexes that define the LSF vector. The index of the first stage codevector is retrieved and the codevector components are obtained from the 16-dimensional codebook of 16 codevectors. The second stage index is interpreted like in subclause 6.1.1.1.1. and the corresponding multiple scale lattice codevector is obtained. If the codevector index from the first stage has one of the values 0, 1, 2, 3, 7, 9, 12, 13, 14, 15 the permutations specified in subclause 5.6.2.1.3 are applied to the decoded codevector. The resulting codevector is added to the codevector obtained in the first stage and the result corresponds to the decoded LSF vector. The sampling frequency of the LP-CNG frame can be determined by checking the value of the highest order LSF coefficient (last coefficient). If the last decoded LSF coefficient is larger than 6350 the decoded frame has sampling rate of 16 kHz, otherwise it is sampled at 12.8kHz and contains either NB or WB LSF data. The smoothed LP synthesis filter, , is then obtained in the same way as described in subclause 5.6.2.1.4.
6.7.2.1.5 LP-CNG low-band excitation generation
The low-band excitation signal used for CNG synthesis is generated by combining a random excitation and an excitation representing the low frequency spectral details of the excitation signal.
The random excitation is generated for each subframe using a random integer generator, the seed of which is updated by
(1954)
where is the seed value, initially set to 21845, and short[.] limits the value to the interval [–32767; 32768]. The generated random sequence, denoted as
, is scaled for each subframe by
(1955)
where denotes the length of subframe,
is the smoothed quantized excitation energy, as described in subclause 6.7.2.1.3, with some random variation between subframes added. The random variation is added to the smoothed quantized excitation energy by
()
where is a random integer number generated for each subframe using the same equation (1954) with the initial value of 21845. The purpose of which is to better model the variance of background noise during inactive signal periods. The scaled random sequence in each subframe,
, is concatenated to form the random excitation signal for the whole frame,
, where
is the frame length.
The excitation representing the low frequency spectral details of the excitation signal is generated from the quantized residual spectral envelope. The quantized residual spectral envelope is recovered from each WB SID frame in the inverse way as described in subclause 5.6.2.1.6 that
()
where is the quantized residual spectral envelope,
is the entry of the residual spectral envelope codebook found with the index decoded from the SID frame,
is the quantized total excitation energy calculated using the similar equation in subclause 5.6.2.1.6,. A smoothed residual spectral envelope is updated at each CN frame by
through an AR filtering
(1958)
where denotes the smoothed residual spectral envelope from the previous frame. If the current frame is of NO_DATA type, the
from the last received SID frame is used. The FFT spectrum of the random excitation,
, generated in equation (1955) is computed and based on this the low frequency spectral envelope,
, corresponding to the one transmitted in the SID frame is calculated in a similar manner. The difference envelope between the smoothed spectral envelope
and the random excitation envelope
is calculated
()
where is the smoothed quantized excitation energy obtained in subclause 6.7.2.1.3, 2 times of
is compensated to the
before the difference envelope is calculated. Slight random variation is added to the difference envelope.
(1960)
where is a series of random integer numbers generated for each envelope band using the same equation (1954) with the initial value of 21845. A series of 256-point random-phase FFT coefficients are generated where its low frequency spectral envelope is made equal to the difference envelope
and the coefficients corresponding to other frequencies are set to 0. An IFFT is performed to the above FFT coefficients and a 256-point time domain sequence
is outputted.
is re-sampled to 320 points if operating in 16kHz core.
is scaled for each subframe in a similar way to equation (1955) that
()
where is the scaled
with random variation added,
denotes the length of subframe,
is a random integer number generated for each subframe using the same random generator as used in equation (1960),
is the average energy of
calculated as
()
where is the frame length. Energy increasing is not allowed if the current frame is the first SID frame after an active burst in which case, for subframe with
,
is limited to
. The
is the excitation representing the low frequency spectral details of the excitation signal whichis attenuated and combined with the earlier calculated random excitation
()
where is the frame length,
is the combined excitation signal. The combined excitation
is scaled if its average energy is higher than the smoothed quantized excitation energy
obtained in subclause 6.7.2.1.3.
()
where is the frame length,
is the scaled combined excitation which is the final excitation signal for low-band CNG synthesis.
6.7.2.1.6 LP-CNG low-band synthesis
The low-band comfort noise is synthesized by filtering the scaled combined excitation signal, , obtained in previous subclause 6.7.2.1.5 through the smoothed LP synthesis filter,
, obtained in subclause 6.7.2.1.4.
6.7.2.1.7 LP-CNG high-band decoding and synthesis
To enable high perceptual quality in the inactive portions of speech on the decoder side, during SWB mode operation of the codec, a high band comfort noise synthesis (SHB-CNG) (12.8 – 14.4 kHz) is added to the low bandwidth (0-12.8 kHz) LP-CNG synthesis output. This also helps to ensure smooth transitions between active and inactive speech.
However, this is being done without transmitting any extra parameters from the encoder to decoder to model the high-band spectral characteristics of the inactive frames. Instead, to model the high band spectrum (12.8 – 14.4 kHz) of the comfort noise, the high band LSF parameters of the active speech frames preceding the current inactive frames are used after interpolation as described below. The hangover setting in the SAD algorithm ensures the active speech segments used for the spectral characteristics estimation of the inactive frames, sufficiently capture the background noise characteristics without significant impact from the talk spurt.
The quantized LSF vectors of order 10 corresponding to active speech high band (subclauses 5.2.4.1.3.1 and 6.1.5.1.3.1) received at the decoder are buffered up to two past active frames (N-1) and (N), denoted by and
where N+M is the current inactive frame. Using these, the LSF vector corresponding to SHB-CNG of (N+M) th inactive frame
is interpolated as
()
where interpolation factor T is computed as
()
using the number of inactive frames M leading up to the current inactive frame (N+M) since the last active frame N.
This interpolated LSF vector is then converted to LPC coefficients and used as the coefficients of LP synthesis filter to generate a synthesized signal. The energy of the high-band signal is obtained for each CN frame by either directly decoding from the SID frame if the SID frame is a SWB SID frame or by extrapolating for other received frame types. If SWB SID frame is received, the high-band energy of the frame which is the quantized high-band log average energy,
, is recovered by
()
where is the high-band energy index decoded from the SWB SID frame. If
is 0,
is set to -15 for a lower noise floor. If WB SID frame or NO_DATA frame is received, the high-band energy of the frame is generated locally at the decoder by extrapolating from the smoothed low-band energy of the frame together with the high-band to low-band energy ratio at the last received SWB LP-CNG SID frame. The smoothed low-band energy of the frame is a weighted average of the low-band energy of the current frame and the smoothed low-band energy of the previous frame. The low-band energy of the current frame which is the log average energy of the low-band signal is calculated from the low-band synthesis signal
()
where is the low-band synthesis signal as obtained in subclause 6.7.2.1.6,
= 640 is the length of the low-band synthesis signal. If the low-band energy
of the current frame is deviating from the smoothed low-band energy of the previous frame
by more than 12 dB, a step update flag
is set to 1 indicating the permission of step update, otherwise is set to 0. If the flag
is set to 1, the smoothed low-band energy at the current frame,
, is set to the current frame’s low-band energy
. Otherwise, if the flag
is set to 0, the smoothed low-band energy is updated at the current frame as
(1969)
where denotes the smoothed low-band energy of the previous frame. The high-band energy of the frame,
, for received WB SID or NO_DATA frame is thus extrapolated as the sum of
and
, where
denotes the high-band to low-band energy ratio at the last received SWB SID frame i frames ago. The high-band energy of the frame is then smoothed for final use according to
(1970)
where is the smoothed high-band energy of the current frame,
denotes the smoothed high-band energy of the previous frame,
is the forgetting factor which is set to 0 if
is set to 1 or the current frame is the first frame after an active burst, otherwise
is set to 0.75. The high-band comfort noise is synthesized by filtering a 320-point white noise excitation signal through the LP synthesis filter derived earlier in this subclause. The synthesized comfort noise signal is then level adjusted to match the calculated smoothed high-band energy
. A smoothing period is setup for the first 5 frames after an active burst of more than 3 frames and if the core technology used in the last active frame is not HQ-core. Within the smoothing period, the synthesized comfort noise is not level adjusted to the calculated smoothed high-band energy
, but to an interpolated energy between
and the high-band log average energy calculated at the last active frame. The interpolated energy is calculated as
(1971)
where denotes the interpolated high-band energy of the
-th CN frame in the smoothing period,
denotes the high-band log average energy of the last active frame,
denotes the sine function. Finally, the level adjusted synthesized high-band comfort noise is spectral flipped to the bandwidth from 12.8kHz to 14.4kHz as described in subclause 6.1.5.1.12. The resulting high-band synthesis signal is later added to the low-band synthesis signal to form the final SWB comfort noise synthesis signal.
6.7.2.2 Memory update
When an inactive signal frame is processed, the following updates are performed:
– MA memory of the ISF quantizer is set to zero;
– AR memory of the ISF quantizer is set to mean values (UC mode, WB case);
– phase dispersion memory is set to zero;
– synthesis excitation spectrum tilt is set to zero;
– noise enhancer memory is set to zero;
– class of last received good frame for FEC is set to UNVOICED_CLAS;
– floating point pitch for each subframe is set to the subframe length;
– the low-pass filtered pitch gain for FEC is set to zero;
– the filtered algebraic codebook gain for FEC is set to the square root of the smoothed quantized CNG excitation energy, , from subclause 6.7.2.1.3;
– the excitation buffer memory is updated;
– previous pitch gains are all set to zero;
– previous codebook gain is set to zero;
– active frame counter is set to zero;
– voicing factors used by the bandwidth extension are all set to 1;
– bass post-filter is tuned off;
– synthesis filter memories are updated.