5.6.3 Encoding for FD-CNG
26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS
To be able to produce an artificial noise resembling the actual input background noise in terms of spectro-temporal characteristics, the FD-CNG makes use of a noise estimation algorithm to track the energy of the background noise present at the encoder input. The noise estimates are then transmitted as parameters in the form of SID frames to update the amplitude of the random sequences generated in each frequency band at the decoder side during inactive phases. Note, however, that the noise estimation is carried out continuously on every frame, i.e., regardless of the speech activity. Therefore, it can deliver some meaningful information about the noise spectrum at any time, in particular at the very beginning of a speech pause.
The FD-CNG noise estimator relies on a hybrid spectral analysis approach. Low frequencies corresponding to the core bandwidth are covered by a high-resolution FFT analysis, whereas the remaining higher frequencies are captured by the CLDFB which exhibits a significantly lower spectral resolution of 400Hz.
The size of an SID frame is however very limited in practice. To reduce the number of parameters describing the background noise, the input energies are averaged among groups of spectral bands called partitions in the sequel.
5.6.3.1 Spectral partition energies
The partition energies are computed separately for the FFT and CLDFB bands. The energies corresponding to the FFT partitions and the
energies corresponding to the CLDFB partitions are then concatenated into a single array
of size
which will serve as input to the noise estimator described in subclause 5.6.3.2.
5.6.3.1.1 Computation of the FFT partition energies
Partition energies for the frequencies covering the core bandwidth are obtained as
, (1356)
where and
are the average energies in critical band
for the first and second analysis windows, respectively, as explained in subclause 5.1.5.2. The number of FFT partitions
depends on the sampling rate
of the input signal, as show in Table 133. The de-emphasis spectral weights
are used to compensate for the high-pass filter described in subclause 5.1.4 and are defined as
(1357)
5.6.3.1.2 Computation of the CLDFB partition energies
The partition energies for frequencies above the core bandwidth are computed as
, (1358)
where and
are the indices of the first and last CLDFB bands in the i-th partition, respectively,
is the total energy of the j-th CLDFB band (see subclause 5.1.2.2), and
is a scaling factor computed in subclause 5.1.6.1. The constant 16 refers to the number of time slots in the CLDFB. The number of CLDFB partitions
depends on the configuration used, as described in the next subclause.
5.6.3.1.3 FD-CNG configurations
The following table lists the number of partitions and their upper boundaries for the different FD-CNG configurations at the encoder, as a function of the input sampling rate.
Table 152: Configurations of the FD-CNG noise estimation at the encoder
|
|
|
|
|
|
|
8 |
17 |
0 |
100, 200, 300, 400, 500, 600, 750, 900, 1050, 1250, 1450, 1700, 2000, 2300, 2700, 3150, 3700 |
|
|
16 |
20 |
1 |
100, 200, 300, 400, 500, 600, 750, 900, 1050, 1250, 1450, 1700, 2000, 2300, 2700, 3150, 3700, 4400, 5300, 6350 |
8000 |
|
32/48 |
20 |
4 |
100, 200, 300, 400, 500, 600, 750, 900, 1050, 1250, 1450, 1700, 2000, 2300, 2700, 3150, 3700, 4400, 5300, 6350 |
8000, 10000, 12000, 16000 |
For each partition,
corresponds to the frequency of the last band in the i-th partition. The indices
and
of the first and last bands in each spectral partition can be derived as a function of the common processing’s sampling rate 12.8 kHz and FFT size 256 (see subclause 5.1):
, ()
, ()
whereis the frequency of the first band in the first spectral partition. Hence the FD-CNG generates some comfort noise above 50Hz only.
5.6.3.2 FD-CNG noise estimation
The FD-CNG relies on a noise estimator to track the energy of the background noise present in the input spectrum. This is mostly based on the minimum statistics algorithm [R. Martin, Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics, 2001].
However, to reduce the dynamic range of the input energies and hence facilitate the fixed-point implementation of the noise estimation algorithm, a non-linear transform is applied before noise estimation (see subclause 5.6.3.2.1). The inverse transform is then used on the resulting noise estimates to recover the original dynamic range (see subclause 5.6.3.2.3). The resulting noise estimates are used in subclause 5.6.3.5 to encode the SID frames.
5.6.3.2.1 Dynamic range compression for the input energies
The input energies are processed by a non-linear function and quantized with 9-bit resolution as follows:
. ()
Background of using log2 is that the (int)log2 can usually be calculated very quickly (in one cycle) on fixed-point processors using the “norm” function which determines the numbers of leading zeros in a fixed point number.
Background for adding a constant 1 inside the log2 function is to ensure that the converted energiesremain positive. This is especially important as the noise estimator rely on a statistical model of the noise energy. Performing noise estimation on negative values would strongly violate the model and can result in unexpected behaviour.
5.6.3.2.2 Noise tracking
The input energy corresponds to an instantaneous power for the i-th partition, referred to as periodogram in the sequel. The minimum statistics algorithm relies on an optimally smoothed periodogram
which can be considered as an estimate of the input power spectral density. The algorithm derives therefore an estimate of the noise power spectral density which we denote in the following as
. As described in the sequel, some additional smoothing of
is applied, yielding the smoothed noise estimate
introduced in subclause 5.6.3.2.2.5.
5.6.3.2.2.1 Initialization phase
To correctly initialize the noise estimation algorithm, an initialization phase is used as long as the input energy of the first partition grows. Note that the initialization phase is also triggered when a reset of the noise estimation algorithm is judged necessary, as described in subclause 5.6.3.4.
The following applies for each partition during the initialization phase:
. ()
Moreover, the minimum statistics algorithm includes a bias compensation mechanism exploiting statistical moments and
of first and second orders, respectively. During the initialization phase, we have
and
.
It is also necessary to initialize several summed quantities. The total noise energy over the FFT and CLDFB partitions are computed during the initialization phase as
()
and
, ()
respectively. Note that the quantity corresponds to the size of the i-th partition. The total input energies
and
for the FFT and CLDFB partitions are initialized to
and
, respectively, and the total smoothed input energies
and
are initialized with
and
, respectively.
In subclause 5.6.3.2.2.4, some auxiliary arrays,
,
and
are also required to track minima in each spectral partition. They are all filled with the largest possible platform value during the initialization phase.
5.6.3.2.2.2 Optimal smoothing of the input energies
As mentioned earlier, the power spectral density estimate is computed iteratively as a smoothed version of the input energy
, i.e.,
(1365)
whereis a time-varying optimal smoothing parameter. It is computed separately for the FFT and CLDFB:
(1366)
where
(1367)
(1368)
are some correction factors,
(1369)
(1370)
impose a lower limit on the optimal smoothing parameter, and
, (1371)
, (1372)
, (1373)
(1374)
denote some summed quantities.
5.6.3.2.2.3 Bias compensation
The minimum statistics algorithm essentially consists in tracking the minima offor each partition i over time. However, this method delivers some biased estimates and necessitates therefore the computation of a bias compensation factor which is dependent on the variance of
.
For each spectral partition, we first estimate the first-order moment of
as
, (1375)
where is a smoothing parameter. The variance of
is then derived as follows:
. (1376)
As shown in subclause 5.6.3.2.2.4, the minimum tracking uses in each partition i a window of sub-windows of length
each. Minima are in fact computed over the entire buffer of size
past frames, but also over the last sub-window. The bias compensation factor
for the total window length, and
for a sub-window are given for each partition
as
, (1377)
, (1378)
with
, (1379)
, (1380)
. (1381)
For the sake of robustness, the bias compensation factors are furthermore increased proportionally to the mean of and
among the spectral partitions. The correction factor is obtained for the FFT and CLDFB partitions as
, (1382)
, (1383)
with
, (1384)
(1385)
5.6.3.2.2.4 Minimum tracking
For the sake of simplicity, we provide a description of the minimum tracking algorithm for the FFT partitions only. The CLDFB partitions can be treated in the same way.
The bias compensation factor and the correction factor
computed in subclause 5.6.3.2.2.3 are used to obtain a more accurate estimate of the background noise energy in each partition. They are re-computed after each frame and tracking of the minimum
is carried out in each FFT partition
as
. When a new minimum
is found, a flag
is set to 1 and
is updated as
. Otherwise
is set to zero.
Note thatis set to the maximum possible platform value after processing the last frame of each sub-window, i.e., every
frames.
and
refer therefore to a minimum within the current sub-window for the partition i. In the last frame of the current sub-window, the current minimum
is stored into a buffer
collecting the minima found in the last
sub-windows. The buffer is used at the end of each sub-window to determine the overall minimum among all sub-windows
.
For a frame in between the first and last frames of the current sub-window (i.e., frames 2 to of the sub-window), the overall minimum
is updated as
, and a flag
is set to 1 if a new minimum was found, i.e., if
. The flag
is set to 0 after processing the last frame of each sub-window.
To improve tracking of a time-varying noise, a local minimum among the current sub-window can replace the overall minimum
in the last frame of each sub-window provided that it yields only a moderate increase of
, and if the local minimum was not found in the first or last frame of the sub-window, i.e., if
and
. In this case,
replaces also all values in the buffer
. The search range
for the local minima (and hence the tolerated amount of increase compared to the current overall minimum) lies between 1.1 and 2 and increases as
decreases:
. (1386)
The noise estimate is updated to
after each frame, except for the first frame in each sub-window.
Furthermore, when the smoothed energy of the first spectral partition exceeds the instantaneous energy by a factor of more than 50 (i.e., ) for at least two frames in a row, a sudden noise offset is assumed and the noise tracking is modified for all partitions
as:
, (1387)
, (1388)
, (1389)
and
, (1390)
, (1391)
. (1392)
5.6.3.2.2.5 Smoothing of the noise estimates
The main outputs of the noise tracker are the noise estimates. To obtain smoother transitions in the comfort noise, a first-order recursive filter is applied, i.e.
. (1393)
Furthermore, the input energyis averaged over the last 5 frames. This is used to apply an upper limit on
in each spectral partition.
5.6.3.2.3 Dynamic range expansion for the estimated noise energies
The estimated noise energies are processed by a non-linear function to compensate for the dynamic range compression applied in subclause 5.6.3.2.1:
(1394)
5.6.3.3 Adjusting the first SID frame in FD-CNG
Before encoding the first SID frame of a CNG phase (i.e., an SID frame preceded by an active frame), an upper limit is applied to the noise estimates to minimize the risk of generating some noise bursts at the beginning of a CNG phase. To this end, the noise estimate
obtained during the previous inactive frame (i.e., one frame before the last active phase) is used in combination with the input energy of the current frame as follows:
, (1395)
where refers to the partition index,
, and num_active_frames corresponds to the length of the last active phase.
5.6.3.4 FD-CNG resetting mechanism
To signal the need of resetting FD-CNG, a flag is computed based on a detection of fast increasing noise energy.
If the current total noise energy as calculated in subclause 5.1.11.1 is bigger than the one of the last frame, up to four difference values of the total noise energy of the previous frames are summed up, in case the noise energy increased in these last frames consecutively.
If the encoder is out of initialisation phase, four or more frames with increasing were detected and the sum of the last four of them was bigger than 5, the reset flag is set to 1. Besides that, in case the signal’s input bandwidth of the current frame is larger than the previous one, the reset flag is also set to 1.
If none of this is the case but the flag was set before, it takes nine more frames before the flag is set to zero.
In case the flag is set to one, a reinitialization of the minimum statistics routine is triggered (subclause 5.6.3.2.2.1), and selection of SID or NO_DATA frame is forbidden.
5.6.3.5 Encoding SID frames in FD-CNG
The CN parameters to be encoded into a FD-CNG SID frame are the noise estimates, with
, and
depends on the bandwidth and the bitrate as given in the table below.
Table : Number of CN parameters encoded in a FD-CNG SID frame
|
Bandwidth |
NB |
WB |
SWB |
|
|
Bit-rates [kbps] |
|
|
|
|
|
|
17 |
20 |
21 |
24 |
The noise estimates are first converted to dB
. ()
The noise estimates in dB are then normalized using
. ()
The normalized noise estimates in dB are then quantized using a Multi-Stage Vector Quantizer (MSVQ). The MSVQ has 6 stages, with 7 bits in the first stage and 6 bits in the other stages (total of 37 bits). A M-best search algorithm is used, with M=24 is the number of survivors in a stage that will be searched in the next stage. Note that a single set of codebooks is used for all configurations. The vectors in the codebook have a length of 24, and they are simply truncated if
is less than 24.
The MSVQ decoder output is given by
, ()
where are the indices encoded in the bitstream and
is the
-th coefficient of the
-th vector in the codebook of stage
.
A global gain is then computed
, ()
with is a scale which depends on the bandwidth and the bitrate as described in the table below.
Table : FD-CNG SID global gain scale
|
Bandwidth |
NB |
WB |
SWB |
||||||||||
|
Bit-rates [kbps] |
<=7.2 |
8 |
9.6 |
13.2 |
<=7.2 |
8 |
9.6 |
13.2 |
16.4 |
24.4 |
13.2 |
16.4 |
24.4 |
|
|
-5.5 |
-5 |
-4 |
-3 |
-5.5 |
-5 |
-1.55 |
-3 |
-0.6 |
-0.2 |
-3 |
-0.8 |
-0.25 |
The global gain is then quantized on 7 bits using
, ()
producing the quantized global gain
. ()
The quantized noise estimates are then given by
. ()
Finally the last band parameter is adjusted in case the encoded last band size is different from the decoded last band size
5.6.3.6 FD-CNG local CNG synthesis
Noise estimates encoded in the SID frame are then used to locally generate CNG, in order to update the encoder memories. The following table lists the number of partitions used for generating CNG in the FD-CNG encoder, and their upper boundaries, as a function of bandwidths and bit-rates.
Table : Configurations of the FD-CNG local CNG synthesis
|
Bit-rates (kbps) |
|
|
(Hz) |
(Hz) |
|
|
NB |
|
17 |
0 |
100, 200, 300, 400, 500, 600, 750, 900, 1050, 1250, 1450, 1700, 2000, 2300, 2700, 3150, 3975 |
|
|
WB |
|
20 |
0 |
100, 200, 300, 400, 500, 600, 750, 900, 1050, 1250, 1450, 1700, 2000, 2300, 2700, 3150, 3700, 4400, 5300, 6375 |
|
|
|
20 |
1 |
100, 200, 300, 400, 500, 600, 750, 900, 1050, 1250, 1450, 1700, 2000, 2300, 2700, 3150, 3700, 4400, 5300, 6375 |
8000 |
|
|
|
21 |
0 |
100, 200, 300, 400, 500, 600, 750, 900, 1050, 1250, 1450, 1700, 2000, 2300, 2700, 3150, 3700, 4400, 5300, 6375, 7975 |
|
|
|
SWB/FB |
|
20 |
4 |
100, 200, 300, 400, 500, 600, 750, 900, 1050, 1250, 1450, 1700, 2000, 2300, 2700, 3150, 3700, 4400, 5300, 6375 |
8000, 10000, 12000, 14000 |
|
|
21 |
3 |
100, 200, 300, 400, 500, 600, 750, 900, 1050, 1250, 1450, 1700, 2000, 2300, 2700, 3150, 3700, 4400, 5300, 6375, 7975 |
10000, 12000, 16000 |
For each partition,
corresponds to the frequency of the last band in the i-th partition. The indices
and
of the first and last bands in each spectral partition can be derived as a function of the core sampling rate
and the FFT size
:
, (1403)
, (1404)
whereis the frequency of the first band in the first spectral partition. Hence the FD-CNG generates some comfort noise above 50Hz only.
5.6.3.6.1 SID parameters interpolation
The SID parameters are interpolated using linear interpolation in the log domain, as described in subclause 6.7.3.1.2. The interpolated SID parameters are noted .
5.6.3.6.2 LPC estimation from the interpolated SID parameters
A set of LPC coefficients is estimated from the SID spectrum in order to update excitation and LPC related memories, as described in subclause 6.7.3.1.3. The LPC coefficients are noted .
5.6.3.6.3 FD-CNG encoder comfort noise generation
A FD-CNG time-domain signal is generated similarly to the time-domain CNG signal generated at the decoder-side (see subclauses 6.7.3.3.2 and 6.7.3.3.3), except that the interpolated SID parameters are used to generate the noise in the frequency-domain (instead of the noise levels
which are not available at the encoder side).
5.6.3.6.4 FD-CNG encoder memory update
The memories update is performed using the FD-CNG time-domain signal and the LPC coefficients , similarly to the decoder memory update (see subclause 6.7.3.3.4).
Additionally, the weighted signal domain memories are updated by filtering the FD-CNG time-domain signal through a LP analysis filter with a weighted version of the LPC coefficients .