5.6.1 Overview
26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS
This subclause describes the discontinuous transmission (DTX) scheme and the comfort noise generation (CNG) algorithm. The DTX/CNG operation, which is activated on a command line, is used to reduce the transmission rate by simulating background noise during inactive signal periods. The regular DTX/CNG modes are supported for bit rates up to 24.4 kbps. For higher bit rates, the EVS codec supports a less aggressive DTX/CNG scheme that only switches to CNG for low input signal power.
The reduction of the transmission rate during inactive periods is achieved by coding the parameters referred to as comfort noise (CN) parameters. These parameters are used at the decoder to regenerate the background noise as well as possible, by respecting the spectral and temporal content of the background noise at the encoder. In the EVS Codec, the CNG algorithm reproduces high quality comfort noise by choosing between a linear prediction-domain based coding mode (LP-CNG) and a frequency-domain based coding mode (FD-CNG), according to the input characteristics. Each of the two coding modes utilizes a different set of CN parameters. In the LP-CNG mode, four CN parameters are analyzed and encoded: the low-band excitation energy, the low-band signal spectrum, the low-band excitation envelope and the high-band energy, where the high-band energy is only encoded for SWB/FB input. In the FD-CNG mode, the CN parameters consisting of global gain and spectral energies grouped in critical bands. Those parameters are encoded by a vector quantizer for transmission.
When the codec is operated with the DTX/CNG operation, the signal activity detector (SAD) is used to analyse the input signal to determine whether the signal comprises an active or inactive signal (see SAD decision in subclause 6.2). Based on its analysis, the SAD generates a SAD flag, , whose state indicates whether the signal is active (
= 1) or merely a background noise (
= 0). When
= 1, the regular encoding and decoding process is performed, as in the default option. When
= 0, DTX functions are run at the encoder that transmit either a silence insertion descriptor (SID) frame or a NO_DATA frame. The SID frame contains the CN parameters, which are used to update the statistics of the background noise at the decoder, whereas the NO_DATA frame is empty. The SID frame is always encoded using 48 bits regardless the actual CNG mode operating.
Further, hangover logic, as described in subclause 6.2, is used to enhance the quality of SID frames. The hangover logic in the SAD algorithm is such that the encoder waits for a certain number of frames before switching from the active signal to inactive signal. If the background noise contains transients that force the encoder to switch from inactive signal to active signal and then back to inactive signal in a very short time period, no hangover is used.
5.6.1.1 SID update
The CN parameters are transmitted at a fixed or adaptive rate during inactive signal periods using a command line parameter. By default in the command line the transmission rate of CNG update is fixed to 8 frames. However, the CNG update rate can also be set to another fixed value or a variable rate by means of a command line parameter. The fixed rate is limited to between 3 and 100 frames. The adaptive rate, in general, is dependent on the background noise characteristics such as the current signal-to-noise ratio (SNR) and is limited to be between 8 and 50. Generally, at a high SNR, the SID frames are transmitted with a lower rate to achieve a significant reduction of average data rate at the cost of only minor quality degradation. On the other hand, at a low SNR, SID frames are transmitted with a higher frequency so that the comfort noise remains as natural as possible. Thus, increasing SNR implies decreasing SID frame frequency, whereas decreasing SNR implies increasing SID frame frequency.
To determine the adaptive SID transmission rate, the SNR is calculated based on the long-term energy of the active signal, , and background noise,
. The DTX algorithm updates both long‑term values in each frame to take into account the possible evolutions of the level of the two respective signals. In the current frame, only one of these two energies is updated. If the current frame is classified as VOICED, the DTX module updates only the long-term energy of the active signal. Otherwise, it updates only the long-term energy of the background noise. The adaptive rate calculation is performed in every inactive signal frame after the preamble period. This period is characterized by at least 50 updates of both
and
.
The update of the long-term energy of an active signal is performed as follows:
(1313)
and the update of the long-term energy of an inactive signal as
()
where Ef = ||snr(n)|| is the energy of the denoised signal, snr(n), in the current frame. α represents a forgetting factor. Its value is based on the energy evolution. The value of α is set either to 0.99 for slow adaptation, or 0.90 for fast adaptation of the long-term energy level. In case of , fast adaptation is chosen if
in the current frame, otherwise slow adaptation is applied. In case of
, the fast adaptation is chosen if
in the current frame, otherwise slow adaptation is applied.
Having estimated the long-term energies, and
, the SNR value in the logarithmic domain is calculated in every inactive signal frame as
()
The SID transmission rate, rSID, is finally adapted based on the current SNR value. The value of rSID is linearly varied between a minimum value, rMIN, that corresponds to a minimum SNR value, SNRMIN, and a maximum value, rMAX, that corresponds to a maximum SNR value, SNRMAX. That is
()
where rMIN ≤ rSID ≤ rMAX. The values rMIN, SNRMIN, rMAX and SNRMAX are selected as follows:
|
rMIN = 8 SNRMIN = 36 dB rMAX = 50 SNRMAX = 51 dB. |
Thus, the adaptive rate is limited to between 8 and 50 frames and is updated in every inactive signal frame. If the number of consecutive NO_DATA frames is equal to or greater than the current value of rSID, the next inactive signal frame is denoted as a SID frame. There is one exception to this rule.
- SID frame sent due to detection of abrupt changes in the spectral characteristics of background noise, as described in Section 5.6.1.2.
After the rate rSID is determined, a variation of the long-term energy of the inactive signal is calculated and subjected to a fixed threshold. This is performed in every NO_DATA frame after the preamble period. That is, if the following variation holds
()
the long-term energy of inactive speech is reset to , where
is the long-term energy of the inactive signal updated in every SID frame.
5.6.1.2 Spectral tilt based SID transmission
Adaptive SID update rate that relies only on fluctuations in SNR as described in Section 5.6.1.1 may sometimes fail to detect significant changes in the background noise characteristics. In some cases, inactive frames that are perceptually different will have similar energy characteristics (typically encoded as gain values in the SID frames). Although background noise in a street (street noise) may have an energy distribution over time that is similar to that of background noise in a crowded space (babble noise), for example, these two types of noise will usually be perceived very differently by a listener. Spectral tilt is a good measure to capture such changes in the background noise characteristics.
It would be beneficial for a DTX/CNG scheme to track such perceptual changes in the background noise, apart from tracking the SNR as described in Section 5.6.1.1. Hence a scheme to detect a sudden change in spectral tilt of the background noise and trigger a new SID frame indicating the parameters of the new background noise is employed in the encoder.
To ensure that there is no affect from an active talk spurt to the computation of spectral tilt of the background noise, this computation is performed in every inactive frame after 5 consecutive inactive frames that immediately follow an active talk spurt.
Linear prediction coefficients (LPCs) are derived using performing Linear prediction analysis (Section 5.1.5.1 – Section 5.1.5.3) on the current input frame with background noise and no active speech. LPCs are then converted to reflection coefficients (RCs) as follows using a backwards Levinson Durbin recursion. For a given Nth order LPC vector, the Nth reflection coefficient value is derived using the formula
, it is then possible to calculate the lower order LPC vectors
using the following recursion
()
which yields the reflection coefficient vector . The spectral tilt of background noise is indicated by the first reflection coefficient
. A smoothened running average of the spectral tilt of background noise in Kth inactive frame
is computed using a first order IIR filter as follows.
()
The running average differences from frame to frame are accumulated in during each during each successive inactive frame K as follows:
()
Absolute value of this delta-sum parameter is compared against a set threshold of 0.2. If this threshold is exceeded during an inactive frame indicating a change in spectral tilt characteristics of the background noise, and if the number of consecutive NO_DATA frames is equal to or greater than 8, a new SID frame is transmitted regardless of the current value of the adaptive SID rate parameter rSID. At this point, parameters
and
are also reset to zero to start a fresh computation of spectral tilt of the background noise that follows this SID frame.
5.6.1.3 CNG selector
The CNG selector chooses one of the two CNG modes (FD_CNG or LP_CNG) for generating comfort noise. In case AMR-WB IO mode is used, LP-CNG is always selected. Otherwise, the decision is based on the energy ratio between a high and a low frequency range of the background noise signal and the bandwidth of the signal.
Noise energy estimates for the low frequency range up to 1270 Hz and for the two highest critical bands are estimated by , ()
, ()
except for narrowband signals, where the lowest band is ignored and the highest bands are lower:
, ()
. ()
is the background noise energy per critical band as described in subclause 5.1.11.1. Both values
and
are used to calculate the spectral tilt of the background noise energies and update the memory for
:
()
Depending on the previous CNG mode, input signal bandwidth (input_bwidth) as detected by the bandwidth detection module (subclause 5.1.6) and the CNG mode is changed if the current frame is active and at least the past 20 frames were active and one of the following cases applies:
if (cng_mode == LP_CNG &&
(( input_bwidth == NB && > 9.f) ||
( input_bwidth > NB && > 45.f)))
{
cng_mode = FD_CNG;
}
else
if ( cng_mode == FD_CNG &&
(( input_bwidth == NB && < 2.f) ||
( input_bwidth > NB && < 10.f)))
{
cng_mode = LP_CNG;
}