6.3 Switching coding modes in decoding
26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS
6.3.1 General description
This clause describes all transitions between coding modes including changes for sample rates, bit rates and audio bandwidths for the decoding process. The transitions between CELP coding mode and MDCT coding mode within the same bit rate and audio bandwidth are described in 6.3.2 and 6.3.3.
The handling of sample rate changes within the CELP or LP-based coding mode and MDCT-based TCX mode is described in 6.3.4.
The switching between primary and AMR-WB IO modes is described in 6.3.5.
The handling of transitions in the context of bit rate switching is described in 6.3.6.
Finally, the transition between NB, WB, SWB and FB are described in 6.3.7.
6.3.2 MDCT coding mode to CELP coding mode
When a CELP encoded frame is preceded by a MDCT based encoded frame, the memories of the CELP encoded frame have to be updated before starting the decoding of the CELP frame, similarly to the encoder case (see clause 5.4.2).
Additionally, cross-fading is applied in the time-domain at the output sampling rate to avoid any discontinuities between the MDCT based output and the CELP based output including bandwidth extension.
The CELP memories update and the cross-fading are performed depending on the bitrate and the previous encoding mode. In general three different MDCT to CELP (MC1 to MC3) transitions are supported. The table in clause 5.4.2 describes which transition mode is used for a specific configuration. The different decoding cases are described in detail in the following sub-clauses.
6.3.2.1 MDCT to CELP transition 1 (MC1)
MC1 is used when the previous frame was decoded with HQ MDCT and the current frame is decoded with CELP. The CELP state variables are reset in the current frame to predetermined (fixed) values. In particular the following memories are reset to 0 in the CELP decoder:
- Resampling memories of the CELP synthesis
- Pre-emphasis and de-emphasis memories
- LPC synthesis memories
- Past excitation (adaptive codebook memory)
- Bass post-filter memories
The old LPC coefficients and associated representations (LSP, LSF) and CELP gain quantization memories are reset to predetermined (fixed) values. The CELP decoder in the current frame is forced to operate in Transition coding (TC), i.e. without using an adaptive codebook from the previous frame. Since the LPC coefficients from the previous frame are not available, only one set of LPC coefficients corresponding to the end of frame are decoded and using for all subframes in the current frame.
To avoid discontinuities at the output sampling rate between the decoded HQ MDCT signal in the previous frame and the decoded CELP signal (including time-domain BWE) in the current frame, cross-fading (i.e. overlap-add) is used.
The decoded HQ MDCT signal could be windowed to compensate for the MDCT synthesis window, however in practice this compensation is not done. Note that the synthesis of the CELP decoder is more delayed than the synthesis of the MDCT decoder as shown in Figure 112a; the time difference is denoted here D. In the current CELP frame, the first samples are replaced by the HQ MDCT synthesis from the previous frame (D samples). The unweighted HQ MDCT aliased memory is overlap-added with the CELP output.
Figure a: Overlap-add in MDCT to CELP transition.
6.3.2.2 MDCT to CELP transition 2 (MC2)
As described in clause 6.2.4.5, the MDCT based TCX decoder generates two time-domain signals, one at the output sampling rate and one at the CELP sampling rate
. The signal at the CELP sampling rate is used to update the CELP memories, similarly to the encoder case (see clause 5.4.2.2). It is also used at the decoder side to update the memories of the CLDFB based resampler that is used to resample the decoded CELP signal. The signal at the output sampling rate is the signal that is actually sent to the decoder output. To avoid discontinuities between this MDCT based TCX signal and the decoded CELP signal at the output sampling rate, cross-fading is used.
The decoded CELP signal at the output sampling rate is obtained after CELP decoding at the CELP sampling rate, CLDFB based resampling and time-domain bandwidth extension decoding (see clause 6.1). Both the CLDFB based resampling and the time-domain bandwidth extension decoding introduce a delay. Consequently, the decoded CELP frame is delayed compared to the MDCT based TCX frame and an overlap between the two frames is then introduced. This overlap is used to perform a cross-fade and thus to avoid any discontinuities between the two frames. The output signal is given by
()
with is the output signal,
is the MDCT based TCX signal,
is the CELP signal,
is the delay of the TCX LTP postfilter (0.25ms),
is the delay introduced by the CLDFB resampler and the time-domain BWE (1.25ms if
and 2.3125ms otherwise), and
is the frame length at the output sampling rate (20ms).
6.3.2.3 MDCT to CELP transition 3 (MC3)
Similarly to MC1, the CELP memories are either reset or extrapolated similarly as in the encoder (see clause 5.4.2.3).
To avoid discontinuities between the decoded MDCT based TCX signal and the decoded CELP signal (including time-domain BWE) at the output sampling rate, cross-fading is used. The same approach as used in clause 6.3.2.1 is used.
6.3.3 CELP coding mode to MDCT coding mode
When a MDCT encoded frame is preceded by a CELP encoded frame, a beginning portion of the MDCT encoded frame cannot be reconstructed properly due to the aliasing introduced by the missing previous MDCT encoded frame. As already described in clause 5.4.3, two approaches are used to solve this problem, depending on the MDCT based coding mode (either MDCT based TCX or HQ MDCT). The decoder part of these two approaches is described in detail in the following sub-clauses.
6.3.3.1 CELP coding mode to MDCT based TCX coding mode
If MDCT based TCX is used in the current frame and if the previous frame was encoded with CELP, the MDCT based TCX frame length is increased and the left part of the MDCT window is modified, as already described in clause 5.4.3.1.
At the decoder side, the MDCT coefficients are decoded as described in clause 6.2.2. The decoded MDCT coefficients are then transformed back to the time-domain as described in clause 6.2.4. Two inverse transforms are performed and two time-domain signals are generated, as described in clause 6.2.4.5. One signal is generated at the CELP sampling rate and follows the previous decoded CELP signal at the CELP sampling rate. The other signal is generated at the output sampling rate and follows the previous decoded CELP signal at the output sampling rate (after CLDFB resampling and time-domain BWE). The decoding of the CELP signal (including the time-domain BWE) was described in clause 6.1.
To avoid any discontinuities that could be introduced in the decoded time-domain signal at the border between the two frames, a smoothing mechanism is applied. This algorithm is described in detail in the following.
Let’s first assume the following notations. The frame length at the CELP sampling rate is noted. The decoded CELP signal at the CELP sampling rate is noted
with
. The decoded MDCT signal (including the windowed portion overlapping with the previous CELP frame) is noted
with
,
is the length of the segment where both the MDCT signal and the CELP signal overlap (it is also equal to the length of the sine window used in the left part of the transition window as described in clause 5.4.3.1), and
is the CELP sampling rate. The LPC synthesis filter used in last subframe of the previous CELP frame is noted
with
and
is the LPC filter.
A modified CELP signal is first computed by performing an overlap-add operation with the decoded MDCT signal on the overlap region and by artificially compensating the aliasing introduced by the decoded MDCT signal using the decoded CELP signal. The modified CELP signal can be defined as follow
()
with is the sine window used in the left-part of the transition window as described in clause 5.4.3.1. Contrary to the non-modified CELP signal case, discontinuities are significantly reduced (or even completely supressed in most cases) in the modified CELP signal preceding the MDCT signal, due to the overlap-add operation. However, the modified CELP signal cannot be used directly to generate the decoder output signal of the current frame, because it would introduce an additional decoder delay equal to the overlap length. Instead, the modified CELP signal is used only to generate a zero-input-response (ZIR) of the LPC synthesis filter. This ZIR is then used to modify the decoded MDCT signal, reducing significantly (or even removing in most cases) the possible discontinuity, and without introducing any additional decoder delay. The ZIR
of the LPC synthesis filter
is generated by first computing the memory of the LPC synthesis filter as follow
()
and then computing the zero-input-response as follow
()
with is the number of generated ZIR samples. The ZIR is then windowed such that its amplitude always decreases to 0, producing the windowed ZIR
()
Finally, the windowed ZIR is added to the beginning portion of the decoded MDCT signal, corresponding to the time samples .
The same smoothing mechanism is applied to the decoded signals at the output sampling rate, with the exception that the ZIR is not re-computed but obtained by resampling the ZIR computed at the CELP sampling rate. The resampling is performed using linear interpolation as described in clause 5.4.4.4. The resampled ZIR is then added to the decoded MDCT signal at the output sampling rate.
The smoothing mechanism described above ensures a smooth transition between the CELP and MDCT signals at the output sampling rate, but only in the CELP bandwidth part. Due to the delay introduced by the time-domain BWE, a gap is introduced in the high frequency region as explained in clause 6.1.5.1.13.1. To fill this gap, a transition signal is generated as described in clause 6.1.5.1.13.1. This transition signal is long enough to cover not only the gap but also an additional signal portion following the gap. This additional signal portion is then used to perform a cross-fading with the decoded MDCT signal, ensuring smooth transition at the output sampling rate.
6.3.3.2 CELP coding mode to HQ MDCT coding mode
When the previous frame is CELP and the current frame is to be coded by HQ MDCT, the current frame is a transition frame in which two types of decoding are used:
- Constrained CELP coding and (when required) simplified time-domain BWE coding
- HQ MDCT coding with a modified window
Constrained CELP decoding means here that CELP is restricted to decode only a subset of CELP parameters, to reuse parameters (LPC coefficients) from the previous CELP frame, and to cover only the first subframe of the current frame. These constraints are set to minimize the bit budget taken by continuing CELP decoding in the current frame, this bit budget being taken out of HQ MDCT decoding.
As shown in Figure 112b, the transition frame includes at the decoder side a gap between the previous output frame (decoded by CELP) and the decoded signal with only the contribution from HQ MDCT. The length of this gap at the decoder is 4.375 ms, which corresponds to 10-5.625 ms (10 ms for ¼ of MDCT window support – 5.625 ms which is the length of the zero segment at the beginning of the ALDO synthesis window). In addition, an overlap period of 1,825 ms is used to attenuate discontinuities between CELP and HQ MDCT decoded signal. The total transition region between CELP and HQ MDCT in the decoder (grey zone decoder in Figure 112b) is 4.375+1.825 = 6.25 ms..
Figure 112b: Modified MDCT synthesis window in the transition frame (CELP to HQ MDCT).
6.3.3.2.1 Constrained CELP decoding and simplified BWE decoding
The bit budget for CELP and BWE in the current (transition) frame is determined depending on the CELP coder used in the previous frame (12.8 kHz or 16 kHz) and decoded audio bandwidth in the current frame, as described in the pseudo-code in clause 5.4.3.2.1. Note that the current frame being a transition frame, one bit is used to indicate the type of CELP coding (12.8 kHz or 16 kHz); this bit is allows decoding the transition frame even when the information about the previous CELP coding type was lost due to frame erasures..
LPC coefficients from the end of the previous frame are reused, constrained CELP decoding only relies on decoding an extra subframe with the same CELP core decoder (12.8 kHz or 16 kHz) as in the previous frame; subframe decoding similar to clause 6.1 is applied (without LSF decoding). The length of the decoded subframe is 5 ms if CELP is at 12.8 kHz and 4 ms if CELP is at 16 kHz. The decoded CELP synthesis is normally delayed by 1.25 ms at the decoder due to the FIR resampling/delay operation. However, to have enough samples to cover the transition region of 6.25 ms, the resampling memory is resampled with 0-delay using the optimized cubic interpolation described in clause 6.3.3.2.1.1.
Note that the cubic interpolation requires to have two future samples, which are estimated by simply repeating the last value of the FIR memory.
Hence, if CELP is at 12.8 kHz,, the CELP synthesis (after FIR resampling and 0-delay resampling) is 6.25 ms long; if CELP is at 16 kHz, the subframe is 1 ms shorter and the CELP synthesis is extended by ringing to get a decoded signal of 6.25 ms; the ringing is used only in the overlap region and its influence is perceptually minimal.
When the coded audio bandwidth is higher than the bandwidth of the core CELP coder, BWE decoding is applied. The previous frame is high-pass FIR filtered to obtain the high-band, and the decoded pitch lag and gain are decoded to repeat 6.25 ms of high-band signal which is added to the 6.25 ms of decoded CELP signal.
6.3.3.2.1.1 Optimized cubic interpolation
The missing signal at the output sampling frequency is partly available in the memory buffer at the internal sampling frequency, 12.8 kHz or 16 kHz. By doing low delay resampling like interpolation method of this memory a good estimation of the missing signal can be obtained. Third order cubic interpolation is used here, where cubic curves are used to interpolate the output values within 3 input interval delimited by 4 input samples. Respectively, in each input interval the interpolation can be made by using 3 different cubic curves. To further improve the quality of this estimation the interpolated samples are obtained by computing a weighted mean value of the possible cubic interpolated values computed on the plurality intervals covering the time position of the sample to interpolate.
The length of the resampling buffer (input to cubic interpolation) is 1.25 ms (16 samples at 12.8 kHz sampling rate or 20 samples at 16 kHz sampling rate) plus 2 past samples used as memory for the first cubic interpolations of the first 2 intervals. In cubic interpolation, 4 consecutive input samples determinate a cubic curve, the general equation of this curve is . To simplify the computations of the coefficients the temporal index of the 4 consecutive input samples are always considered as
,
,
and
and so they define 3 intervals, [-1, 0], [0, 1] et [1, 2]. Noting the values of these 4 input samples
,
,
and
, the coefficients
,
,
and
can be computed as:
(a)
(b)
(c)
(d)
To get the output resampled signal often the value of the output is needed to be determined between two input samples, in the interval limited by these input samples. As mentioned above, in cubic interpolation one cubic curve covers 3 intervals and respectively each interval can be covered by 3 different cubic curves: by the interval central [0, 1] of the central cubic curve or by the interval [1,2] of the previous cubic curve or the interval [-1, 0] of the next cubic curve. In the following the index corresponds to the beginning of the input interval where the output interpolated sample is computed. Let’s note the coefficients of the cubic curve of which the central interval is used
,
,
,
, the coefficients of the previous cubic curve
,
,
,
and the coefficients of the next cubic curve
,
,
,
. This gives 3 possible values for a given time instant
,
(1931e)
(1931f)
(1931g)
The interpolated output value for a given instant
is computed as the weighted mean value of these 3 possible interpolated values:
(h)
The weights used are same for each interpolated value, =
=
=1/3. To reduce the complexity the values of x/3, x2/3, x3/3, (x-1)/3, (x-1)2/3, (x-1)3/3, (x+1) /3, (x+1)2/3, and (x+1)3/3, are tabulated for all possible values of
needed for the interpolations. So the weighting by 1/3 is integrated in these tables, only the coefficients
,
and
are needed with a multiplication by 1/3 when the output value is computed. For example to upsample from 12.8 kHz to 32 kHz the required values of
are 0.2, 0.4, 0.6 and 0.8.
The last 2intervals cannot be covered by 3 cubic curves as future samples are not available to compute all curves. Here simplified interpolation is used. For the last but one input interval the central interval of the last possible cubic curve is used to compute the interpolated signal:
(i)
and for the last input interval the interval [1,2] of the same last cubic curve is used to compute the interpolated signal
(j)
In case of subsampling, the output samples after the last input sample cannot be interpolated, that causes a small delay of up to 3 output samples.
6.3.3.2.2 HQ MDCT decoding with a modified synthesis window
HQ MDCT decoding in the transition frame is identical to clause 6.2.3, except the MDCT synthesis window is modified and the bit budget in the current frame is decreased as described in clause 6.3.3.2.1.
The modified MDCT window is designed to avoid aliasing in the first part of the frame. Its shape also allows cross-fading between the synthesis from constrained CELP and simplified BWE and the synthesis from HQ MDCT.
6.3.3.2.3 Cross-fading
As shown in Figure 112b, the CELP and HQ MDCT decoded signals are overlapping; the length of this overlapping region is 1,825 ms. The HQ MDCT synthesis is already windowed by the modified MDCT window at the decoder. The CELP decoded signal is windowed by the complementary window and added to the HQ MDCT output in the overlap region.
6.3.4 Internal sampling rate switching
When changing the internal sampling rate in CELP or MDCT-based TCX, a number of memory and buffer updates needs to be done. These are described in subsequent sub-clauses.
6.3.4.1 Reset of LPC memory
Same as subclause 5.4.4.1.
6.3.4.2 Conversion of LP filter between 12.8 and 16 kHz internal sampling rates
Same as subclause 5.4.4.2.
6.3.4.3 Extrapolation of LP filter
Same as subclause 5.4.4.3.
6.3.4.4 Update of CELP synthesis memories
Same as subclause 5.4.4.7.
6.3.4.5 Update of CELP decoded past signal
When switching from CELP coding mode to MDCT-base TCX coding mode, the CELP decoded past signal with
is needed as described in subclause 6.3.3.1. The past signal is resampled with the method described in subclause 5.4.4.4 in case of internal sampling rate switching before proceeding as described in subclause 6.3.3.1.
6.3.4.6 Post-processing
In case of sampling rate switching, the post-processing module described in clause 6.1.4 has a specific behavior during the transition.
6.3.4.6.1 Adaptive post-filtering
The memories of the adaptive post-filtering are resampled with the linear interpolation described in subclause 5.4.4.4 in case of sampling rate switching and in case post-filtering was applied in the previous frame.
If the post-filtering was not employed in the previous frame, the adaptive post-filter is not employed in the first frame after the transition. In such a case, only the memories of the post-filter are populated and updated for the next frame.
Moreover, in case the adaptive post-filter was activated in the previous frame and is switched off in the current frame, a smoothing mechanism is applied for avoiding any discontinuities. It is achieved by computing the zero impulse response of the post-filter states and adding it to the zero memory response of the current decoded frame. First, the first subframe of the current decoded frame is filtered by the corresponding set of LPC analysis filter and using as memory the previously decoded frame samples before being post-processed. The past non post-processed samples are eventually resampled as described in clause 5.4.4.4 in case of internal sampling rate switching. The residual is re-synthesized with
using this time as memory the past decoded frame sampled computing after being post-processed. As stated above, the past non post-processed samples are resampled in case of internal sampling rate switching. The rest the current decoded frame is not processed further.
6.3.4.6.2 Bass post filter
At 9.6, 16.4 and 24.4 kbps, the past memory of signal defined in clause 6.1.4.2 is reset in case of sampling rate switching. That means that the Bass post-filter has no effect during the transition.
6.3.4.7 CLDFB
In case of internal sampling rate switching, the states of the analysis CLDFB needs to be resampled for both the decoded signal coming from either CELP or MDCT-based TCX decoded module, and also for the error signal provided by the Bass post-filter as defined in clause 6.1.4.2.
The resampling of the two set of states is performed with the help of the linear interpolation described in subclause 5.4.4.4.
6.3.5 EVS primary modes and AMR-WB IO
6.3.5.1 Switching from primary modes to AMR-WB IO
In addition to processing described Subclause 5.4.5.1, the following is done:
- Reset the unvoiced/audio signal improvement memories
- Reset AMR-WB BWE memories
- bass post-filter is not employed in the first AMR-WB IO frame
- formant post-filter is not employed in the first AMR-WB IO frame
6.3.5.2 Switching from AMR-WB IO mode to primary modes
Same as Subclause 5.4.5.2.
6.3.6 Rate switching
When the bit-rate is changing, the different coding tools are reconfigured at the beginning of the frame. The different bit-rate dependent setups of each tool are described in each corresponding clause. Rate switching doesn’t require any specific handling, except in the following scenarios.
6.3.6.1 Rate switching along with internal sampling rate switching
In case the internal sampling rate changes when switching the bit-rate, the processing described in clause 6.3.4 is performed at first.
6.3.6.2 Rate switching along with coding mode switching
In case the internal sampling rate changes when switching the bit-rate, the processing described in clause 6.3.3 is performed. If the internal sampling rate is also changing, the processing of clause 6.3.4 is performed beforehand.
6.3.6.3 Adaptive post-filter reset and smoothing
The adaptive post-filter can be reset and it effect smoothed when it is switched on or off from frame to frame, respectively. The procedure is described in suclause 6.3.4.6.1, where the buffer resampling is performed only in case of internal sampling rate switching.
6.3.7 Bandwidth switching
When rate switching happens and the bandwidth of the output signal is changed from WB to SWB and from SWB to WB, bandwidth switching post-processing is performed in order to improve the perceptual quality for the end users. The smoothing is applied for switching from WB to SWB and the blind bandwidth extension is employed for switching from SWB to WB.
6.3.7.1 Bandwidth switching detector
Firstly, bandwidth switching detector is employed to detect if there is bandwidth switching or not.
Initialize the counter of bandwidth switching from WB to SWB , and initialize the counter of bandwidth switching from SWB to WB
.
The counter is calculated as follows:
- If the counter
, the counter is reset to 0;
- else if the output bandwidth of the current frame is SWB, the total bit rate of the current frame is large than 9.6kbps, and the bandwidth of the previous frame is WB, the total bit rate of the previous frame is not large than 9.6kbps, the counter
is incremented 1;
- else if the counter
, the counters
and
will be reset as follows:
()
()
The counter is calculated as follows:
- If the counter
, the counter is reset to 0;
- else if the conditions
are all satisfied, the counter
is incremented 1
- else if the counter
, the counters
and
will be reset as follows:
()
()
Finally, the counterindicates the switching from super wideband to wideband; and the counter
indicates the switching from wideband to super wideband.
6.3.7.2 Super wideband switching to wideband
TBE mode, multi-mode FD BWE or MDCT based scheme will be employed to generate the SHB signal when switching to wideband.
If the following conditions
()
are satisfied, TBE mode is applied to reconstruct the SHB signal.
Otherwise, if the conditions are satisfied, multi-mode FD BWE algorithm is applied;
If the core is MDCT coding, the MDCT coefficients of the upper band are predicted.
6.3.7.2.1 TBE mode
The following steps are performed when TBE mode is used to generate the SHB signal for wideband output:
- Estimate the high band LSF, gain shape according to the corresponding parameters of the previous frame or by looking for the pre-determined tables
- Reconstruct an initial SHB signal according to the TBE algorithm described in subclause 5.2.6.1.
- Predict a global gain of the initial SHB signal according to the spectral tilt parameter of the current frame and the correlation of the low frequency signal between the current frame and the previous frame
- Modify the initial SHB signal by the predicted global gain to obtain a final SHB signal
- Finally, the final SHB signal and low frequency signal are combined to obtain the output signal.
The spectral tilt parameter can be calculated as described by the algorithm described in equations (800) and (801), and the correlation of the low frequency signals of the current frame and the previous frame can be the energy ratio between the current frame and the previous frame.
In detail, the algorithm of predicting the global gain is described as follows:
- Classify the signal of the current frame to fricative signal
or non- fricative signal
according to the spectral tilt parameter and the correlation of the low frequency signal between the current frame and the previous frame:
When the spectral tilt parameter of the current frame is larger than 5 and the FEC class of the low frequency signal is UNVOICED_CLAS, or the spectral tilt parameter is larger than 10. And if the signal of the previous frame is non-fricative signal, and the correlation parameter is larger than a threshold, or if the signal of the previous frame is fricative signal and the correlation parameter is less than a threshold, the current frame is classified as fricative signal. Otherwise, the current frame is classified as fricative signal
.
- For non-fricative signal, the spectral tilt parameter is limited to the range [0.5, 1.0]. For fricative signal, the spectral tilt parameter is limited to not larger than 8. The limited spectral tilt parameter is used as the global gain of the SHB signal.
For some speacial cases: If the energy of the SHB signal(calculated by the global gain and ) is larger than the energy of the signal with the frequency range in [3200, 6400]
, the global gain of the SHB signal is calculated as follows:
()
where is the energy of the initial SHB signal.
If the energy of the SHB signal is less than 0.05 times of the energy of the signal with the frequency range in [3200, 6400] , the global gain of the SHB signal is calculated as follows:
(1938)
For non-fricative signal, the global gain is multiplied by 2; and for fricative signal, the global gain is multiplied by 8. And then the global gain of the SHB signal will be smoothed further as follows:
- If the signal of the current frame and the previous frame are both fricative signal and
(1939)
- else if the energy ratio of the low frequency signal between the current frame and the previous frame is in the range of [0.5, 2], and the modes of the signal of the current frame and the previous frame are both fricative or are both non-fricative
()
- Otherwise
()
where is the energy ratio between the final SHB signal of the previous frame and the initial SHB signal of the current frame.
At the end, fade out the global gain of the SHB signal frame by frame as follows:
()
6.3.7.2.2 Multi-mode FD BWE mode
Predict the SHB signal of the current frame, and weight the SHB signal of the current frame and the previous frame to obtain the final SHB signal of the current frame. Then the SHB signal and the low frequency signal are combined to obtain the output signal.
The SHB signal of the current frame is generated as follows:
1) Predict the fine structure of the SHB signal of the current frame as described in subclause 6.1.5.2.1.5.
2) Predict the envelope of the SHB signal of the current frame, and weight the envelopes of the current frame and the previous frame to obtain the final envelope of the current frame.
3) Reconstruct the SHB signal of the current frame by the predicted fine structure and the weighted envelope.
In detail, the algorithm of predicting and weighting the envelopes is described as follows:
- The 224 MDCT coefficients in the 800-6400 Hz frequency range
are split into 7 sharpness bands (32 coefficients per band). Calculate the peak of the magnitudes and the average magnitude in each sharpness band.
- The low frequency signal is classified into NORMAL or HARMONIC according to the ratios of the peak of the magnitudes and the average magnitude.
- Predict the initial envelope of the SHB signal according to the average magnitudes, the maximum value of the average magnitudes, the minimum value of the average magnitudes and the tilt parameter of the low frequency signal.
- An energy ratio
between the same parts of the low frequency signal of the current frame and the previous frame is introduced to control the weighted value of the predicted envelope. This ratio
reflects the correlation of the current frame and the previous frame.
- Factor
is the weighting factor for the spectral envelope of the current frame, and
is the one for the previous frame.
is initialized to 0.1. Note that
is reset to 0.1 if
equals to zero.
- If
, the predicted spectral envelope is weighted according to the energy ratio
()
and , when
where is the envelope of the previous frame. Factor
is incremented by 0.05 and saved for next frame.
Otherwise, the predicted envelope is weighted as follows:
()
Then, fade out the predicted envelope of the SHB signal frame by frame as follows:
()
6.3.7.2.3 MDCT core
For low bit rate core: predict the envelopes of the upper band from the 2 decoded highest frequency envelopes and the average envelope of all the decoded envelopes, and reconstruct he normalized coefficients by random noise. And then the MDCT coefficients of the upper band are reconstructed by the envelopes and the normalized coefficients.
For high bit rate core: for transient mode, predict the envelopes by the 20 decoded highest frequency coefficients; and for non-transient mode, predict the envelopes by the 2 decoded highest frequency envelopes. The normalized coefficients are predicted by random noise or by weighting the random noise and the decoded low frequency coefficients. And then the MDCT coefficients of the upper band are reconstructed by the envelopes and the normalized coefficients.
6.3.7.3 Wideband switching to super wideband
In the case of switching from wideband to super wideband for multi-mode FD BWE or the MDCT core the coefficients in the frequency domain are faded using the factor. When operating with TBE, the global gain in the temporal domain is faded using the same factor
.