5.2.6 Coding of upper band for LP-based Coding Modes

26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

5.2.6.1 Bandwidth extension in time domain

The time-domain bandwidth extension (TBE) module codes the signal content beyond the range of frequencies that are coded by the low band core encoder. The inaccuracies in the representation of the spectral and the temporal information content at higher frequencies in a speech signal are masked more easily than contents at lower frequencies. Consequently, TBE manages to encode the spectral regions beyond what is coded by the core encoder in speech signals with far fewer bits than what is used by the core encoder.

The TBE algorithm is used for coding the high band of clean and noisy speech signals when the low band is coded using the ACELP core. The input time domain signal of current frame is divided into two parts: the low band signal and the super higher band (SHB) signal for SWB signal () or the higher band (HB) signal for WB signal (). While the SWB TBE encoding of upper band (6.4 to 14.4 kHz or 8 to 16 kHz) is supported at bitrates 9.6 kbps, 13.2 kbps, 16.4 kbps, 24.4 kbps, and 32 kbps; the WB TBE encoding of upper band (6 to 8 kHz) is supported at 9.6 kbps and 13.2 kbps. Specifically, at 13.2 kbps and 32 kbps the SWB TBE algorithm is employed as the bandwidth extension algorithm when the speech/music classifier determines that the current frame of signal is active speech or when GSC coding is selected to encode super wideband noisy speech. Similarly, at 9.6 kbps, 16.4 kbps and 24.4 kbps the TBE algorithm is used to perform bandwidth extension for all ACELP frames. The ACELP/MDCT core coder selection at 9.6 kbps, 16.4 kbps and 24.4 kbps is based on an open-loop decision as described in subclause 5.1.14.1.

In coding the higher frequency bands that extend beyond the narrowband frequency range, bandwidth extension techniques exploit the inherent relationship between the signal structures in these bands. Since the fine signal structure in the higher bands are closely related to that in the lower band, explicit coding of the fine structure of the high band is avoided. Instead, the fine structure is extrapolated from the low band. The high level architecture of the TBE encoder is shown in figure below. The front end processing to generate the high band target signal in figure , is replaced by a simple flip-and-decimate-by-4 operation in the WB TBE framework.

Figure 44: Time Domain Bandwidth Extension Encoder. Blocks in dashed lines are activated only at higher bit rates (24.4 kb/s or higher)

5.2.6.1.1 High band target signal generation

The input speech signal, , that is sampled at either 32 kHz or 48 kHz sampling frequency, is processed through the QMF analysis filter bank. This processing step is performed as part of the common processing step as described in subclause 5.1. The output of the QMF analysis filter bank, and are the real and imaginary sub-band values, respectively, is the sub-band time index with and is sub-band frequency band index with . The number of sub-bands, , where F_s is the sample rate of the input signal, . For example, for a SWB 32 kHz sampled input, L_c = 40, and for a FB 48 kHz sampled input, L_c = 60 sub-bands.

The spectral flip and down mix module extracts the high band components from the input speech signal. The high band target signal contains the high frequency components of the input signal that are to be represented by the time domain bandwidth extension encoder. The frequency range of the high band depends on the coding bandwidth of the low band ACELP core. When the low band ACELP core codes up to a maximum bandwidth of 6.4 kHz, then the high band target signal, , contains input signal components in the 6.4 — 14.4 kHz band. When the low band ACELP core codes up to a maximum bandwidth of 8 kHz, then the high band target signal contains input signal components in the 8 — 16 kHz band. In the high band target signal, , the input signal components are arranged in a flipped and down-mixed format such that the 0 frequency value in the SWB high band target signal corresponds to the maximum frequency in the above mentioned bandwidth. The process of deriving the SWB high band target signal is shown pictorially in figure .

Figure 45: Generation of high band target signal for SWB TBE

Similarly, the WB high band target signal containing signal components from 6 to 8 kHz is generated as shown in figure . In particular, a simple flip-and-decimate-by-4 operation is performed to extract the 6 to 8 kHz high band from the 16 kHz sampled WB input signal. In case of fixed point operation, the input signal is scaled dynamically based on the spectral tilt of the input signal prior to performing the flip-and-decimate-by-4 operation. This adjustment of the scaling is done such that for signals with low energy in the high band (indicated by having a spectral tilt >= 0.95), no headroom is provided prior to decimation to avoid any loss of precision after decimation, and for signals with relatively higher energy in the high band (indicated by having a spectral tilt < 0.95), a headroom of 3 bits is provided prior to decimation to avoid any saturation in the decimation operation.

Figure 46: Generation of high band target signal for WB TBE

The process of deriving the SWB high band target signal using the complex low-delay filter bank (CLDFB) analysis is described below. The real, , and imaginary, , CLDFB coefficients are first flipped as follows:

()

The values m and M are dependent on the maximum bandwidth coded by the ACELP core as shown in table . The flipped coefficients, and are used to generate the high band target signal using the CLDFB synthesis as described in subclause 6.9.3.

Table 52: Maximum and minimum frequencies for spectral flipping and downmix

ACELP core		m	M	High band target signal band
Low band core sample rate		Low band coded bandwidth	M	High band target signal band
@12.8 kHz core	6.4 kHz	14	34	6.4-14.4 kHz
@ 16 kHz core	8 kHz	19	39	8-16 kHz
@12.8 kHz core	6.4 kHz	–	–	6-8 kHz (WB)

5.2.6.1.2 TBE LP analysis

TBE linear prediction analysis is performed on a 33.75ms high band signal that includes the current 20ms SHB target frame with 5ms of past samples and 8.75ms of look ahead samples. The SWB high band target signal is generated using the CLDFB synthesis filter bank described in subclause 6.9.3. Twenty milliseconds of the high band target signal is used to populate the shb_old_speech buffer from sample 220 to sample 539 as shown in figure . Both the WB TBE and SWB TBE LP analysis follow similar steps as described below except that the buffer lengths in SWB TBE and WB TBE are different reflecting the effective upper band coding bandwidth. Certain steps that are relevant at low bit rates and only for WB TBE and not for SWB TBE LP analysis are specified where applicable.

Figure 47: SWB Highband target signal buffers

The first 220 samples in the shb_old_speech buffer are filled from the memory shb_old_speech_mem. The shb_old_speech_mem is updated as

()

For linear prediction analysis, a 540 sample LPC analysis frame is derived from the shb_old_speech buffer. A single analysis window , is used to calculate 10 auto correlation coefficients. The window function is as shown in figure

Figure 48: SWB TBE analysis window

The autocorrelation coefficients is calculated according to

()

where is obtained by multiplying by .

A bandwidth expansion is applied to the autocorrelation coefficients by multiplying the coefficients by the expansion function:

()

The bandwidth expanded autocorrelation coefficients are used to obtain LP filter coefficients, by solving the following set of equations using the Levinson-Durbin algorithm.

()

It should be noted that

5.2.6.1.3 Quantization of linear prediction parameters

First a spectral bandwidth expansion operation is performed on the LPC coefficients by multiplying with bandwidth expansion weights

()

The coefficients are given in table :

Table 53: LPC bandwidth expansion coefficients

k
2	3	4	5	6	7	8	9	10	11
0.975	0.950625	0.926859375	0.903687891	0.881095693	0.859068301	0.837591593f	0.816651804	0.796235509	0.776329621

The bandwidth expansion of LP coefficients is performed in WB TBE LP analysis and low bitrate SWB TBE analysis for bit rates below 13.2 kbps. The bandwidth expanded LP coefficients are converted to line spectral frequencies

The LSP weights are calculated from the line spectral frequencies

()

And based on the spacing between adjacent LSF parameters, and

()

where and , otherwise.

5.2.6.1.3.1 LSF quantization

WB TBE LSF quantization uses a single stage vector quantizer that utilizes 2 bits and 8 bits for LSF quantization, respectively, at 9.6 kbps and at 13.2 kbps. For low bit rate SWB LSF quantization at 9.6 kbps and primary frame encoding in channel aware mode at 13.2 kbps (see subclause 5.8.3.1), a simple 8-bit 10-dimensional single stage vector quantizer is used. In SWB TBE encoding, for bit rates at and above 13.2 kbps, the 10 dimensional LSF vectoris encoded in a SQ and extrapolation procedure. The low-frequency half (first five LSF coefficents ) of the LSF vector are scalar quantized with the following number of bits {4, 4, 3, 3, 3}. That is the first two coefficents are quantized with four bits each, with the CB from table , while the remaining three coefficients are quantized with three bits each with reconstruction values in table .

Table 54: Four bit CB for LSF quantization

Index	Value
0000	0.01798018
0001	0.02359377
0010	0.02790103
0011	0.03181538
0100	0.03579450
0101	0.03974377
0110	0.04364637
0111	0.04754591
1000	0.05181858
1001	0.05624165
1010	0.06022101
1011	0.06419064
1100	0.06889389
1101	0.07539274
1110	0.08504436
1111	0.10014875

Table 55: Three bit CB for LSF quantization

Index	Value
000	0.02070812
001	0.02978384
010	0.03800822
011	0.04548685
100	0.05307309
101	0.06137543
110	0.07216742
111	0.09013262

The output of the SQ is five quantized coefficients , which are also used as an input to extrapolation of the remaining five coefficients.

The high-frequency half (last five LSF coefficents ) are calculated as weighted averaging of the quantized low-frequency part of the LSF vector and selected optimal grid points. First the already quantized coefficents are flipped arround a quantized mirroring frequency, which separates the low-frequency part from the high-frequency part of the LSF vector. Then the the fipped coefficients are adjusted by an optimal frequency grid codebook, which is obtained in a closed-loop search procedure.

The transmitted mirroring frequency is obtained as quantized difference between the first extrapolated and last quantized coefficient, added to the last coded position

()

The quantization is performed with two bit SQ with reconstruction points {0.01436178, 0.02111641, 0.02735687, 0.03712105}.

The quanized five low-frequency LSF coefficents are flipped arround the mirroring frequency , to form a set of flipped coefficients

()

Then the flipped coefficients are further scaled according to

()

The flip and scaled coefficients are adjusted with one of four grid vectors. The pre-stored set of four grid vectors from table , is first re-scaled to fit into the interval between the last quantized LSF and maximum grid point value 0.5

()

Table 56: CB with LSF grid points

grid 1	grid 2	grid 3	grid 4
0.15998503	0.15614473	0.14185823	0.15416561
0.31215086	0.30697672	0.26648724	0.27238427
0.47349756	0.45619822	0.39740108	0.39376780
0.66540429	0.62493785	0.55685745	0.59287916
0.84043882	0.77798001	0.74688616	0.86613986

Then smoothed coefficients are obtained by weighed averaging of the re-scaled grid sets and re-scaled flipped coefficients

()

where are pre-defined weights

Different grid vectors form different sets of smoothed coefficients . The optimal grid is selected as the one that produces smoothed coefficients closest, in the mean-squared-error sense, to the actual high-frequency coefficients

()

The CB indices for the five low-frequency coefficients, mirroring frequency index, and the index of the optimal grid are transmitted to the decoder.

The quantized LSF parameters are checked for inter-LSF spacing. First, the spacing between adjacent LSFs is determined

()

If , then the quantized LSFs are adjusted as follows

If , then

Otherwise,

5.2.6.1.4 Interpolation of LSF coefficients

The quantized LSF parameters are converted into LSPs

()

The interpolation between the LSPs of the current frame, , and that of the previous frame is performed to a set of 4 interpolated LSPs, as follows:

()

The interpolated LSPs are then converted back to LSFs. This process gives 4 sets of interpolated LSFs. Further the LSFs are converted into LP coefficients . Note for all j. The conversion from the interpolated LSFs to LP coefficients is done similar to the process employed in the core coding modules.

5.2.6.1.4.1 LSP Interpolation at 13.2 kbps and 16.4 kbps

LSP interpolation is employed to smooth the SWB TBE LSP parameters at 13.2 kbps and 16.4 kbps. When the bit rate of operation is 13.2 kbps or 16.4 kbps, the similarity of signal characteristics of the current frame and the previous frame are analysed to determine whether they meet the Preset Modification Conditions (PMCs) or not. The PMCs denote if the PMCs are met, then a first set of correction weights are determined from the Linear Spectral Frequency (LSF) differences of the current frame and the LSF differences of the previous frame. Otherwise a second set of correction weights are set with constant values that lie in the range 0.0 to 1.0. The LSPs are then interpolated using the appropriate set of correction weights. The PMCs are used to determine whether the signal characteristic of the current frame and the previous frame is close or not.

The PMCs are determined by first converting the linear prediction coefficients (LPCs) to reflection coefficients (RCs) as follows using a backwards Levinson Durbin recursion. For a given N-th order LPC vector the Nth reflection coefficient value is derived using the equality, it is then possible to calculate the lower order LPC vectors using the following recursion.

()

which yields the reflection coefficient vector .

A spectral tilt parameter is then calculated from the first reflection coefficient as follows:

()

The PMCs are then determined from the spectral tilts of the current frame and the previous frame along with the coding types of the current and the previous frames. In the case that the current frame is a transition frame, the PMCs are by default not met.

In order to determine that the current frame is not a transition frame requires the following steps:

Determining whether there is a change from a fricative frame to a non-fricative frame. Specifically, if the tilt of the previous frame is greater than a tilt threshold value of 5.0 and the coder type of the frame is a transient, or the tilt of the previous frame is greater than a tilt threshold value of 5.0 and the tilt of the current frame is smaller than a tilt threshold value of 1.0 then the frame is a transition frame.
Determining whether there is a change from a non-fricative frame to a fricative frame. Specifically, if the tilt of the previous frame is smaller than a tilt threshold value of 3.0 and the coder type of the previous frame is equal to one of the four types of VOICED, GENERIC, TRANSITION or AUDIO, and the tilt of the current frame is greater than a tilt threshold value of 5.0 then again the frame is a transition frame
The current frame is not a transition frame if both 1) and 2) are not met, and then the PMCs are met.

When the PMCs are met, the first set of correction weights are defined as follows

()

where the and are defined as

()

where the are the LSFs of the current frame and the are the LSFs of the previous frame.

When the PMCs are not met, the second set of correction weights is used and these are set to 0.5 as follows:

()

The correction weights are finally used in the interpolation between the LSPs of the current frame and the LSPs of the previous frame, it is described as follows:

()

where are the LSPs of the previous frame, are the LSPs of the current frame, are the interpolated LSPs. The interpolated LSPs are used to encode the current frame.

The interpolated LSPs are then converted back to LSFs，further the LSFs are converted into LP coefficients.

5.2.6.1.4.2 LSP Interpolation at 24.4 kbps and 32 kbps

The interpolation between the LSPs of the current frame, , and that of the previous frame is performed to yield a set of 4 interpolated LSPs, as follows:

()

Table 57: LSP interpolation factors IC1 and IC2 over four sets

LSP set, j	IC1	IC2
Set 1	0.7	0.3
Set 2	0.4	0.6
Set 3	0.1	0.9
Set 4	0	1.0

The interpolated LSPs are then converted back to LSFs. This process gives 4 sets of interpolated LSFs. Further the LSFs are converted into LP coefficients. Note for all j.

5.2.6.1.5 Target and residual energy calculation and quantization

At 24.4 kb/s and 32 kb/s, energy parameters from the high band target frame and an LPC residual calculated from the target signal and the interpolated LP coefficients, , are calculated and transmitted to the decoder.

For calculation of the energy parameters, the signal in the high band target frame (see figure ), denoted as is used. The 4 energy parameters are calculated according to:

()

The sum of energy parameters, , is quantized using 6 bits in SWB TBE at 24.4 kbps and 32 kbps. A residual signal is calculated from using the interpolated LP coefficients. For j=1, …4 and n=0,…79,

()

The residual signal is then used to calculate the residual energy parameters,

()

The residual energy parameters are then normalized. First the maximum residual energy parameter is calculated . Then is normalized to get as per

()

Each is then scalar quantized using a 3 bit uniform quantizer with the lowest quantization point as 0.125 and uniform quantization steps of 0.125.

5.2.6.1.6 Generation of the upsampled version of the lowband excitation

An upsampled version of the low band excitation signal is derived from the ACELP core as show in figure below.

Figure 49: Generating the upsampled version of the lowband excitation

For each ACELP core coding subframe, i, a random noise scaled by a factor voice factor, is first added to the fixed codebook excitation that is generated by the ACELP core encoder. The voice factor is determined using the subframe maximum normalized correlation parameter, that is derived during the ACELP encoding. First the factors are combined to generate .

()

calculated above is limited to a maximum of 1 and a minimum of 0.

if the ACELP core encodes a maximum of 6.4 KHz or if the ACELP core encodes a maximum bandwidth of 8 KHz.

The resampled output is scaled by the ACELP fixed codebook gain and added to a delayed version of itself.

(696)

where g_c is the subframe ACELP fixed codebook gain, g_p is the subframe ACELP adaptive codebook gain and P is the open loop pitch lag.

5.2.6.1.7 Non-Linear Excitation Generation

The excitation signal is processed through a non-linear function in order to extend the pitch harmonics in the low band signal into the high band. The non-linear processing is applied to a frame of in two stages; the first stage works on the first half subframe (160 samples) of and the second stage works on the second half subframe. The non-linear processing steps for the two stages are described below. In the first stage , and in the second stage, .

First, the maximum amplitude sample and its location relative to the first sample in the stage are determined.

()

Based on the value of , the scale factor is determined.

()

The scale factor and the previous scale factor parameter from the memory are then used to determine the parameter scale step.

()

If , then

The output of the non-linear processing is derived as per

()

If the current frame is VOICED frame and sum of voice factors over the subframes is less than a threshold, , then the sign reversal when is not performed. The threshold, = 0.70 when there are five subframes per frame and = 0.78 when there are four subframes per frame.

The previous scale factor parameter is updated recursively for all according to

for(j=n₁; j< n₂; j++)

if(j<i_max)

end

5.2.6.1.8 Spectral flip of non-linear excitation in time domain

The non-linear excitation is spectrally flipped so that the high band portion of the excitation is modulated down to the low frequency region. This spectral flip is accomplished in time domain

()

5.2.6.1.9 Down-sample using all-pass filters

is then decimated using a pair of all pass filters to obtain an 8 kHz bandwidth (16 kHz sampled) excitation signal . This is done by filtering the even samples of by an all pass filter whose transfer function is given by

()

And the odd samples of by an all pass filter whose transfer function is given by

()

The 16 kHz sampled excitation signal are obtained by averaging the outputs of the above filter.

These filter coefficients are specified in below.

Table 58: All-pass filter coefficients for decimation by a factor of 2

	All pass coefficients
a_0,1	0.06056541924291
a_1,1	0.42943401549235
a_2,1	0.80873048306552
a_0,2	0.22063024829630
a_1,2	0.63593943961708
a_2,2	0.94151583095682

5.2.6.1.10 Adaptive spectral whitening

Due to the nonlinear processing applied to obtain the excitation signal , the spectrum of this excitation is no longer flat. In order to flatten the spectrum of the excitation signal , 4^th order linear prediction coefficients are estimated from The spectrum of is then flattened by inverse filtering using the linear prediction filter.

The first step in the adaptive whitening process is to estimate the autocorrelation of the excitation signal

()

A bandwidth expansion is applied to the autocorrelation coefficients by multiplying the coefficients by the expansion function:

()

The bandwidth expanded autocorrelation coefficients are used to obtain LP filter coefficients, by solving the following set of equations using the Levinson-Durbin algorithm as described in section.

()

It must be noted that .

The whitened excitation signal is obtained from by inverse filtering

()

4 samples of from the previous frame are used as memory for the above filtering operation.

For bit rates 24.4 kb/s and 32 kb/s, the whitened excitation is further modulated (in a two-stage gain shape modulation) by the normalized residual energy parameter. In other words, for bitrates 24.4 kb/s and 32 kb/s,

()

5.2.6.1.11 Envelope modulated noise mixing

To the whitened excitation, a random noise vector whose amplitude has been modulated by the envelope of the whitened excitation is mixed using a mixing ratio that is dependent on the extent of voicing in the low band.

First, is calculated and then the envelope of the envelope of the whitened excitation signal is calculated by smoothing

()

In SWB mode, the factors and are calculated using the voicing factors, for subframes , which calculated by parameters which determined from the low band ACELP encoder. The voicing factors denote voicing extend of the high band signal since the fine signal structure in the higher bands are closely related to that in the lower band. The average of the 4 voicing factors, , is calculated and modified as . This is then confined to values between 0.6 and 0.999. Then and are estimated as

()

However, for bit rates 16.4 kb/s and 24.4 kb/s and if TBE was not used in the previous frame, and are set to

()

and for , is substituted by an approximated value as

()

In WB mode, the factors and are initialized to and. However, if the bitrate is 9.6kb/s, they are reset to andif the low band coder type is voiced or, or to and if the low band coder type is unvoiced or. In SWB mode, for bit rates 24.4 kbps and 32 kbps, the mix factors are estimated based on both the low band voice factor, , and the high band closed-loop estimation, where

The mix factor is estimated based on the high band residual signal, , transformed low band excitation, , and the modulated white noise excitation .

A vector of random numbers, of length 160 is then modulated by to generate as

()

The whitened excitation is then de-emphasized with which is the pre-emphasised effect since the used spectrum is flipped.

()

If the lowband coder type is unvoiced, the excitation is first rescaled to match the energy level of the whitened excitation

()

where

()

And then pre-emphasised with =0.68 to generate the final excitation which is the de-emphasised effect since the used spectrum is flipped.

()

If the lowband coder type was not un-voiced, the final excitation is calculated as

(720)

for each sample index within subframe .

For bit rates less than 24.4 kb/s, the mixing parameters and are estimated as,

()

For bit rates 24.4 kb/s and 32 kb/s, the mixing parameters , are estimated for as follows:

(723)

(723a)

where the parameter is defined in equation ().

is then de-emphasised to generate the final excitation.

5.2.6.1.12 Spectral shaping of the noise added excitation

The excitation signal is then put through the high band LPC synthesis filter that is derived from the quantized LPC coefficients (see subclause 5.2.4.1.3).

For bitrates below 24.4 kb/s, a single LPC synthesis filter is used and the shaped excitation signal is generated as

()

For bitrates at and above 24.4 kb/s the LPC synthesis filter is applied to the excitation signal in four subframes based on

()

In particular, for bit rates at and above 24.4 kbps, first a memory-less synthesis is performed (with past LP filter memories set to zero) and the energy of the synthesized high band is matched to that of the original signal. In the subsequent step, the scaled or energy compensated excitation signal as shown below, , is used to perform synthsesis in the second step.

5.2.6.1.13 Post processing of the shaped excitation

The shaped excitation is the synthesized high band signal which is generated by passing the excitation signal through the LPC synthesis filter. The excitation signal is determined by the low band model parameters and the coefficients of the LPC synthesis filter are determined by the high band model parameters. A short-term post-filter is applied to the synthesized high band signal to obtain a short-term post-filtered signal. Comparing with the shape of the spectral envelope of the synthesized high frequency band signal, the shape of the spectral envelope of the short-term post-filtered signal is closer to the shape of the spectral envelope of the high-frequency band signal. The short-term post-filter includes a pole-zero filter, the coefficients of the pole-zero filter are set by the set of high band model parameters. It is described as follows:

()

is derived from the quantized LPC coefficients (see subclause 5.2.6.1.3) with the factors and controlling the degree of the short-term post‑filtering. are the quantized LPC coefficients. The factors and are calculated according to:

()

where is parameter that jointly controls envelope shape and excitation noisiness. It is based on the spectral tilt, determined by the LPC coefficient:

()

where is the past value of .

()

(731)

The gain term is calculated from the truncated impulse response of the filter and is given by:

()

The shaped excitation is divided into four subframes, and each subframe is filtered through , and then filtered with the synthesis filter to produce .

After filtering the synthesized high band signal using the pole-zero filter, filter then compensates for the tilt in the pole-zero filter and is given by:

()

where is set to default constant value or adaptively calculated according to the high band coding parameters and the synthesized high band signal. The calculation process is as follows: is a tilt factor, with being the first reflection coefficient calculated from by:

()

A gain term is applied to compensate for the decreasing effect of in. It has been shown that the product filter has a gain close to unity. is set according to the sign of. If is positive,, otherwise,.

Then is passed through the tilt compensation filter resulting in the post‑filtered speech signal

Adaptive Gain Control (AGC) is applied to compensate for any gain difference between the synthesized speech signal and the post‑filtered signal. The gain scaling factor for each subframe is calculated by:

()

and the post processed shaped excitation is given by:

()

where is updated in sample‑by‑sample basis and given by:

()

and where is an AGC factor with value of 0.85.

In order to smooth the evolution of the post-processed spectrally shaped highband excitation signal across frame boundaries, the look-ahead and the overlap samples are scaled based on the ratio of the current frame’s energy in the overlap region and the previous frame’s energy in the overlap region. The scale factor computation is performed as shown in equation (1579) in subclause 6.1.5.1.12.

The tenth-order LPC synthesis performed as described according to subclause 5.2.6.1.12 uses a memory of ten samples, thus there is at least an energy propagation over ten samples from the previous frame into the current frame. When calculating the energy scaling to be applied to the current frame, the first 10 samples of the current frame are considered as a part of previous frame energy. If the voicing factor is greater than 0.75, the numerator in equation (1579) is attenuated by 0.25. The spectrally shaped high band signal is then modified by the scale factor as shown in equation (1580) in Clause 6.1.5.1.12.

5.2.6.1.14 Estimation of temporal gain shape parameters

There are different initialization estimation of temporal gain shape for the WB mode and the SWB mode.

5.2.6.1.14.1 Initialization estimation of temporal gain shape for WB mode

The frame is divided into eight segments, and the energy envelope of each segment is calculated, then the 4 gain shapes are calculated from the calculated 8 energy envelopes.

The energy envelopes of the eight segments of the target signal and the shaped excitation signal are calculated as follows:

Asymmetric windows are applied to the first and the eighth segments,

()

And symmetric windows are applied to the segments from the second to the seventh.

()

Four high band gain shapes are then calculated by combining pairs of energy envelopes as follows

()

The asymmetric windows are determined by the number of look head samples and the number of segments. The asymmetric window and the symmetric windows are described as follows, the number of look-ahead samples is 5.

Figure 50: Asymmetric window and symmetric window

The variable for is tabulated in table below:

Table 59: Window for highband gain shape calculation

Index	Value
0	0.0
1	0.15643448
2	0.30901700
3	0.45399052
4	0.58778524
5	0.70710677
6	0.80901700
7	0.89100653
8	0.95105654
9	0.98768836
10	1.0

5.2.6.1.14.2 Initialization estimation of temporal gain shape for SWB mode

The high band target frame (see subclause 5.2.6.1.1), and the post-processed shaped excitation are used to calculate 4 temporal gain shape parameters, for. The gain shape parameters are calculated using an overlap of 20 samples from the previous frame to avoid transition artifacts during the reconstruction at the decoder.

(744)

where the subframe energies in the target high band signal and the shaped excitation signal are calculated as

(745)

and

. (746)

The window function is a window signal is given by

(747)

The variable for is tabulated in table below.

Table 60: Window for highband gain shape calculation

Index	Value
0	0.0
1	0.006156
2	0.024472
3	0.054497
4	0.095492
5	0.146447
6	0.206107
7	0.273005
8	0.345492
9	0.421783
10	0.5
11	0.578217
12	0.654508
13	0.726995
14	0.793893
15	0.853553
16	0.904508
17	0.945503
18	0.975528
19	0.993844
20	1.0

5.2.6.1.14.3 Additional processing of temporal gain shape

Additionally, the gain shape values are normalized such that

()

The 4^th subframe gain shape value from the last subframe is used to smooth the Gain Shape parameter evolution. A variable is calculated as

()

If the sum of the voicing factors , then the gain shape parameters are smoothed as follows:

()

The gain shape parameters vector corresponding to a given frame are transformed into the log domain and vector quantized. The quantized gain shape parameters are denoted

5.2.6.1.15 Estimation of frame gain parameters

In addition to the gain shape parameter, an overall frame gain parameter is calculated. First the spectrally shaped excitation is scaled by the gain shape parameters.

()

Samples with negative indices are obtained from the previous frame and

()

The energies of the and for the entire duration of the frame and the overlaps is calculated using a window:

()

where the negative samples are obtained from previous frames, and the window function is given by

()

where is defined in table .

If the high band target energy as calculated in Equation (754) is saturated, then the target signal, , is scaled based on the number of subframes that are saturated (from clause 5.2.6.1.14.2). Then the high band target energy is recalculated using the scaled target signal using Equation (754). Subsequently, the gain frame GF in Equation (757) is compensated for the scaling performed on the target signal.

The overall frame gain parameter is calculated as

()

The parameter is attenuated if the quantization of the gain shape parameter is poor. First the power-off-the-peak value for unquantized and quantized gain shape parameters is calculated:

()

If , then the gain shape parameter quantization is deemed poorly quantized and the gain frame parameter is attenuated as follows in order to avoid perceptually annoying artifacts in the reconstructed speech signal at the decoder.

()

Further, the parameter is smoothed if the current frame is determined to be similar in spectral characteristics to the previous frames. To determine similarity in spectral characteristics, the fast and slow evolution rates, and , and the minimum LSP spacing are determined as follows:

()

where and andwhen k=1 and otherwise.

A smoothed version of using thevalues from the previous frames is also determined:

()

If the minimum LSP spacing and smoothed LSP spacing satisy a spacing critierion (e.g., δ_min < 0.008, δ_smoothed < 0.005) indicating an artifact generating condition, the high band target is filtered using the high band LP to attenuate the coding artifacts before estimating the gain shape and gain frame parameters.

If the and the values are smaller than 0.001, then the gain frame parameter is smoothed between the previous and the current frame.

()

Also, if, then an additional attenuation of the is performed: . Also, if the current frame’s , then is modified as per

()

where is calculated as described in subclause 5.2.6.1.15.

For the WB mode and if the bitrate is 9.6 kb/s, the gain frame parameter is further adapted, based on the average of the 4 voicing factors, , as , if the low band coder type is voiced, and as , if the low band coder type is not voiced, but .

The gain frame parameter is scalar quantized in the log domain using 5 bits. The lowest quantization point is chosen to be -1.0 and the quantization steps are set to be 0.15.

5.2.6.1.16 Estimation of TEC/TFA envelope parameters

The Temporal Envelope Coding (TEC) and the Temporal Flatness Adjuster (TFA) shape the temporal envelope of the high frequency band signal generated by the TBE. They are enabled as the post processing of the TBE at 16.4 and 24.4 kbps with the output sampling frequencies higher than 8000 Hz.

The TEC is for getting better shapes of onsets at the decoder by using the temporal envelope of the low frequency band and the transmitted information on the temporal envelope of the high frequency band. The information indicates the two different shapes, one is “steep onset” and the other is “gentle onset”. The TEC has two modes for accommodating those shapes of the onsets, “no smoothing mode” is for steep onsets and “smoothing mode” is for gentle onsets. Thus, the transmitted information represents the mode of TEC to be used at the decoder as well as the shape of the onset. At the decoder, the information is used to calculate the temporal envelope of the high frequency band.

The TFA flattens the temporal envelope of the high frequency band according to the transmitted information on the flatness of the temporal envelope of the input signal.

5.2.6.1.16.1 Estimation of TEC parameters

The TEC parameter to be transmitted to the decoder is estimated as follows, which is the information on the temporal envelope of the high frequency band and indicates the activation and the used mode of the TEC at the decoder.

The temporal envelope of the low frequency band of the input signal is defined as the mean of the temporal envelopes of three sub-bands in the low frequency band (sub-low-frequency-bands) in the CLDFB domain. The temporal envelope of the m-th sub-low-frequency-band is calculated by

()

where is the energy in the CLDFB domain described in subclause 5.1.2.2 and, and are the lower and upper limits of the CLDFB sub-band of the m-th sub-low-frequency-band. And then, the temporal envelope of the low frequency band is calculated by

()

The temporal envelope of the high frequency band of the input signal is calculated in the CLDFB domain.

()

where is the energy in the CLDFB domain described in subclause 5.1.2.2 and, and are the lower and upper limits of the CLDFB sub-band in the frequency range of the TBE.

The shape of the onset is detected by these temporal envelopes of the high and low frequency bands. A steep onset is detected and the no smoothing mode is selected when . And, a gentle onset is detected and the smoothing mode is selected when .

()

where and are the variances of and respectively.

When , the maximum values of the temporal envelopes of the high and low frequency bands and their positions are detected:

()

And then, the local minimum and maximum values of the temporal envelope of the high frequency band at the position i, are detected and the difference between the local maximum and minimum values is calculated:

()

The maximum of the difference and its position are detected:

()

Using the parameters above, is set as

()

When , the smoothed temporal envelope of the low frequency band is calculated

()

where

()

And then, the correlation coefficient between and is calculated. Further, the ratio between the variances of and is calculated:

()

Using these parameters, is set as

()

5.2.6.1.16.2 Estimation of TFA parameters

The parameter to be transmitted to the decoder is estimated as follows, which represents the flatness of the temporal envelope of the high frequency band as well as the activation of the TFA at the decoder.

The flatness measure of the temporal envelope of the signal in the high band target frame is calculated as:

()

where and where .

In addition to the flatness measure, the mean of the pitch lags and the mean of the open-loop pitch gains of the half frames are calculated. Finally, depending to the last core mode, the parameter is estimated by the parameters above:

in the case that the last core mode is not TCX20,

()

in the case that the last core mode is TCX20,

()

5.2.6.1.16.3 Set the transmitted parameter for TEC and TFA

The TEC and TFA parameters are jointly coded by 2 bits and then transmitted to the decoder together with the other TBE information:

()

5.2.6.1.17 Estimation of full-band frame energy parameters

The full-band TBE algorithm is used for coding the full band. Since the fine signal structure in the full bands are closely related to that in the high band and low band, the full band of the signal are coded by using only 4 bits together with the information from the low band and the high band. A synthesized full band signal is obtained by coding the high band and predicting spectrum from the high band, the synthesized full band signal is then de-emphasized with which is determined by the characteristic factors derived from coding the low band signal. A band passed full band signal is obtained by band-pass filtering the input signal. The energy ratio is calculated by comparing the energy calculated from the de-emphasized synthesized full band signal with the energy calculated from the band passed full band signal. Finally the parameters including the characteristic factors, high band coding information and the energy ratio are transmitted to the decoder.

The synthesized full band signal is obtained as follow: A vector of random numbers of length 320 passes through the LPC synthesis filter, the spectrum of is used as the predicted spectrum from the high band and the coefficients of the LPC synthesis filter are derived from the quantized LPC coefficients (see subclause 5.2.6.1.3).

()

Then the synthesized full band signal is de-emphasized as follows:

The spectrum of are moving corrected described as follows:

()

The variables are the parameters of spectrum moving correction tabulated in Table below:

Table 61: Parameters of spectrum moving correction

Index	Value	Index	Value
0	9.536743164062500e-007	16	-9.536743164062500e-007
1	9.353497034680913e-007	17	-9.353497034680913e-007
2	8.810801546133007e-007	18	-8.810801546133007e-007
3	7.929511980364623e-007	19	-7.929511411930434e-007
4	7.929511980364623e-007	20	-6.743495077898842e-007
5	5.298330165715015e-007	21	-5.298329597280826e-007
6	3.649553264040151e-007	22	-3.649552411388868e-007
7	1.860525884467279e-007	23	-1.860525173924543e-007
8	0.000000000000000e+000	24	0.000000000000000e+000
9	-1.860526737118562e-007	25	1.860527589769845e-007
10	-3.649554116691434e-007	26	3.649554969342717e-007
11	-5.298331302583392e-007	27	5.298331871017581e-007
12	-6.743496214767220e-007	28	6.743496783201408e-007
13	-7.929512548798812e-007	29	7.929513117233000e-007
14	-8.810802114567196e-007	32	8.810802683001384e-007
15	-9.353497603115102e-007	31	9.353497603115102e-007

And then flip the spectrum of to get

()

The signalis then de-emphasized with described as follows:

()

The de-emphasized factor is determined by the characteristic factors of the signal such as “voicing factors”, “spectral tilt”, “short-term average energy”, “short-term average zero crossing rate”. The calculation of de-emphasized factor using the voicing factors is described as follows:

()

where is the voicing factor of the ith subframe.

The signalis modulated by the gain shape values (see subclause 5.2.6.1.14) and the overall frame gain parameter (see subclause 5.2.6.1.15).

()

The energy of is described as follows:

()

The original input signal pass through the band-pass filter to get the band passed full band signal, and then calculate the energy of from 16 kHz to 20 kHz. The energy ratio is calculated as follows:

()

The energy ratio is then transmitted to the decoder using 4 bits.

5.2.6.2 Multi-mode FD Bandwidth Extension Coding

The input signal of current frame is divided into two parts: the low band signal and the super higher band (SHB) signal for SWB signal or the higher band (HB) signal for WB signal. The low band signal of the input signal is coded by LP coding modes, and the SHB or HB signal of the input signal is coded by the multi-mode FD bandwidth extension (BWE) algorithm. A classification decision process of the SHB or HB signal of input signal is first performed. Then, the multi-mode BWE algorithm for LP-based coding modes uses a combination of adaptive spectral envelope and time envelope coding for super wideband extension, and spectral envelope coding for wideband extension, according to the result of the classification decision process. The coded bitstream of low band signal, the adaptive coded bitstream of SHB or HB signal as well as the result of the classification decision process of current input signal are output. Table describes the multi-mode FD BWE at the different bitrates of operation. Theoretically, the delay of the synthesized output signal can be determined adaptively in the range of according to the delay of the core coding algorithm and the delay of the multi-mode bandwidth extension algorithm. To achieve lowest delay for bandwidth extension coding, the super higher band (SHB) signal is delayed by . Here is larger than . Then, the achieved lowest delay of the multi-mode bandwidth extension is in encoder side.

Table 62: Multi-mode FD BWE at different bitrates

Bitrate [kbps]	Bandwidth	Multi-mode FD BWE
7.2, 8	WB	Blind, HARMONIC/NORMAL
13.2	WB	Guided, HARMONIC/NORMAL
13.2	SWB	Guided, TRANSIENT/HARMONIC/NORMAL/NOISE
32	SWB/FB	Guided, TRANSIENT/HARMONIC/NORMAL/NOISE

5.2.6.2.1 SWB/FB Multi-mode FD Bandwidth Extension

For frames declared as TRANSIENT (TS) frames or as non-TRANSIENT, a bit budget of 31 bits is allocated to the SWB Multi-mode FD Bandwidth Extension. If the super higher band (SHB) signal of the input in the previous frame or in the current frame is detected as TRANSIENT, then the current frame is also classified as TRANSIENT. Non-TRANSIENT frames can be further classified as HARMONIC (HM), NORMAL (NM) or NOISE (NS) depending upon the frequency fluctuation that is detected. Two bits of the bit budget are allocated to the signal class. In case of a TRANSIENT frame, the remaining 29 bits are allocated to encode four spectral envelopes and four time envelopes. For other cases, i.e. non-TRANSIENT frames, the remaining 29 bits are allocated to encode fourteen spectral envelopes, and no time envelope is encoded. For FD BWE encoding, the 320 MDCT coefficients of the SHB signal, are coded. In the case of the FB mode, the encoding algorithm from 8kHz to 15.5kHz is the same as the SWB mode. To encode 15.5kHz to 20kHz, the spectral energies of from 11 kHz to 15.5 kHz and from 15.5 kHz to 20 kHz are calculated, and then the ratio of the two energies is coded using 4 bits after being quantized.

5.2.6.2.1.1 Windowing and time-to-frequency transformation

The input high-pass filtered signal is delayed bysamples and windowed to obtain the windowed input signal as shown in figure , where is equal to:

()

Figure 51: Windowing of the input high-pass filtered signal

A 640-point length MDCT on top of is used for SWB FD BWE. Refer to subclause 5.3.2.

5.2.6.2.1.2 Transient detection

The input time-domain SHB signal of current frame, sampled at 16kHz, is first high-pass filtered; the high-pass filter serves as a precaution against low frequency components adversely affecting the processing. A first order IIR filter is used, and it is given by: ()

The output of the high-pass filter is obtained according to: ()

where denotes the 20ms frame length at 16kHz. The high-pass filtered signal is divided into four sub-frames; each corresponding to 5 ms or 80 samples.

The energy of each sub-frame, , is computed according to:

()

For each sub-frame, the signal’s long term energy, , is updated according to the following equation:

()

In the above equation, the forgetting factor is set to 0.25, and the convention is that for the first sub-frame,

from the previous frame. It should be noted that when the current frame and the previous frame apply different BWE algorithms, or the core configurations are different, then the signals’ long term energy is calculated as .

The memory state of the high-pass filter, and are saved for the next frame’s processing. For each sub-frame , a comparison between the short term energy and the short term energy of previous sub-frame or the long term energy is performed to detect whether the current frame is TRANSIENT or not. A transient is detected whenever the energy ratio is above a certain threshold which is larger than 1. Formally, a transient is detected whenever:

()

where is the energy ratio threshold and is set to for INACTIVE frames and otherwise.

It should be noted that if the previous frame did not use SWB Multi-mode FD BWE then the current frame is not classified as a transient frame. In general, the time-frequency transform is applied on a 40ms frame; therefore, a transient affects two consecutive frames. To overcome this, a hangover for a detected transient is applied. A transient detected at a certain frame also triggers a transient in the next frame.

The output of the transient detector is a flag, denoted . The flag is set to the logical value TRUE if a transient is detected or FALSE otherwise.

In addition, the parameters of spectral tilt and frame class for the current frame are also used for further refinement of the transient decision.

The spectral tilt of the low frequency signal is calculated by:

()

where, and are calculated by:

()

is initialized to, and if , is adjusted by adding .

The spectral tilt and class of the current frame are checked against threshold values, and the high frequency TRANSIENT signal classification of the current frame is adjusted. Formally, when is TRUE, and , the signal classification is set to FALSE, and also the hangover is set to 0.

Another flag, , is used to indicate whether a transient is present in the low frequency signal of the current frame. It is calculated as follows: When the conditionis satisfied, then the flag is set to logical TRUE; and it is set to FALSE otherwise. Here,, and the convention is that for the first sub-frame, from the previous frame.

5.2.6.2.1.3 Frequency domain classification and coding

In frames that are classified as containing transients, the value of is set (=TRANSIENT). For frames without transients, a frequency sharpness parameter is computed to reflect the spectral fluctuation of the frequency coefficients in the super high band signal and those frames are categorized in one of three classes:

a) HARMONIC: when frequency sharpness is high.

b) NOISE: when frequency sharpness is low.

c) NORMAL: when frequency sharpness is moderate.

The 288 MDCT coefficients in the 6400-13600 Hz frequency range, are split into nine sharpness bands (32 coefficients per band). The frequency sharpness,, is then defined as the ratio of the peak magnitude to the average magnitude in a sharpness band

()

where the maximum magnitude of spectral coefficients in a sharpness band, denoted, is given by

()

Then, six parameters are calculated; the global gain, , the average, the , and three further sharpness parameters are determined; the maximum sharpness, , the sharpness band counter, , and the noise band counter, .

The maximum sharpness,, in all sharpness bands is computed as:

()

The counteris computed from the nine frequency sharpness parameters,, and from the nine maximum magnitudes, , as follows: Initialized to zero, is incremented by one for each ,, ifand.

The counter is computed from the nine frequency sharpness parametersas follows: Initialized to zero, is incremented by one for each,if is less than 3.

The class of non-TRANSIENT frames is determined from the three sharpness parameters, , and and four other parameters; the previous frame saved class, , , , and the ratio of the global gains in the current and previous frames.

The threshold of the number of harmonics and the threshold of the maximum sharpness are set according to the signal class of previous frame and the mode of previous extension layer. When the bandwidth is changed i.e., , the and will be decreased for harmonic mode and increased for other signal mode:

()

– If, and , the current frame is classified as HARMONIC frame (= HARMONIC) and the counter for signal classis incremented by one whenis less than twelve. Otherwise, is decremented by one whenis larger than zero.

– Then, if, the current frame is also classified as HARMONIC frame (= HARMONIC).

– For other cases, depending on the noise counter,, , and the spectral tilt,the current frame is classified as NORMAL or NOISE:

if,and, the current frame is classified as NOISE frame (= NOISE), otherwise, the current frame is classified as NORMAL frame (= NORMAL).

Two bits are transmitted for SHB signal class coding. Table gives the coded bits for each class.

Table 63: SHB signal class coding

Signal class	Coded bits
NOISE	00
TRANSIENT	01
NORMAL	10
HARMONIC	11

The signal class of the current frame is preserved as for the next frame.

5.2.6.2.1.4 Sub-band division

The 320 MDCT coefficients (at 13.2kbps in the 6150-14150 Hz frequency range, at 32kbps in the 8000-16000 Hz frequency range) are either split into four sub-bands for TRANSIENT frames or fourteen sub-bands for non-TRANSIENT frames. Table and table define the sub-band boundaries and sizes for TRANSIENT frames and non-TRANSIENT frames respectively. The-th sub-band comprises coefficients where .

Table : Sub-band boundaries and number of coefficients per sub-band in TRANSIENT frames

j
0	0	76
1	76	76
2	152	84
3	236	84
4	320	–

Table 65: Sub-band boundaries and number of coefficients per sub-band in Non-TRANSIENT frames

j	)
0	0	16
1	16	24
2	40	16
3	56	24
4	80	16
5	96	24
6	120	16
7	136	24
8	160	24
9	184	24
10	208	24
11	232	24
12	256	32
13	288	32
14	320	–

5.2.6.2.1.5 Spectral envelope calculation and quantization

The spectral envelope, or the energy of each band, is computed as follows:

()

If the current frame is a Non-TRANSIENT frame, energy control is performed to prevent too much noise being applied to the generated spectrum. The energy control adjusts the spectral energy of each band, depending on the different characteristics between the original high frequency spectrum and the base excitation spectrum.

First a spectral copy is created by mapping the frequencies, depending upon the bandwidth, the ACELP coding modes and the FD BWE mode as defined in table .

Table 66: Frequency mapping to generate base excitation spectrum in FD BWE

Mode, Bit-rate, ACELP coding modes	l
WB @ 7.2, 8kbps All LP-based modes except for AUDIO	0	160	239	240	319
WB @ 7.2, 8kbps AUDIO WB @ 13.2kbps All	0	80	159	240	319
SWB @ 13.2kbps NORMAL, NOISE	0	112	239	246	373
	1	112	239	374	501
	2	176	239	502	565
SWB @ 13.2kbps HARMONIC	0	0	239	246	485
SWB @ 13.2kbps HARMONIC	1	128	207	486	565
SWB @ 32kbps NORMAL, NOISE	0	112	239	320	447
	1	112	239	448	575
	2	176	239	576	639
SWB @ 32kbps HARMONIC	0	0	239	320	559
SWB @ 32kbps HARMONIC	1	128	207	560	639

To generate the base excitation spectrum, the spectral copy is normalized by the sum of its absolute spectral components; the window size used depends on the signal characteristics.

()

The tonality measures used for the energy control are then calculated:

()

The ratio between the tonality of the original high frequency spectrum () and the tonality of the base excitation spectrum () is then calculated as follows:

(810)

where is 0.35.

The envelope control factor is then applied to the envelope :

()

Next the spectral envelope is adjusted by subtracting the mean vectors which are shown in table 67.

()

Table 67: Mean vectors in FD BWE

j	TRANSIENT	Non-TRANSIENT
0	27.23	28.62
1	23.81	28.96
2	23.87	28.05
3	19.51	27.97
4	–	26.91
5	–	26.82
6	–	26.35
7	–	25.98
8	–	24.94
9	–	24.03
10	–	22.94
11	–	22.14
12	–	21.23
13	–	20.40

If the current frame is a TRANSIENT frame, the following smoothing processes are applied before the envelope quantization.

– If is TRUE, the coder type is INACTIVE and the transient hangover is equal to one, the flag is set to 1, and the time envelope is adjusted as follows:

()

– Otherwise, the adjustment is as follows:

()

If the current frame is a Non-TRANSIENT frame,

()

If the current frame is a TRANSIENT frame, the mean squared error (MSE) criterion is used for the search of the VQ, in a Non-TRANSIENT frame, the weighted mean squared error (WMSE) is used. The weighting serves to emphasise the lower frequency bands and is calculated by two methods; one is a deterministic weighting based solely on the frequency, and the other is a weighting that is calculated based upon the envelope. The first frequency weighting is defined in table 68 and the second frequency weighting is calculated as follows:

()

Table 68: Frequency weighting

j	Non-TRANSIENT
0	1.0
1	0.97826087
2	0.957446809
3	0.9375
4	0.918367347
5	0.9
6	0.882352941
7	0.865384615
8	0.849056604
9	0.833333333
10	0.818181818
11	0.803571429
12	0.789473684
13	0.775862069

The SWB spectral envelope is quantized with a multi-stage split VQ using envelope interpolation as in figure 52. In the first stage, two or three candidates’ (, three in TRANSIENT frame, two in Non-TRANSIENT frame) indices are chosen using the error minimization criterion. The set of candidates with the least quantization error, taking into account all quantization steps, is then selected and the selected indices transmitted.

Figure 52: Envelope VQ in a TRANSIENT frame and a Non-TRANSIENT frame

Again during the first stage, values in even positions are selected and quantized using VQ with 5 bits for Non-TRANSIENT frames and 7 bits for TRANSIENT frames.

(817)

(818)

In Non-TRANSIENT frames, the candidate indices from the first stage VQ are defined as . The quantization error is calculated and the error is split into and and quantized using 7 bits and 6 bits respectively, as follows:

()

then;

()

The candidate indices from the second stage VQ are defined as and .

The two quantized values are then combined:

(821)

At odd positions, an interpolation using boundary values is applied for intra-frame prediction and the predicted error is calculated:

()

The errors are then split into and and quantized using 5 bits and 6 bits respectively.

()

The candidate indices from the third stage VQ are defined as. In a TRANSIENT frame only 2 stages of quantization are applied. At odd positions, an interpolation using boundary values is applied for intra-frame prediction and the predicted error is calculated and quantized

(824)

The candidate indices from the second stage VQ are defined as .

The final selected set of indices for a Non-Transient frame, or for a Transient frame are then transmitted.

5.2.6.2.1.6 Time envelope calculation and encoding

In case of TRANSIENT frames, ie, the super higher band (SHB) signal of the input in the previous frame is detected as TRANSIENT and the super higher band (SHB) signal of the input in the current frame is detected as NON-TRANSIENT, or the super higher band (SHB) signal of the input in the current frame is detected as TRANSIENT, the time envelope is also calculated. The time envelope, which represents the temporal energy of the SHB signal, is computed as a set of root mean square (RMS) calculations from each 80 samples of time-domain SHB signal. This results in four time envelope coefficients per frame.

()

The time envelope is firstly adjusted by the attenuated value which represents the energy attenuation of the WB signal, and the attenuated value is calculated by the original WB signal and local synthesized WB signal:

()

In order to highlight the characteristics of the transient signal, the time envelope of the transient signals is modified. A reference sub-frame is first selected from the sub-frames of the input transient signal, which has the maximal amplitude value of envelope compared with values of the envelopes of the rest sub-frames. Then the time envelope of the reference sub-frame is increased whilst at the same time the envelopes of the sub-frames before and after the reference sub-frame are decreased. Then, the adjusted time envelopes of the SHB signal of the current frame are quantized and coded into the bitstream. In order to get better transient effect, in sub-frames before and after the reference sub-frame, the difference between the decreased time envelope and the maximum time envelope is greater than a preset threshold.

– If the sub-frame index which is defined asis less than 4, the time envelope is adjusted by:

()

where is the sub-frame index with the maximum time envelope . It should be noted when the sub-frame index,and when the condition is satisfied, the adjustment of time envelope of sub-frame can be performed.

– For other cases, to obtain the time envelopes of the SHB signal of the current frame used to encode, the adjustment is as follows:

()

and

()

In addition, when is TRUE, the coder type is INACTIVE and the transient hangover is equal to one, the flag is then set to 1, and the time envelope is adjusted as follows

()

Finally, the values are further bounded in the range [0,…,15]: ,.

The adjusted time envelopes are rounded and quantized with four bits using uniform scalar quantization in the case of TRANSIENT frames.

5.2.6.2.1.7 Bit allocation for FD BWE

Table illustrates the BWE bit allocation for TRANSIENT and Non-TRANSIENT frames.

Table 69: SWB FD BWE bit allocation

Signal class

Signal class bits ()

Time envelope ()

Spectral envelope ()

Total bits

TRANSIENT

16 (=4×4)

13 (=7+6)

NON-TRANSIENT

29 (=5+7+6+5+6)

5.2.6.2.2 WB Multi-mode FD Bandwidth Extension

At 13.2kbps, for frame declared as HARMONIC (HM) frame or NORMAL (NM), a bit budget of 6 bits is allocated to the WB Multi-mode FD Bandwidth Extension. One bit is allocated to the signal class and five bits are allocated to encode two spectral envelopes which are calculated by the 80 MDCT coefficients of the higher band (HB) signal, .

At 7.2kbps or 8kbps, it is blind BWE and no bit budget is allocated. In this case, a two-stage blind BWE is used. In the first stage, the high band frequency generation is the same as the BWE in AMR-WB [9], described in subclauses 6.3.1, 6.3.2.2 and 6.3.3 of [9], and it is added to the ACELP core synthesis. Then the second stage BWE is generated as described in the following sub-clauses, and it is added to the core synthesis with the first stage BWE. At 5.9 kbps VBR coding and CNG coding up to 8.0 kbps, the BWE is also blind with no bit budget allocated, but only the first stage BWE is used.

5.2.6.2.2.1 Windowing and time-to-frequency transformation

The input high-pass filtered signal is delayed bysamples and windowed to obtain the windowed input signal as shown in figure , where .

320-point length MDCT on top of is used for WB FD BWE. Refer to subclause 5.3.2.

5.2.6.2.2.2 Frequency domain classification and coding

At 13.2kbps, frequency domain classification is performed. A frequency sharpness parameter is computed to reflect the spectral fluctuation of the frequency coefficients in the higher band signal and those frames are categorized in one of two classes:

a) HARMONIC: when frequency sharpness is high.

b) NORMAL: when frequency sharpness is moderate.

The 96 MDCT coefficients in the 5600-8000 Hz frequency range are split into three sharpness bands (32 coefficients per band). The frequency sharpness,, is then defined as the ratio of the peak magnitude to the average magnitude in a sharpness band

()

where the maximum magnitude of spectral coefficients in a sharpness band, denoted, is given by

()

Then, another two sharpness parameters are determined: the maximum sharpness, , and the sharpness band counter, .

The maximum sharpness,, in all sharpness bands is computed as:

()

The counteris computed from the three frequency sharpness parameters, , and from the three maximum magnitudes, , as follows: Initialized to zero, is incremented by one for each ifand.

The class of HARMONIC frame is determined from these three sharpness parameters, , and the class of the previous frame, .

The threshold of the number of harmonics and the threshold of the maximum sharpness are set according to the class of the previous frame and the mode of previous extension layer:

()

Initialize the signal class of the current frame as NORMAL frame (= NORMAL).

– If and , the current frame is classified as HARMONIC frame (= HARMONIC) and the counter for signal classis initialized to 0, and incremented by one whenis less than twelve. Otherwise, is decremented by one whenis larger than zero.

– If the counter for signal class is not less than 2, the current frame is also classified as HARMONIC.

One bit is transmitted for HB signal class coding. Table gives the coded bit for each class.

Table 70: HB signal class coding

Signal class F_class	Coded bit
NORMAL	0
HARMONIC	1

The signal class of the current frame is preserved as for the next frame.

5.2.6.2.2.3 Spectral envelope calculation and quantization

The spectral envelopes are computed from each 40 samples of frequency-domain HB signal. This results in two spectral envelopes per frame.

()

Then envelope control is performed to prevent too much noise being applied to the reconstructed spectrum. First a spectral copy is created by mapping the frequencies, depending upon the bandwidth, the ACELP coding modes and the FD BWE mode as defined in table 66.

To generate the base excitationexcitation spectrum, the spectral copy is normalized by the sum of its absolute spectral components. The parameter of adaptive normalization length is calculated depending on the original WB MDCT coefficients:

– The 256 WB MDCT coefficients in the 0-6400 Hz frequency range, are split into 16 sharpness bands (16 coefficients per band). In sharpness band j, if and , the counter is incremented by one.

where, and the maximum magnitude of the spectral coefficients in a sharpness band, denoted, is:

()

Parameteris initialized to 0 and calculated for every frame.

– Then the normalization length is obtained:

()

where the current normalization length is calculated depending on the HB signal class:

()

and the current normalization length is preserved asfor the next frame.

Then the base excitation spectrum is obtained by:

()

where, are the spectral copy coefficients, and the normalized envelope is calculated by:

()

The tonality measures used for the energy control are then calculated:

(843)

The ratio between the tonality of the original high frequency spectrum () and the tonality of the base excitation spectrum () is then calculated as follows:

Void (844)

(845)

where is 0.35.

The envelope control factor is then applied to the envelope:

()

Then the spectral envelope of log-domain is obtained by:

()

Finally, the spectral envelopes are quantized with a 64-dimentional array described in table .

The distance between the spectral envelopes and the codebook is calculated by:

()

and the index is encoded with 5 bits.

Table 71: Codebook

0	1	2	3	4	5
1.1606680	0.6594560	-4.9874350	-5.1700310	10.230799	-0.0125740
6	7	8	9	10	11
10.605126	9.7910260	-0.3739880	-0.6027910	6.2753817	0.3307670
12	13	14	15	16	17
9.4537100	8.8558020	2.9320890	2.1643160	3.1332030	2.9710870
18	19	20	21	22	23
8.061906	-0.5905290	15.754963	5.0496380	17.227070	18.329395
24	25	26	27	28	29
-2.4710190	-3.1725330	-1.4136470	-1.9457110	15.147771	14.506490
30	31	32	33	34	35
11.358370	11.714662	9.4275510	-0.1223030	7.0970160	-1.5805260
36	37	38	39	40	41
12.498663	3.1614850	10.349261	1.5185040	5.3809850	-1.7341900
42	43	44	45	46	47
1.1224600	-2.2397020	12.362551	12.133788	4.2788690	-1.7729040
48	49	50	51	52	53
6.1577130	5.4971410	3.3243130	-2.5710470	19.097071	9.3576920
54	55	56	57	58	59
7.6509204	7.4404626	0.5055090	-3.7073090	18.584702	11.302494
60	61	62	63
18.706564	18.308905	23.010420	22.915377

5.2.6.2.2.4 Bit allocation for FD BWE

Table illustrates the WB BWE bit allocation.

Table 72: WB FD BWE bit allocation

Signal class

Signal class bit ()

Spectral envelope ()

Total bits

HARMONIC

NORMAL

5.2.6.3 Coding of upper band at 64 kb/s

The SWB, resp. FB, signal at 64 kbps bit-rate is coded in two bands. The lower band that covers 0-8kHz is coded using the LP-based coding at 16 kHz internal sampling rate as described earlier and the upper band that extends the coded band-width up to 16 kHz, resp. 20 kHz, is coded using a high-rate upper band coding. The same upper band coding with a fixed 16 kbps bit-budget is used in all GC, TC and IC frames. This bit-budget can be eventually increased with unused bits coming from the AVQ within the combined algebraic codebook.

The upper band coding is mostly done in the MDCT domain and has two modes: normal and transient. While normal mode is used in most of generic and voiced frames, the use of transient mode minimalizes the pre-echo and post-echo in frames where the signal at the frame beginning is significantly different from the signal at the frame end, e.g. onsets. Detection of transient frames is done in time domain using a detector described in .

First the input signal, filtered by the HP filter and sampled at 32 or 48 kHz, is transformed using MDCT and OLA function. In normal mode, the whole frame is transformed at once while in transient mode the frame is divided into four sub-frames and thus four sets of spectral coefficients are present. In both modes only spectrum coefficients between 7.6 kHz and 14.4 kHz are encoded. While the spectrum between 7.6 kHz and 14.4 kHz is divided in normal mode into four bands of 1.7 kHz width each, it is divided into two bands of 3.4 kHz each in transient mode. The other frequency coefficients are zeroed. Consequently spectral coefficients, , are encoded in normal mode frames and spectral coefficients , , are encoded in transient mode frames.

5.2.6.3.1 Coding in normal mode

In normal mode frames, the global gain is computed on the spectrum 7.6 kHz – 14.4 kHz as follows

(849)

and quantized using a 5-bit log gain quantizer at the range of [3.0; 500.0].

The quantized global gain is further used to normalize the spectrum resulting in a normalized spectrum by the quantized global gain.

()

where is the start frequency bin of spectrum reconstruction and for normal frames.

Then the spectrum envelope is computed in four bands which results in 68 spectral coefficients per band.

()

The spectrum envelope is quantized using two two-dimensional VQs by means of 6 bits codebook and 5 bits codebook in table 73 and table 74, respectively

Table 73: 6 bits spectral envelope VQ codebook


0	0.044983	0.0417	22	8.919388	9.762914	44	15.26931	21.53914
1	0.524276	0.469365	23	11.29932	11.7639	45	16.98352	24.69959
2	0.671757	0.605513	24	11.78222	5.879754	46	19.59173	22.68968
3	0.983501	0.855093	25	14.05046	9.665228	47	20.1462	25.88847
4	1.227874	1.1322	26	11.20153	9.001128	48	17.79742	19.45312
5	1.672212	1.432704	27	14.43475	13.23657	49	21.29062	20.18658
6	2.548211	2.361091	28	14.33726	3.904411	50	24.09732	19.08672
7	3.196961	3.306999	29	20.07105	4.335061	51	23.61309	22.54586
8	2.580753	5.217478	30	18.10581	8.223599	52	23.68201	16.32824
9	4.207751	7.243802	31	22.35229	9.603263	53	26.88655	19.40244
10	3.517157	1.738487	32	7.242756	16.56449	54	26.00977	15.63221
11	4.381567	2.753657	33	11.77753	19.16765	55	28.93993	16.24062
12	4.758266	4.696094	34	11.1218	15.45598	56	25.09448	12.36642
13	6.827988	6.106459	35	14.56358	17.35957	57	27.71338	13.26328
14	4.450459	10.13121	36	17.82122	11.89472	58	28.33095	10.32926
15	7.256045	12.48804	37	17.46603	15.29606	59	30.63283	12.85128
16	6.70872	1.953339	38	21.33696	13.45518	60	25.2738	6.138124
17	6.60403	3.69956	39	20.54434	17.12537	61	29.19534	7.222413
18	10.61273	2.537916	40	9.056358	22.33831	62	32.17132	5.019567
19	9.387467	4.241173	41	11.23842	28.83252	63	31.979	9.473855
20	7.119045	8.281485	42	13.26273	25.14338
21	9.062854	7.086526	43	16.24356	28.25685

Table 74: 5 bits spectral envelope VQ codebook


0	0.512539	0.472507	16	13.67263	5.457414
1	1.338963	1.108591	17	16.47199	3.917684
2	2.544041	1.759765	18	20.91033	6.43281
3	3.124053	3.045299	19	25.45733	8.61722
4	4.892713	3.721097	20	16.4107	7.574456
5	4.010297	5.750862	21	18.57439	10.2915
6	5.111215	2.164709	22	22.08876	12.51216
7	6.667518	3.893404	23	21.17053	17.20871
8	8.454117	2.75143	24	5.276107	9.62247
9	11.12357	3.518174	25	9.093585	11.27469
10	6.622948	5.960704	26	11.94566	15.53814
11	8.562429	5.003579	27	16.55041	15.04656
12	8.919363	7.784057	28	6.358148	17.5474
13	10.75904	5.959438	29	13.31662	21.76552
14	12.44919	8.359519	30	7.646096	26.10672
15	13.67701	11.23058	31	2.451297	31.9331

The quantized spectrum envelope is used to further normalize the spectrum resulting in spectrum normalized per bands.

()

Afterwards, the number of the bands to be quantized is calculated according to the total bits and the saturated threshold, and the bands are selected according to the quantized spectrum envelopes. Once the bands are selected, the first stage encoding is processed by means of AVQ. If there are at least 14 remaining bits after the first stage encoding and the first stage quantized spectrum is non-zero, a second stage encoding is employed also by means of AVQ.

Calculate the number of the bands to be quantized according to total bits and the saturated threshold, and select the bands to be quantized according to the quantized envelopes as follows:

– If , the selected bands are by:

()

– Otherwise, the selected bands are by:

()

where the and are set as follows:

()

where is the sub-band index with the minimum envelope , and it is calculated by:

()

The number of the sub-bands is obtained according to the number of the total bits and the saturated threshold as follows:

()

Quantize the normalized coefficients of the selected bands (sub-bands) by AVQ to obtain the quantized normalized coefficients.

The envelope of the spectrum between 14.4 kHz and 16 kHz in SWB is predicted by:

()

And the envelope of the spectrum between 14.4 kHz and 20 kHz in FB is calculated and quantized as follows:

()

where is the start frequency bin of spectrum reconstruction and for normal frames. is the width of the MDCT coefficients between 14.4 kHz and 20 kHz in FB, and is set by:.

The ratio of the envelopes is refined:

()

And the index of attenuation factor is obtained according to the ratio of the envelopes, and the envelope of the spectrum between 14.4 kHz and 20 kHz in FB is finally obtained according to the attenuation factor:

()

Then the index is encoded with 2bits.

If the number of the remaining bits after the first stage encoding is larger than 14, then the second stage encoding is needed. Select sub-bands from the first stage selected sub-bands to perform the second stage encoding, and the number of the sub-bands is calculated according to the number of remaining bitsand the saturated threshold .

The input coefficients of the second stage encoding are obtained by reordering the differences between the original normalized coefficients and the quantized normalized coefficients as follows:

First, in the sub-bands with the AVQ codebook index,

()

where the index is initialized to 0, and is incremented by 1 when.

Then, in the sub-bands , if , , and the index is incremented by 1.

The number of the sub-bands is calculated according to the number of the remaining bits and the saturated threshold:

()

The coefficients of the first sub-bands in are selected to perform the second stage encoding as follows:

The global gain of second stage encodingis calculated and quantized as.

()

and the spectrum is normalized with as follows:

()

Then the normalized spectrum is quantized by AVQ.

5.2.6.3.2 Coding in transient mode

In transient mode frames, a similar procedure as in normal mode frames is employed and the following descriptions focus on the differences.

The total bit-budget is divided by four in order to obtain a bit-budget for every sub-frame and the encoding is performed four times (once for every sub-frame). In sub-frame,,the global gain is computed on the spectrum 7.6 kHz – 14.4 kHz using equation () for and quantized using a 5-bit log gain quantizer at the range of [3.0, 500.0].

()

where is the start frequency bin of spectrum reconstruction and for transient frames. is the frame length. for SWB signal and for FB signal

The spectrum envelope is computed in two bands for each sub-frame, thus each band contains 34 spectral coefficients.

Then, the spectral envelope of normalized spectrum is calculated:

()

where,

()

The total 8 spectral envelopes in 4 sub-frame are divided into 4 two-dimensional vectors, i.e., , , in each sub-frame are combined as a vector, and quantized using two-dimensional VQs. For the first vector, the spectral envelopes in the first sub-frame are quantized using one two-dimensional VQ by means of 4 bits codebook defined in table .

Table 75: 4 bit spectral envelope VQ codebook


0	0.799219	0.677609
1	1.754571	1.215689
2	2.846222	2.017775
3	4.379336	1.975914
4	5.935472	2.945818
5	3.938621	4.220399
6	8.080808	2.632276
7	7.579771	4.986835
8	4.956485	10.36366
9	7.739148	8.652471
10	9.238397	7.051655
11	10.205707	5.619638
12	10.645117	4.374648
13	11.66018	3.474015
14	10.845836	2.664596
15	11.724073	1.637023

The index of this VQs is noted as . This 4 bits codebook can be divided into two 3 bits codebook. The first 3 bits codebook is , and the second is in the table . Then the first quantized vector is determined one of the 3 bits codebook according to the .

If , the first 3 bits codebook is selected as new codebook;

if , the second 3 bits codebook is selected as new codebook.

Then the following three vectors are quantized by using the 3 bits codebook determined before.

The quantized spectral envelope is applied to the spectrum:

()

where , and .

Then, the normalized spectrum is quantized by AVQ.

If , the envelope of the spectrum between 14.4 kHz and 16 kHz in SWB is predicted by:

()

If , the envelope of the spectrum between 14.4 kHz and 20 kHz in FB is calculated and quantized as follows:

()

where, is the start frequency bin of spectrum reconstruction and for transient frames. is the width of the MDCT coefficients between 14.4 kHz and 20 kHz in FB, and is set by.

The ratio of the envelopes is refined as:

()

where . Then the index is encoded with 2bits.

Finally, the unused AVQ bits from the current sub-frame are employed in the subsequent sub-frame within the same frame.