6.8.3 High frequency band

26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

The high-frequency band generation for modes from 6.6 to 23.05 kbit/s is illustrated in figure 113. The high band is generated by generating an over-sampled excitation signal in DCT domain that is extended in the 6400-8000 Hz band above the 0-6400 Hz band. Note that in reality the high band is extended to a slightly wider band (6000-8000 Hz) to facilitate the addition of low and high-band, especially in the cross-over region around 6400 Hz. Tonal and ambiance components in the extended band are extracted and combined adaptively to obtain the extended excitation signal, which is then filtered in DCT domain. After inverse DCT, gains are applied in time domain (by sub-frame) and the extended excitation signal is filtered by an LP filter whose coefficients are derived from the LP filter.

Figure 113: High-frequency band generation in AMR-WB IO modes

6.8.3.1 Preliminary estimation steps

The low-frequency band signal is extended to obtain the high frequency band signal by bandwidth extension algorithm, and the bandwidth extension algorithm includes the estimation of gains and the prediction of the excitation of the high frequency band signal.

The gains of the high frequency band signal are estimated by pitch, noise gate factor, voice factor, classification parameter and LPC.

The excitation of the high frequency band signal is adaptively predicted from the decoded low frequency band excitation signal (the sum of adaptive codebook contribution and algebraic codebook contribution) according to LSF and the bitrates.

The excitation of the high frequency band signal is modified by the gains of the high frequency band signal, and the LP synthesis is performed by filtering the modified excitation signal through the LP synthesis filter to obtain the high frequency band signal.

The parameters of estimating the gains and predicting excitation of the high frequency band signal are decoded from the bitstream of low band or calculated by the decoded low band signal.

6.8.3.1.1 Estimation of tilt, figure of merit and voice factors

1) Calculate the spectrum tilt factor of each subframe according to the decoded low frequency band as follows:

()

where denotes the length of sub frame and denotes the decoded low frequency band signal. is preserved as for the following unvoiced flag calculation.

2) Calculate the sum of the differences between every two adjacent pitch values:

()

where is the pitch value of each subframe.

3) If the frame counter is greater than 100 and the FEC class of current frame is , set to 0 and set the minimum noise gate to -30. Otherwise, if is greater than 200, set to 200; if not, is increased by 1; If the noise gate is less than, set to.

4) Calculate the average voice factor as follows:

()

where denotes the voice factor of each subframe.

5) Based on the classify parameter from FEC classification, determine two parameters and.

()

and if the FEC class of current frame is , .

Then is further modified by the average voice factor and smoothed as follows:

()

Set to .

If is less than 0.5, set to 1; Otherwise, set to . Then smooth as follows:

()

Set to .

6) If the sum of the differences between every two adjacent pitch values is less than 10 and the spectrum tilt factor of current subframe is less than zero, reset to 0.2. Then, if is greater than 0.2, reset to 0.8; Otherwise, reset to . Finally, modify as follows:

()

6.8.3.1.2 Estimation of sub-frame gains based on LP spectral envelopes

The signal in low-band (0-64000 Hz) is generated based on a source-filter model, where the filter is given by the synthesis filter . Similarly, as shown in subclause 6.8.3.3, the signal in high-band (above 6400 Hz) is generated based on a source-filter model; the filter in high-band is an linear predictive (LP) filter derived from the LP filter in low-band.

Since the low and high-band are combined in the final synthesis, a preliminary equalization step is performed to match the levels of the two LP filters at a given frequency. At 6400 Hz the shape of is already too decreasing, therefore a frequency of 6000 Hz has been chosen for this equalization frequency point.

In each sub-frame, the frequency response of the LP filter in the low-band and the LP filter in the high-band are computed at the frequency of 6000 Hz:

, (2024)

and

, (2025)

where =0.9 at 6.6 kbit/s and 0.6 at other modes (from 8.85 to 23.85 kbit/s)

These values are computed efficiently using the following pseudo-code:

px = py = 0

rx = ry = 0

for i=0 to 16

px = px + Ap[i]*exp_tab_p[i]

py = py + Ap[i]*exp_tab_p[33-i]

rx = rx + Aq[i]*exp_tab_q[i]

ry = ry + Aq[i]*exp_tab_q[33-i]

end for

P = 1/sqrt(px*px+py*py)

R = 1/sqrt(rx*rx+ry*ry)

where Aq[i]= are the coefficients of , Ap[i]=are the coefficients of , sqrt() corresponds to the square root operation and the tables exp_tab_p and exp_tab_q of size 34 contain the real and imaginary parts of complex exponentials at 6000 Hz:

exp_tab_p[i] = (2026)

and

exp_tab_q[i] = (2027)

The ratio provides an estimated gain to be used in each sub-frame to align at the given frequency point (6000 Hz) the level of LP spectral envelopes in two different bands. This value is further refined to optimize overall quality.

To avoid over-estimating the sub-frame gain in high-band which could result in too high enrgy in the high hand, an additional LP filter of lower order is also computed based on the lower-band LP filter. An LP filter of order 2 is derived by truncating the filter decoded in low band to an order of 2 (instead of an order of 16). The stability of this truncated filter is ensured by the following steps:

The filter is initialized as: , i=1, 2
Reflection coefficients are computed: ,
Filter stability and control of resonance is forced by applying the following conditions:

(2028)

(2029)

The coefficients of the LP filter of order 2 are then given by: ,

The frequency response of the resulting LP filter of order 2 is computed as follows:

, (2030)

which can be computed efficiently using a similar pseudo-code with tables exp_tab_p and exp_tab_q It was found that, for some signals, using the value instead of the value takes better into account the influence of spectral tilt in the actual signal spectrum and therefore avoids the influence of spectral peaks or valleys near the reference frequency point (6000 Hz) which could bias the value .

The optimized gain to shape the excitation in high-band is then estimated based on , , .

Before the gain is estimated, an unvoiced flag is determined first so that the gain estimation will be different for unoiced speech and voiced speech. An unvoicing parameter is defined as,

(2031)

wherein is a smoothed voicing parameter of . The unvoicing parameter is first smoothed by

(2032)

Then, it is further smoothed by,

(2033)

A relative difference parameter is now defined as

(2034)

An initial unvoiced flag is decided by the following procedure,

(2035)

A final unvoiced flag is limited to

(2036)

The gain computed is performed according to the voicing of the signal:

If the sub-frame is classified as unvoiced

(2037)

where the smoothed value in the current sub-frame of index is computed as

(2038)

and

(2039)

and

(2040)

Otherwise, if the sub-frame is not classified as unvoiced:

(2041)

where the smoothed value in the current sub-frame of index is computed as

(2042)

with if and , otherwise, and

(2043)

and where

(2044)

and

(2045)

6.8.3.2 Generation of high-band excitation

6.8.3.2.1 DCT

The current frame of decoded excitation from the low-band, , , sampled at 12.8 kHz, is transformed in DCT domain as described in sub-clause 5.2.3.5.3.1, to obtain the spectrum, , .

6.8.3.2.2 High band generation

6.8.3.2.2.1 Adaptive start frequency bin prediction

The start frequency bin of predicting the high band excitation from the low band excitation is adaptively determined by the line spectrum frequency (LSF) parameters. The LSF parameters are decoded from the bitstream of low frequency band. Based on the decoded LSF parameters of the low band signal, the differences between every two adjacent LSF parameters are calculated and the minimum difference is searched since the minimum difference corresponds to an energy peak of the low band spectral envelope. The start frequency bin is determined by the position of the minimum difference, where the low band excitation is decoded from the bitstream of the low band as described in subclause 6.8.1.1.

In order to mitigate switching the start frequency bin frequently in or , the voicing flag will be determined according to the average voice factor and the FEC class of current frame:

()

The voicing flag is further refined to 0 if .

Initially the start frequency bin is 160. If the bitrate is not less than 23050, the start frequency bin ; Otherwise, the start frequency bin is adaptively searched as follows:

Calculate the LSF differences between every two adjacent LSF parameters:

(2047)

where is the order of the LP filter and .

Determine the range of search the minimum LSF difference in :

Initialize the range to , if voicing flag , reset :

(2048)

Search the minimum value of the adjusted LSF difference in the range , is calculated as follows:

()

and , the position of the minimum value is

()

where is adjust factor of LSF parameters based on the core bitrate and the FEC class of current frame:

()

The start frequency bin of of predicting the high band excitation from the low band excitation is calculated:

(2052)

In order to decrease the distortion of the spectrum of the high band, the start frequency bin of the current frame is reset with the start frequency bin of the previous frame when the below conditions is satisfied:

– If one of the conditions , , or is satisfied, of current frame is preserved for the next frame, and

(2053)

– Otherwise, the start frequency bin of bandwidth extension is set to, and the of current frame is preserved for the next frame if .

The start frequency bin of predicting the high band excitation from the low band excitation is further refined if the FEC class of current frame is :

()

If is not an even number, is decremented by one.

Then, obtain the high band excitation by choosing low band excitation with a given length of the bandwidth according to the start frequency bin.

6.8.3.2.2.2 Extension of excitation spectrum

The DCT spectrum covering the 0-6400 Hz band is extended to the 0-8000 Hz band as follows:

()

where is the adaptive start band as computed according to subclause 6.8.3.2.2.1. The 5000-6000 Hz band in is copied from in the same band, this allows keeping the original spectrum in this band to avoid introducing distortions when the high-band is added to the decoded low-band signal. The 6000-8000 Hz band in is copied from e.g. in the 4000-6000 Hz band when =160.

6.8.3.2.3 Extraction of tonal and ambiance components

Tonal and ambiance components are extracted in the 6000-8000 Hz. This extraction is implemented according to the following steps:

Computation of total energy in the extended low-band signal:

()

where =0.1.

Computation of the ambiance component (in absolute value) corresponding to the average (bin-by-bin) level of the spectrum and computation of the energy of dominant tonal components in high frequency:

The average level is given by the following equation:

, (2057)

where = 80. This level gives an average level in absolute value and represents a sort of spectral envelope. Note that the index corresponds to indices from 240 to 319, i.e. the 6000-8000 Hz band. In general, and , however for the first and last 7 indices ( et ) the following values are used:

and for

Detection and computation of the residual signal which defines tonal components:

, (2058)

Tonal components are detected using the criterion >0.

Computation of the energy of dominant tonal components in high frequency:

The energy of tonal components is computed as follows:

, (2059)

6.8.3.2.4 Recombination

The extracted tonal and ambiance components are re-mixed adaptively. The combined signal is obtained using absolute values as:

, (2060)

where the factor controlling the ambiance

(2061)

and is a multiplicative factor given by:.

(2062)

Tonal components, that were detected using the criterion, are reduced by a factor and the average level is amplified by .

Signs from are then applied as follows:

, (2063)

where

(2064)

The combined high-band signal is then obtained by adjusting the energy as follows:

, (2065)

where the adjustment factor is given by:

(2066)

The factor is used to avoid over-estimation of energy and is given by:

(2067)

and

(2068)

6.8.3.2.5 Filtering in DCT domain

The excitation is de-emphasized as follows:

(2069)

where is the frequency responses of the filter over a limited frequency range. Taking into account the (odd) frequencies of the DCT, is given by:

, (2070)

where

, (2071)

The de-emphasis is applied in two steps, for where the response of is applied in the 5000-6400 Hz band, and for corresponding to the 6400-8000 Hz band. This de-emphasis is used to bring the signal in a domain consistent with the low-band signal (in the 0-6.4 band), which is useful for the subsequent energy estimation and adjustment.

Then, the high-band is bandpass filtered in DCT domain, by splitting fixed high-pass filtering and adaptive low-pass filtering. The partial response of the low-pass filter in DCT domain is computed as follows:

, (2072)

where =60 at 6.6 kbit/s, 40 at 8.85 kbit/s, and 20 for modes >8.85 bit/s. It defines a low-pass filter with variable cut-off frequency, depending on the mode in the current frame. Then, the band-pass filter is applied in the following form:

(2073)

The definition of the factor , is given in table .

Table 174: High-pass filter in DCT domain


0	0.001622428	14	0.114057967	28	0.403990611	42	0.776551214
1	0.004717458	15	0.128865425	29	0.430149896	43	0.800503267
2	0.008410494	16	0.144662643	30	0.456722014	44	0.823611104
3	0.012747280	17	0.161445005	31	0.483628433	45	0.845788355
4	0.017772424	18	0.179202219	32	0.510787115	46	0.866951597
5	0.023528982	19	0.197918220	33	0.538112915	47	0.887020781
6	0.030058032	20	0.217571104	34	0.565518011	48	0.905919644
7	0.037398264	21	0.238133114	35	0.592912340	49	0.923576092
8	0.045585564	22	0.259570657	36	0.620204057	50	0.939922577
9	0.054652620	23	0.281844373	37	0.647300005	51	0.954896429
10	0.064628539	24	0.304909235	38	0.674106188	52	0.968440179
11	0.075538482	25	0.328714699	39	0.700528260	53	0.980501849
12	0.087403328	26	0.353204886	40	0.726472003	54	0.991035206
13	0.100239356	27	0.378318805	41	0.751843820	55	1.000000000

6.8.3.2.6 Inverse DCT

The current frame of extended excitation in high-band, , , sampled at 16 kHz, is transformed in time domain as described in subclause 5.2.3.5.13, to obtain the signal , .

6.8.3.2.7 Gain computation and scaling of excitation

6.8.3.2.7.1 6.6, 8.85, 12.65, 14.25, 15.85, 18.25, 19.85 or 23.05 kbit/s modes

The signal , is scaled by sub-frame of 5 ms as follows:

, (2074)

where =0,1,2,3 is the sub-frame index and

(2075)

with

(2076)

and = 0.01. The sub-frame gain can be further written as:

(2077)

which shows that this gain is used to have in the same ratio of sub-frame vs frame energy than in the low-band signal .

The scaled extended excitation signal is then computed for as follows:

(2077a)

where is given in Eqs. 2037 and 2041 and is the extended excitation signal.

6.8.3.2.7.2 23.85 kbit/s mode

In the 23.85 kbit/s mode, a high-frequency (HF) gain is transmitted at a bit rate of 0.8 kbit/s (4 bits per 5 ms sub-frame). This information is transmitted only at 23.85 kbit/s and it used in EVS AMR-WB IO to improve quality by adjusting the excitation gain.

To be able to use the HF gain information, the excitation has to be converted to a signal domain similar to AMR-WB high-band coding. To do so the energy of the excitation is adjusted in each subframe as follows:

, (2078)

where the sub-frame gain is computed as:

(2079)

The factor 5 in the the denominator is used to compensate the difference in bandwith between the signal and the signal , noting that in AMR-WB the HF excitation is a white noise in the 0-8000 Hz band.

The 4-bit index in each sub-frame, , transmitted at 23.85 kbit/s is demultiplexed from the bitstream and decoded as follows:

(2080)

where is the codebook used for HG gain quantization in AMR-WB, as defined in table .

Table 175: AMR-WB gain codebook for high band


0	0.110595703125000	8	0.342102050781250
1	0.142608642578125	9	0.372497558593750
2	0.170806884765625	10	0.408660888671875
3	0.197723388671875	11	0.453002929687500
4	0.226593017578125	12	0.511779785156250
5	0.255676269531250	13	0.599822998046875f
6	0.284545898437500	14	0.741241455078125
7	0.313232421875000	15	0.998779296875000

Then, the signal is scaled according to this decoded HF gain as follows:

, (2081)

The energy of the excitation is further adjusted by sub-frame under the following conditions. A factor is computed:

(2082)

Here the term 0.6 corresponds to the average magnitude ratio between the frequency response of the de-emphasis filter in the 5000-6400 Hz band. Therefore, the term represents the energy of the high-band excitation that would be obtained at 23.05 kbit/s.

Based on the tilt information of the low-band signal, the scaled extended excitation signal is then computed for as follows:

If >1 or <0:

(2083)

Otherwise:

(2084)

6.8.3.3 LP filter for the high frequency band

The high-band LP synthesis filter is derived from the weighted low-band LP synthesis filter as follows:

(2085)

where is the interpolated LP synthesis filter in each 5-ms sub-frame and =0.9 at 6.6 kbit/s and 0.6 at other modes (from 8.85 to 23.85 kbit/s). has been computed analysing signal with the sampling rate of 12.8 kHz but it is now used for a 16 kHz signal.

6.8.3.4 High band synthesis

The scaled extended excitation signal in high-band is filtered by to obtain the decoded high-band signal, which is added to synthesized low band signal to produce the synthesized output signal.