6.8.3 High frequency band

26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS

The high-frequency band generation for modes from 6.6 to 23.05 kbit/s is illustrated in figure 113. The high band is generated by generating an over-sampled excitation signal in DCT domain that is extended in the 6400-8000 Hz band above the 0-6400 Hz band. Note that in reality the high band is extended to a slightly wider band (6000-8000 Hz) to facilitate the addition of low and high-band, especially in the cross-over region around 6400 Hz. Tonal and ambiance components in the extended band are extracted and combined adaptively to obtain the extended excitation signal, which is then filtered in DCT domain. After inverse DCT, gains are applied in time domain (by sub-frame) and the extended excitation signal is filtered by an LP filter whose coefficients are derived from the LP filter.

Figure 113: High-frequency band generation in AMR-WB IO modes

6.8.3.1 Preliminary estimation steps

The low-frequency band signal is extended to obtain the high frequency band signal by bandwidth extension algorithm, and the bandwidth extension algorithm includes the estimation of gains and the prediction of the excitation of the high frequency band signal.

The gains of the high frequency band signal are estimated by pitch, noise gate factor, voice factor, classification parameter and LPC.

The excitation of the high frequency band signal is adaptively predicted from the decoded low frequency band excitation signal (the sum of adaptive codebook contribution and algebraic codebook contribution) according to LSF and the bitrates.

The excitation of the high frequency band signal is modified by the gains of the high frequency band signal, and the LP synthesis is performed by filtering the modified excitation signal through the LP synthesis filter to obtain the high frequency band signal.

The parameters of estimating the gains and predicting excitation of the high frequency band signal are decoded from the bitstream of low band or calculated by the decoded low band signal.

6.8.3.1.1 Estimation of tilt, figure of merit and voice factors

1) Calculate the spectrum tilt factor of each subframe according to the decoded low frequency band as follows:

()

where denotes the length of sub frame and denotes the decoded low frequency band signal. is preserved as for the following unvoiced flag calculation.

2) Calculate the sum of the differences between every two adjacent pitch values:

()

where is the pitch value of each subframe.

3) If the frame counter is greater than 100 and the FEC class of current frame is , set to 0 and set the minimum noise gate to -30. Otherwise, if is greater than 200, set to 200; if not, is increased by 1; If the noise gate is less than, set to.

4) Calculate the average voice factor as follows:

()

where denotes the voice factor of each subframe.

5) Based on the classify parameter from FEC classification, determine two parameters and.

()

and if the FEC class of current frame is , .

Then is further modified by the average voice factor and smoothed as follows:

()

()

Set to .

If is less than 0.5, set to 1; Otherwise, set to . Then smooth as follows:

()

Set to .

6) If the sum of the differences between every two adjacent pitch values is less than 10 and the spectrum tilt factor of current subframe is less than zero, reset to 0.2. Then, if is greater than 0.2, reset to 0.8; Otherwise, reset to . Finally, modify as follows:

()

6.8.3.1.2 Estimation of sub-frame gains based on LP spectral envelopes

The signal in low-band (0-64000 Hz) is generated based on a source-filter model, where the filter is given by the synthesis filter . Similarly, as shown in subclause 6.8.3.3, the signal in high-band (above 6400 Hz) is generated based on a source-filter model; the filter in high-band is an linear predictive (LP) filter derived from the LP filter in low-band.

Since the low and high-band are combined in the final synthesis, a preliminary equalization step is performed to match the levels of the two LP filters at a given frequency. At 6400 Hz the shape of is already too decreasing, therefore a frequency of 6000 Hz has been chosen for this equalization frequency point.

In each sub-frame, the frequency response of the LP filter in the low-band and the LP filter in the high-band are computed at the frequency of 6000 Hz:

, (2024)

and

, (2025)

where =0.9 at 6.6 kbit/s and 0.6 at other modes (from 8.85 to 23.85 kbit/s)

These values are computed efficiently using the following pseudo-code:

px = py = 0

rx = ry = 0

for i=0 to 16

px = px + Ap[i]*exp_tab_p[i]

py = py + Ap[i]*exp_tab_p[33-i]

rx = rx + Aq[i]*exp_tab_q[i]

ry = ry + Aq[i]*exp_tab_q[33-i]

end for

P = 1/sqrt(px*px+py*py)

R = 1/sqrt(rx*rx+ry*ry)

where Aq[i]= are the coefficients of , Ap[i]=are the coefficients of , sqrt() corresponds to the square root operation and the tables exp_tab_p and exp_tab_q of size 34 contain the real and imaginary parts of complex exponentials at 6000 Hz:

exp_tab_p[i] = (2026)

and

exp_tab_q[i] = (2027)

The ratio provides an estimated gain to be used in each sub-frame to align at the given frequency point (6000 Hz) the level of LP spectral envelopes in two different bands. This value is further refined to optimize overall quality.

To avoid over-estimating the sub-frame gain in high-band which could result in too high enrgy in the high hand, an additional LP filter of lower order is also computed based on the lower-band LP filter. An LP filter of order 2 is derived by truncating the filter decoded in low band to an order of 2 (instead of an order of 16). The stability of this truncated filter is ensured by the following steps:

  • The filter is initialized as: , i=1, 2
  • Reflection coefficients are computed: ,
  • Filter stability and control of resonance is forced by applying the following conditions:

(2028)

(2029)

  • The coefficients of the LP filter of order 2 are then given by: ,

The frequency response of the resulting LP filter of order 2 is computed as follows:

, (2030)

which can be computed efficiently using a similar pseudo-code with tables exp_tab_p and exp_tab_q It was found that, for some signals, using the value instead of the value takes better into account the influence of spectral tilt in the actual signal spectrum and therefore avoids the influence of spectral peaks or valleys near the reference frequency point (6000 Hz) which could bias the value .

The optimized gain to shape the excitation in high-band is then estimated based on , , .

Before the gain is estimated, an unvoiced flag is determined first so that the gain estimation will be different for unoiced speech and voiced speech. An unvoicing parameter is defined as,

(2031)

wherein is a smoothed voicing parameter of . The unvoicing parameter is first smoothed by

(2032)

Then, it is further smoothed by,

(2033)

A relative difference parameter is now defined as

(2034)

An initial unvoiced flag is decided by the following procedure,

(2035)

A final unvoiced flag is limited to

(2036)

The gain computed is performed according to the voicing of the signal:

If the sub-frame is classified as unvoiced

(2037)

where the smoothed value in the current sub-frame of index is computed as

(2038)

and

(2039)

and

(2040)

Otherwise, if the sub-frame is not classified as unvoiced:

(2041)

where the smoothed value in the current sub-frame of index is computed as

(2042)

with if and , otherwise, and

(2043)

and where

(2044)

and

(2045)

6.8.3.2 Generation of high-band excitation

6.8.3.2.1 DCT

The current frame of decoded excitation from the low-band, , , sampled at 12.8 kHz, is transformed in DCT domain as described in sub-clause 5.2.3.5.3.1, to obtain the spectrum, , .

6.8.3.2.2 High band generation

6.8.3.2.2.1 Adaptive start frequency bin prediction

The start frequency bin of predicting the high band excitation from the low band excitation is adaptively determined by the line spectrum frequency (LSF) parameters. The LSF parameters are decoded from the bitstream of low frequency band. Based on the decoded LSF parameters of the low band signal, the differences between every two adjacent LSF parameters are calculated and the minimum difference is searched since the minimum difference corresponds to an energy peak of the low band spectral envelope. The start frequency bin is determined by the position of the minimum difference, where the low band excitation is decoded from the bitstream of the low band as described in subclause 6.8.1.1.

In order to mitigate switching the start frequency bin frequently in or , the voicing flag will be determined according to the average voice factor and the FEC class of current frame:

()

The voicing flag is further refined to 0 if .

Initially the start frequency bin is 160. If the bitrate is not less than 23050, the start frequency bin ; Otherwise, the start frequency bin is adaptively searched as follows:

  1. Calculate the LSF differences between every two adjacent LSF parameters:

(2047)

where is the order of the LP filter and .

  1. Determine the range of search the minimum LSF difference in :

Initialize the range to , if voicing flag , reset :

(2048)

  1. Search the minimum value of the adjusted LSF difference in the range , is calculated as follows:

()

and , the position of the minimum value is

()

where is adjust factor of LSF parameters based on the core bitrate and the FEC class of current frame:

()

  1. The start frequency bin of of predicting the high band excitation from the low band excitation is calculated:

(2052)

  1. In order to decrease the distortion of the spectrum of the high band, the start frequency bin of the current frame is reset with the start frequency bin of the previous frame when the below conditions is satisfied:

– If one of the conditions , , or is satisfied, of current frame is preserved for the next frame, and

(2053)

– Otherwise, the start frequency bin of bandwidth extension is set to, and the of current frame is preserved for the next frame if .

The start frequency bin of predicting the high band excitation from the low band excitation is further refined if the FEC class of current frame is :

()

If is not an even number, is decremented by one.

Then, obtain the high band excitation by choosing low band excitation with a given length of the bandwidth according to the start frequency bin.

6.8.3.2.2.2 Extension of excitation spectrum

The DCT spectrum covering the 0-6400 Hz band is extended to the 0-8000 Hz band as follows:

()

where is the adaptive start band as computed according to subclause 6.8.3.2.2.1. The 5000-6000 Hz band in is copied from in the same band, this allows keeping the original spectrum in this band to avoid introducing distortions when the high-band is added to the decoded low-band signal. The 6000-8000 Hz band in is copied from e.g. in the 4000-6000 Hz band when =160.

6.8.3.2.3 Extraction of tonal and ambiance components

Tonal and ambiance components are extracted in the 6000-8000 Hz. This extraction is implemented according to the following steps:

  • Computation of total energy in the extended low-band signal:

()

where =0.1.

  • Computation of the ambiance component (in absolute value) corresponding to the average (bin-by-bin) level of the spectrum and computation of the energy of dominant tonal components in high frequency:

The average level is given by the following equation:

, (2057)

where = 80. This level gives an average level in absolute value and represents a sort of spectral envelope. Note that the index corresponds to indices from 240 to 319, i.e. the 6000-8000 Hz band. In general, and , however for the first and last 7 indices ( et ) the following values are used:

and for

and for

  • Detection and computation of the residual signal which defines tonal components:

, (2058)

Tonal components are detected using the criterion >0.

  • Computation of the energy of dominant tonal components in high frequency:

The energy of tonal components is computed as follows:

, (2059)

6.8.3.2.4 Recombination

The extracted tonal and ambiance components are re-mixed adaptively. The combined signal is obtained using absolute values as:

, (2060)

where the factor controlling the ambiance

(2061)

and is a multiplicative factor given by:.

(2062)

Tonal components, that were detected using the criterion, are reduced by a factor and the average level is amplified by .

Signs from are then applied as follows:

, (2063)

where

(2064)

The combined high-band signal is then obtained by adjusting the energy as follows:

, (2065)

where the adjustment factor is given by:

(2066)

The factor is used to avoid over-estimation of energy and is given by:

(2067)

and

(2068)

6.8.3.2.5 Filtering in DCT domain

The excitation is de-emphasized as follows:

(2069)

where is the frequency responses of the filter over a limited frequency range. Taking into account the (odd) frequencies of the DCT, is given by:

, (2070)

where

, (2071)

The de-emphasis is applied in two steps, for where the response of is applied in the 5000-6400 Hz band, and for corresponding to the 6400-8000 Hz band. This de-emphasis is used to bring the signal in a domain consistent with the low-band signal (in the 0-6.4 band), which is useful for the subsequent energy estimation and adjustment.

Then, the high-band is bandpass filtered in DCT domain, by splitting fixed high-pass filtering and adaptive low-pass filtering. The partial response of the low-pass filter in DCT domain is computed as follows:

, (2072)

where =60 at 6.6 kbit/s, 40 at 8.85 kbit/s, and 20 for modes >8.85 bit/s. It defines a low-pass filter with variable cut-off frequency, depending on the mode in the current frame. Then, the band-pass filter is applied in the following form:

(2073)

The definition of the factor , is given in table .

Table 174: High-pass filter in DCT domain

0

0.001622428

14

0.114057967

28

0.403990611

42

0.776551214

1

0.004717458

15

0.128865425

29

0.430149896

43

0.800503267

2

0.008410494

16

0.144662643

30

0.456722014

44

0.823611104

3

0.012747280

17

0.161445005

31

0.483628433

45

0.845788355

4

0.017772424

18

0.179202219

32

0.510787115

46

0.866951597

5

0.023528982

19

0.197918220

33

0.538112915

47

0.887020781

6

0.030058032

20

0.217571104

34

0.565518011

48

0.905919644

7

0.037398264

21

0.238133114

35

0.592912340

49

0.923576092

8

0.045585564

22

0.259570657

36

0.620204057

50

0.939922577

9

0.054652620

23

0.281844373

37

0.647300005

51

0.954896429

10

0.064628539

24

0.304909235

38

0.674106188

52

0.968440179

11

0.075538482

25

0.328714699

39

0.700528260

53

0.980501849

12

0.087403328

26

0.353204886

40

0.726472003

54

0.991035206

13

0.100239356

27

0.378318805

41

0.751843820

55

1.000000000

6.8.3.2.6 Inverse DCT

The current frame of extended excitation in high-band, , , sampled at 16 kHz, is transformed in time domain as described in subclause 5.2.3.5.13, to obtain the signal , .

6.8.3.2.7 Gain computation and scaling of excitation

6.8.3.2.7.1 6.6, 8.85, 12.65, 14.25, 15.85, 18.25, 19.85 or 23.05 kbit/s modes

The signal , is scaled by sub-frame of 5 ms as follows:

, (2074)

where =0,1,2,3 is the sub-frame index and

(2075)

with

(2076)

and = 0.01. The sub-frame gain can be further written as:

(2077)

which shows that this gain is used to have in the same ratio of sub-frame vs frame energy than in the low-band signal .

The scaled extended excitation signal is then computed for as follows:

(2077a)

where is given in Eqs. 2037 and 2041 and is the extended excitation signal.

6.8.3.2.7.2 23.85 kbit/s mode

In the 23.85 kbit/s mode, a high-frequency (HF) gain is transmitted at a bit rate of 0.8 kbit/s (4 bits per 5 ms sub-frame). This information is transmitted only at 23.85 kbit/s and it used in EVS AMR-WB IO to improve quality by adjusting the excitation gain.

To be able to use the HF gain information, the excitation has to be converted to a signal domain similar to AMR-WB high-band coding. To do so the energy of the excitation is adjusted in each subframe as follows:

, (2078)

where the sub-frame gain is computed as:

(2079)

The factor 5 in the the denominator is used to compensate the difference in bandwith between the signal and the signal , noting that in AMR-WB the HF excitation is a white noise in the 0-8000 Hz band.

The 4-bit index in each sub-frame, , transmitted at 23.85 kbit/s is demultiplexed from the bitstream and decoded as follows:

(2080)

where is the codebook used for HG gain quantization in AMR-WB, as defined in table .

Table 175: AMR-WB gain codebook for high band

0

0.110595703125000

8

0.342102050781250

1

0.142608642578125

9

0.372497558593750

2

0.170806884765625

10

0.408660888671875

3

0.197723388671875

11

0.453002929687500

4

0.226593017578125

12

0.511779785156250

5

0.255676269531250

13

0.599822998046875f

6

0.284545898437500

14

0.741241455078125

7

0.313232421875000

15

0.998779296875000

Then, the signal is scaled according to this decoded HF gain as follows:

, (2081)

The energy of the excitation is further adjusted by sub-frame under the following conditions. A factor is computed:

(2082)

Here the term 0.6 corresponds to the average magnitude ratio between the frequency response of the de-emphasis filter in the 5000-6400 Hz band. Therefore, the term represents the energy of the high-band excitation that would be obtained at 23.05 kbit/s.

Based on the tilt information of the low-band signal, the scaled extended excitation signal is then computed for as follows:

If >1 or <0:

(2083)

Otherwise:

(2084)

6.8.3.3 LP filter for the high frequency band

The high-band LP synthesis filter is derived from the weighted low-band LP synthesis filter as follows:

(2085)

where is the interpolated LP synthesis filter in each 5-ms sub-frame and =0.9 at 6.6 kbit/s and 0.6 at other modes (from 8.85 to 23.85 kbit/s). has been computed analysing signal with the sampling rate of 12.8 kHz but it is now used for a 16 kHz signal.

6.8.3.4 High band synthesis

The scaled extended excitation signal in high-band is filtered by to obtain the decoded high-band signal, which is added to synthesized low band signal to produce the synthesized output signal.