6.2 Mono Signal High-Band synthesis

26.2903GPPAudio codec processing functionsExtended Adaptive Multi-Rate - Wideband (AMR-WB+) codecRelease 17Transcoding functionsTS

The synthesis of the HF signal implements a kind of bandwidth extension (BWE) mechanism and uses some data from the LF decoder. It is an evolution of the BWE mechanism used in the AMR-WB speech decoder. The HF decoder is detailed in Figure 16. The HF signal is synthesized in 2 steps: calculation of the HF excitation signal and computation of the HF signal from the HF excitation. The HF excitation is obtained by shaping in time-domain the LF excitation signal with scalar factors (or gains) per 64-sample subframes. This HF excitation is post-processed to reduce the "buzziness" of the output, and then filtered by a HF linear-predictive synthesis filter 1/AHF(z). Recall that the LP order used to encode and then decode the HF signal is 8. The result is also post-processed to smooth energy variations.

Figure 16: Block diagram of high frequency decoder

The HF decoder synthesizes an 1024-sample HF superframe. This superframe is segmented according to MODE = (m0, m1, m2, m3). To be more specific, the decoded frames used in the HF decoder are synchronous with the frames used in the LF decoder. Hence, mk  1, mk = 2 and mk = 3 indicate respectively a 256, 512 and 1024-sample frame. These frames are referred to as HF-256, HF-512 and HF-1024, respectively.

From the synthesis chain described above, it is clear that the only parameters needed for HF decoding are ISF and gain parameters. The ISF parameters represent the filter 1/AHF(z), while the gain parameters are used to shape the LF excitation signal. These parameters are demultiplexed based on MODE and knowing the format of the bitstream.

Control data which are internal to the HF decoder are generated from the bad frame indicator vector BFI = (bfi0, bfi1, bfi2, bfi3). These data are bfi_isf_hf, BFI_GAIN, and the number of subframes for ISF interpolation. The nature of these data is defined in more details below:

bfi_isf_hf is a binary flag indicating loss of the ISF parameters. Its definition is given below from BFI.

For HF-256 in packet k, bfi_isf_hf = bfik ,

For HF-512 in packets k and k+1, bfi_isf_hf = bfik ,

For HF-1024 (in packets k=0 to 3), bfi_isf_hf = bfi0

This definition can be readily understood from the bitstream format. Recall that the ISF parameters for the HF signal are always in the first packet describing HF-256, -512 or –1024 frames.

BFI_GAIN is a binary vector used to signal packet losses to the HF gain decoder: BFI_GAIN = ( bfik ) for HF-256 in packet k, ( bfik bfik+1 ) for HF-512 in packets k and k+1, BFI_GAIN = BFI for HF-1024.

The number of subframes for ISF interpolation refers to the number of 64-sample subframes in the decoded frame. This number is 4 for HF-256, 8 for HF-512 and 16 for HF-1024.

The ISF vector isf_hf_q is decoded using AR(1) predictive VQ. If bfi_isf_hf = 0, the 2-bit index i1 of the 1st stage and the 7-bit index i2 of the 2nd stage are available and isf_hf_q is given by

isf_hf_q = cb1(i1) + cb2(i2) + mean_isf_hf + isf_hf * mem_isf_hf

where cb1(i1) is the i1–th codevector of the 1st stage, cb2(i2) is the i2–th codevector of the 2st stage, mean_isf_hf is the mean ISF vector, isf_hf = 0.5 is the AR(1) prediction coefficient and mem_isf_hf is the memory of the ISF predictive decoder.

If bfi_isf_hf = 1, the decoded ISF vector corresponds to the previous ISF vector shifted towards the mean ISF vector:

isf_hf_q =  isf_hf * mem_isf_hf + mean_isf_hf

with  isf_hf = 0.9. After calculating isf_hf_q, the ISF reordering defined in AMR-WB speech coding is applied to isf_hf_q with an ISF gap of 9 Fs/1280 Hz. Finally the memory mem_isf_hf is updated for the next HF frame as:

mem_isf_hf = isf_hf_qmean_isf_hf

Note that the initial value of mem_isf_hf (at the reset of the decoder) is zero.

A simple linear interpolation between the ISP parameters of the previous decoded HF frame (HF-256, HF-512 or HF-1024) and the new decoded ISP parameters is performed. The interpolation is conducted in the ISP domain and results in ISP parameters for each 64-sample subframe, according to the formula:

ispsubframe-i = i/nb * ispnew + (1-i/nb) * ispold,

where nb is the number of subframes in the current decoded frame (nb=4 for HF-256, 8 for HF-512, 16 for HF-1024), i=0,…,nb-1 is the subframe index, ispold is the set of ISP parameters obtained from the ISF parameters of the previously decoded HF frame and ispnew is the set of ISP decoded. The interpolated ISP parameters are then converted into linear-predictive coefficients for each subframe.

The computation of the gain gmatch in dB is detailed in the next paragraphs. This gain is interpolated for each 64-sample subframe based on its previous value old_gmatch as:

= i/nb * gmatch + (1-i/nb) * old_gmatch,

where nb is the number of subframes in the current decoded frame (nb=4 for HF-256, 8 for HF-512, 16 for HF-1024), i=0,…,nb-1 is the subframe index. This results in a vector (, ).

Gain estimation computation to match magnitude at Fs/4 kHz

Same as section 5.6 (Figure 9)

Decoding of correction gains and gain computation

Recall that after gain interpolation the HF decoder gets the estimated gains (gest0, gest1, …, gestnb-1) in dB for each of the nb subframes of the current decoded frame. Furthermore, nb = 4, 8 and 16 in HF-256, -512 and –1024, respectively. The correction gains in dB are then decoded which will be added to the estimated gains per subframe to form the decode gains ,, …, :

((dB), (dB), …, (dB)) = (,, …, ) + (,, …, )

where

(,, …, ) = (gc11, gc11, …, gc1nb-1) + (gc20, gc21, …, gc2nb-1).

Therefore, the gain decoding corresponds to the decoding of predictive two-stage VQ-scalar quantization, where the prediction is given by the interpolated Fs/4 kHz junction matching gain. The quantization dimension is variable and is equal to nb.

Decoding of the 1st stage:

The 7-bit index 0  idx  127 of the 1st stage 4-dimensional HF gain codebook is decoded into 4 gains (G0, G1, G2, G3). A bad frame indicator bfi = BFI_GAIN0 in HF-256, -512 and –1024 allows to handle packet losses. If bfi = 0, these gains are decoded as

(G0, G1, G2, G3) = cb_gain_hf(idx) + mean_gain_hf

where cb_gain_hf(idx) is the idx-th codevector of the codebook cb_gain_hf. If bfi =1, a memory past_gain_hf_q is shifted towards –20 dB:

past_gain_hf_q:= gain_hf * (past_gain_hf_q + 20) – 20.

where gain_hf = 0.9 and the 4 gains (G0, G1, G2, G3) are set to the same value:

Gk = past_gain_hf_q + mean_gain_hf, for k = 0,1,2 and 3

Then the memory past_gain_hf_q is updated as:

past_gain_hf_q:= (G0 + G1 + G2 + G3)/4 – mean_gain_hf.

The computation of the 1st stage reconstruction is then given as:

HF-256: (gc10, gc11, gc12 , gc13) = (G0, G1, G2, G3).

HF-512: (gc10, gc11, …, gc17) = (G0, G0, G1, G1, G2, G2, G3, G3).

HF-1024: (gc10, gc11, …, gc115) = (G0, G0, G0, G0, G1, G1, G1, G1, G2, G2, G2, G2, G3, G3, G3, G3).

Decoding of 2nd stage:

In TCX-256, (gc20, gc21, gc22, gc23) is simply set to (0,0,0,0) and there is no real 2nd stage decoding. In HF-512, the 2-bit index 0  idxi  3 of the i-th subframe, where i=0, …, 7, is decoded as:

If bfi = 0, gc2i = 3 * idxi – 4.5 else gc2i = 0.

In TCX-1024, 16 subframes 3-bit index the 0  idxi  7 of the i-th subframe, where i=0, …, 15, is decoded as:

If bfi = 0, gc2i = 3 * idx – 10.5 else gc2i = 0.

In TCX-512 the magnitude of the second scalar refinement is up to  4.5 dB and in TCX-1024 up to  10.5 dB. In both cases, the quantization step is 3 dB.

HF gain reconstruction:

The gain for each subframe is then computed as:

Buzziness reduction and energy smoothing

The role of energy smoothing is to attenuate pulses in the time-domain HF excitation signal rHF(n), which often cause the audio output to sound "buzzy". Pulses are detected by checking if the absolute value | rHF(n) | > 2 * thres(n), where thres(n) is an adaptive threshold corresponding to the time-domain envelope of rHF(n). The samples rHF(n) which are detected as pulses are limited to  2 * thres(n), where  is the sign of rHF(n).

Each sample rHF(n) of the HF excitation is filtered by a 1st order low-pass filter 0.02/(1 – 0.98 z-1) to update thres(n). Note that the initial value of thres(n) (at the reset of the decoder) is 0. The amplitude of the pulse attenuation is given by:

 = max( |rHF(n)|-2*thres(n) , 0.0).

Thus,  is set to 0 if the current sample is not detected as a pulse, which will let rHF(n) unchanged. Then, the current value thres(n) of the adaptive threshold is changed as:

thres(n):= thres(n) + 0.5 * .

Finally each sample rHF(n) is modified to: r’HF(n) = rHF(n) – if rHF(n)  0, and r’HF(n) = rHF(n) + otherwise.

The short-term energy variations of the HF synthesis sHF(n) are then smoothed. The energy is measured by subframe. The energy of each subframe is modified by up to  1.5 dB based on an adaptive threshold.

For a given subframe sHF(n), n=0,…,63, the subframe energy is calculated as

The value t of the threshold is updated as:

t := min( * 1.414, t ), if  < t

max(  / 1.414, t ), otherwise.

The current subframe is then scaled by :

, n=0,…,63