5.2 Principle of the hybrid ACELP/TCX core encoding

26.2903GPPAudio codec processing functionsExtended Adaptive Multi-Rate - Wideband (AMR-WB+) codecRelease 17Transcoding functionsTS

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

The encoding algorithm at the core of the AMR-WB+ codec is based on a hybrid ACELP/TCX model. For every block of input signal, the encoder decides (either in open-loop or closed-loop) which encoding model (ACELP or TCX) is best. The ACELP model is a time-domain, predictive encoder, best suited for speech and transient signals. The AMR-WB encoder is used in ACELP modes. Alternatively, the TCX model is a transform-based encoder, and is more appropriate for typical music samples. Frame lengths of variable sizes are possible in TCX mode, as will be explained in Section 5.2.1.

In Sections 5.2.1 to 5.2.4, the general principles of the hybrid ACELP/TCX core encoder will be presented. Then Section 5.3 and its subsections will give the details of the ACELP and TCX encoding modes.

5.2.1 Timing chart of the ACELP and TCX modes

The ACELP/TCX core encoder takes a mono signal as input, at a sampling frequency of Fs/2 kHz. This signal is processed in super-frames of 1024 samples in duration. Within each 1024-sample super-frame, several encoding modes are possible, depending on the signal structure. These modes are: 256-sample ACELP, 256-sample TCX, 512-sample TCX and 1024-sample TCX. These encoding modes will be described further, but first we look at the different possible mode combinations, described by a timing chart.

Figure 4 shows the timing chart of all possible modes within an 1024-sample superframe. As the figure shows, each 256-sample frame within a super-frame can be into one of four possible modes, which we call ACELP, TCX256, TCX512 and TCX1024. When in ACELP mode, the corresponding 256-sample frame is encoded with AMR-WB. In TCX256 mode, the frame is encoded using TCX with a 256-sample support, plus 32 samples of look-ahead used for overlap-and add since TCX is a transform coding approach. The TCX512 mode means that two consecutive 256-sample frames are grouped to be encoded as a single 512-sample block, using TCX with a 512-sample support plus 64 samples look-ahead. Note that the TCX512 mode is only allowed by grouping either the first two 256-sample frames of the super-frame, or the last two 256-sample frames. Finally, the TCX1024 mode indicates that all 256-sample frames within the super-frame are grouped together to be encoded in a single block using TCX with an 1024-sample support plus 128 samples look-ahead.

Figure 4: Timing chart of the frame types

5.2.2 ACELP/TCX mode combinations and mode encoding

From Figure 4, there are exactly 26 different ACELP/TCX mode combinations within an 1024-sample superframe. These are shown in Table10.

Table 10: Possible mode combinations in an 1024-sample super-frame

(0, 0, 0, 0)	(0, 0, 0, 1)	(2, 2, 0, 0)
(1, 0, 0, 0)	(1, 0, 0, 1)	(2, 2, 1, 0)
(0, 1, 0, 0)	(0, 1, 0, 1)	(2, 2, 0, 1)
(1, 1, 0, 0)	(1, 1, 0, 1)	(2, 2, 1, 1)
(0, 0, 1, 0)	(0, 0, 1, 1)	(0, 0, 2, 2)
(1, 0, 1, 0)	(1, 0, 1, 1)	(1, 0, 2, 2)
(0, 1, 1, 0)	(0, 1, 1, 1)	(0, 1, 2, 2)	(2, 2, 2, 2)
(1, 1, 1, 0)	(1, 1, 1, 1)	(1, 1, 2, 2)	(3, 3, 3, 3)

We interpret each quadruplet of numbers (m₀, m₁, m₂, m₃) in Table 10 as follows: m_k is the mode indication for the k^th 256-sample frame in the 1024-sample super-frame, where m_k can take the following values:

– m_k = 0 means the mode for frame k is 256-sample ACELP

– m_k = 1 means the mode for frame k is 256-sample TCX

– m_k = 2 means the mode for frame k is 512-sample TCX

– m_k = 3 means the mode for frame k is 1024-sample TCX

Obviously, when the first 256-sample frame is in mode "2" (512-sample TCX), the second 256-sample frame must also be in mode 2. Similarly, when the third 256-sample frame is in mode "2" (512-sample TCX), the fourth 256-sample frame must also be in mode 2. And there is only one possible mode configuration including the value "3" (1024-sample TCX), namely all four 256-sample frames are in the same mode (m_k = 3 for k = 0, 1, 2 and 3). This rigid frame structure can be exploited to aid in frame erasure concealment.

As discussed above, the parameters for each 1024-sample super-frame are actually decomposed into four frames of identical size. To increase robustness, the mode bits are actually sent as two bits (the values of m_k) in each transmitted frame. For example, if the superframe is encoded in a full 1024-sample TCX frame, which is then decomposed into four packets of equal size, then each of these four packets will contain the binary value "11" (mode m_k = 3) as mode indicator.

5.2.3 ACELP/TCX closed-loop mode selection

The best mode combination out of the 26 possible combinations of Table 10 is determined in closed-loop. This means that the signal in each 256-sample frame within an 1024-sample super-frame has to be encoded in several modes before selecting the best combination. This closed-loop approach is explained in Figure 5.

The left portion of Figure 5 (Trials) shows what encoding mode is applied to each 256-sample frame in 11 successive trials. Fr0 to Fr3 refer to Frame 0 to Frame 3 in the super-frame. The trial number (1 to 11) indicates a step in the closed-loop mode-selection process. Note that each 256-sample frame is involved in only four of the 11 encoding trials. When more than 1 frame is involved in a trial (lines 5, 10 and 11 of Figure 5), then TCX of the corresponding length is applied (TCX512 or TCX1024). The right portion of Figure 5 gives an example of mode selection, where the final decision (after Trial 11) is 1024-sample TCX. This would result in sending a value of 3 for the mode in all four packets for this super-frame. Bold numbers in the example at the right of Figure 5 show at what point a mode decision is taken in the intermediate steps of the mode selection process. The final mode decision is only known after Trial 11.

The mode selection process shown in Figure 5 proceeds as follows. First, in trials 1 and 2, ACELP (AMR-WB) then 256-sample TCX encoding are tried in the first 256-sample frame (Fr0). Then, a mode selection is made for Fr0 between these two modes. The selection criterion is the average segmental SNR between the weighted speech x_w(n) and the synthesized weighted speech . The segmental SNR in subframe i is defined as

where N is the length of the subframe (equivalent to a 64-sample sub-frame in the encoder). Then, the average segmental SNR is defined as

where N_SF is the number of subframes in the frame. Since a frame can be either 256, 512 or 1024 samples in length, N_SF can be either 4, 8 or 16. In the example of Figure 5, we assume that, according to the decision criterion, mode ACELP was retained over TCX. Then, in trials 3 and 4, the same mode comparison is made for Fr1 between ACELP and 256-sample TCX. Here, we assume that 256-sample TCX was better than ACELP, based again on the segmental SNR measure described above. This choice is indicated in bold on line 4 of the example at the right of Figure 5. Then, in trial 5, Fr0 and Fr1 are grouped together to form a 512-sample frame which is encoded using 512-sample TCX. The algorithm now has to choose between 512-sample TCX for the first 2 frames, compared to ACELP in the first frame and TCX256 in the second frame. In this example, on line 5 in bold, the sequence ACELP-TCX256 was selected over TCX-512, according to the segmental SNR criterion.

	TRIALS (11)				Example of selection (in bold = comparison is made)
	Fr 0	Fr 1	Fr 2	Fr 3	Fr 0	Fr 1	Fr 2	Fr 3
1	ACELP				ACELP
2	TCX256				ACELP
3		ACELP			ACELP	ACELP
4		TCX256			ACELP	TCX256
5	TCX512	TCX512			ACELP	TCX256
6			ACELP		ACELP	TCX256	ACELP
7			TCX256		ACELP	TCX256	TCX256
8				ACELP	ACELP	TCX256	TCX256	ACELP
9				TCX256	ACELP	TCX256	TCX256	TCX256
10			TCX512	TCX512	ACELP	TCX256	TCX512	TCX512
11	TCX1024	TCX1024	TCX1024	TCX1024	TCX1024	TCX1024x	TCX1024	TCX1024

Figure 5: Closed-loop selection of ACELP/TCX mode combination

The same procedure as trials 1 to 5 is then applied to the third and fourth frames (Fr2 and Fr3), in trials 6 to 10. After trial 10, in the example of Figure 5, the four 256-sample frames are classified as: ACELP for F0, then TCX256 for F1, then TCX512 for F2 and F3 grouped together. A last trial (line 11) is then performed where all four 256-sample frames (the whole super-frame) are encoded with 1024-sample TCX. Using the segmental SNR criterion, again with 64-sample segments, this is compared with the signal encoded using the mode selection in trial 10. In this example, the final mode decision is 1024-sample TCX for the whole frame. The mode bits for each 256-sample frame would then be (3, 3, 3, 3) as discussed in Table10.

5.2.4 ACELP/TCX open-loop mode selection

The alternative method for ACELP/TCX mode selection is the low complexity open-loop method. The open-loop mode selection is divided into three selection stages: Excitation classification (EC), excitation classification refinement (ECR) and TCX selection (TCXS). The mode selection is done purely open-loop manner in EC and ECR. The usage of TCXS algorithm depends on EC and ECR and it is closed loop TCX mode selection.

1. stage

The first stage excitation classification is done before LP analysis. The EC algorithm is based on the frequency content of the input signal using the VAD algorithm filter bank.

AMR-WB VAD produces signal energy E(n) in the 12 non-uniform bands over the frequency range from 0 to Fs/4 kHz for every 256-sample frame. Then energy levels of each band are normalised by dividing the energy level E(n) from each band by the width of that band in Hz producing normalised E_N(n) energy levels of each band where n is the band number from 0 to 11. Index 0 refers to the lowest sub band.

For each of the 12 bands, the standard deviation of the energy levels is calculated using two windows: a short window std_short(n) and a long window std_long(n). The length of the short and long window is 4 and 16 frames, respectively. In these calculations, the 12 energy levels from the current frame together with past 3 or 15 frames are used to derive two stda_short and stda_long standard deviation values. The standard deviation calculation is performed only when VAD indicates active signal.

The relation between lower frequency bands and higher frequency bands are calculated in each frame. The energy of lower frequency bands LevL from 1 to 7 are normalised by dividing it by the length of these bands in Hz. The higher frequency bands 8 to 11 are normalised respectively to create LevH. Note that the lowest band 0 is not used in these calculations because it usually contains so much energy that it will distort the calculations and make the contributions from other bands too small. From these measurements the relation LPH = LevL / LevH is defined. In addition, for each frame a moving average LPHa is calculated using the current and 3 past LPH values. The final measurement of the low and high frequency relation LPHaF for the current frame is calculated by using weighted sum of the current and 7 past LPHa values by setting slightly more weighting for the latest values.

The average level (AVL) in the current frame is calculated by subtracting the estimated level of background noise from each filter bank level after which the filter bank levels are normalised to balance the high frequency bands containing relatively less energy than the lower bands. In addition, total energy of the current frame, TotE_0, is derived from all the filter banks subtracted by background noise estimate of the each filter bank. Total energy of previous frame is therefore TotE_-1.

After calculating these measurements, a choice between ACELP and TCX excitation is made by using the following pseudo-code:

if (stda_long < 0.4)

SET TCX_MODE

else if (LPHaF > 280)

SET TCX_MODE

else if (stda_long >= 0.4)

if ((5+(1/( stda_long -0.4))) > LPHaF)

SET TCX_MODE

else if ((-90* stda_long +120) < LPHaF)

SET ACELP_MODE

else

SET UNCERTAIN_MODE

if (ACELP_MODE or UNCERTAIN_MODE) and (AVL > 2000)

SET TCX_MODE

if (UNCERTAIN_MODE)

if (stda_short < 0.2)

SET TCX_MODE

else if (stda_short >= 0.2)

if ((2.5+(1/( stda_short -0.2))) > LPHaF)

SET TCX_MODE

else if ((-90* stda_short +140) < LPHaF)

SET ACELP_MODE

else

SET UNCERTAIN_MODE

if (UNCERTAIN_MODE)

if ((TotE₀ / TotE_-1)>25)

SET ACELP_MODE

if (TCX_MODE || UNCERTAIN_MODE))

if (AVL > 2000 and TotE₀< 60)

SET ACELP_MODE

2. stage

ECR is done after open-loop LTP anlysis.

If VAD flag is set and mode has been classified in EC algorithm as uncertain mode (defined as TCX_OR_ACELP), the is mode is selected as follows:

if (SD_n > 0.2)

Mode = ACELP_MODE;

else

if (LagDif_buf < 2 )

if (Lag_n== HIGH LIMIT or Lag_n == LOW LIMIT){

if (Gain_n–NormCorr_n<0.1 and NormCorr_n>0.9)

Mode = ACELP_MODE

else

Mode = TCX_MODE

else if (Gain_n– NormCorr_n < 0.1 and NormCorr_n > 0.88)

Mode = ACELP_MODE

else if (Gain_n – NormCorr_n > 0.2)

Mode = TCX_MODE

else

NoMtcx = NoMtcx +1

if (MaxEnergy_buf < 60 )

if (SD_n > 0.15)

Mode = ACELP_MODE;

else

NoMtcx = NoMtcx +1.

Where spectral distance, SD_n, of the frame n is calculated from ISP parameters as follows:

where ISP_n is the ISP coefficients vector of the frame n and ISP_n(i) is ith element of it.

LagDif_bufis the buffer containing open loop lag values of previous ten frames (256 samples).

Lag_n contains two open loop lag values of the current frame n.

Gain_n contains two LTP gain values of the current frame n.

NormCorr_n contains two normalised correlation values of the current frame n.

MaxEnergy_buf is the maximum value of the buffer containing energy values. The energy buffer contains last six values of current and previous frames (256 samples).

lph_nindicates the spectral tilt.

If VAD flag is set and mode has been classified in EC algorithm as ACELP mode, the mode decision is verified according to following algorithm where mode can be switched to TCX mode.

if (LagDif_buf < 2)

if (NormCorr_n < 0.80 and SD_n < 0.1)

Mode = TCX_MODE;

if (lph_n > 200 and SD_n < 0.1)

Mode = TCX_MODE

If VAD flag is set in current frame and VAD flag has set to zero at least one of frames in previous super-frame and the mode has been selected as TCX mode, the usage of TCX1024 is disabled (the flag NoMtcx is set).

if (vadFlag_old == 0 and vadFlag == 1 and Mode == TCX_MODE))

NoMtcx = NoMtcx +1

If VAD flag is set and mode has been classified as uncertain mode (TCX_OR_ACELP) or TCX mode, the mode decision is verified according to following algorithm.

if (Gain_n – NormCorr_n < 0.006 and NormCorr_n > 0.92 and Lag_n > 21)

DFTSum = 0;

for (i=1; i<40; i++)

DFTSum = DFTSum + mag[i];

if (DFTSum > 95 and mag[0] < 5)

Mode = TCX_MODE;

else

Mode = ACELP_MODE;

NoMtcx = NoMtcx +1

vadFlag_oldis the VAD flag of the previous frame and vadFlag is the VAD flag of the current frame.

NoMtcx is the flag indicating to avoid TCX transformation with long frame length (1024 samples), if TCX coding model is selected.

Mag is a discete Fourier transformed (DFT) spectral envelope created from LP filter coefficients, Ap, of the current frame. DFTSum is the sum of first 40 elements of the vector mag , excluding the first element (mag(0)) of the vector mag.

If VAD flag is set and the mode, Mode(Index), of the Indexth frame of current superframe has still been classified as uncertain mode (TCX_OR_ACELP), the mode is decided based on selected modes in the previous and current superframes. The counter, TCXCount, gives the number of selected long TCX frames (TCX512 and TCX1024) in previous superframe (1024 samples). The counter, ACELPCount, gives the number of ACELP frames (256 samples) in previous and current superframes.

if ((prevMode(i) == TCX1024 or prevMode(i) == TCX512) and vadFlag_old(i)== 1 and TotE_i > 60)

TCXCount = TCXCount + 1

if (prevMode(i) == ACELP_MODE)

ACELPCount = ACELPCount + 1

if (Index != i)

if (Mode(i) == ACELP_MODE)

ACELPCount = ACELPCount + 1

Where prevMode(i) is the ith frame (256 samples) in the previous superframe, Mode(i) is the ith frame in the current superframe. i is the frame (256 samples) number in superframe (1, 2, 3, 4), The mode, Mode(Index), is selected based on the counters TCXCount and ACELPCount as follows

if (TCXCount > 3)

Mode(Index) = TCX_MODE;

else if (ACELPCount > 1)

Mode(Index) = ACELP_MODE

else

Mode(Index) = TCX_MODE

3. stage: TCXS is done only if the number of ACELP modes selected in EC and ECR is less than three (ACELP<3) within an 1024-sample super-frame. The Table11 shows the possible mode combination which can be selected in TCXS. TCX mode is selected according to segmental SNR described in Chapter 5.2.3 (ACELP/TCX closed-loop mode selection).

Table 11: Possible mode combination selected in TCXS

Selected mode combination after open-loop mode selection (TCX = 1 and ACELP = 0)	Possible mode combination after TCXS (ACELP = 0, TCX256 = 1, TCX512 = 2 and TCX1024 = 3)
			NoMTcx
(0, 1, 1, 1)	(0, 1, 1, 1)	(0, 1, 2, 2)
(1, 0, 1, 1)	(1, 0, 1, 1)	(1, 0, 2, 2)
(1, 1, 0, 1)	(1, 1, 0, 1)	(2, 2, 0, 1)
(1, 1, 1, 0)	(1, 1, 1, 0)	(2, 2, 1, 0)
(1, 1, 0, 0)	(1, 1, 0, 0)	(2, 2, 0, 0)
(0, 0, 1, 1)	(0, 0, 1, 1)	(0, 0, 2, 2)
(1, 1, 1, 1)	(1, 1, 1, 1)	(2, 2, 2, 2)	1
(1, 1, 1, 1)	(2, 2, 2, 2)	(3, 3, 3, 3)	0