4 Outline description
26.1903GPPAdaptive Multi-Rate - Wideband (AMR-WB) speech codecSpeech codec speech processing functionsTranscoding functionsTS
This TS is structured as follows:
Section 4.1 contains a functional description of the audio parts including the A/D and D/A functions. Section 4.2 describes input format for the AMR-WB encoder and the output format for the AMR-WB decoder. Sections 4.3 and 4.4 present a simplified description of the principles of the AMR-WB codec encoding and decoding process respectively. In subclause 4.5, the sequence and subjective importance of encoded parameters are given.
Section 5 presents the functional description of the AMR-WB codec encoding, whereas clause 6 describes the decoding procedures. In section 7, the detailed bit allocation of the AMR-WB codec is tabulated. Section 8 describes the homing operation.
4.1 Functional description of audio parts
The analogue‑to‑digital and digital‑to‑analogue conversion will in principle comprise the following elements:
1) Analogue to uniform digital PCM
– microphone;
– input level adjustment device;
– input anti‑aliasing filter;
– sample‑hold device sampling at 16 kHz;
– analogue‑to‑uniform digital conversion to 14‑bit representation.
The uniform format shall be represented in two’s complement.
2) Uniform digital PCM to analogue
‑ conversion from 14‑bit/16 kHz uniform PCM to analogue;
‑ a hold device;
‑ reconstruction filter including x/sin( x ) correction;
‑ output level adjustment device;
‑ earphone or loudspeaker.
In the terminal equipment, the A/D function may be achieved
‑ by direct conversion to 14‑bit uniform PCM format;
For the D/A operation, the inverse operations take place.
4.2 Preparation of speech samples
The encoder is fed with data comprising of samples with a resolution of 14 bits left justified in a 16‑bit word. The decoder outputs data in the same format. Outside the speech codec further processing must be applied if the traffic data occurs in a different representation.
4.3 Principles of the adaptive multi-rate wideband speech encoder
The AMR-WB codec consists of nine source codecs with bit-rates of 23.85 23.05, 19.85, 18.25, 15.85, 14.25, 12.65, 8.85 and 6.60 kbit/s.
The codec is based on the code‑excited linear predictive (CELP) coding model. The input signal is pre-emphasized using the filter Hpre-emph(z)=1z1. The CELP model is then applied to the pre-emphasized signal. A 16th order linear prediction (LP), or short‑term, synthesis filter is used which is given by:
, ( 1 )
where âi,i=1,…,m are the (quantized) linear prediction (LP) parameters, and m = 16 is the predictor order. The long‑term, or pitch, synthesis filter is usually given by:
, ( 2 )
where T is the pitch delay and gp is the pitch gain. The pitch synthesis filter is implemented using the so-called adaptive codebook approach.
The CELP speech synthesis model is shown in Figure 1. In this model, the excitation signal at the input of the short‑term LP synthesis filter is constructed by adding two excitation vectors from adaptive and fixed (innovative) codebooks. The speech is synthesized by feeding the two properly chosen vectors from these codebooks through the short‑term synthesis filter. The optimum excitation sequence in a codebook is chosen using an analysis‑by‑synthesis search procedure in which the error between the original and synthesized speech is minimized according to a perceptually weighted distortion measure.
The perceptual weighting filter used in the analysis‑by‑synthesis search technique is given by:
, ( 3 )
where A(z) is the unquantized LP filter, , and 1=0.92 is the perceptual weighting factor. The weighting filter uses the unquantized LP parameters.
The encoder performs the analysis of the LPC, LTP and fixed codebook parameters at 12.8 kHz sampling rate. The coder operates on speech frames of 20 ms. At each frame, the speech signal is analysed to extract the parameters of the CELP model (LP filter coefficients, adaptive and fixed codebooks’ indices and gains). In addition to these parameters, high-band gain indices are computed in 23.85 kbit/s mode. These parameters are encoded and transmitted. At the decoder, these parameters are decoded and speech is synthesized by filtering the reconstructed excitation signal through the LP synthesis filter.
The signal flow at the encoder is shown in Figure 2. After decimation, high-pass and pre-emphasis filtering is performed. LP analysis is performed once per frame. The set of LP parameters is converted to immittance spectrum pairs (ISP) and vector quantized using split-multistage vector quantization (S-MSVQ). The speech frame is divided into 4 subframes of 5 ms each (64 samples at 12.8 kHz sampling rate). The adaptive and fixed codebook parameters are transmitted every subframe. The quantized and unquantized LP parameters or their interpolated versions are used depending on the subframe. An open‑loop pitch lag is estimated in every other subframe or once per frame based on the perceptually weighted speech signal.
Then the following operations are repeated for each subframe:
– The target signal x(n) is computed by filtering the LP residual through the weighted synthesis filter with the initial states of the filters having been updated by filtering the error between LP residual and excitation (this is equivalent to the common approach of subtracting the zero input response of the weighted synthesis filter from the weighted speech signal).
– The impulse response, h(n) of the weighted synthesis filter is computed.
– Closed‑loop pitch analysis is then performed (to find the pitch lag and gain), using the target x(n) and impulse response h(n), by searching around the open‑loop pitch lag. Fractional pitch with 1/4th or 1/2nd of a sample resolution (depending on the mode and the pitch lag value) is used. The interpolating filter in fractional pitch search has low pass frequency response. Further, there are two potential low-pass characteristics in the the adaptive codebook and this information is encoded with 1 bit.
– The target signal x(n) is updated by removing the adaptive codebook contribution (filtered adaptive codevector), and this new target, x2(n), is used in the fixed algebraic codebook search (to find the optimum innovation).
– The gains of the adaptive and fixed codebook are vector quantified with 6or 7 bits (with moving average (MA) prediction applied to the fixed codebook gain).
– Finally, the filter memories are updated (using the determined excitation signal) for finding the target signal in the next subframe.
The bit allocation of the AMR-WB codec modes is shown in Table 1. In each 20 ms speech frame, 132, 177, 253, 285, 317, 365, 397, 461 and 477 bits are produced, corresponding to a bit-rate of 6.60, 8.85 ,12.65, 14.25, 15.85, 18.25, 19.85, 23.05 or 23.85 kbit/s. More detailed bit allocation among the codec parameters is given in tables 12a-12i. Note that the most significant bits (MSB) are always sent first.
Table 1: Bit allocation of the AMR-WB coding algorithm for 20 ms frame
Mode |
Parameter |
1st subframe |
2nd subframe |
3rd subframe |
4th subframe |
total per frame |
VAD-flag |
1 |
|||||
23.85 kbit/s |
ISP |
46 |
||||
LTP-filtering |
1 |
1 |
1 |
1 |
4 |
|
Pitch delay |
9 |
6 |
9 |
6 |
30 |
|
Algebraic code |
88 |
88 |
88 |
88 |
352 |
|
Codebook gain |
7 |
7 |
7 |
7 |
28 |
|
HB-energy |
4 |
4 |
4 |
4 |
16 |
|
Total |
477 |
|||||
VAD-flag |
1 |
|||||
23.05 kbit/s |
ISP |
46 |
||||
LTP-filtering |
1 |
1 |
1 |
1 |
4 |
|
Pitch delay |
9 |
6 |
9 |
6 |
30 |
|
Algebraic code |
88 |
88 |
88 |
88 |
352 |
|
Gains |
7 |
7 |
7 |
7 |
28 |
|
Total |
461 |
|||||
VAD-flag |
1 |
|||||
19.85 kbit/s |
ISP |
46 |
||||
LTP-filtering |
1 |
1 |
1 |
1 |
4 |
|
Pitch delay |
9 |
6 |
9 |
6 |
30 |
|
Algebraic code |
72 |
72 |
72 |
72 |
288 |
|
Codebook gain |
7 |
7 |
7 |
7 |
28 |
|
Total |
397 |
|||||
VAD-flag |
1 |
|||||
18.25 kbit/s |
ISP |
46 |
||||
LTP-filtering |
1 |
1 |
1 |
1 |
4 |
|
Pitch delay |
9 |
6 |
9 |
6 |
30 |
|
Algebraic code |
64 |
64 |
64 |
64 |
256 |
|
Gains |
7 |
7 |
7 |
7 |
28 |
|
Total |
365 |
|||||
VAD-flag |
1 |
|||||
15.85 kbit/s |
ISP |
46 |
||||
LTP-filtering |
1 |
1 |
1 |
1 |
4 |
|
Pitch delay |
9 |
6 |
9 |
6 |
30 |
|
Algebraic code |
52 |
52 |
52 |
52 |
208 |
|
Gains |
7 |
7 |
7 |
7 |
28 |
|
Total |
317 |
|||||
VAD-flag |
1 |
|||||
14.25 kbit/s |
ISP |
46 |
||||
LTP-filtering |
1 |
1 |
1 |
1 |
4 |
|
Pitch delay |
9 |
6 |
9 |
6 |
30 |
|
Algebraic code |
44 |
44 |
44 |
44 |
176 |
|
Gains |
7 |
7 |
7 |
7 |
28 |
|
Total |
285 |
|||||
VAD-flag |
1 |
|||||
12.65 kbit/s |
ISP |
46 |
||||
LTP-filtering |
1 |
1 |
1 |
1 |
4 |
|
Pitch delay |
9 |
6 |
9 |
6 |
30 |
|
Algebraic code |
36 |
36 |
36 |
36 |
144 |
|
Gains |
7 |
7 |
7 |
7 |
28 |
|
Total |
253 |
|||||
VAD-flag |
1 |
|||||
8.85 kbit/s |
ISP |
46 |
||||
Pitch delay |
8 |
5 |
8 |
5 |
26 |
|
Algebraic code |
20 |
20 |
20 |
20 |
80 |
|
Gains |
6 |
6 |
6 |
6 |
24 |
|
Total |
177 |
|||||
VAD-flag |
1 |
|||||
6.60 kbit/s |
ISP |
36 |
||||
Pitch delay |
8 |
5 |
5 |
5 |
23 |
|
Algebraic code |
12 |
12 |
12 |
12 |
48 |
|
Gains |
6 |
6 |
6 |
6 |
24 |
|
Total |
132 |
4.4 Principles of the adaptive multi-rate speech decoder
The signal flow at the decoder is shown in Figure 3. At the decoder, the transmitted indices are extracted from the received bitstream. The indices are decoded to obtain the coder parameters at each transmission frame. These parameters are the ISP vector, the 4 fractional pitch lags, the 4 LTP filtering parameters, the 4 innovative codevectors, and the 4 sets of vector quantized pitch and innovative gains. In 23.85 kbit/s mode, also high-band gain index is decoded. The ISP vector is converted to the LP filter coefficients and interpolated to obtain LP filters at each subframe. Then, at each 64-sample subframe:
– The excitation is constructed by adding the adaptive and innovative codevectors scaled by their respective gains.
– The 12.8 kHz speech is reconstructed by filtering the excitation through the LP synthesis filter.
– The reconstructed speech is de-emphasized.
Finally, the reconstructed speech is upsampled to 16 kHz and high-band speech signal is added to the frequency band from 6 kHz to 7 kHz.
4.5 Sequence and subjective importance of encoded parameters
The encoder will produce the output information in a unique sequence and format, and the decoder must receive the same information in the same way. In table 12a-12i, the sequence of output bits and the bit allocation for each parameter is shown.
The different parameters of the encoded speech and their individual bits have unequal importance with respect to subjective quality. The output and input frame formats for the AMR wideband speech codec are given in [2], where a reordering of bits take place.