6.1.1 General LP-based decoding
26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS
The LSF parameters are decoded from the received bitstream and converted to LSP coefficients and subsequently to LP coefficients. The interpolation principle, described in subclause 5.1.9.6, is used to obtain interpolated LSP vectors for all subframes, i.e. 4 subframes in case of 12.8 kHz internal sampling rate and 5 subframes in case of 16 kHz sampling rate. Then, the excitation signal is reconstructed and post-processed before performing LP synthesis (filtering with the LP synthesis filter) to obtain the reconstructed signal. The reconstructed signal is then de-emphasized (an inverse of the pre-emphasis applied at the encoder). Finally, a post-processing is applied for enhancing the format and harmonic structure of signal as well as the periodicity in the low frequency region of the signal. The signal is then up-sampled to the output sample rate. Finally, the high-band signal is generated and added to the up-sampled synthesized signal to obtain a full-band reconstructed signal (output signal).
6.1.1.1 LSF decoding
6.1.1.1.1 General LSF decoding
Depending on the predictor allocation per mode, like specified at encoder side in subclause 5.2.2.1.3 one first bit is read to select between safety net or predictive mode for the switched safety net/predictive cases. The bit value of one corresponds to safety net and value zero corresponds to predictive mode. The following bits are read in groups of a number equal to the stage sizes corresponding to each coding mode as specified in subclause 5.2.2.1.4 and the codevectors are retrieved from the corresponding codebooks. The last bits correspond to the lattice codevector, having the index . The LSF residual after the first non-structured, optimized VQ was quantized by splitting the vector into two subvectors. The index
was obtained as a combined index of the two indexes corresponding to the first and the second subvector. The two indexes are retrieved as follows:
(1409)
()
These indexes correspond each to a scale index, leader class index and leader vector permutation index. For each of the two indexes corresponding to the 8-dimensional subvectors the following operations are applied. The scale offset is determined by finding out the largest scale offset that is smallest than the index . The corresponding scale offset is removed from each of
. Similarly the leader offset is calculated and removed for each of the two indexes. The index of the scale offset gives the index of the scale,
, and the index of the leader offset gives the index of the leader class,
. The remaining index values are
. The sign index,
and the leader index
are obtained
()
()
where is the cardinality of unsigned permutations for the leader class
, given in subclause 5.2.2.1.4. The indexes
and
are decoded using the position decoding based on counting the binomial coefficients and the sign decoding described in [26].
Decoding of the index corresponding to the unsigned permutation of the leader vector goes as follows. Knowing the leader class index, the number of distinct non zero values and the amount of each of these values which are tabulated (see subclause 5.2.2.1.4) can be determined. The used leader classes defined in subclause 5.2.2.1.4 have at most 4 distinct values. If there is a single value, , in the leader class corresponding to the decoded leader class index, all decoded vector components have the same value
()
where is the subvector dimension.
If there are two distinct values , in the decoded leader vector, each appearing
and
times respectively, the decoded vector is initialized with
. ()
The leader vector permutation index is interpreted using binomial coefficients decoding. The positions of the values
are determined within a vector of length
. The position of the first
,
is determined such that
. ()
If then
. The position of the second
value,
, is determined similarly for an updated index
()
an updated number vector length, instead of
, and an updated number of values,
instead of
.
The procedure follows until the positions of all v0 values are determined. Once these positions are known the values are inserted in the vector
at the corresponding positions.
If there are 3 distinct values having
number of occurrences respectively, the decoded vector is initialized with:
. ()
Out of , two subindexes are obtained:
. ()
. ()
The positions of the values are determined by binomial decoding of the index
considering
positions out of
and the values
are inserted in the vector
. The decoding is performed according to equations (1208) and (1209). The positions for values v0 are obtained by binomial decoding of the index Li1, considering k0 positions out of S.
If there are 4 distinct values the vector is initialized with
. ()
The index is divided into:
. ()
. ()
. ()
. ()
The positions for values are obtained by binomial decoding the index
for
position out of
. The positions for values
are obtained by binomial decoding of the index
for
positions out of
. The positions for values
are obtained by binomial decoding of the index
for
positions out of
.
The obtained subvectors are multiplied with the corresponding scales and component wise multiplied with the off-line computed standard deviations. The standard deviations are individually estimated for each coding mode and bandwidth. The result corresponds to the codevector from the last stage of the LSF quantizer. The codevectors from all stages are added together.
If the coding mode corresponds to a safety net only mode, or if it corresponds to a switched safety net/AR predictive mode and the safety net mode has been selected at the encoding stage, a vector representing the component wise mean for the current coding mode is added to the sum of codevectors and the result represents the decoded LSF vector. The decoded LSF vector is thus given by:
, for
=0,…,
-1 (1425)
where is the LSF vector for current frame
,
lk(i), i=0, M-1 is the codevector obtained at stage
out of the
quantization stages and
is the mean LSF vector for the current coding mode.
If AR predictive mode was selected at the encoding stage, the decoded LSF vector is given by:
, for
=0,…,
-1. (1426)
If MA predictive mode was selected at the encoding stage, based on the coding mode, the decoded LSF vector is given by:
, for
=0,…,
-1. (1427)
where is the quantization error at the previous frame
.
6.1.1.1.2 LSF decoding for voiced coding mode at 16 kHz internal sampling frequency
The VC mode at the 16 kHz internal sampling frequency has two decoding rates: 31 bits per frame and 40 bits per frame. The VC mode is decoded by a 16-state and 8-stage BC-TCVQ. figure 88 shows the decoder of the predictive BC-TCVQ with safety-net using an encoding rate of 31 bits. The 31bit LSF decoding performed by the predictive BC-TCVQ with safety-net proceeds as follows. First, one bit is decoded at the Scheme selection block. This bit defines whether the predictive scheme or the safety-net scheme is used.
For the safety-net scheme, is decoded by equation (1428),
, for
=2,…,
/2 (1428)
where the prediction residual, , is decoded by the 1st BC-TCVQ.
If the predictive scheme is used, the prediction vector is obtained using (1429):
, for
=0,…,
-1 (1429)
where are the selected AR prediction coefficients for the VC mode at 16kHz isf, M is the LPC order, and
.
The decoding of is performed as given by equation (1430),
, for
=2,…,
/2 (1430)
where the prediction residual, , is decoded by the 2nd BC-TCVQ.
The quantized LSF vector for the predictive scheme is calculated by equation (1431),
, for
=0,…,
-1 (1431)
where is the mean vector for VC mode and
The quantized LSF vector for the safety-net scheme is calculated by equation (1432).
, for
=0,…,
-1 (1432)
Figure 88: Block diagram of the decoder for the predictive BC-TCVQ with safety-net for an encoding rate of 31 bits per frame
Figure 89 shows the decoder of the predictive BC-TCVQ with safety-net for an encoding rate of 40 bits per frame. The 40-bit LSF decoding using the predictive BC-TCVQ with safety-net is performed as follows. The scheme selection and the decoding method of BC-TCVQ for both the predictive and safety-net schemes are the same as those of the 31-bit LSF decoding. and
are decoded by the 3rd and 4th SVQ decoding respectively. The quantized LSF vector
for the predictive scheme is calculated according to equation (1433),
, for
=0,…,
-1 (1433)
where is the output of the 2nd BC-TCVQ and the 2nd intra-frame prediction.
The quantized LSF vector for the safety-net scheme is calculated by equation (1434),
, for
=0,…,
-1 (1434)
where is the output of the 1st BC-TCVQ and 1st intra-frame prediction.
Figure 89: Block diagram of the decoder for the predictive BC-TCVQ/SVQ with safety-net for an encoding rate of 40 bits per frame
6.1.1.2 Reconstruction of the excitation
6.1.1.2.1 Reconstruction of the excitation in GC and VC modes and high rate IC/UC modes
6.1.1.2.1.1 Decoding the adaptive codebook vector
The received adaptive codebook parameters (or pitch parameters) are the closed-loop pitch, , and the pitch gain,
(adaptive codebook gain), transmitted for each subframe, serve to compute the adaptive codevector,
.
6.1.1.2.1.2 Pulse index decoding of the 43-bit algebraic codebook
The joint indexing decoding procedure of three pulses on two tracks is described as follows:
In the decoder side, the de-indexing procedure is as below for pulses,
positions on the track:
- 24 bits are extracted from the received bit-stream and then decoded as the temporary index
. If
is smaller than THR which is the same as the encoder side, the joint index
equals to
. If
is bigger than or equal to THR, 1 more bit will be extracted from the bit-stream as Bit. Then the global index
is adjusted as:
. Then the joint index
is computed by subtracting THR from
:
(1435)
- Decompress the joint index
into the two index for each track:
(1436)
(1437)
- Decoding the index for each track(
and
) as below:
- determining the quantity of pulse positions according to the first index
As the offset index is saved in a table (available in encoder and decoder), and each offset index in the table indicates the unique number of pulse positions in the track. So
can be decoded from the index easily. Then the number of pulse position
, the sign index
and
are obtained.
- As we know the number of pulse position
and index
, the index
and
can be decoded based on permutation method from the index
, and each pulse position is also decoded from
and
.Separating and obtaining the second index
and the third index
in the following way:
(1438)
(1439)
wherein represents the second index,
represents the third index,
represents the quantity of the positions with pulse on it,
refers to taking the remainder, and “Int” refers to taking the integer
- determining the distribution of the positions with a pulse on the track according to the second index;
the is obtained, the following calculation process is applied at the decoder:
(1) , …, and
are subtracted from
one by one.
(1440)
until the remainder
changes from a positive number to a negative number, where
is the total quantity of positions on the track,
is the quantity of positions with pulses,
, and C refers to calculating the combination function. The
, namely, the serial number of the first position with a pulse(s) on the position, is recorded, where
.
(2) If ,
, …, and
are further subtracted from
one by one until the
remainder
changes from a positive number to a negative number. The
namely, the serial number of the second position with a pulse(s) on the position, is recorded, where
.
(3) And so on, , …, and
are further subtracted from
one by one until the
remainder
changes from a positive number to a negative number, where
. The
namely, the serial number of the n+1 position with a pulse(s) on the position, is recorded, where
.
(4) The decoding of the is completed, and
is obtained.
- determining the quantity of pulses in each position with pulses according to the third index;
For each track, according to the third index , determine the number of pulses on each position that has a pulse. the
is obtained, the following calculation process is applied at the decoder:
(1) is calculated from a smaller
value to a greater
value, where:
,
,
, and C refers to calculating the combination function. The last
value that lets
be greater than zero is recorded as the position
of the first pulse on the track.
(2) If ,
is further calculated from a smaller
value to a greater
value, where
; and the last
value that lets
be greater than zero is recorded as the position
of the second pulse on the track.
(3) By analogy, is calculated from a smaller
value to a greater
value, where:
, and
; and the last
value that lets
be greater than zero is recorded as the position
for the (h+1)th pulse(h+1 is an ordinal number) on the track.
(4) The decoding of the is completed, and
is obtained.
- After obtain
, mean on each position
have a pulse, if
, mean on the position
have more pulses. The
is the result after subtract value “1” from the number of pulses in each pulse position, so value “1” is need to be added back to
position, and
is rebuilt as following
- By now all the pulse positions, the quantity of pulses in each pulse position and associated signs are decoded, so the pulses on each track is reconstructed.
6.1.1.2.1.3 Mulit-track joint decoding of pulse indexing
All the muti-track joint decoding step is described as following:
- extracting the
,
,
,
and
from the stream;
- Get the parameter from the table 35 according to the pulse number of each track, include the index bitst, Hi_Bit_bitst, Hi_Bit_ranget, re-back_bitst,
- Extract
and
from
, extract
and
from
, extract
and
from
, extract
and
from
.
- From
,
,
,
and
,
,
,
and
are decoded out.
- The
is combined with
and obtain
,
is combined with
and obtain
, then
can be get as following:
(1441)
(1442)
- The
is combined with
and obtain
, then
can be get as following:
(1443)
(1444)
- The
is combined with
and obtain
, then
,
can be get as following:
(1445)
(1446)
- Combine
,
,
,
with
,
,
,
, and get the index of each track.
6.1.1.2.1.4 Decoding the algebraic codebook vector
The received algebraic codebook index is used to extract the positions and amplitudes (signs) of the excitation pulses and to find the algebraic codevector . If the integer part of the pitch lag is less than the subframe size 64, the pitch sharpening procedure is applied, which translates into modifying
by filtering it through the adaptive pre-filter
which further consists of two parts: a periodicity enhancement part
, where
is the integer part of the pitch lag representing the fine spectral structure of the speech signal, and a tilt part
, where
is related to the voicing of the previous subframe and is bounded by [0.28, 0.56] at 16.4 and 24.4 kbps, and by [0.0; 0.5] otherwise.
The periodicity enhancement part of the filter colours the spectrum by damping inter-harmonic frequencies, which are annoying to the human ear in case of voiced signals.
Depending on bitrates and coding mode, and the estimated level of background noise, the adaptive pre-filter also includes a filter based on the spectral envelope, which colours the spectrum by damping frequencies between the formant regions. The final form of the adaptive pre filter is given by
(1447)
where and
if
Hz and
and
if
Hz.
6.1.1.2.1.5 Decoding of the combined algebraic codebook
At 32 kbps and 64 kbps bit-rates, the pre-quantizer excitation contribution is obtained from the received pre-quantizer parameters as follows. The contribution from the pre-quantizer is obtained by first de-quantizing the decoded (quantized) spectral coefficients using an AVQ decoder and applying the iDCT to these de-quantized spectral coefficients. Further the pre-emphasis filter is applied after the iDCT to form the pre-quantizer contribution
. The pre-quantizer contribution
then scales using the quantized pre-quantizer gain
to form the pre-quantizer excitation contribution.
The same above procedure applies for decoding GC, TC and IC mode at 32 kbps and 64 kbps with the exception of non-harmonic signals at 32kbps GC mode where the iDCT stage is omitted. It is noted that at the decoder, the order of codebooks and corresponding codebook stages during the decoding process is not important as a particular codebook contribution does not depend on or affect other codebook contributions. Thus the codebook arrangement in the IC mode is identical to the GC mode codebook arrangement. The pre-quantizer gain in GC and TC mode is obtained by
()
where is the decoded normalized pre-quantizer gain and
predicted algebraic codevector energy.
In IC mode, the de-quantizer gain is obtained by
()
where is the quantized algebraic codebook gain.
6.1.1.2.1.6 AVQ decoding
The reading of the AVQ parameters from the bitstream is complementary to the insertion described in subclause 5.2.3.1.6.9.3. The codebook numbers are used to estimate the actual bit-budget needed to encode AVQ parameters at the decoder and the number of unused AVQ bits is computed as a difference between the allocated and actual bit budgets.
6.1.1.2.1.6.1 Decoding of AVQ parameters
The parameters decoding involves decoding the AVQ parameters describing each 8-dimensional quantized sub‑bands of the quantized spectrum
. The
comprise several sub-bands (8 in case of combined algebraic codebook), each of 8 samples. The decoded AVQ parameters for each sub‑band
comprise:
- the codebook number
,
- the vector index
,
- and, if the codevector (i.e. lattice point) is not in a base codebook, the Voronoi index
.
The unary code for the codebook number , is first read from the bitstream and
is determined. From the codebook number
, the base codebook and the Voronoi extension order
are then obtained. If
, there is no Voronoi extension (
) and the base codebook is
. If
the base codebook is either Q3 (
even) or Q4 (
odd) and the Voronoi order (1 or 2) is also determined (
if
;
, otherwise).
Then, if , the vector index
, coded on
bits is read from the bitstream and the base codevector
is decoded.
After the decoding of the base codevector, if the Voronoi order is greater than 0, the Voronoi extension index
is decoded to obtain the Voronoi extension vector
. The number of bits in each component of index vector
is given by the Voronoi extension order
, and the scaling factor
of the Voronoi extension is given by
.
Finally, from the scaling factor , the Voronoi extension vector
and the base codebook vector
, each 8-dimensional AVQ sub-band
is computed as:
()
In case of decoding the pre-quantizer, resp. de-quantizer, contribution from subclause 6.1.1.2.1.3, the decoded sub-band blocks of corresponds to the decoded spectrum coefficients
, resp.
.
6.1.1.2.1.6.2 De-indexing of codevector in base codebook
The index decoding of the codevector is done in several steps. First, the absolute leader and its offset are identified by comparing the index with the offset in the look‑up table. The offset is subtracted from the index to produce a new index. From this index, the sign index and the absolute vector index are extracted. The sign index is decoded and the sign vector is obtained. The absolute vector index is decoded by using a multi-level permutation-based index decoding method and the absolute vector is obtained. Finally, the decoded vector is reconstructed by combining the sign vector with the absolute vector.
6.1.1.2.1.6.2.1 Sign decoding
The sign vector is obtained by extracting from left to right all the sign bits for non-zero elements in the absolute vector. The bit number of the sign code is read from the (). If the bit number of the sign index is not equal to the number of the non-zero elements in the decoded absolute vector, the sign of the last non-zero element is recovered.
6.1.1.2.1.6.2.2 Decoding of the absolute vector and of its position vector
The decoding method of the absolute vector index is described as follows:
- The absolute vector index is decomposed into several mid-indices for each level from lowest level to highest level. The absolute vector index is the starting value for the lowest level. The mid-index of each lower level is obtained by dividing the absolute vector index by the possible index value count,
, the quotient is the absolute vector index for the next lower level. The remainder is the middle index,
, for the current level.
- The
of each lower level is decoded based on a permutation and combination function and the position vector of each lower level vector related to its upper level vector is obtained.
Finally, one-by-one from the lowest level to the highest level, each lower level absolute vector is used to partly replace the upper level absolute vector elements according to the position parameter. The highest level vector is the decoded output absolute vector. A example of the absolute vector partly replace the
absolute vector elements is give as following:
Figure 90: Replacing example between and
for
.
6.1.1.2.1.6.2.3 Position vector decoding
To obtain the position vector from the middle index in each lower level, the algorithm uses a permutation and combination procedure to estimate the position sequence. The procedure is as follows:
1) Increment the value beginning from zero, until
is not more than
.
2) Let be the first position, and subtract
from the
.
3) Increase , beginning from
, until
is not more than
, where
is the position decoded at the previous step.
4) Let be the position number
, and subtract
from the
.
5) Repeat steps 3 and 4 until all positions are decoded for the current level position sequence.
6.1.1.2.1.6.2.4 Absolute vector decoding
For the lowest level, the absolute vector only includes one type of element whose value can be obtained from the decomposition order column in the table of subclause 5.2.3.1.6.9.3.2. The lowest level absolute vector is passed to the next level and at the next step another type of element is added. This new element is obtained from the decomposition order column in the table of subclause 5.2.3.1.6.9.3.2. This procedure is repeated until the highest level is reached.
6.1.1.2.1.6.2.5 Construction of the output codevector in base codebook
Constructing the 8-dimensional output codevector in the base codebook is the final step of the decoding procedure. The codevector is obtained by combining the sign vector with the absolute vector. If the bit number of the sign index is not equal to the number of the non-zero elements in the decoded absolute vector, the sign of the last non-zero element is recovered. The recovery rule, based on the RE8 lattice property, is as follows: if the sum of all output vector elements is not an integer multiple of 4, the sign of the last element is set to negative.
6.1.1.2.1.7 Decoding the gains
6.1.1.2.1.7.1 Decoding memory-less coded gains
Before calculating the adaptive and algebraic codebook gain in each subframe, the predicted algebraic codevector energy, , is decoded for the whole frame.
Now, let denote the algebraic codebook excitation energy in dB in a given subframe, which is given by
()
In the equation above, is the pre-filtered algebraic codevector.
A predicted algebraic codebook gain is then calculated as
()
An index is then retrieved from the bitstream representing a jointly-quantized adaptive codebook gain along with a correction factor. The quantized adaptive codebook gain, , is retrieved directly from the codebook and the quantized algebraic codebook gain is given by
(1453)
whereis the decoded correction factor.
Note that no prediction based on parameters from past frames is used. This increases the robustness of the codec to frame erasures.
6.1.1.2.1.7.2 Decoding memory-less joint coded gains at lowest bit-rates
For the lowest bitrates of 7.2 and 8.0 kbps, slightly different memory-less joint gain coding scheme is used.
Similarly as in the encoder, the estimated (predicted) gain of the algebraic codebook in the first subframe is given by
()
where CT is the coding mode, selected for the current frame in the pre-processing part, and is the energy of the filtered algebraic codevector. The inner term inside the logarithm corresponds to the gain of innovation vector. The only parameter in the equation above is the coding mode CT which is constant for all subframes of the current frame. The superscript [0] denotes the first subframe of the current frame.
In all subframes following the first subframe the estimated value of the algebraic codebook gain is given by
()
where k=1,2,3. Note, that the terms in the first and in the second sum of the exponent, there are quantized gains of algebraic and adaptive excitation of previous subframes, respectively. Note that the term including the gain of innovation vector is not subtracted. The reason is in the use of the quantized values of past algebraic codebook gains which are already close enough to the optimal gain and thus it is not necessary to subtract this gain again.
The gain de-quantization in the decoder is done by retrieving the codevector [;
] according to the index receing in the bitstream. The quantized value of the fixed codebook gain is then calculated as
()
6.1.1.2.1.7.3 Decoding scalar coded gains at highest bit-rates
As described in subclause 6.1.1.2.1.3.1, before calculating the adaptive and algebraic codebook gain in each subframe, the predicted algebraic codevector energy, , is decoded for the whole frame. Then two indexes are retrieved from the bitstream and used to decode the adaptive codebook gain and a correction factor. The decoded algebraic codebook gain is further obtained using equation (1453).
6.1.1.2.1.8 Reconstructed excitation
The total excitation in each subframe is constructed by
()
where is the pre-filtered algebraic codevector.
In case that combined algebraic codebook is used, the total excitation is each subframe is constructed by
()
The excitation signal, , is used to update the contents of the adaptive codebook for the next frame. The excitation signal,
, is then post processed as described in subclause 7.1.2.4 to obtain the post-processed excitation signal
, which is finally used as an input to the synthesis filter
.
6.1.1.2.2 Reconstruction of the excitation in TC mode
In TC mode, the TC frame configuration (subclause 6.8.4.2.2) is decoded first. Then, the adaptive excitation signal is either a zero vector, a glottal-shape codevector or an adaptive codebook vector. In a subframe where the glottal-shape codebook is used, the reconstruction of the glottal-shape codevector is done using the received TC parameters as described in subclause 6.8.4.2.1. In a subframe where the adaptive codebook is used, the adaptive codevector is found as described in subclause 7.1.2.1.1. In all subframes after the one where the glottal-shape codebook is used, a low-pass filtering is applied and the filtered adaptive excitation is found as .
If a subframe contains a zero adaptive excitation vector, only the algebraic codebook gain is decoded using a 2-bit or 3-bit scalar quantizer (described in subclause 6.8.4.2.4). Otherwise, the adaptive and algebraic codebook gains are decoded as in GC and VC modes (described in subclause 7.1.2.1.3).
Finally, the reconstructed excitation is computed as described in subclause 7.1.2.1.4.
6.1.1.2.3 Reconstruction of the excitation in UC mode at low rates
6.1.1.2.3.1 Decoding the innovative vector
In UC mode, the signs and indices of the two random vectors are decoded and the excitation is reconstructed as in subclause 5.2.3.3.1. The correction of the random codebook tilt is used as described in subclause 5.2.3.3.2.
6.1.1.2.3.2 Decoding the random codebook gain
In UC mode, only the random codebook gain is transmitted. The received index gives the gain in dB,
, using the relations and quantization step defined in subclause 5.2.3.3.4. The values
and
given in subclause 5.2.3.3.4. The quantized gain,
, is then given according to subclause 5.2.3.3.4.
6.1.1.2.3.3 Enhancement of background noise
The anti-swirling technique is applied in inactive periods, at 9.6 kb/s for NB signals, and 9.6 kb/s and below for WB and SWB signal. This technique is based on the decoded SAD and noisiness parameters. Basically, the anti-swirling effect is achieved by means of LP parameter smoothing in combination with reducing the power variations and spectral fluctuations of the excitation signal during detected periods of signal inactivity.
6.1.1.2.3.3.1 LP parameter smoothing
The LP parameter smoothing is done in two steps. First, a low-pass filtered set of LSP parameters is calculated by first-order autoregressive filtering according to
()
Hererepresents the low-pass filtered frame-end LSP parameter vector obtained for the current frame,
is the decoded frame-end LSP parameter vector for the current frame, and
is a weighting factor controlling the degree of smoothing.
In a second step, a weighted combination between the low-pass filtered LSP parameter vector, , and the decoded LSP parameter vectors,
,
and
, is calculated using a weighting factor
. That is
(1460)
As mentioned in subclause 7.1.1, LSP interpolation is performed to obtain four LSP vectors, each for an individual subframe. This interpolation is based on: the decoded frame-end LSP parameter vector of the previous frame, the decoded mid-frame LSP parameter vector in the current frame and the decoded frame-end LSP parameter vector of the current frame. Subsequently, instead of using these parameters, their smoothed versions, given in equation () are employed.
It is noteworthy that the degree of smoothing is controlled by means of the control factor, which is described in subclause 6.1.1.2.3.3.3.
6.1.1.2.3.3.2 Modification of the excitation signal
One essential element of the anti-swirling technique is the reduction of power and spectrum fluctuations of the signal during periods of signal inactivity.
In the first step, tilt compensation of the excitation signal is performed with a first-order tilt compensation filter given as
()
The coefficientis calculated as
()
whereand
are the zero-th and the first autocorrelation coefficients of the original excitation signal. The tilt compensation is carried out on a subframe basis.
In the second step, the spectral fluctuations of the excitation signal are further reduced by replacing a part of it with a white noise signal. To this end, first a properly scaled random sequence of unit variance is generated. This signal is then scaled by means of a gain factor, , in such a way that its power equals the smoothed power of the excitation signal. The gain factor,
, is obtained by filtering the RMS value of the excitation signal, denoted as
, on a frame-by-frame basis. That is
()
The noise is scaled by multiplying all its samples by the gain factor. Then, with some weighting factor,
, the excitation signal,
, is combined with the scaled noise signal, denoted as
. This is done according to the following equation leading to the smoothed excitation signal
:
()
It is noteworthy that the degree of excitation signal smoothing is controlled by means of the control factor, which is described in subclause 6.1.1.2.3.3.3.
6.1.1.2.3.3.3 Controlling the background noise smoothing
The anti-swirling method described in the clauses above is controlled by means of the control parametersand
in response to the received SAD and noisiness parameters.
First, the received and decoded noisiness parameter steers an intermediate smoothing control parameter, , such that it is ensured that the degree of smoothing is only increased gradually up to a maximum degree that is indicated by the received parameter. Given the received noisiness parameter,
, an intermediate parameter,
, is set according to the following relation:
()
whereis the stored intermediate control parameter from the previous frame and
is the step-size with which the smoothing control parameters are steered towards
as long as they are greater than
. In case the current frame is erased (
),
is set to the intermediate control parameter of the previous frame,
.
The SAD parameter activates the smoothing operation only when the received SAD flag, , indicates inactivity. However, in order to decrease the risk that smoothing is enabled during active signal periods, erroneously declared as inactive, the background noise smoothing is only enabled after a hangover period of 5 frames. Further, whenever the SAD declares a frame as active, the smoothing operation is disabled. In order to avoid adding a new hangover period after spurious SAD activation, no hangover is added if the detected activity period is less or equal to 3 frames.
In addition to this SAD-driven activation, for quality reasons, it is important to avoid the anti-swirling operation being turned on too abruptly. To this end, after each hang-over period, a phase in period of frames is applied, during which the smoothing operation is gradually steered from inactivate to fully enabled. Accordingly, for the n-th frame of the phase-in period, the smoothing control parameters
and
are calculated as follows:
()
For all other frames (during which the smoothing is activated) and
.
It is noteworthy that phase-in periods are only inserted after hangover periods, i.e., not after spurious SAD activations of less than 3 frames.
6.1.1.2.4 Reconstruction of the excitation in IC/UC mode at 9.6 kbps
6.1.1.2.4.1 Decoding of the innovative excitation
In IC and UC modes at 9.6 kbps the decoding the algebraic codebood ecitation is the same as described in n subclause 6.1.1.2.1.2.
At WB, an additional Gaussian noise excitation is generated as described in subclause 5.2.3.4.2.
6.1.1.2.4.2 Gains decoding
In NB, only the algebraic codeword gain is calculated as
()
The algebraic codebook excitation energy in dB, , is computed as in equation (). The quantized gain in dB is given by
()
The quantization index (6 bits) is retrieved directly from the bitstream (subclause 5.2.3.4.3.2)..
For WB the quantized algebraic codeword gainand Gaussian noise excitation gain
are decoded. They are calculated respectively as
()
()
The quantized gain in dB is given by
()
The quantization index (5 bits) and
(2 bits) are retrieved from the bitstream. The predicted algebraic codevector energy,
, is decoded for the whole frame prior to calculating the algebraic codebook gain in each subframe (subclause 5.2.3.4.3.2).
6.1.1.2.4.4 Total excitation
The total excitation in each subframe is constructed by
()
where and
are the pre-filtered algebraic codevector and the pre-filtered Gaussian noise excitation respectively.
Only the algebraic codevector is used to update the contents of the adaptive codebook for the next frame.
The excitation signal, , is then post processed as described in subclause 6.1.1.3 to obtain the post-processed excitation signal
, which is finally used as an input to the synthesis filter
.
6.1.1.2.5 Reconstruction of the excitation in GSC
In GSC mode, the attack flag is first decoded (subclause 5.1.13.5.3). Then, the number of subframe is decoded. To do so, if the bit rate is 13.2 kbit/s and the coding mode is INACTIVE, the first step is to decode 1 bit to verify if the coded frame is a SWB unvoiced frame which would implies 4 subframes. Otherwise when the number of subframe is less than 4, the noise level as defined in subclause 5.2.3.5.4 is decoded. If the bitrate is 13.2 kbit/s, then a supplementary bit is decoded to determine if the number of subframe is 1 or 2, for lower bitrate the number of subframe is 1.
Then the cut off frequency (as defined in subclause 5.2.3.5.6) is decoded and if it is different from 0, the time domain contribution is decoded (subclause 5.2.3.5.2). When a time domain contribution exists, it is converted in frequency domain and low pass filtered using the decoded cut-off frequency as described in subclause 5.2.3.5.6, otherwise the time domain contribution is set to 0.
Then the frequency domain component is decoded starting the gain of sub bands as defined in subclause 5.2.3.5.7. The gain information is then used to determine the bit allocation, the number of bands and the order of the bands to be decoded by the PVQ as described in subclause 5.2.3.5.8. Then the PVQ is decoding the spectral difference and spectral dynamic control and noise filling is applied on the decoded vector as described in subclause 5.2.3.5.10. When the spectral difference vector is complete, the gain is applied and it is combined, in the frequency domain, with the temporal contribution as described in subclause 5.2.3.5.11. If the decoded frequency excitation meets the given condition, predict the un-decoded frequency excitation by the decoded frequency excitation as described in subclause 5.2.3.5.12. The complete excitation in the frequency domain is revert back to time domain using the inverse DCT as described in subclause 5.2.3.5.12 and then a pre-echo removal is applied as in subclause 5.2.3.5.13 to get the total excitation .
6.1.1.3 Excitation post-processing
Before the synthesis, a post-processing of the excitation signal, , is performed to form the updated excitation signal,
, as follows.
6.1.1.3.1 Anti-sparseness processing
An adaptive anti-sparseness post-processing procedure is applied to the pre-filtered algebraic codevector, . This is to reduce the perceptual artefacts arising from the sparseness of algebraic codebook vectors having only a few non-zero samples per subframe. The anti-sparseness processing consists of circular convolution of the algebraic codevector with an impulse response by means of an FFT. Three pre-stored impulse responses are used and a selection number
0, 1or 2 and is set to select one of them. A value of 2 or greater corresponds to no modification; a value of 1 corresponds to medium modification and a value of 0 corresponds to strong modification. The selection of the impulse response is performed adaptively based on the decoded adaptive codebook gain,
, coding mode and bit rate.
The following selection procedure is employed where is the algebraic codebook gain in the previous subframe,
are current and 5 previous subframes’ adaptive codebook gains and
is the previous selection number.
()
6.1.1.3.2 Gain smoothing for noise enhancement
A nonlinear gain smoothing technique is applied to the algebraic codebook gain, , in order to enhance the excitation in noise. Based on the stability and voicing of the signal segment, the gain of the algebraic codebook vector is smoothed in order to reduce fluctuation in the energy of the excitation in case of stationary signals. This improves the performance in case of stationary background noise. The voicing factor
is given by
()
with giving a measure of signal periodicity
(1475)
whereand
are the energies of the scaled pitch codevector and scaled algebraic codevector, respectively. Note that since the value of
is between –1 and 1, the value of
is between 0 and 1. Note that the factor
is related to the amount of "unvoicing" with a value of 0 for purely voiced segments and a value of 1 for purely unvoiced segments.
A stability factor is computed based on a distance measure between the adjacent LP filters. Here, the factor
is related to the LSF distance measure. The LSF distance is given by
()
where in the present frame, calculated in subclause 7.1.1, and
are the LSFs in the previous frame. The stability factor
is given by
()
The LSF distance measure is smaller in case of stable signals. As the value of is inversely related to the LSF distance measure, then larger values of
correspond to more stable signals. The gain smoothing factor,
, is given by
()
The value of approaches 1 for unvoiced and stable signals, which is the case of stationary background noise signals. For purely voiced signals, or for unstable signals, the value of
approaches 0. An initial modified gain,
, is computed by comparing the algebraic codebook gain,
, to a threshold given by the initial modified gain from the previous subframe,
. If
is larger than or equal to
, then
is computed by decrementing
by 1.5 dB, constrained by
. If
is smaller than
, then
is computed by incrementing
by 1.5 dB, constrained by
. Finally, the algebraic codebook gain is modified using the value of the smoothed gain as follows
(1479)
6.1.1.3.3 Pitch enhancer
A pitch enhancer scheme modifies the total excitation of voiced signals by filtering the algebraic codebook excitation through an innovation filter. The filter frequency response emphasizes the higher frequencies and reduces the energy of the low-frequency portion of the innovative codevector. The filter coefficients are related to the periodicity of the signal. Therefore, the pitch enhancer is not applied to excitation in UC at low bit rates, i.e. birates < 9600 kb/s.
A filter of the form
()
is used where if
Hz and
if
Hz, with
being a periodicity factor given in equation (). The filtered algebraic codebook vector in the current subframe is given by
()
where the out-of-subframe samples and
are set to zero. The updated post-processed excitation is given by
()
The above procedure can be done in one step by updating the excitation as follows
()
whereis the modified algebraic codebook gain from equation ().
6.1.1.3.4 Music post processing
In case of a sound signals coded with the GSC, a music enhancer scheme modifies the total excitation corresponding to the sound signal in such a way that the quantization noise inserted between spectral tones during the encoding/decoding process can be reduced. The music enhancer consists of converting the decoded excitation into frequency domain, computing a weighting mask for retrieving spectral information lost in the quantization noise, and modifying the frequency domain excitation by applying the weighting mask to increase the spectral dynamics, and converting the modified frequency domain excitation back to time domain.
The current frequency domain post processing achieves higher frequency resolution, without adding delay to the synthesis. A weighting mask is created based on the past spectrum energy and used to improve the efficiency of the inter-tone noise removal. To achieve this post processing without adding delay to the codec, a symmetric trapezoidal window is used. It is centred on the current frame where the window is flat, and extrapolation is used to create the future signal. The advantage of working on the excitation signal rather than on the synthesis signal is that any potential discontinuities introduced by the post processing are smoothed out by the subsequent application of the LP synthesis filter. The following text describes the implementation of the music post processing.
6.1.1.3.4.1 Excitation buffering and extrapolation
To increase the frequency resolution, a frequency transform longer than the frame length is used. To do so, a concatenated excitation vector is created by concatenating the last 192 samples of the previous frame excitation, the decoded excitation of the current frame
, and an extrapolation of 192 excitation samples of the future frame
. This is described below where
is the length of the past excitation as well as the length of the extrapolated excitation, and
is the frame length. These correspond to 192 and 256 samples respectively, giving the total length
samples:
()
The extrapolation of the future excitation samples is computed by periodically extending the current frame excitation signal
using the decoded factional pitch of the last subframe of the current frame. Given the fractional resolution of the pitch lag, an upsampling of the current frame excitation is performed using a 35 samples long Hamming windowed sinc function.
6.1.1.3.4.2 Windowing and frequency transform
Prior to the time-to-frequency transform a windowing is performed on the concatenated excitation. The selected window has a flat top corresponding to the current frame, and it decreases with the Hanning function to 0 at each end. The following equation represents the window used:
()
When applied to the concatenated excitation, an input to the frequency transform having a total length samples (
) is obtained in the prototype. The windowed concatenated excitation
is centered on the current frame and is represented with the following equation:
()
During the frequency-domain post processing phase, the concatenated excitation is represented in a transform-domain using a type II DCT giving a resolution of 10Hz. The frequency representation of the concatenated and windowed time-domain CELP excitation fu is given below:
()
where , is the concatenated and windowed time-domain excitation and
is the length of the frequency transform. The frame length
is 256 samples, but the length of the frequency transform
is 640 samples for a corresponding inner sampling frequency of 12.8 kHz.
6.1.1.3.4.3 Energy per band and per bin analysis
After the DCT, the resulting spectrum is divided into critical frequency bands. The critical frequency bands used in the prototype are as close as possible to what is specified in [17], and their upper limits are defined as follows:
()
The 640-point DCT results in a frequency resolution of 10 Hz (6400Hz/640pts). The number of frequency bins per critical frequency band is
()
The average spectral energy per critical frequency band is computed as follows:
()
where represents the hth frequency bin of a critical band and
is the index of the first bin in the ith critical band given by
()
The spectral analysis also computes the energy of the spectrum per frequency bin, using the following relation:
()
Finally, the spectral analysis computes a total spectral energy of the concatenated excitation as the sum of the spectral energies of the first 17 critical frequency bands using the following relation:
()
6.1.1.3.4.4 Excitation type classification
The method for enhancing decoded generic sound signal includes an additional analysis of the excitation signal designed to maximize the efficiency of the inter-harmonic noise reducer by identifying which frame is well suited for the inter-tone noise reduction.
This classifier not only separates the decoded concatenated excitation into sound signal categories, but it also gives instructions to the inter-harmonic noise reducer regarding the maximum level of attenuation and the minimum frequency where the reduction can starts.
The first operation consists in performing an energy stability analysis based on the total spectral energy of the concatenated excitation:
()
where represents the average difference of the energies of the concatenated excitation vectors of two adjacent frames,
represents the energy of the concatenated excitation of the current frame
, and
represents the energy of the concatenated excitation of the previous frame
. The average is computed over the last 40 frames.
Then, a statistical deviation of the energy variation over the last fifteen (15) frames is calculated using the following relation:
()
where, in the prototype, the scaling factor is found experimentally and set to about 0.77. The resulting deviation
is compared to four (4) floating thresholds to determine to what extend the noise between harmonics can be reduced. The output of this second stage classifier is split into five (5) sound signal categories
, named sound signal categories 0 to 4. Each sound signal category has its own inter-tone noise reduction tuning.
The five (5) sound signal categories 0-4 can be determined as indicated in the following table.
Table 156: Output characteristic of the excitation classifier
|
Category (eCAT) |
Enhanced band (Hz) |
Allowed reduction (dB) |
|
0 |
NA |
0 |
|
1 |
[510, 6400] |
6 |
|
2 |
[510, 6400] |
9 |
|
3 |
[400, 6400] |
12 |
|
4 |
[300, 6400] |
12 |
The sound signal category 0 is a non-tonal, non-stable sound signal category which is not modified by the inter-tone noise reduction technique. This category of the decoded sound signal has the largest statistical deviation of the spectral energy variation and in general comprises speech signal.
Sound signal category 1 (largest statistical deviation of the spectral energy variation after category 0) is detected when the statistical deviation of spectral energy variation history is lower than Threshold 1 and the last detected sound signal category is ≥ 0. Then the maximum reduction of quantization noise of the decoded tonal excitation within the frequency band 510 to Hz is limited to a maximum noise reduction
of 6 dB.
Sound signal category 2 is detected when the statistical deviation of spectral energy variation is lower than Threshold 2 and the last detected sound signal category is ≥ 1. Then the maximum reduction of quantization noise of the decoded tonal excitation within the frequency band 510 to Hz is limited to a maximum of 9 dB.
Sound signal category 3 is detected when the statistical deviation of spectral energy variation is lower than Threshold 3 and the last detected sound signal category is ≥ 2. Then the maximum reduction of quantization noise of the decoded tonal excitation within the frequency band 4000 to Hz is limited to a maximum of 12 dB.
Sound signal category 4 is detected when the statistical deviation of spectral energy variation is lower than Threshold 4 and when the last detected signal type category is ≥ 3. Then the maximum reduction of quantization noise of the decoded tonal excitation within the frequency band 300 to Hz is limited to a maximum of 12 dB.
The floating thresholds 1-4 help preventing wrong signal type classification. Typically, decoded tonal sound signal representing music gets much lower statistical deviation of its spectral energy variation than speech. However, even music signal can contain higher statistical deviation segment, and similarly speech signal can contain segments with lower statistical deviation. It is nevertheless unlikely that speech and music contents change regularly from one to another on a frame basis. The floating thresholds add decision hysteresis and act as reinforcement of previous state to substantially prevent any misclassification that could result in a suboptimal performance of the inter-harmonic noise reducer.
Counters of consecutive frames of sound signal category 0, and counters of consecutive frames of sound signal category 3 or 4, are used to respectively decrease or increase the thresholds.
For example, if a counter counts a series of more than 30 frames of sound signal category 3 or 4, all the floating thresholds (1 to 4) are increased by a predefined value for the purpose of allowing more frames to be considered as sound signal category 4.
The inverse is also true with sound signal category 0. For example, if a series of more than 30 frames of sound signal category 0 is counted, all the floating thresholds (1 to 4) are decreased for the purpose of allowing more frames to be considered as sound signal category 0. All the floating thresholds 1-4 are limited to absolute maximum and minimum values to ensure that the signal classifier is not locked to a fixed category.
In the case of frame erasure, all the thresholds 1-4 are reset to their minimum values and the output of the signal classifier is considered as non-tonal (sound signal category 0) for three (3) consecutive frames (including the lost frame).
If information from a Voice Activity Detector (VAD) is available and it is indicating no voice activity (presence of silence), or if the last frame didn’t contain generic audio the decision of the signal type classifier is forced to sound signal category 0 ().
6.1.1.3.4.5 Inter-tone noise reduction in the excitation domain
Inter-tone noise reduction is performed on the frequency representation of the concatenated excitation as a first operation of the enhancement. The reduction of the inter-tone quantization noise is performed by scaling the spectrum in each critical band with a scaling gain limited between a minimum and a maximum gain
and
. The scaling gain is derived from an estimated signal-to-noise ratio (SNR) in that critical band. The processing is performed on frequency bin basis and not on critical band basis. Thus, the scaling gain is applied on all frequency bins, and it is derived from the SNR computed using the bin energy divided by an estimation of the noise energy of the critical band including that bin. This feature allows for preserving the energy at frequencies near harmonics or tones, thus substantially preventing distortion, while strongly reducing the noise between the harmonics. The inter-tone noise reduction is performed in a per bin manner over all 640 bins.
The minimum scaling gain is derived from the maximum allowed inter-tone noise reduction in dB,
. As described above, the second stage of classification makes the maximum allowed reduction varying between 6 and 12 dB. Thus minimum scaling gain is given by
()
The scaling gain is computed related to the SNR per bin. Then per bin noise reduction is performed on the entire spectrum to the maximum frequency of 6400 Hz. The noise reduction can start at the 2th critical band (i.e. no reduction is performed below 300Hz). The excitation type classifier module can push the starting critical band up to the 4th band (510 Hz), to reduce any potential degradation. This means that the first critical band on which the noise reduction is performed is between 300Hz and 920 Hz, and it can vary on a frame basis. In a more conservative implementation, the minimum band where the noise reduction starts can be set higher.
The scaling for a certain frequency bin is computed as a function of
, given by
, bounded by
()
Usually is equal to 1 (i.e. no amplification is allowed), then the values of
and
are determined such as
for
dB, and
for
dB. That is, for SNRs of 1 dB and lower, the scaling is limited to
and for SNRs of 45 dB and higher, no noise reduction is performed (
). Thus, given these two end points, the values of
and
in equation are given by
and
()
If is set to a value higher than 1, then it allows the process to slightly amplify the tones having the highest energy. This can be used to compensate for the fact that the CELP codec, used in the prototype, doesn’t match perfectly the energy in the frequency domain. This is generally the case for signals different from voiced speech.
The SNR per bin in a certain critical band is computed as
()
where and
denote the energy per frequency bin for the past and the current frame spectral analysis, respectively, as computed in subclause 5.1.5.2,
denotes the noise energy estimate of the critical band
,
is the index of the first bin in the ith critical band, and
is the number of bins in the critical band
as defined above.
The smoothing factor is adaptive and it is made inversely related to the gain itself. The smoothing factor is given by . That is, the smoothing is stronger for smaller gains
. This approach substantially prevents distortion in high SNR segments preceded by low SNR frames, as it is the case for voiced onsets. The smoothing procedure is able to quickly adapt and to use lower scaling gains on onsets.
In case of per bin processing in a critical band with index , after determining the scaling gain and using
the actual scaling is performed using a smoothed scaling gain
, updated in every frequency analysis as follows
()
Temporal smoothing of the gains substantially prevents audible energy oscillations while controlling the smoothing using substantially prevents distortion in high SNR segments preceded by low SNR frames, as it is the case for voiced onsets or attacks.
The scaling in the critical band is performed as
()
where is the index of the first bin in the critical band
and
is the number of bins in that critical band.
The smoothed scaling gains are initially set to 1. Each time a non-tonal sound frame is processed
, the smoothed gain values are reset to 1 to reduce any possible reduction in the next frame.
Note that in every spectral analysis, the smoothed scaling gains are updated for all frequency bins in the entire spectrum. Note that in case of low-energy signal, inter-tone noise reduction is limited to -1.25 dB. This happens when the maximum noise energy in all critical bands,
is less or equal to 10.
6.1.1.3.4.6 Inter-tone quantization noise estimation
The inter-tone quantization noise energy per critical frequency band is estimated as being the average energy of that critical frequency band excluding the maximum bin energy of the same band. The following formula summarizes the estimation of the quantization noise energy for a specific band :
()
where is the index of the first bin in the critical band
,
is the number of bins in that critical band,
is the average energy of a band
,
is the energy of a particular bin and
NB(i) is the resulting estimated noise energy of a particular band
. The variable
represents a noise scaling factor per band that is found experimentally and is set such that more noise can be removed in low frequencies and less noise in high frequencies as it is shown below:
()
6.1.1.3.4.7 Increasing spectral dynamic of the excitation
The second operation of the frequency post processing provides an ability to retrieve frequency information that is lost within the coding noise. The CELP codecs, especially when used at low bitrates, are not very efficient to properly code frequency content above 3.5-4 kHz. The following steps take advantage of the fact that music spectrum does not often changed substantially from frame to frame. Therefore a long term averaging can be done and some of the coding noise can be eliminated. The following operations are performed to define a frequency-dependent gain function. This function is then used to further enhance the excitation before converting it back to the time domain.
6.1.1.3.4.8 Per bin normalization of the spectrum energy
The first operation consists in creating a weighting mask based on the normalized energy of the spectrum of the concatenated excitation. The normalization is done such that the tones have a value above 1.0 and the valleys a value under 1.0. To do so, the energy spectrum is normalized between 0.925 and 1.925 to get the normalized energy spectrum
using the following equation:
()
where represents the bin energy as calculated in subclause 5.1.5.2. Since the normalization is performed in the energy domain, many bins have very low values. The offset 0.925 has been chosen such that only a small part of the normalized energy bins would have a value below 1.0. Once the normalization is done, the resulting normalized energy spectrum is passed through a power function of 8 to obtain a scaled energy spectrum as shown in the following formula:
()
where is the normalized energy spectrum and
is the scaled energy spectrum. A maximum limit of the scaled energy spectrum is fixed to 5, creating a ratio of approximately 10 between the maximum and minimum normalized energy values. The following equation shows how the function is applied:
()
Where represents limited scaled energy spectrum and
is the scaled energy spectrum.
6.1.1.3.4.9 Smoothing of the scaled energy spectrum along the frequency axis and the time axis
With the last two operations, the position of the most energetic pulses begins to take shape. Applying power of 8 on the bins of the normalized energy spectrum is a first operation to create the mask that increases the spectral dynamics. The next 2 operations enhance this spectrum mask. First the scaled energy spectrum is smoothed along the frequency axis from low frequency to the high frequency with an averaging filter. Then, the resulting mask is processed along the time domain axis to smooth the bin values from frame to frame.
The smoothing of the scaled energy spectrum along the frequency axis can be described with following function:
()
Finally, the smoothing along time axis results in a time-averaged amplification/attenuation weighting mask to be applied to the spectrum
. The weighting mask, also called gain mask, is described with the following equation:
()
where is the scaled energy spectrum smoothed along the frequency axis,
is the frame index, and
is the time-averaged weighting mask.
A slower adaptation rate has been chosen for the lower frequencies to substantially prevent gain oscillation. A faster adaptation rate is allowed for higher frequencies since the positions of the tones are more likely to change rapidly in the higher part of the spectrum. With the averaging performed on the frequency axis and the long term smoothing performed along the time axis, the final vector is used as a weighting mask to be applied directly on the enhanced spectrum
of the concatenated excitation.
6.1.1.3.4.10 Application of the weighting mask to the enhanced concatenated excitation spectrum
The weighting mask defined above is applied differently depending on the output of the excitation type classifier (value of ). The weighting mask is not applied if the excitation is classified as category 0 (
; i.e. high probability of speech content).
For the first 1 kHz, the mask is applied if the excitation is not classified as category 0 (). Attenuation is possible but no amplification is performed in this frequency range (maximum value of the mask is limited to 1).
If more than 25 consecutive frames are classified as category 4 (; i.e. high probability of music content), but not more than 40 frames, then the weighting mask is applied without amplification for all the remaining bins (the maximum gain
is limited to 1, and there is no limitation on the minimum gain).
When more than 40 frames are classified as category 4, for the frequencies between 1 and 2 kHz the maximum gain is set to 1.5 for bitrates below 12650 bits per second (b/s). Otherwise the maximum gain
is set to 1. In this frequency band, the prototype fixes the minimum gain
to 0.75 only if the bitrate is higher than 15850 b/s, otherwise there is no limitation on the minimum gain.
For the band 2 to 4 kHz, the maximum gain is limited to 2 for bitrates below 12650 b/s, and it is limited to 1.25 for the bitrates equal to or higher than 12650 b/s and lower than 15850 b/s. Otherwise, then maximum gain
is limited to 1. Still in this frequency band, the minimum gain
is 0.5 only if the bitrate is higher than 15850 b/s, otherwise there is no limitation on the minimum gain.
For the band 4 to 6.4 kHz, the maximum gain is limited to 2 for bitrates below 15850 b/s and to 1.25 otherwise. In this frequency band, the prototype fixes the minimum gain
to 0.5 only if the bitrate is higher than 15850 b/s, otherwise there is no limitation on the minimum gain.
6.1.1.3.4.11 Inverse frequency transform and overwriting of the current excitation
After the frequency domain enhancement is completed, an inverse frequency-to-time transform is performed in order to get the enhanced temporal excitation back. The frequency-to-time conversion is achieved with the same type II DCT as used for the time-to-frequency conversion. The modified time-domain excitation is obtained as
()
where is the frequency representation of the modified excitation,
is the enhanced concatenated excitation, and
is the length of the concatenated excitation vector.
To avoid adding delay to the synthesis, it has been decided to avoid overlap-and-add algorithm in the LP domain path. Thus, the exact length of the final excitation is used to generate the synthesis directly from the enhanced concatenated excitation, without overlap as shown in the equation below:
()
Here represents the length of the section of the window that was applied on the past segment of the excitation, prior to the frequency transformation.