5.4.3 Signal-based adaptation
26.4483GPPCodec for Enhanced Voice Services (EVS)Jitter Buffer ManagementRelease 17TS
5.4.3.1 General
To alter the playout delay the output signal of the decoder is time-warped. The time-warping is performed by the Time Scale Modification module, which generates additional samples for increasing the playout delay or which removes samples from the output signal to reduce the playout delay.
Figure 4: Signal-based adaptation based on signal characteristics
A SOLA (synchronized overlap-add) algorithm with built-in quality control is used for adaptation by performing time-scale modification of the signal without altering the pitch. The level of adaptation is signal-dependent, as also outlined in Figure 4:
– signals that are classified by the quality-control as "introducing severe distortions when scaled" are never scaled;
– low-level signals, which are close to silent, are time-scaled without synchronization to the maximum possible extent;
– signals that are classified as "time-scalable", e.g. periodic signals, are scaled by a synchronized overlap-add of a shifted representation of the input signal with the input signal to the Time Scale Modification module. The shift itself is derived from a similarity measure. The shift differs depending on whether the signal is to be shortened or lengthened.
The time-scale modification algorithm shifts a portion of samples to a position that yields synchronization. The original signal is cross-faded with the shifted portion using overlap-add, followed by the remaining samples that follow the shifted portion. The level of time-scaling is signal-dependent.
To yield synchronization the shifted portion and the original frame are cross-correlated using a complexity-reduced normalized cross correlation function. The original frame size as fixed by the EVS decoder is 20ms, i.e. 640 samples for SWB. The size of the template is 10ms, i.e. 320 samples for SWB. The general parameters of the time-scale modification are listed in Table 1.
Table 1: Time-scale Modification Parameters
|
8 kHz |
16 kHz |
32 kHz |
48 kHz |
||
|
|
frame size [samples] |
160 |
320 |
640 |
960 |
|
|
segment size [samples] |
80 |
160 |
320 |
480 |
|
|
minimum pitch |
20 |
40 |
80 |
120 |
|
|
search length |
100 |
200 |
400 |
600 |
|
|
maximum pitch |
120 |
240 |
480 |
720 |
5.4.3.2 Time-shrinking
Time-shrinking reduces the size of a frame to . will differ to by 2,5-10 ms, i.e. the resulting frame duration will be in the range 10-17,5 ms. The amount of time-scaling depends on the positionof the best matching candidate segment that yields highest similarity to the first segment of the input signal. The start position and end position used to search for the candidate segment inside the input frame for time-shrinking are listed in Table 2.
Table 2: Time-shrinking Parameters
|
8 kHz |
16 kHz |
32 kHz |
48 kHz |
||
|
|
start search position [samples] |
20 |
40 |
80 |
120 |
|
|
end search position [samples] |
80 |
160 |
320 |
480 |
The output frame is the result of a cross-fade of the first segment
of the input frame and the best matching candidate segment, which is shifted by
. Samples following the candidate segment are appended to the merged signal to yield continuity with following frames.
Figure 5: Shortening of an input frame
Note that the time-shrinking algorithm does not need any look-ahead samples and is therefore not introducing extra delay, however an extra buffer is used to generate output frames with a fixed frame length from the variable length output frames
(clause 5.5).
5.4.3.3 Time-stretching
Time-stretching increases the sizeof an input frame
to
for the output frame
.
will differ to
by 2,5-15 ms, i.e. the resulting frame duration will be in the range 22,5-35 ms. The amount of time-scaling depends on the position
of the best matching candidate segment that yields highest similarity to the first segment
of the input signal. The start position
and end position
used to search for the candidate segment inside the input frame
for time-stretching are listed in Table 3.
The preceding input frame is taken to search for positions with similarity to the first segment of the current input frame.
Table 3: Time-stretching Parameters
|
8 kHz |
16 kHz |
32 kHz |
48 kHz |
||
|
|
start search position [samples] |
-120 |
-240 |
-480 |
-720 |
|
|
end search position [samples] |
-20 |
-40 |
-80 |
-120 |
The output frame is the result of a cross-fade of the first segment
of the input frame and the best matching candidate segment, which is shifted by
. Note that
is bounded by
and
and is therefore a negative number. Samples following the candidate segment are appended to the merged signal to yield continuity with following frames.
Figure 6: Lengthening of an input frame
Note that the time-stretching algorithm does not need any look-ahead samples but requires the previous frame to be available. Therefore no extra delay is introduced, however an extra buffer is used to generate output frames with a fixed frame length from the variable length output frames
(clause 5.5).
5.4.3.4 Energy Estimation
Low-level signals are detected by analysing 1ms subsegments of the input signal
, including the previous frame. If all subsegment energies
of the subsegments to be merged are below a threshold
, with
, then the frame is scaled to the maximum extent:
– a 20 ms frame is shortened to 10ms by setting for time-shrinking according to Table 2
– a 20 ms frame is extended to 35ms by setting for time-stretching according to Table 3
The generation of the output frame is performed by a cross-fade of the first segment
of the input frame and the by
shifted segment, followed by the remaining samples. Note that for low-level signals the output frame is generated without similarity estimation or quality estimation.
5.4.3.5 Similarity Measurement
To estimate similarity a non-normalized cross-correlation of the template segment with other positions of the template in the signal is maximized. In order to limit the computational complexity, a subsampled signal is used to estimate the correlation value, i.e. only every
th sample is used. The subsampling parameters are set as listed in Table 4.
Table 4: Similarity Measurement complexity Parameters
|
8 kHz |
1 6kHz |
32 kHz |
48 kHz |
||
|
|
signal subsampling |
1 |
2 |
4 |
6 |
|
|
correlation subsampling |
1 |
1 |
2 |
3 |
(13)
A hierarchical search for the best match is performed that initially skips every m-th offset, i.e. the search for the maximum correlation is performed on a subsampled set of correlation values, with as listed in Table 4. The offset with highest correlation is then used as starting point for a hierarchical search where
and
after each step.
is set to
for narrowing the search range. This hierarchical search is performed until
.
5.4.3.6 Quality Control
After similarity estimation a quality measure is calculated. To calculate
the normalized cross-correlation in (14) is used.
(14)
is evaluated with up to four different values of
, with
set to
,
,
, and
.
is then the sum of the products of the correlations in (15)
(15)
In the event that one of the calculations needs to access samples that are not available in the current or the previous frame, the values are replaced with using the following scheme:
(16)
(17)
(18)
The quality is then used in the main scaling operations 5.4.3.2 and 5.4.3.3 to decide whether the scaling operation is performed or whether the scaling is deferred to a subsequent frame.
Positive values indicate periodicity, whereas negative values indicate that the scaling could produce a perceptually distorted signal. The threshold for frame scaling is initially, however the threshold is dynamically adapted depending on the number of successive scaled or non-scaled frames. For each scaled frame the threshold is increased by 0,2 to avoid too many subsequent scaling operations. For each non-scaled frame the threshold is lowered by 0,1 to enable time-scaling also for less periodic signals.
For multi-channel operation the channel with the highest sum of energies is used for determining the quality of the scaling operation for all channels. Therefore the denominator from Formula (14) is summed up to determine the channel of highest sum of energies during quality estimation.
5.4.3.7 Overlap-add
Finally, if the time-scaling is performed, the output signal is constructed by an overlap-add of the first segment of the input signal
with the shifted input signal
. Note that
is negative for expansion of the signal.
,
(19)
The overlap-add is performed using a cos-shaped Hann window. Note that the second half of the window is not used.
(20)
The remaining samples of the shifted signal up to the most recent sample are then appended. The total length of the output signal is .
,
(21)