5.4.3 Signal-based adaptation

26.4483GPPCodec for Enhanced Voice Services (EVS)Jitter Buffer ManagementRelease 17TS

5.4.3.1 General

To alter the playout delay the output signal of the decoder is time-warped. The time-warping is performed by the Time Scale Modification module, which generates additional samples for increasing the playout delay or which removes samples from the output signal to reduce the playout delay.

Figure 4: Signal-based adaptation based on signal characteristics

A SOLA (synchronized overlap-add) algorithm with built-in quality control is used for adaptation by performing time-scale modification of the signal without altering the pitch. The level of adaptation is signal-dependent, as also outlined in Figure 4:

– signals that are classified by the quality-control as "introducing severe distortions when scaled" are never scaled;

– low-level signals, which are close to silent, are time-scaled without synchronization to the maximum possible extent;

– signals that are classified as "time-scalable", e.g. periodic signals, are scaled by a synchronized overlap-add of a shifted representation of the input signal with the input signal to the Time Scale Modification module. The shift itself is derived from a similarity measure. The shift differs depending on whether the signal is to be shortened or lengthened.

The time-scale modification algorithm shifts a portion of samples to a position that yields synchronization. The original signal is cross-faded with the shifted portion using overlap-add, followed by the remaining samples that follow the shifted portion. The level of time-scaling is signal-dependent.

To yield synchronization the shifted portion and the original frame are cross-correlated using a complexity-reduced normalized cross correlation function. The original frame size as fixed by the EVS decoder is 20ms, i.e. 640 samples for SWB. The size of the template is 10ms, i.e. 320 samples for SWB. The general parameters of the time-scale modification are listed in Table 1.

Table 1: Time-scale Modification Parameters

8 kHz

16 kHz

32 kHz

48 kHz

frame size [samples]

160

320

640

960

segment size [samples]

80

160

320

480

minimum pitch

20

40

80

120

search length

100

200

400

600

maximum pitch

120

240

480

720

5.4.3.2 Time-shrinking

Time-shrinking reduces the size of a frame to . will differ to by 2,5-10 ms, i.e. the resulting frame duration will be in the range 10-17,5 ms. The amount of time-scaling depends on the positionof the best matching candidate segment that yields highest similarity to the first segment of the input signal. The start position and end position used to search for the candidate segment inside the input frame for time-shrinking are listed in Table 2.

Table 2: Time-shrinking Parameters

8 kHz

16 kHz

32 kHz

48 kHz

start search position [samples]

20

40

80

120

end search position [samples]

80

160

320

480

The output frame is the result of a cross-fade of the first segment of the input frame and the best matching candidate segment, which is shifted by . Samples following the candidate segment are appended to the merged signal to yield continuity with following frames.

Figure 5: Shortening of an input frame

Note that the time-shrinking algorithm does not need any look-ahead samples and is therefore not introducing extra delay, however an extra buffer is used to generate output frames with a fixed frame length from the variable length output frames (clause 5.5).

5.4.3.3 Time-stretching

Time-stretching increases the sizeof an input frame to for the output frame . will differ to by 2,5-15 ms, i.e. the resulting frame duration will be in the range 22,5-35 ms. The amount of time-scaling depends on the positionof the best matching candidate segment that yields highest similarity to the first segment of the input signal. The start position and end position used to search for the candidate segment inside the input frame for time-stretching are listed in Table 3.

The preceding input frame is taken to search for positions with similarity to the first segment of the current input frame.

Table 3: Time-stretching Parameters

8 kHz

16 kHz

32 kHz

48 kHz

start search position [samples]

-120

-240

-480

-720

end search position [samples]

-20

-40

-80

-120

The output frame is the result of a cross-fade of the first segment of the input frame and the best matching candidate segment, which is shifted by . Note that is bounded by and and is therefore a negative number. Samples following the candidate segment are appended to the merged signal to yield continuity with following frames.

Figure 6: Lengthening of an input frame

Note that the time-stretching algorithm does not need any look-ahead samples but requires the previous frame to be available. Therefore no extra delay is introduced, however an extra buffer is used to generate output frames with a fixed frame length from the variable length output frames (clause 5.5).

5.4.3.4 Energy Estimation

Low-level signals are detected by analysing 1ms subsegments of the input signal , including the previous frame. If all subsegment energies of the subsegments to be merged are below a threshold , with , then the frame is scaled to the maximum extent:

– a 20 ms frame is shortened to 10ms by setting for time-shrinking according to Table 2

– a 20 ms frame is extended to 35ms by setting for time-stretching according to Table 3

The generation of the output frame is performed by a cross-fade of the first segment of the input frame and the by shifted segment, followed by the remaining samples. Note that for low-level signals the output frame is generated without similarity estimation or quality estimation.

5.4.3.5 Similarity Measurement

To estimate similarity a non-normalized cross-correlation of the template segment with other positions of the template in the signal is maximized. In order to limit the computational complexity, a subsampled signal is used to estimate the correlation value, i.e. only every th sample is used. The subsampling parameters are set as listed in Table 4.

Table 4: Similarity Measurement complexity Parameters

8 kHz

1 6kHz

32 kHz

48 kHz

signal subsampling

1

2

4

6

correlation subsampling

1

1

2

3

(13)

A hierarchical search for the best match is performed that initially skips every m-th offset, i.e. the search for the maximum correlation is performed on a subsampled set of correlation values, with as listed in Table 4. The offset with highest correlation is then used as starting point for a hierarchical search where and after each step. is set to for narrowing the search range. This hierarchical search is performed until .

5.4.3.6 Quality Control

After similarity estimation a quality measure is calculated. To calculate the normalized cross-correlation in (14) is used.

(14)

is evaluated with up to four different values of , with set to , , , and . is then the sum of the products of the correlations in (15)

(15)

In the event that one of the calculations needs to access samples that are not available in the current or the previous frame, the values are replaced with using the following scheme:

(16)

(17)

(18)

The quality is then used in the main scaling operations 5.4.3.2 and 5.4.3.3 to decide whether the scaling operation is performed or whether the scaling is deferred to a subsequent frame.

Positive values indicate periodicity, whereas negative values indicate that the scaling could produce a perceptually distorted signal. The threshold for frame scaling is initially, however the threshold is dynamically adapted depending on the number of successive scaled or non-scaled frames. For each scaled frame the threshold is increased by 0,2 to avoid too many subsequent scaling operations. For each non-scaled frame the threshold is lowered by 0,1 to enable time-scaling also for less periodic signals.

For multi-channel operation the channel with the highest sum of energies is used for determining the quality of the scaling operation for all channels. Therefore the denominator from Formula (14) is summed up to determine the channel of highest sum of energies during quality estimation.

5.4.3.7 Overlap-add

Finally, if the time-scaling is performed, the output signal is constructed by an overlap-add of the first segment of the input signal with the shifted input signal . Note that is negative for expansion of the signal.

, (19)

The overlap-add is performed using a cos-shaped Hann window. Note that the second half of the window is not used.

(20)

The remaining samples of the shifted signal up to the most recent sample are then appended. The total length of the output signal is .

, (21)