8 Jitter buffer management in MTSI clients in terminals

26.1143GPPIP Multimedia Subsystem (IMS)Media handling and interactionMultimedia telephonyRelease 18TS

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

8.1 General

This clause specifies mechanisms to handle delay jitter in MTSI clients in terminals.

8.2 Speech

8.2.1 Terminology

In the following paragraph(s), Jitter Buffer Management (JBM) denotes the actual buffer as well as any control, adaptation and media processing algorithm (excluding speech decoder) used in the management of the jitter induced in the transport channel. An illustration of an exemplary structure of an MTSI speech receiver with adaptive jitter buffer is shown in figure 8.1 to clarify the terminology and the relation between different functional components.

Figure 8.1: Example structure of an MTSI speech receiver

The blocks "network analyzer" and "adaptation control logic" together with the information on buffer status form the actual buffer control functionality, whereas "speech decoder" and "adaptation unit" provide the media processing functionality. Note that the external playback device control driving the media processing is not shown in figure 8.1.

The grey dashed lines indicate the measurement points for the jitter buffer delay, i.e. the difference between the decoder consumption time and the arrival time of the speech frame to the JBM.

The functional processing blocks are as follows:

– Buffer: The jitter buffer unpacks the incoming RTP payloads and stores the received speech frames. The buffer status may be used as input to the adaptation decision logic. Furthermore, the buffer is also linked to the speech decoder to provide frames for decoding when they are requested for decoding.

– Network analyser: The network analysis functionality is used to monitor the incoming packet stream and to collect reception statistics (e.g. jitter, packet loss) that are needed for jitter buffer adaptation. Note that this block can also include e.g. the functionality needed to maintain statistics required by the RTCP if it is being used.

– Adaptation control logic: The control logic adjusting playback delay and operating the adaptation functionality makes decisions on the buffering delay adjustments and required media adaptation actions based on the buffer status (e.g. average buffering delay, buffer occupancy, etc.) and input from the network analyser. Furthermore, external control input, including RTCP from the sender, can be used e.g. to enable inter-media synchronisation, to adapt the jitter buffer, or other external scaling requests. The control logic may utilize different adaptation strategies such as fixed jitter buffer (without adaptation and time scaling), simple adaptation during comfort noise periods or buffer adaptation also during active speech. The general operation is controlled with desired proportion of frames arriving late, adaptation strategy and adaptation rate.

– Speech decoder: The standard AMR, AMR-WB or EVS speech decoder. Note that the speech decoder is also assumed to include error concealment / bad frame handling functionality. Speech decoder may be used with or without the adaptation unit.

– Adaptation unit: The adaptation unit shortens or extends the output signal length according to requests given by the adaptation control logic to enable buffer delay adjustment in a transparent manner. The adaptation is performed using the frame based or sample based time scaling on the decoder output signal during comfort noise periods only or during active speech and comfort noise. The buffer control logic should have a mechanism to limit the maximum scaling ratio. Providing a scaling window in which the targeted time scale modifications are performed improves the situation in certain scenarios – e.g. when reacting to the clock drift or to a request of inter-media (re)synchronization – by allowing flexibility in allocating the scaling request on several frames and performing the scaling on a content-aware manner. The adaptation unit may be implemented either in a separate entity from the speech decoder or embedded within the decoder.

8.2.2 Functional requirements for jitter-buffer management

The functional requirements for the speech JBM guarantee appropriate management of jitter which shall be the same for all speech JBM implementations used in MTSI clients in terminals. A JBM implementation used in MTSI shall support the following requirements, but is not limited in functionality to these requirements. They are to be seen as a minimum set of functional requirements supported by every speech JBM used in MTSI.

Speech JBM used in MTSI shall:

– support source-controlled rate operation as well as non-source-controlled rate operation;

– be able to receive the de-packetized frames out of order and present them in order for decoder consumption;

– be able to receive duplicate speech frames and only present unique speech frames for decoder consumption;

– be able to handle clock drift between the encoding and decoding end-points.

8.2.3 Minimum performance requirements for jitter-buffer management

8.2.3.1 General

An MTSI client in terminal supporting speech shall use a JBM fulfilling the minimum performance requirements defined in this clause. The JBM specified in [128] fulfils these minimum performance requirements and should be used for EVS. The EVS JBM may also be used for other codecs.

The jitter buffering time is the time spent by a speech frame in the JBM. It is measured as the difference between the decoding start time and the arrival time of the speech frame to the JBM. The frames that are discarded by the JBM are not counted in the measure.

The minimum performance requirements consist of objective criteria for delay and jitter-induced concealment operations. In order for a JBM implementation to pass the minimum performance requirements all objective criteria shall be met.

A JBM implementation used in MTSI shall comply with the following design guidelines:

1. The overall design of the JBM shall be to minimize the buffering time at all times while still conforming to the minimum performance requirements of jitter induced concealment operations and the design guidelines for sample-based timescaling (as set in bullet point 3);

2. If the limit of jitter induced concealment operations cannot be met, it is always preferred to increase the buffering time in order to avoid growing jitter induced concealment operations going beyond the stated limit above. This guideline applies even if that means that end-to-end delay requirement given in TS 22.105 [34] can no longer be met;

3. If sample-based time scaling is used (after speech decoder), then artefacts caused by time scaling operation shall be kept to a minimum. Time scaling means the modification of the signal by stretching and/or compressing it over the time axis. The following guidelines on time scaling apply:

– Use of a high-quality time scaling algorithm is recommended;

– The amount of scaling should be as low as possible;

– Scaling should be applied as infrequently as possible;

– Oscillating behaviour is not allowed.

NOTE: If the end-to-end delay for the ongoing session is known to the MTSI client in terminal and measured to be less than 150 ms (as defined in TS 22.105 [34]), the JBM may relax its buffering time minimization criteria in favour of reduced JBM adaptation artefacts if such a relaxation will improve the media quality. Note that a relaxation is not allowed when testing for compliance with the minimum performance requirements specified in clauses 8.2.3.2.2 and 8.2.3.2.3.

8.2.3.2 Objective performance requirements

8.2.3.2.1 General

The objective performance requirements consist of criteria for delay, time scaling and jitter-induced concealment operations.

The objective minimum performance requirements are divided into three parts:

1. Limiting the jitter buffering time to provide as low end-to-end delay as possible.

2. Limiting the jitter induced concealment operations, i.e. setting limits on the allowed induced losses in the jitter buffer due to late losses, re-bufferings, and buffer overflows.

3. Limiting the use of time scaling to adapt the buffering depth in order to avoid introducing time scaling artefacts on the speech media.

In order to fulfil the objective performance requirements, the JBM under test needs to pass the respective criteria using the six channels as defined in clause 8.2.3.3. Note that in order to pass the criteria for a specific channel, all three requirements must be fulfilled.

8.2.3.2.2 Jitter buffer delay criteria

The reference delay computation algorithm in Annex D defines the performance requirements for the set of delay and error profiles described in clause 8.2.3.3. The JBM algorithm under test shall meet these performance requirements. The performance requirements shall be a threshold for the Cumulative Distribution Function (CDF) of the speech-frame delay introduced by the reference delay computation algorithm. A CDF threshold is set by shifting the reference delay computation algorithm CDF 60 ms. The speech-frame delay CDF is defined as:

P(x) = Probability (delay_compensation_by_JBM ≤ x)

The relation between the reference delay computation algorithm and the CDF threshold is outlined in figure 8.2.

Figure 8.2: Example showing the relation between the reference delay algorithm
and the CDF threshold – the delay and error profile 4 in table 8.1 has been used

The JBM algorithm under test shall achieve lower or same delay than that set by the CDF threshold for at least 90 % of the speech frames. The values for the CDF shall be collected for the full length of each delay and error profile. The delay measure in the criteria is measured as the time each speech frame spends in the JBM; i.e. the difference between the decoder consumption time and the arrival time of the speech frame to the JBM.

The parameter settings for the reference delay computation algorithm are:

– adaptation_lookback = 200;

– delay_delta_max = 20;

– target_loss= 0.5.

8.2.3.2.3 Jitter induced concealment operations

The jitter induced concealment operations include:

– JBM induced removal of a speech frame, i.e. buffer overflow or intentional frame dropping when reducing the buffer depth during adaptation.

– Deletion of a speech frame because it arrived at the JBM too late.

– Modification of the output timeline due to link loss.

– Jitter-induced insertion of a speech frame controlled by the JBM (e.g. buffer underflow).

Link losses handled as error concealment and not changing the output timeline shall not be counted in the jitter induced concealment operations.

Jitter loss rate = JBM triggered concealed frames / Number of transmitted frames

The jitter loss rate shall be calculated for active speech frames only.

NOTE: SID_FIRST and SID_UPDATE frames belong to the non-active speech period, hence concealment for losses of such frames should not be included in the statistics.

The jitter loss rate shall be below 1% for every channel measured over the full length of the respective channel. The value of 1 % was chosen because such a loss rate will usually not significantly reduce the speech quality.

8.2.3.3 Delay and error profiles

Six different delay and error profiles are used to check the tested JBM for compliance with the minimum performance requirements. The profiles span a large range of operating conditions in which the JBM shall provide sufficient performance for the MTSI service. All profiles are 7500 IP packets long.

Table 8.1: Delay and error profile overview – The channels are attached electronically

Profile	Characteristics	Packet loss rate (%)	Filename
1	Low-amplitude, static jitter characteristics, 1 frame/packet	0	dly_error_profile_1.dat
2	Hi-amplitude, semi-static jitter characteristics, 1 frame/packet	0.24	dly_error_profile_2.dat
3	Low/high/low amplitude, changing jitter, 1 frame/packet	0.51	dly_error_profile_3.dat
4	Low/high/low/high, changing jitter, 1 frame/packet	2.4	dly_error_profile_4.dat
5	Moderate jitter with occasional delay spikes, 2 frames/packet (7 500 IP packets, 15 000 speech frames)	5.9	dly_error_profile_5.dat
6	Moderate jitter with severe delay spikes, 1 frame/packet	0.1	dly_error_profile_6.dat

The attached profiles in the zip-archive "delay_and_error_profiles.zip" are formatted as raw text files with one delay entry per line. The delay entries are written in milliseconds and packet losses are entered as "-1". Note that when testing for compliance, the starting point in the delay and error profile shall be randomized.

8.2.3.4 Speech material for JBM minimum performance evaluation

The files described in table 8.2 and attached to the present document in the zip-archive "JBM_evaluation_files.zip" shall be used for evaluation of a JBM against the minimum performance requirements. The data is stored as RTP packets, formatted according to "RTP dump" format [41]. The input to these files is AMR or AMR-WB encoded frames, encapsulated into RTP packets using the octet-aligned mode of the AMR RTP payload format [28].

Table 8.2: Input files for JBM performance evaluation – The files are attached electronically

Codec	Frames per RTP packet	Filename
AMR (12.2 kbps)	1	test_amr122_fpp1.rtp
AMR (12.2 kbps)	2	test_amr122_fpp2.rtp
AMR-WB (12.65 kbps)	1	test_amrwb1265_fpp1.rtp
AMR-WB (12.65 kbps)	2	test_amrwb1265_fpp2.rtp

8.3 Video

Video receivers should implement an adaptive video de-jitter buffer. The overall design of the buffer should aim to minimize delay, maintain synchronization with speech, and minimize dropping of late packets. The exact implementation is left to the implementer.

8.4 Text

Conversational quality of real-time text is experienced as being good, even with up to one second end-to-end text delay. Strict jitter buffer management is therefore not needed for text. Basic jitter buffer management for text is described in section 5 of RFC 4103 [31] where a calculation is described for the time allowed before an extra delayed text packet may be regarded to be lost.