8.13 Jitter buffer management behaviour (handset, headset and electrical interface UE)

26.1323GPPRelease 18Speech and video telephony terminal acoustic test specificationTS

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

8.13.0 General

For speech-only with LTE, NR or WLAN access, a jitter buffer is used in receiving to handle the variation in packet receiver timing. To minimize the additional latency introduced by the jitter buffer, adaptation is used to minimize delay while preventing packet losses due to packet delivery timing variations. See clause 8 of TS 26.114 [39] for the definition of jitter buffer and minimum performance requirements on JBM.

The test method is used to characterize different possible strategies and trade-offs in the design of JBM implementations used in MTSI terminals.

8.13.1 Delay histogram

For this test it shall be ensured that the call is originated from the UE.

NOTE 1: Differences have been observed between UE-originated calls and UE-terminated calls. For better consistency, calls from the UE are used.

The test signal consists of 3 repeats of the Composite Source Signal (CSS) according to ITU-T Recommendation P.501 [22] followed by a speech signal of 160s. During the first two CSS signals the terminal can adapt its jitter buffer. The third CSS is used for measuring the delay in constant-delay condition, and the speech signal is used for delay and quality measurement in the packet impairment condition.

Constant delay T_c corresponding to the minimum delay of the profile (i.e. the compensation value for the profile) shall be added at the beginning of the different delay/loss profiles, to avoid unecessary delay jumps between the two measurement phases and realistic conditions for the second measurement test phase.

In receiving direction, the delay between the electrical access point of the test equipment and the reference point (RP), T_TEAP-RP(t) = T_R-jitter(t)+ T_TER, is measured in two successive phases:

1) First the delay in constant-delay condition T_{TEAP-DRP-constant} is measured as described in steps 1 to 4, clause 8.10.2, 8.10.2a/8.10.2b using the third CSS signal. The constant delay T_c is subtracted from T_TEAP-RP to obtain T_R-constant.

2) Then the delay with packet impairment T_R-jitter(t) is measured continuously for a speech signal during the inclusion of packet delay and loss profiles in the receiving direction RTP voice stream.

The reference point is defined as follows:

1) for handset and headset UE, the reference point is the DRP.

2) for electrical interface UE, the reference point is the input of the electrical reference interface.

Packet impairments shall be applied between the reference client and system simulator eNodeB. Separate calls shall be established for each packet impairment condition.

The start of the delay profiles must be synchronized with the start of the downlink speech material reproduction (compensated by the delay between reproduction and the point of impairment insertion, i.e. the delay of the reference client) in order to ensure a repeatable application of impairments to the test speech signal. Tests shall be performed with DTX enabled in the reference client.

NOTE 2: RTP packet impairments representing packet delay variations and loss are specified in Annex F. Care must be taken that the system simulator uses a dedicated bearer with no buffering/scheduling of packets for transmission.

For the CSS signal repeated 3 times, the pseudo random noise (pn)-part of the CSS has to be longer than the maximum expected delay. It is recommended to use a pn sequence of 32 k samples (with 48 kHz sampling rate). The test signal level is -16 dBm0 measured at the digital reference point or the equivalent analogue point.

For the speech signal, 8 English test sentences according to ITU-T P.501 Annex C.2.3, normalized to an active speech level of -16dBm0, are used (2 male, 2 female speakers). The sequences are concatenated in such a way that all sentences are centered within a 4.0s time window, which results in an overall duration of 32.0s. The sequences are repeated 5 times, resulting in a test file 160.0s long. The first 2 sentences are used for convergence of the UE jitter buffer manager and are discarded from the analysis. Equivalent implementations of the concatenation by repeating the test sentences in sequence may be used.

For the delay calculation with the speech signal, a cross-correlation with a rectangular window length of 4s, centered at each sentence of the stimulus file, is used. The process is repeated for each sample. For each cross correlation, the maximum of the envelope is obtained producing one delay value per sentence.

The UE delay in the receive direction, T_R-jitter(t), is obtained by subtracting the delay introduced by the test equipment and the simulated transport network packet delay introduced by the delay and loss profile (as specified for the respective profile in Annex F) from the first electrical event at the electrical access point of the test equipment to the first bit of the corresponding speech frame at the system simulator antenna, T_TER, from the measured T_TEAP-DRP(t).

The difference D_T between maximum receiving delay obtained with at least 5 individual calls (see clause 7.10.2) and the delay T_R-constant measured for the CSS signal in constant delay condition is calculated. The quantity "Call-to-Call Variability Adjustment" (CCVA) = max(0,D_T) shall be added to the obtained delay for the speech signal T_R-jitter(t).

The UE delay in the receiving direction shall be reported in the form of an histogram covering the range of measured CCVA-adjusted values (T_R-CCVA(t) = T_R-jitter(t) + CCVA) with a step of 20 ms. The following pseudo code provides an example implementation for the histogram:

lo=min(floor(T_R-CCVA(t=1…40)/20)*20)

hi=max(ceil(T_R-CCVA(t=1…40)/20)*20)

[n,x]=hist(T_R-CCVA(t=1…40),lo:20:hi)

bar(x,n)

The T_R-CCVA values for all 40 sentences shall also be reported in the test report.

NOTE 3: The synchronization of the speech frame processing in the UE to the bits of the speech frames at the UE antenna may lead to a variability of up to 20 ms of the measured UE receive delay between different calls. This synchronization is attributed to the UE receiving delay according to the definition of the UE delay reference points. The effect of this possible call-to-call variation is taken into account with the CCVA = max(0,D_T) value.

8.13.2 Speech quality loss histogram

For the evaluation of speech quality loss in conditions with packet arrival time variations and packet loss, the speech test signal described in clause 8.13.1 shall be used. Two 48 kHz recordings are used to produce the speech quality loss metric:

– A recording obtained in jitter and error free conditions with the test signal described in clause 8.13.1 (reference condition)

– A recording obtained during the application of packet arrival time variations and packet loss as described in clause 8.13.1 (test condition)

The speech quality of the signal is estimated using the measurement algorithm described in ITU-T Recommendation P.863 [44] in super-wideband mode. Level pre-alignment to -26 dBov of recordings shall be used – see P.863.1 clause 10.2 [45].

NOTE: For the analysis of acoustical measurements, ITU-T P.863 [44] assumes diffuse-field equalized recordings. For this reason, signals at DRP are diffuse-field corrected for testing handset and headset UE. For electrical interface UE, only the level pre-alignment is applied.

A score shall be computed for each 8s speech sentence pair. The MOS-LQO values for the reference and test conditions shall be reported in the form of an histogram covering the range of measured values with a step of 0.1 and the values for all 20 sentences pairs shall also be reported in the test report. The following pseudo code provides an example implementation for the histogram:

lo=min(floor(MOS-LQO_{test condition}(i=1…20)/0.1)*0.1)

hi=max(ceil(MOS-LQO_{test condition}(i=1…20)/0.1)*0.1)

[n,x]=hist(MOS-LQO_{test condition}(i=1…20),lo:0.1:hi)

bar(x,n)

The synchronization between stimuli and degraded condition shall be done by the test system before applying the P.863 algorithm on each sentence pair.