7.11 Echo control characteristics

26.1323GPPRelease 18Speech and video telephony terminal acoustic test specificationTS

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

7.11.1 Test set-up and test signals

The device is set up according to clause 5. The ambient noise level shall be ≤ ‑64 dBPa(A).

The test shall be performed with the British-English "long" double-talk and conditioning speech sequences from ITU-T Recommendation P.501 [22], with the signals in the receiving direction band limited according to clause 5.4.

A description of the test stimuli is presented in Table 2a and Table 2b. The test sequence is composed of an initial conditioning sequence of 23,5 s and a double talk sequence of 35 s. For the analysis, the double talk sequence is divided into two segments, a first double-talk sequence with single short near-end words (0 – 20 s), and a second double-talk sequence with continuous double talk (20 – 35 s).

The sending speech during double-talk and the "near-end speech only" are recorded individually, with the "near-end speech only" sequence recorded with silence in the receiving direction. The time-alignment of the two recorded sequences is performed off-line during the analysis.

Table 2a: Test stimuli for recording of Echo Canceller operation

	Conditioning	Single words (segment 1) and full sentence (segment 2) double talk
Far-end signal	FB_female_conditioning_seq_long.wav	FB_male_female_single-talk_seq.wav
Artificial mouth signal	FB_male_conditioning_seq_long.wav	FB_male_female_double-talk_seq.wav

Table 2b: Test stimuli for reference "near-end speech only" recording.

	Conditioning	Single words (segment 1) and full sentence (segment 2) double talk
Far-end signal	FB_female_conditioning_seq_long.wav	silence
Artificial mouth signal	FB_male_conditioning_seq_long.wav	FB_male_female_double-talk_seq.wav

The level of the signal of the artificial mouth shall be -4,7 dBPa measured at the MRP. For electrical interface UE, the level of the signal shall be calibrated to -60 dBV for analogue and to -16 dBm0 for digital connections. In order to obtain a reproducible time alignment as seen by the UE, the send signal (artificial mouth, electrical reference interface output) shall be delayed by the amount of the receiving direction delay. For the purpose of this alignment, the receiving direction delay for handset and headset modes is defined from the system simulator input to the artificial ear or the electrical reference interface, respectively. For hands-free modes, the downlink delay is defined from the system simulator input to the acoustic output from the UE loudspeaker.

The level of the downlink signal shall be -16 dBm0 measured at the digital reference point or the equivalent analogue point.

For electrical interface UE, an echo loss of 30 dB as described in clause 5.1.6 shall be simulated in the electrical reference interface.

7.11.2 Test method

The test method measures the duration of any level difference between the sending signal of a double-talk sequence (where the echo canceller has been exposed to simultaneous echo and near-end speech) and the sending signal of the same near-end speech only. The level difference is classified into eight categories according to Figure 17b5 and Table 2c, representing various degrees of "Full duplex operation", "Near-end clipping", and "Residual echo".

NOTE 1: The limits for specifying the categories in Figure 17b5 and Table 2c are provisional pending further analysis and validation.

NOTE 2: The categories in Figure 17b5 and Table 2c are labelled in a functional order and the subjective impression of the respective categories is for further study.

NOTE 3: To reduce potential issues associated with low-frequency test room noise, a [4^th]-order high-pass filter with a cut-off frequency of [100] Hz can be applied before the level computation.

Figure 17b5: Classification of echo canceller performance

Table 2c: Categories for echo canceller performance classification

Category	Level difference (ΔL)	Duration (D)	Description
A1	-4 dB ≤ ΔL < 4 dB		Full-duplex and full transparency
A2	-15 dB ≤ ΔL < -4 dB		Full-duplex with level loss in Tx
B	ΔL < -15 dB	D < 25 ms	Very short clipping
C	ΔL < -15 dB	25 ms ≤ D < 150 ms	Short clipping resulting in loss of syllables
D	ΔL < -15 dB	D ≥ 150 ms	Clipping resulting in loss of words
E	ΔL ≥ 4 dB	D < 25 ms	Very short residual echo
F	ΔL ≥ 4 dB	25 ms ≤ D < 150 ms	Echo bursts
G	ΔL ≥ 4 dB	D ≥ 150 ms	Continuous echo

A pseudo-code reference of the test method including test scripts and test-vectors is presented in clause C.3 and outlined in the following sub clauses.

7.11.2.1 Signal alignment

For the analysis of the signal level difference, the send signal during double-talk and the near-end only signal are aligned using a correlation analysis as described in clause C.3.2.

7.11.2.2 Signal level computation and frame classification

The analysis is based on the digital level measured with a meter according to IEC 61672 [38] with a time constant of 12,5 ms, sampled at 5 ms intervals corresponding to the evaluated frames.

The "double-talk" frames are defined as the frames where both the far-end (receiving direction) signal includes active speech (extended with a hang-over period of 200 ms) and the near-end signal is composed of active speech. Active speech is defined to be detected using a speech level meter according to ITU-T P.56, and frames within -15.9 dB from the active speech level are classified as active speech frames.

The "far-end single-talk adjacent to double-talk" frames are similarly defined using a speech level meter according to ITU-T P.56 as the frames with active far-end speech (extended with a hang-over period of 200 ms) and no active near-end speech (extended with a hang-over period of 200 ms).

A reference implementation of the signal level computation and frame classification is presented in clause C.3.3.

7.11.2.3 Classification into categories

The analysis and classification into the categories according to Figure 17b5 and Table 2c is performed according to the reference implementation described in clause C.3.4 and C.3.4.

The frames are first categorized according to the level categories defined in Table 2c. To determine the durations, the amount of adjacent frames falling into the same level category is determined.

The classification is then performed individually for the following situations:

– frames classified as "double-talk" from segment 1 of the double-talk sequence (see clause 7.11.1)

– frames classified as "far-end single-talk adjacent to double-talk" from segment 1 of the double-talk sequence

– frames classified as "double-talk" from segment 2 of the double-talk sequence

– frames classified as "far-end single-talk adjacent to double-talk" from segment 2 of the double-talk sequence

To determine the percentage values for each category (A1, A2, B, C, D, E, F, and G) within each situation, the number of frames falling into the respective category is divided by the total number of frames within the situation in question.

To determine the averaged level difference of the frames for each category (A1, A2, B, C, D, E, F, and G) within each situation, the sum of the level difference (in dB) of the frames falling into the respective category is divided by the total number of frames within the situation in question.