4.2.3 Motion to Sound Latency in Dynamic Binaural Rendering Systems

26.2603GPPObjective test methodologies for the evaluation of immersive audio systemsRelease 18TS

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

4.2.3.1 Introduction

Motion to Sound latency is the time difference between the event of a change in head rotation and when the immersive audio signal is finally compensated for the head motion. The method in this specification is intended to verify that the overall motion-to-sound latency that a user experiences upon rotating their head is within acceptable limits.

The method allows full measurement of motion to sound, i.e. including both the latency of the head tracking sensor as well as the audio playback. This includes all components of a real setup and therefore contains all possible causes of additional latency that a user may experience.

The method also provides a latency value for the isolated audio processing of the binaural renderer without the aforementioned external hardware, assuming that the binaural renderer can process audio data as an audio processing plugin that can be evaluated in isolation.

NOTE: This method requires synchronized playback of two renderer instances and may not be suitable for the measurements of UEs where such synchronization is not possible.

4.2.3.2 Requirements

The following will be required:

Software:

– Audio processing software to run and record output of two renderers simultaneously

– Head tracker software

Hardware:

– Host machine for audio processing

– Head tracker hardware

– Stereo audio recording interface

– Stereo audio playback interface

– Mechanical setup to rotate the head tracking sensor in a precise and reproducible way

An exemplary hardware setup can be seen below in Figure 2, the method however can also be implemented using different systems under test and accompanying equipment:

Figure 2: Hardware Overview (Setup in Position 1 on the left, Position 2 on the right)

The audio processing environment uses two parallel signal chains, each containing its own instance of the same binaural renderer being tested. The test is concerned only with yaw angles, so values of pitch and roll should be set to zero at the beginning of the test and can be ignored thereafter.

Figure 3: Generic Audio Processing Environment

The initial conditions are that Rendering Chain 1 (RC1) has a static yaw head rotation angle of 0 degrees and RC2 uses the physical rotation of the head tracker to get its yaw value. A white noise signal is virtually placed directly in front of the listener (0 degrees azimuth, elevation), meaning that rotation of the arm directly affects how the white noise source is rendered.

4.2.3.3 Calibration

The first step is to calibrate the final position of the rotating arm (Position 2 / P2). The rotating arm is moved manually and requires only a limited range of motion – from some small rotation away from the table (Position 1 / P1), 20 to 30 degrees will be ample, through until contact with the table (P2). The arm should be placed at P2 and set up so that this position also corresponds to 0 degrees yaw.

4.2.3.4 Evaluation Environment

An object within the evaluation environment, e.g. using Max/MSP, should be created to set the value of yaw to exactly 0 degrees once the real value of yaw (received from the head tracker) is <0.2 degrees.

NOTE: This tolerance value was chosen to be as small as possible while ensuring that it does not bounce (dependent on the accuracy of the tracking system)

This object should be designed to latch to zero once the actual value is under the tolerance threshold, so that any small accidental rebound of the rotating arm does not affect the yaw angle fed to the renderer – it artificially remains at exactly 0, which is important to ensure that both rendering chains have exactly the same head rotation when the arm is in its final position (P2). The output from the evaluation environment is captured by the recording audio interface, which therefore includes any latency introduced by playback.

4.2.3.5 Data acquisition

A test run begins by starting to record on the recording audio interface. The rotating arm is set to Position 1, then the audio processing set running and starts feeding the input source to both renderer instances. A microphone is positioned near the contact point at the table. This mono room microphone will be recorded synchronously with the output from the evaluation environment, with its purpose being to log the point of contact of the arm with the table, which should be done with a good amount speed and vigour so that the microphone picks up a loud knock at the table. Shortly after this (one or two seconds for example), with the test run now complete, playback and recording can be stopped.

Some milliseconds after the collision, the latest yaw value detected by the tracking system will have been passed into the evaluation environment (t_Tracking). With the target yaw value now reached (latched to zero in Max), both rendering chains will have the identical values of head rotation and therefore, after some further short delay, the output of both renderer instances will be identical.

4.2.3.6 Data Analysis

The overall motion-to-sound latency (tM2S) is taken as the time from the moment of collision until the point at which the two output signals are identical.

To easily visually inspect when this point occurs, one output channel of one signal chain (e.g. RC1-left) is subtracted from the same output channel of the complementary signal chain (RC2-left).

NOTE 1: This could be done manually in audio editor software after processing, but this would require recording at least three channels synchronously (one from each renderer chain, and one of the room microphone). Instead, the subtraction of signals can be done within Max/MSP, meaning only the output of this operation (one channel) and the room microphone can conveniently be recorded with a stereo audio interface. In addition to the stereo WAVE file recorded by a separate audio application, the Max/MSP application also writes to a separate mono WAVE file once it detects that it is in the final tolerated yaw position (latched on). This mono WAVE contains only the subtracted signals as described above, from which the tMspProc time can also be measured.

Evaluation is performed offline in audio editor software. The tMspProc time is measured from the start of the file until the point at which consecutive zero samples begin. This value encompasses any motion-to-sound latency caused by the tested renderer chain as well as any other latency caused by Virtual Studio Technology (VST) plugin framework buffering. The tMspProc time shall be measured from the audio frame boundary at which the latched-on yaw value is activated and applied within that audio frame.

NOTE 2: Since the yaw rotation update rate of the tracker is typically in the range of a few milliseconds, there is a framing mismatch when compared to the audio framing, but this mismatch will not be incorporated in the tMspProc value but rather only in the tM2S measurement.

An example measurement of the tMspProc is displayed in Figure 4. For tM2S this is measured by selecting the duration between the visible collision peak in the microphone channel and the point at which the other channel reduces to silence. Figure 5 shows an example measurement for the motion-to-sound latency.

Figure 4: tMspProc latency measurement

Figure 5: Motion-to-sound (tM2S) latency measurement

In Figure 5 the room recording is on the top, subtracted renderer output is on the bottom. Marked region is the time passed since the arm hits the table (recorded knock) and when the subtracted binaural renderer output reaches silence.

NOTE 3: Unlike the tMspProc measurement, the tM2S measurement is taken from signals recorded from hardware audio interfaces, hence it is not possible to look for continuous silence since the resulting file will always contain some noise added by the digital-to-analog and analog-to-digital converters. For this reason, it is important to ensure a high signal-to-noise ratio in the signal provided to audio interfaces, to make it easier to inspect where the cancellation occurs.

Annex A (normative):
Order dependent directions

The following tables order-dependent directions , where and denote the elevations and azimuths in radians, respectively.

Index
1	1.570796	0
2	-0.339837	0
3	-0.339837	2.094395
4	-0.339837	-2.0944

Index
1	1.570796	0
2	-0.790277	0
3	0.363207	-1.95668
4	0.363207	1.956682
5	-0.844382	-1.95668
6	0.009757	-3.14159
7	-0.844382	1.956681
8	0.245128	0.687124
9	0.245129	-0.68712

Index
1	1.570796	0
2	0.716698	0
3	-0.461173	1.119907
4	-1.034310	-0.25283
5	0.492174	1.155586
6	-0.165812	2.040481
7	-0.461172	-1.38118
8	-0.165813	0.270692
9	0.001916	-2.20417
10	0.653709	2.297267
11	0.653709	-2.80293
12	-0.192680	3.010956
13	-1.079056	2.154919
14	0.001915	-0.63529
15	0.616834	-1.41973
16	-0.887326	-2.46809

Index
1	1.570796	0
2	0.747578	0
3	-0.168324	-2.00759
4	0.846499	1.927637
5	0.234515	-1.41208
6	0.699165	-2.10001
7	0.307091	2.512927
8	0.130649	1.667633
9	-0.677517	1.442383
10	0.136843	-0.60062
11	-1.317269	0.329968
12	-0.433118	-1.18621
13	-0.231864	2.983332
14	0.174242	-2.69222
15	-0.599985	0.507602
16	-0.382009	2.208977
17	-0.009394	0.952319
18	-1.013813	-1.71565
19	0.696199	0.934402
20	-0.602139	-0.38654
21	-1.041921	2.675958
22	-0.623111	-2.62842
23	0.054056	0.165012
24	0.855489	-1.02504
25	0.808243	-3.13121

Index
1	1.570796	0
2	-0.454100	0
3	0.323739	-1.19666
4	-1.175381	0.184066
5	0.947221	0.124282
6	-0.193698	-2.84022
7	0.500281	-1.84701
8	-0.663529	0.698758
9	-0.613332	2.280239
10	-0.588043	-2.28482
11	0.946645	-2.37569
12	0.333311	2.883411
13	0.967374	-1.18504
14	0.436854	-2.76846
15	0.510141	0.763488
16	-0.063811	-0.46491
17	0.048266	-2.27504
18	-0.148392	1.762138
19	0.945735	2.804486
20	-0.125777	-1.69175
21	-0.241518	-1.0321
22	-0.063824	0.509415
23	-1.240392	-1.95737
24	0.542172	-0.567
25	0.043647	2.319619
26	-0.291045	2.853233
27	-0.841101	-3.07101
28	-1.213891	2.113132
29	-0.706626	-1.50877
30	-0.774625	-0.65404
31	-0.707445	1.464227
32	0.990842	1.373127
33	-0.122664	1.112751
34	0.598614	2.113949
35	0.306690	0.057137
36	0.381934	1.457925

Index
1	1.570796	0
2	0.720144	0
3	-0.308365	3.024454
4	0.068431	2.080642
5	-0.495677	-2.21373
6	-0.018779	-2.03598
7	0.426043	1.678014
8	-0.259742	0.964363
9	0.179320	-3.03552
10	-0.249618	-2.70206
11	1.074183	0.581055
12	-0.781172	-2.80103
13	0.457849	0.550136
14	0.523951	-1.98436
15	-0.006246	-0.51212
16	-0.788507	-1.1411
17	0.228181	-2.48765
18	-0.418110	-1.62282
19	-0.512688	-0.57506
20	0.572140	2.286204
21	-0.867576	-0.08741
22	-0.624799	0.547028
23	-0.446687	1.878965
24	-0.789667	2.746717
25	1.047763	-0.76025
26	0.247192	-1.01978
27	0.720143	1.162107
28	-0.081819	1.507148
29	0.226040	1.062706
30	0.709088	-2.68135
31	-0.249096	-1.08377
32	0.573959	2.91352
33	1.069121	2.939099
34	0.135381	-1.53966
35	-0.057504	0.473238
36	-0.975369	-1.95522
37	-0.666036	1.294994
38	-1.146922	0.887936
39	-0.357070	2.427548
40	0.200642	-0.01608
41	-0.965084	1.97199
42	0.681666	-1.35341
43	0.112434	2.651183
44	0.528475	-0.57647
45	1.003627	1.857517
46	-1.275974	-0.77916
47	1.051102	-2.01121
48	-1.315079	3.087768
49	-0.326694	-0.00446

Annex B (normative):
Directions in Gaussian spherical grid