B.5 Headphone Output Signal Computation
26.1183GPPRelease 17TSVirtual Reality (VR) profiles for streaming applications
B.5.1 General
The overall Scene Model is represented by the collection of all point sources with updated position obtained from the rotated channels, objects, and the ESD components as well as the non-diegetic channels and objects for which ‘gca_fixedChannelsPosition == 1’ or ‘goa_fixedPosition == 1’. The overall number of point sources in the Scene Model is denoted with .
B.5.2 HRIR Selection
The position of each point source in the listener-relative coordinate system is used to query a best-match HRIR pair from the set of available HRIRs. For lookup, the polar coordinates of the HRIR locations are transformed into the internally used cartesian coordinates and the closest-match available HRIR for a given point source position is selected. As no interpolation between different HRIRs is performed, HRIR datasets with sufficient spatial resolution should be provided.
B.5.3 Initialization
The HRIR filters used for binauralization are asynchronously partitioned and transformed into the frequency domain using a Fast FourierTransform (FFT). The necessary steps for each of the HRIR filter pairs are as follows:
1) Uniformly partition the length N HRIR filter pairs into filter partitions of length .
2) Zero-pad the filter partitions to length .
3) Transform all filter partitions into the frequency domain using real-to-complex FFT to obtain the frequency domain filter pairs , where denotes the frequency index.
B.5.4 Convolution and Crossfade
Each audio block of a point source of the Scene Model is convolved with its selected HRIR filter pair for the left and right ear respectively. To reduce the computational complexity, a fast frequency domain convolution technique of uniformly partitioned overlap-save processing is useful for typical FIR filter lengths for HRIRs/BRIRs. The required processing steps are described in the following.
The following block processing steps are performed for each of the point sources of the Scene Model:
a) Obtain a block of new input samples of the point source .
b) Perform a real-to-complex FFT transforms of length to obtain the frequency domain representation of the input .
c) Compute the frequency domain headphone output signal pair for the point source by multiplying each HRIR frequency domain filter partition with the associated frequency domain input block and adding the product results over all partitions.
d) samples of the time domain output signal pair are obtained from by performing a complex-to-real IFFT.
e) Only the last output samples represent valid output samples. The samples before are time-aliased and are discarded.
f) In case of a HRIR filter exchange happens due to changes in the scene displacement, steps 3-5 are computed for both the current HRIR filter and the ones used in the previous block. A time-domain crossfade is performed over the B output samples obtained in step 5:
g)
h) The crossfade envelopes are defined as
to preserve a constant power of the resulting output signal.
The crossfade operation define in step 6 is only applied to point sources of the Scene Model that have been generated from channel or object content. For HOA content, the crossfade is applied between the current and the previous rotation matrices (see B.4.2).
B.5.5 Binaural Downmix
The rendered headphone output signal is computed as the sum over all binauralized point source signal pairs . In case that the metadata provided together with the audio data at the input interface (see X.3.1) includes gain values applicable to a specific channel group (gca_channelGain in mpegh3da_getChannelMetadata()) or objects (goa_objectGainFactor in mpegh3da_getObjectAudioAndMetadata()), these gain values are applied to the corresponding binauralized point source signal before the summation:
Finally, any additional non-binauralized non-diegetic audio input (‘gca_directHeadphone == 1’, see B.3.4) is added time-aligned to the two downmix channels.
B.5.6 Complexity
The algorithmic complexity of the external binaural renderer using a fast convolution approach can be evaluated for the following computations:
Convolution (B.5.4) |
1) RFFT: (with as an estimated additional complexity factor for the FFT) 2) complex multiplications: 3) complex additions: 4) IRFFT: |
Downmix (B.5.5) |
1) real multiplications: 2) real additions: |
Filter Exchange and Crossfade (B.5.4) |
1) RFFT: 2) Time-domain crossfade (real multiplications): 3) Time-domain crossfade (real additions): |
Additional computations are required for scene displacement processing (see B.4).
The total complexity per output sample can be determined by adding the complexity estimation for convolution and downmix and dividing by the block length B. In blocks where a filter exchange is performed, items 2-4 from the convolution contribute two times to the overall complexity in addition to the time-domain crossfade multiplications and additions (filter exchange items 2 and 3). The partitioning and FFT for the filter exchange, as well as the scene displacement, can be performed independent of the input block processing.
B.5.7 Motion Latency
The Scene Model can be updated with arbitrary temporal precision, but the resulting HRIR exchange is only done at processing block boundaries of the convolution. With a standard block size of samples at 48 kHz sampling rate, this leads to a maximum onset latency of 5.3 ms until there is an audible effect of a motion of sources or the listener. In the following block, a time-domain crossfade between the new and the previous filtered signal is performed (see Convolution/Initialization), so that a discrete, instantaneous motion is completed after a maximum of two convolution processing blocks (10.6 ms for 512 samples at 48 kHz sampling rate). Additional latency from head trackers, audio buffering, etc. is not considered.
The rotation of the HOA content is performed at a block boundary resulting in a maximum latency of one processing block, until a motion is completed.
Annex C (informative):
Registration Information