6 Audio

26.1183GPPRelease 17TSVirtual Reality (VR) profiles for streaming applications

Tools: ARFCN - Frequency Conversion for 5G NR/LTE/UMTS/GSM

6.1 Audio Operation Points

6.1.1 Definition of Operation Point

For the purpose to define interfaces to a conforming audio decoder, audio operation points are defined. In this case the following definitions hold:

– Operation Point: A collection of discrete combinations of different content formats and VR specific rendering metadata, etc. and the encoding format.

– Receiver: A receiver that can decode and render any bitstream that is conforming to a certain Operation Point.

– Bitstream: A audio bitstream that conforms to an audio format.

Figure 6.1-1: Audio Operation Points

This clause focuses on the interoperability point to a media decoder as indicated in Figure 6.1-1. This clause does not deal with the access engine and file parser which addresses aspects how the audio bitstream is delivered.

In all audio operation points, the VR Presentation can be rendered using a single media decoder which provides decoded PCM signals and rendering metadata to the audio renderer.

6.1.2 Parameters of Audio Operation Point

This clause defines the potential parameters of Audio Operation Points. This includes the detailed audio decoder requirements and audio rendering metadata. The requirements are defined from the perspective of the audio decoder and renderer.

Parameters for an Audio Operation Point include:

– the audio decoder that the bitstream needs to conform to,

– the mandated or permitted rendering data that is included in the audio bitstream.

6.1.3 Summary of Audio Operation Points

Table 6.1-1 provides an informative overview of the Audio Operating Points. The detailed, normative specification for each audio operating point is subsequently provided in the referenced clause.

Table 6.1-1: Overview of OMAF operation points for audio (informative)

Operation Point	Codec	Configuration		Max Sampling Rate	Clause
3GPP MPEG-H Audio	MPEG-H Audio	Low Complexity Profile, Level 1,2 or 3		48 kHz	6.1.4

6.1.4 3GPP MPEG-H Audio Operation Point

6.1.4.1 Overview

The 3GPP MPEG-H Audio Operation Point fulfills the requirements to support 3D audio and is specified in ISO/IEC 23090-2 [13], clause 10.2.2. Channels, Objects and First/Higher-Order Ambisonics (FOA/HOA) are supported, as well as combinations of those. The Operation Point is based on MPEG-H 3D Audio [19].

A bitstream conforming to the 3GPP MPEG-H Audio Operation Point shall conform to the requirements in clause 6.1.4.2.

A receiver conforming to the 3GPP MPEG-H Audio Operation Point shall support decoding and rendering a Bitstream conforming to the 3GPP MPEG-H Audio Operation Point. Detailed receiver requirements are provided in clause 6.1.4.3.

6.1.4.2 Bitstream requirements

The audio stream shall comply with the MPEG-H 3D Audio Low Complexity (LC) Profile, Levels 1, 2 or 3 as defined in ISO/IEC 23008-3, clause 4.8 [19]. The values of the mpegh3daProfileLevelIndication for LC Profile Levels 1, 2 and 3 are "0x0B", "0x0C" and "0x0D", respectively, as specified in ISO/IEC 23008-3 [19], clause 5.3.2.

Audio encapsulation shall be done according to ISO/IEC 23090-2 [12], clause 10.2.2.2.

All Low Complexity Profile and Levels restrictions specified in ISO/IEC 23008-3 [19], clause 4.8.2 shall apply. The constraints on input and output configurations are provided in Table 3 — "Levels and their corresponding restrictions for the Low Complexity Profile", of ISO/IEC 23008-3 [19]. This includes the following for Low Complexity Profile Level 3:

– Maximum number of core coded channels (in compressed data stream): 32,

– Maximum number of decoder processed core channels: 16,

– Maximum number of loudspeaker output channels: 12

– Maximum number of decoded objects: 16

– Maximum HOA order: 6

MPEG-H Audio sync samples contain Immediate Playout Frames (IPFs), as specified in ISO/IEC 23008-3, clause 20.2 [19] and shall follow the requirements specified in ISO/IEC 23090-2 [12], clause 10.2.2.3.1.

6.1.4.3 Receiver requirements

6.1.4.3.1 General

A receiver supporting the 3GPP MPEG-H Audio Operation Point shall fulfill all requirements specified in the remainder of clause 6.1.4.3.

6.1.4.3.2 Decoding process

The receiver shall be capable of decoding MPEG-H Audio LC Profile Level 1, Level 2 and Level 3 bitstreams as specified in ISO/IEC 23008-3, clause 4.8 [19] with the following relaxations:

– The Immersive Renderer defined in ISO/IEC 23008-3 [19], clause 11 is optional.

– The carriage of generic data defined in ISO/IEC 23008-3 [19], clause 14.7 is optional and thus MHAS packets of the type PACTYP_GENDATA are optional and the decoder may ignore packets of this type.

The decoder shall read and process MHAS packets of the following types in accordance with ISO/IEC 23008-3 [19], clause 14:

PACTYP_SYNC,

PACTYP_MPEGH3DACFG,

PACTYP_AUDIOSCENEINFO,

PACTYP_AUDIOTRUNCATION,

PACTYP_MPEGH3DAFRAME,

PACTYP_USERINTERACTION,

PACTYP_LOUDNESS_DRC,

PACTYP_EARCON,

PACTYP_PCMCONFIG, and

PACTYP_PCMDATA.

The decoder may read and process MHAS packets of the following types:

PACTYP_SYNCGAP,

PACTYP_BUFFERINFO,

PACTYP_MARKER and

PACTYP_DESCRIPTOR.

Other MHAS packets may be present in an MHAS elementary stream, but may be ignored.

The Earcon metadata shall be processed and applied as described in ISO/IEC 23008-3 [19], clause 28.

6.1.4.3.3 Random Access

The audio decoder is able to start decoding a new audio stream at every random access point (RAP). As defined in clause 6.1.4.2, the sync sample (RAP) contains the configuration information (PACTYP_MPEGH3DACFG and PACTYP_AUDIOSCENEINFO) that is used to initialize the audio decoder. After initialization, the audio decoder reads encoded audio frames (PACTYP_MPEGH3DAFRAME) and decodes them.

To optimize startup delay at random access, the information from the MHAS PACTYP_BUFFERINFO packet should be taken into account. The input buffer should be filled at least to the state indicated in the MHAS PACTYP_BUFFERINFO packet before starting to decode audio frames.

NOTE 1: It may be necessary to feed several audio frames into the decoder before the first decoded PCM output buffer is available, as described in ISO/IEC 23008-3 [19], clause 5.5.6.3 and clause 22.

It is recommended that, at random access into an audio stream, the receiving device performs a 100 ms fade-in on the first PCM output buffer that it receives from the audio decoder.

NOTE 2: The MPEG-H 3D Audio Codec can output the original input samples without any inherent fade-in behavior. Thus, the receiving device needs to appropriately handle potential signal discontinuities, resulting from the original input signal, by fading in at random access into an audio stream.

6.1.4.3.4 Configuration change

If the decoder receives an MHAS stream that contains a configuration change, the decoder shall perform a configuration change according to ISO/IEC 23008-3 [19], clause 5.5.6. The configuration change can, for instance, be detected through the change of the MHASPacketLabel of the packet PACTYP_MPEGH3DACFG compared to the value of the MHASPacketLabel of previous MHAS packets.

If MHAS packets of type PACTYP_AUDIOTRUNCATION are present, they shall be used as described in ISO/IEC 23008‑3 [19], clause 14.

The Access Unit that contains the configuration change and the last Access Unit before the configuration change may contain a truncation message (PACTYP_AUDIOTRUNCATION) as defined in ISO/IEC 23008-3 [19], clause 14. The MHAS packet of type PACTYP_AUDIOTRUNCATION enables synchronization between video and audio elementary streams at program boundaries. When used, sample-accurate splicing and reconfiguration of the audio stream are possible.

6.1.4.3.5 MPEG-H Multi-stream Audio

The receiver shall be capable of simultaneously receiving at least 3 MHAS streams. The MHAS streams can be simultaneously decoded or combined into a single stream prior to the decoder, by utilizing the field mae_bsMetaDataElementIDoffset in the Audio Scene Information as described in ISO/IEC 23008-3 [19], clause 14.6.

6.1.4.3.6 Rendering requirements

6.1.4.3.6.1 General

The 3GPP MPEG-H Audio Operation Point builds on the MPEG-H 3D Audio codec, which includes rendering to loudspeakers, binaural rendering and also provides an interface for external rendering. Legacy binaural rendering using fixed loudspeaker setups can be supported by using loudspeaker feeds as output of the decoder.

6.1.4.3.6.2 Rendering to Loudspeakers

Rendering to loudspeakers shall be done according to ISO/IEC 23008-3 [19] using the interface for local loudspeaker setup and rendering as defined in ISO/IEC 23008-3 [19], clause 17.3.

NOTE: ISO/IEC 23008-3 [19] specifies rendering to predefined loudspeaker setups as well as rendering to arbitrary setups.

6.1.4.3.6.3 Binaural Rendering of MPEG-H 3D Audio

6.1.4.3.6.3.1 General

MPEG-H 3D Audio specifies methods for binauralizing the presentation of immersive content for playback via headphones, as is needed for omnidirectional media presentations. MPEG-H 3D Audio specifies a normative interface for the user’s viewing orientation and permits low-complexity, low-latency rendering of the audio scene to any user orientation.

The binaural rendering of MPEG-H 3D Audio shall be applied as described in ISO/IEC 23008-3 [19], clause 13 according to the Low Complexity Profile and Levels restrictions for binaural rendering specified in ISO/IEC 23008-3 [19], clause 4.8.2.2.

6.1.4.3.6.3.2 Head Tracking Interface

For binaural rendering using head tracking the useTrackingMode flag in the BinauralRendering() syntax element shall be set to 1, as described in ISO/IEC 23008-3 [19], clause 17.4. This flag defines if a tracker device is connected and the binaural rendering shall be processed in a special headtracking mode, using the scene displacement values (yaw, pitch and roll).

The values for the scene displacement data shall be sent using the interface for scene displacement data specified in ISO/IEC 23008-3 [19], clause 17.9. The syntax of mpegh3daSceneDisplacementData() interface provided in ISO/IEC 23008-3 [19], clause 17.9.3 shall be used.

6.1.4.3.6.3.3 Signaling and processing of diegetic and non-diegetic audio

The metadata flag fixedPosition in SignalGroupInformation() indicates if the corresponding audio signals are updated during the processing of scene-displacement angles. In case the flag is equal to one, the positions of the corresponding audio signals are not updated during the processing of scene displacement angles.

Channel groups for which the flag gca_directHeadphone is set to "1" in the mpegh3da_getChannelMetadata()sytax element are routed to left and right output channel directly and are excluded from binaural rendering using scene displacement data (non-diegetic content). Non-diegetic content may have stereo or mono format. For mono, the signal is mixed to left and right headphone channel with a gain factor of 0.707.

6.1.4.3.6.3.4 HRIR/BRIR Interface processing

The interface for binaural room impulse responses (BRIRs) specified in ISO/IEC 23008-3 [19], clause 17.4 shall be used for external BRIRs and HRIRs. The HRIR/BRIR data for the binaural rendering can be fed to the decoder by using the syntax element BinauralRendering(). The number of BRIR/HRIR pairs in each BRIR/HRIR set shall correspond to the number indicated in the relevant level-dependent row in Table 9 – "The binaural restrictions for the LC profile" of ISO/IEC 23008-3 [19] according to the Low Complexity Profile and Levels restrictions in ISO/IEC 23008‑3 [19], clause 4.8.2.2.

The measured BRIR positions are passed to the mpegh3daLocalSetupInformation(), as specified in ISO/IEC 23008-3 [19], clause 4.8.2.2. Thus, all renderer stages are set to the target layout that is equal to the transmitted channel configuration. As one BRIR is available per regular input channel, the Format Converter can be passed through in case regular input channel positions are used. Preferably, the BRIR measurement positions for standard target layouts 2.0, 5.1, 10.2 and 7.1.4 should be provided.

6.1.4.3.6.4 Rendering with External Binaural Renderer

MPEG-H 3DA provides the output interfaces for the delivery of un-rendered channels, objects, and HOA content and associated metadata as specified in clause 6.1.4.3.6.5. External binaural renderers can connect to this interface e.g. for playback of head-tracked audio via headphones. An example of such external binaural renderer that connects to the external rendering interface of MPEG-H 3DA is specified in Annex B.

6.1.4.3.6.5 External Renderer Interface

ISO/IEC 23008-3 [19], clause 17.10 specifies the output interfaces for the delivery of un-rendered channels, objects, and HOA content and associated metadata. For connecting to external renderers, a receiver shall implement the interfaces for object output, channel output and HOA output as specified in ISO/IEC 23008-3 [19], clause 17.10, including the additional specification of production metadata defined in ISO/IEC 23008-3 [19], clause 27. Any external renderer should apply the metadata provided in this interface and related audio data in the same manner as if MPEG-H internal rendering is applied:

– Correct handling of loudness-related metadata in particular with the aim of preserving intended target loudness

– Preserving artistic intent, such as applying transmitted Downmix and HOA Rendering matrices correctly

– Rendering spatial attributes of objects appropriately (position, spatial extent, etc.)

NOTE: The external example binaural renderer in Annex B only handles a subset of the parameters to illustrate the use of the output interface. Alternative external binaural renderers are expected to apply and handle the metadata provided in this interface and related audio data in the same manner as if internal rendering is applied.

In this interface the PCM data of the channels and objects interfaces is provided through the decoder PCM buffer, which first contains the regular rendered PCM signals (e.g. 12 signals for a 7.1+4 setup). Subsequently additional signals carry the PCM data of the originally transmitted channel representation. These are followed by signals carrying the PCM data of the un-rendered output objects. Then additional signals carry the HOA audio PCM data which number is indicated in the HOA metadata interface via the HOA order (e.g. 16 signals for HOA order 3). The HOA audio PCM data in the HOA output interface is provided in the so-called Equivalent Spatial Domain (ESD) representation. The conversion from the HOA domain into the ESD representation and vice versa is described in ISO/IEC 23008-3 [19], Annex C.5.1.

The metadata for channels, objects, and HOA is available once per frame and their syntax is specified in mpegh3da_getChannelMetadata(), mpegh3da_getObjectAudioAndMetadata(), and mpegh3da_getHoaMetadata() respectively. The metadata and PCM data shall be aligned for an external renderer to match each metadata element with the respective PCM frame.

6.2 Audio Media Profiles

6.2.1 Introduction and Overview

This clause defines the media profiles for audio. Media profiles include specification on the following:

– Elementary stream constraints based on the audio operation points defined in clause 6.1.

– File format encapsulation constraints and signalling including capability signalling. The defines to a 3GPP VR Track as defined above.

– DASH Adaptation Set constraints and signalling including capability signalling. This defines a DASH content format profile.

Table 6.2-1 provides an overview of the Media Profiles in defined in the remainder of clause 6.2.

Table 6.2-1 Audio Media Profiles

Media Profile

Operation Point

Sample Entry

DASH Integration

OMAF 3D Audio Baseline Media Profile

3GPP MPEG-H Audio Operation Point

mhm1

mhm2

6.2.2 OMAF 3D Audio Baseline Media Profile

6.2.2.1 Overview

MPEG-H 3D Audio [19] specifies coding of immersive audio material and the storage of the coded representation in an ISO BMFF track. The MPEG-H 3D Audio decoder has a constant latency, see Table 1 — "MPEG-H 3DA functional blocks and internal processing domain", of ISO/IEC 23008-3 [19]. With this information, content authors could synchronize audio and video portions of a media presentation, e.g. ensuring lip-synch.

ISO BMFF integration for this profile is provided following the requirements and recommendations in ISO/IEC 23090-2 [12], clause 10.2.2.3.

6.2.2.2 File Format Signaling and Encapsulation

6.2.2.2.1 General

3GP VR Tracks conforming to this media profile used in the context of the specification shall conform to the ISO BMFF [17] with the following further requirements:

– The audio track shall comply to the Bitstream requirements and recommendations for the Operation Point as defined in clause 6.1.4.

– The sample entry ‘mhm1’ shall be used for encapsulation of MHAS packets into ISO BMFF files, per ISO/IEC 23008‑3 [19], clause 20.6.

– All ISO Base Media File Format constraints specified in ISO/IEC 23090-2 [12], clause 10.2.2.3 shall apply.

– ISO BMFF Tracks shall be encoded following the requirements in ISO/IEC 23090-2 [12], clause 10.2.2.3.1.

6.2.2.2.2 Configuration change constraints

A configuration change takes place in an audio stream when the content setup or the Audio Scene Information changes (e.g., when changes occur in the channel layout, the number of objects etc.), and therefore new PACTYP_MPEGH3DACFG and PACTYP_AUDIOSCENEINFO packets are required upon such occurrences. A configuration change usually happens at program boundaries, but it may also occur within a program.

Configuration change constraints specified in ISO/IEC 23090-2 [12], clause 10.2.2.3.2 shall apply.

6.2.2.3 Multi-stream constraints

The multi-stream-enabled MPEG‑H Audio System is capable of handling Audio Programme Components delivered in several different elementary streams (e.g., the main MHAS stream containing one complete audio main, and one or more auxiliary MHAS streams, containing different languages and audio descriptions). The MPEG-H Audio Metadata information (MAE) allows the MPEG‑H Audio Decoder to correctly decode several MHAS streams.

The sample entry ‘mhm2’ shall be used in cases of multi-stream delivery, i.e., the MPEG‑H Audio Scene is split into two or more streams for delivery as described in ISO/IEC 23008-3 [19], clause 14.6. All constraints for file formats using the sample entry ‘mhm2’ specified in ISO/IEC 23090-2 [12], clause 10.2.2.3.3 shall apply.

6.2.2.3a Additional Restrictions for DASH Representations

DASH Integration is provided following the requirements and recommendations in ISO/IEC 23090-2 [12], clause B.2.1. All constraints in ISO/IEC 23090-2 [12], clause B.2.1 shall apply.

6.2.2.4 DASH Adaptation Set Constraints

6.2.2.4.1 General

An instantiation of an OMAF 3D Audio Baseline Profile in DASH should be represented as one Adaptation Set. If so the Adaptation Set should provide the following signalling according to ISO/IEC 23090-2 [12] and ISO/IEC 23008-3 [19], clause 21 as shown in Table 6.2-2.

Table 6.2-2: MPEG-H Audio MIME parameter according to RFC 6381 and ISO/IEC 23008-3

Codec	MIME type	codecs parameter	profiles	ISO BMFF Encapsulation
MPEG-H Audio LC Profile Level 1	audio/mp4	mhm1.0x0B	‘oabl’	ISO/IEC 23008-3
MPEG-H Audio LC Profile Level 2	audio/mp4	mhm1.0x0C	‘oabl’	ISO/IEC 23008-3
MPEG-H Audio LC Profile Level 3	audio/mp4	mhm1.0x0D	‘oabl’	ISO/IEC 23008-3
MPEG-H Audio LC Profile Level 1, multi-stream	audio/mp4	mhm2.0x0B	‘oabl’	ISO/IEC 23008-3
MPEG-H Audio LC Profile Level 2, multi-stream	audio/mp4	mhm2.0x0C	‘oabl’	ISO/IEC 23008-3
MPEG-H Audio LC Profile Level 3, multi-stream	audio/mp4	mhm2.0x0D	‘oabl’	ISO/IEC 23008-3

Mapping of relevant MPD elements and attributes to MPEG-H Audio as well as the Preselection Element and Preselection descriptor are specified in ISO/IEC 23090-2 [12], clause B.2.1.2.

6.2.2.4.2 DASH Adaptive Bitrate Switching

MPEG-H 3D Audio enables seamless bitrate switching in a DASH environment with different Representations (i.e., bit streams encoded at different bitrates) of the same content, i.e., those Representations are part of the same Adaptation Set.

If the decoder receives a DASH Segment of another Representation of the same Adaptation Set, the decoder shall perform an adaptive switch according to ISO/IEC 23008-3 [19], clause 5.5.6.