Y.2 Architecture and Interfaces

26.1143GPPIP Multimedia Subsystem (IMS)Media handling and interactionMultimedia telephonyRelease 18TS

Definitions, reference and coordinate systems, video signal representation and audio signal representation as described in clause 4.1 of TS 26.118 [180] are applicable.

Figure Y.1 provides a possible sender architecture that produces the RTP streams containing 360-degree video and immersive speech/audio as applicable to an ITT4RT client in terminal. VR content acquisition includes capture of 360-degree video and immersive speech/audio, as well as other relevant content such as overlays. Following VR content pre-processing and encoding of 360-degree video and immersive speech/audio components, the corresponding elementary streams are generated. For 360-degree projected video, pre-processing may include video stitching, rotation or other translations, and the pre-processed 360-degree video is then passed into the projection functionality in order to map 360-degree video into 2D textures using a mathematically specified projection format. Optionally, the resulting projected video may be further mapped region-wise onto a packed video. For 360-degree fisheye video, circular videos captured by fisheye lenses are not stitched, but directly mapped onto a 2D texture, without the use of the projection and region-wise packing functionalities (as described in clause 4.3 of ISO/IEC 23090-2 [179]). In this case, pre-processing may include arranging the circular images captured by fisheye lenses onto 2D textures, and the functionality for projection and mapping is not needed. For audio, no stitching process is needed, since the captured signals are inherently immersive and omnidirectional. Followed by the HEVC/AVC encoding of the 2D textures and EVS encoding of immersive speech/audio along with the relevant immersive media metadata (e.g., SEI messages), the consequent video and audio elementary streams are encapsulated into respective RTP streams and transmitted.

Figure Y.1: Reference sender architecture for ITT4RT client in terminal

Figure Y.2 provides an overview of a possible receiver architecture that reconstructs the 360-degree video and immersive speech/audio in an ITT4RT client in terminal. Note that this figure does not represent an actual implementation, but a logical set of receiver functions. Based on one or more received RTP media streams, the UE parses, possibly decrypts and feeds the elementary video stream into the HEVC/AVC decoder and the speech/audio stream into the EVS decoder. The HEVC/AVC decoder obtains the decoder output signal, referred to as the "2D texture", as well as the decoder metadata. Likewise, the EVS decoder output signal contains the immersive speech/audio. The decoder metadata for video contains the Supplemental Enhancement Information (SEI) messages, i.e., information carried in the omnidirectional video specific SEI messages, to be used in the rendering phase. In particular, the decoder metadata may be used by the Texture-to-Sphere Mapping function to generate a 360-degree video (or part thereof) based on the decoded output signal, i.e., the texture. The viewport is then generated from the 360-degree video signal (or part thereof) by taking into account the pose information from sensors, display characteristics as well as possibly other metadata.

For 360-degree video, the following components are applicable:

– The RTP stream contains an HEVC or an AVC bitstream with omnidirectional video specific SEI messages. In particular, the omnidirectional video specific SEI messages as defined in ISO/IEC 23008-2 [119] and ISO/IEC 14496-10 [24] may be present.

– The video elementary stream(s) are encoded following the requirements in clause Y.3

Figure Y.2: Reference receiver architecture for ITT4RT- client in terminal

Figure Y.2: Reference receiver architecture for ITT4RT- client in terminal The output signal, i.e., the decoded picture or "texture", is then rendered using the Decoder Metadata information in relevant SEI messages contained in the video elementary streams as well as the relevant information signalled at the RTP/RTCP level (in the viewport-dependent case). The Decoder Metadata is used when performing rendering operations such as region-wise unpacking, projection de-mapping and rotation for 360-degree projected video, or fisheye video information for 360-degree fisheye video) toward creating spherical content for each eye. Details of such sample location remapping process operations are described in clause D.3.41.7 of ISO/IEC 23008-2 [119].

Viewport-dependent 360-degree video processing could be supported for both point-to-point conversational sessions and multiparty conferencing scenarios and can be achieved by sending from the ITT4RT-Rx client RTCP feedback messages with viewport information and then encoding and sending the corresponding viewport by the ITT4RT-Tx client or by the ITT4RT-MRF. This is expected to deliver resolutions higher than the viewport independent approach for the desired viewport. The transmitted RTP stream from the ITT4RT-Tx client or ITT4RT-MRF may also include the information on the region of the 360-degree video encoded in higher quality as the video generated, encoded and streamed by the ITT4RT-Tx client may cover a larger area than the desired viewport. Viewport-dependent processing is realized via RTP/RTCP based protocols that are supported by ITT4RT clients. The use of RTP/RTCP based protocols for viewport-dependent processing is further described in clause Y.7.2.