4.1.1 Overview

26.1183GPPRelease 17TSVirtual Reality (VR) profiles for streaming applications

Virtual reality is a rendered version of a delivered visual and audio scene. The rendering is designed to mimic the visual and audio sensory stimuli of the real world as naturally as possible to an observer or user as they move within the limits defined by the application.

Virtual reality usually, but not necessarily, assumes a user to wear a head mounted display (HMD), to completely replace the user’s field of view with a simulated visual component, and to wear headphones, to provide the user with the accompanying audio as shown in Figure 4.1-1.

Figure 4.1-1: Reference System

Some form of head and motion tracking of the user in VR is usually also necessary to allow the simulated visual and audio components to be updated in order to ensure that, from the user’s perspective, items and sound sources remain consistent with the user’s movements. Sensors typically are able to track the user’s pose in the reference system. Additional means to interact with the virtual reality simulation may be provided but are not strictly necessary.

VR users are expected to be able to look around from a single observation point in 3D space defined by either a producer or the position of one or multiple capturing devices. When VR media including video and audio is consumed with a head-mounted display or a smartphone, only the area of the spherical video that corresponds to the user’s viewport is rendered, as if the user were in the spot where the video and audio were captured.

This ability to look around and listen from a centre point in 3D space is defined as 3 degrees of freedom (3DOF). According to the figure 4.1-1:

– tilting side to side on the X-axis is referred to as Rolling, also expressed as γ

– tilting forward and backward on the Y-axis is referred to as Pitching, also expressed as β

– turning left and right on the Z-axis is referred to as Yawing, also expressed as α

It is worth noting that this centre point is not necessarily static – it may be moving. Users or producers may also select from a few different observational points, but each observation point in 3D space only permits the user 3 degrees of freedom. For a full 3DOF VR experience, such video content may be combined with simultaneously captured audio, binaurally rendered with an appropriate Binaural Room Impulse Response (BRIR). The third relevant aspect is the interactivity: Only if the content is presented to the user in such a way that the movements are instantaneously reflected in the rendering, then the user will perceive a full immersive experience. For details on immersive rendering latencies, refer to TR 26.918 [2].