7 Transport

26.1423GPPDynamic and Interactive Multimedia Scenes (DIMS)Release 17TS

7.1 Overview

The transport mechanisms support rich media delivery in the following modes: Unicast download (HTTP/TCP [12] or MMS [6] protocol), broadcast/multicast download (FLUTE/UDP [24]), unicast streaming and broadcast/multicast streaming (RTP/UDP [15]). For download mode, reliability is guaranteed by existing mechanisms in the transport and network layers, and no error resilience tools need to be designed at the application layer for rich media delivery. However, rich media transport in streaming mode is more challenging, with UDP being unreliable. Therefore, the RTP design provides some error resilience tools to help the media decoder cope with unreliable transport.

Rich media is a combination of continuous media and discrete media and relevant transport mechanisms for these two media types should be used. Rich media streaming is thus naturally realized by:

a) streaming continuous media such as scene streams, video and audio; and

b) downloading the discrete media, such as images.

DIMS Units can be classified as either used in normal processing, or used only for ‘redundant’ processing. For a given DIMS data-stream, these two kinds of DIMS Units can be managed either:

a) in a single transport; or

b) in two separate transports.

7.2 Storage in ISO Base Media File Format Files

7.2.1 Introduction

DIMS streams, both primary streams (those containing SVG scenes) and secondary streams (which normally carry only updates) are carried in files of the ISO Base Media File Format [10] (including 3GP files [8]) according to this subclause.

Either one or two tracks are used in the file for the normal and redundant DIMS Units.

7.2.2 Stream Type

Scenes are carried in scene tracks in ISO family files. They therefore use:

a) a video media handler ‘vmhd’;

b) a media handler type of ‘sdsm’ (scene description media handler);

c) a derivative of the base SampleEntry in the sample description box.

The timescale for the stream should be suitably chosen to achieve the desired accuracy of timing.

7.2.3 Track and Media Header fields

The width and height in the track header shall be set in the desired ratio, and indicate the suggested minimum display size. A player on a system with an indefinitely large display, in the absence of a fullscreen request, could use this size as a suggested initial display size.

If the presentation has an expected, reasonable duration, then it is encoded as the track duration. Otherwise the ISO file format recommendation of maxint for the duration, when it is indeterminate, should be used.

The language code of the track should be set appropriately if the presentation is language-specific, or else the value ‘und’ (undetermined) or ‘mul’ (multiple) should be used.

7.2.4 Sample Dependency Table

The sample dependency table may be used. The ‘unknown’ field values may be needed under some circumstances. The fields have the following semantics for DIMS streams:

sample_depends_on should be set according to whether the sample contains a normal DIMS Unit (not is-redundant) with is-RAP set to 1:

0: unknown;

1: this sample does not contain a normal RAP;

2: this sample does contain a normal RAP;

3: reserved.

sample_is_depended_on should be set according to the value of the P-bit in the DIMS Unit headers:

0: unknown;

1: one or more DIMS Units have the P-bit set to 1;

2: no DIMS Unit has the P-bit set to 1 (low-priority sample);

3: reserved.

sample_has_redundancy should be set to indicate whether the sample contains redundant DIMS Units:

0: unknown;

1: one or more DIMS Units have the is-redundant bit set to 1;

2: no DIMS Unit has the is-redundant set to 1;

3: reserved.

7.2.5 Sample Entry Name and Format

The sample entry four-character code for scenes is ‘dims’. The configuration box shall be present in the sample entry.

class SceneConfiguration extends FullBox (‘dimC’, version = 0, 0){
unsigned int(8) profile;
unsigned int(8) level;
unsigned int(4) pathComponents;
unsigned int(1) useFullRequestHost;
unsigned int(1) stream_type;
unsigned int(2) contains_redundant;
string text_encoding;
string content_coding;
class MPEG4BitRateBox extends Box(‘btrt’){
unsigned int(32) bufferSizeDB;
unsigned int(32) maxBitrate;
unsigned int(32) avgBitrate;

class DIMSScriptTypes extends Box(‘diST’)

string content_script_types;

class DIMSSampleEntry() extends SampleEntry (‘dims’){
SceneConfiguration config; // mandatory
DIMSScriptTypes scripts; // optional
MPEG4BitRateBox bitrateinfo; // optional

The fields have the following semantics:

– profile – Specifies the profile of DIMS used, for example the valued indicating Mobile Profile as defined in section 8.1

– level – Specifies the minimum DIMS level needed to be able to display the scene as defined in section 8.2.

– stream_type – takes the value 1 for primary streams, and the value 0 for secondary streams. Files containing secondary streams are not normally playable by themselves, outside the context of the scene(s) they are designed to update.

– contains_redundant – takes the value 1 ("main") if the stream contains only DIMS Units with is-redundant set to 0, the value 2 if the stream contains only DIMS Units with is-redundant set to 1 ("redundant"), and takes the value 3 ("main+redundant") if both occur. The value 0 is reserved. Note that streams containing only redundant units must be linked to the main stream for which they are redundant (see subclause 7.2.9).

– text_encoding – is a null terminated string with possible values taken the XML specification for character encoding in entities (e.g. subclause 4.3.3 of XML 1.0 Fourth edition [25]). It describes the text encoding after the content has been de-compressed (e.g. after deflating). This field is only applicable if the content is transmitted as (possibly encoded) text. An empty string shall be used otherwise.

– content_coding – this field provides the identification of the compression scheme. It is a null terminated string specifying the encoding (compression) format of the content. It is defined in the same way as the content-coding header in HTTP (subclause 3.5 of [12]). An empty string indicates that no compression is used.

– content_script_types – is a null terminated string that identifies the scripting languages used. It is a comma-separated list of MIME types [18] from the IANA registry, such as "application/ecmascript" (see [13]). It shall provide a complete listing of the script types that the terminal must support in order to process the stream. If the box is not present, the set of required script types is unknown. If the box contains the empty string, then the stream does not require any script processing.

– bufferSizeDB gives the size of the decoding buffer for the elementary stream in bytes. This is the size of the largest buffer needed to hold a sample in textual format, in bytes (i.e. after any de-compression).

– maxBitrate gives the maximum rate in bits/second over any window of one second.

– avgBitrate gives the average rate in bits/second over the entire presentation.

– useFullRequestHost and pathComponents are defined in subclause 5.5.2.

The text_encoding is required to be consistent over all the DIMS units described by this sample entry. This simplifies processing. It is an error to have a mismatch between this value and those present in the XML of the DIMS units themselves.

7.2.6 Sample Format

A sample is a concatenated sequence of one or more DIMS Units associated with the same media time, with a two-byte length field in network (big-endian) format preceding each DIMS Unit. The length is the length of the DIMS Unit not including the length field itself (that is, the combined length of the DIMS Unit Body and DIMS Unit Header).

A sample may contain one or more Normal DIMS Units or Redundant DIMS Units, or both, associated with the same media time.

7.2.7 Other Resources

Other resources may be carried in the meta-data directories of ISO files, in the track containing the scene, the movie containing that track, or the file containing that movie. If there is no actual meta-data (the meta-data block is there merely to carry resources), the meta-data handler type ‘null’ may be used.

URL forms to address these resources are defined in the ISO specification, and are relative to the file containing the resource.

The meta data box may also be used for multi-scene presentations where the meta box includes the initial SVG scene, and one of the tracks provides the updates.

7.2.8 Sync Samples

The sync sample table marks samples in which any of the DIMS Units have the is-RAP bit set to 1.

NOTE: The use of the shadow sync box is deprecated.

7.2.9 Separate Redundant Track

Redundant DIMS Units may be stored in the file format using a separate track. The redundant track shall be linked to the matching main track by a track reference of type ‘swto’ in the redundant track.

Redundant tracks are identified by this track reference, and shall also have contains_redundant set to "redundant data only" in their sample entry. The track they link to shall have contains_redundant set to "main data only".

If a stream is converted from a single-track to two-tracks, some small adjustment may be needed. Specifically, any ‘normal’ DIMS Units following the redundant-exit indication in the same sample will need to be copied into the ‘redundant’ track, marked as ‘redundant’ DIMS Units, and the redundant-exit indication moved to the last such DIMS Unit in the sample.

A terminal may perform tune-in etc. using the ‘redundant’ track by:

a) finding the random access point in the redundant track, closely preceding the desired play point, by using the sync sample table;

b) processing DIMS Units from the redundant track until the redundant-exit indication;

c) following the ‘swto’ track reference and commencing processing at the temporally next sample in the linked (main) track.

7.3 RTP Payload format for DIMS Streams

7.3.1 Priority

The counter (CTR) field is used to detect the loss of high priority DIMS units. Encoders and decoders keep a running value of the counter; the encoder places in each packet the current value of the counter; after being placed in the packet, the running counter is incremented by one if that packet contains one or more DIMS Units with high priority. The decoder compares the CTR field of each incoming packet with its running counter, and thereby checks for high-priority loss. After the check, the decoder’s running counter is incremented by one if the received packet contains one or more DIMS Units with high priority.

NOTE: A discontinuity in the sequence number indicates a lost packet. A discontinuity in the CTR field indicates the number of prioritized packets which have been lost.

An example of the use of the CTR and priority (P) bits is shown below:

Figure 7-1: Example of prioritization including detection of lost prioritized packets

Note that loss is only detected on the next packet to arrive; if the content has long periods in which no packets are sent, or is otherwise bursty, it may be inadvisable to have a high-priority packet before a long silence interval, as its loss cannot be detected until the first packet after that interval, at the earliest.

7.3.2 RTP Packet format Introduction

In the context of the present document (specifically the MIME type defined in subclause 11.1), the units carried by the RTP Payload Format are DIMS Units. The RTP payload format defines two basic packet structures:

a) packets containing one or more entire units;

b) packets containing a single fragment of a unit.

Depending on the underlying network and the unit size, it may be desirable to split units or aggregate them. RTP Header Usage

The RTP header is defined in [15] and its use in this payload format is described below.

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|V=2|P|X| CC |M| PT | sequence number |
| timestamp |
| synchronization source (SSRC) identifier |
| contributing source (CSRC) identifiers |
| …. |

Figure 7-2: RTP HEADER

Marker bit (M): 1 bit – The marker bit is set for the last packet associated with a timestamp.

NOTE: This is useful when a scene is sent as a combination of a smaller scene and a series of scene commands in separate packets. In this case the marker bit of the packet containing the last scene command is to be set. This is in line with the normal use of the marker bit in video coding and enables efficient buffering.

Timestamp: 32 bits – The timestamp indicates the rendering instant of the unit(s).

The usage of the remaining RTP header fields follows the rules of [15]. Common Packet Header

The RTP payload comprises of a common header and has the following format:

|R|A| T | CTR |


R: 1 bit

The R bit is reserved, shall be set to 0, and shall be ignored by the receiver.

A: 1 bit

When set to one, the A bit indicates that the packet contains one or more random access points (in DIMS, DIMS Units with is-RAP set), or the first fragment of a random access point.

T: 3 bits

The payload type as defined in table 2; Reserved values shall not be used, and packets with reserved values of the type field shall be discarded and not processed.

Table 2: Summary of RTP Payload Types and Descriptions




Aggregation packet


Fragmentation start Packet


Fragmentation continuing Packet


Fragmentation end Packet

4 to 7


CTR: 3 bits

The CTR is used to detect the loss of one or more high-priority units as documented in subclause 7.3.1. Aggregation Packet

These packets contain one or more complete units with the same timestamp. The common header values are:

– Type: 0.

– A (RAP): as needed.

The RTP payload is presented below.

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
| Header(Type=0)| first Unit length | :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
: first unit :
| |…OPTIONAL RTP padding |

Figure 7-4: Aggregation Packet payload format

The units are placed in the RTP payload, in sequence, possibly following by RTP padding. Each unit is preceded by a two-byte length in network (big-endian) byte order. The length is the length of the following unit (both header and body), not including the length field itself. Fragmentation Packets

Frames that exceed the networks maximum transmission unit (MTU) size should be fragmented before transmission. By fragmenting at the RTP level one does not need to rely on lower layer fragmentation, e.g. IP.

The payload format defines fragmentation of units into two or more RTP packets.

NOTE: Fragmentation on the RTP level should however be seen as a solution only when fragmentation on the DIMS level is not possible. Fragmentation can be performed by splitting, for example, a scene into a scene and a number of scene updates. In this way packets can be created that are smaller than MTUs and can be decoded individually, which gives better error resilience when packets are lost.

The common header values are as follows.

– Type: 1, 2, or 3.

– A (RAP): as needed in first fragment, and 0 in all other fragments.

– CTR: shall be identical in all the packets of a fragmented unit; increments after the last fragment depending on the priority of the unit.

Fragments consist of an integer number of consecutive octets of a unit. Fragments of a unit shall be sent as a group and in consecutive order with respect to RTP sequence numbers. The first fragment shall be marked as type 1 and the last fragment shall be marked as type 3. Other fragments shall be marked as type 2.

The unit is complete with header. The header is not repeated in fragments. There is no length field.

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|Header(type=1) | Header | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| |
| Partial Unit payload |
| |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| :…OPTIONAL RTP padding |

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|Hdr (type=2/3) | |
+-+-+-+-+-+-+-+-+ |
| |
| Partial Unit payload |
| |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| :…OPTIONAL RTP padding |


7.3.3 SDP Parameters

The Session Description specifies the clock rate, version profile and level. The fields in the Session Description Protocol (SDP) are defined as follows:

– The media name in the "m=" line of SDP shall be video.

– The encoding name in the "a=rtpmap" line of SDP shall be richmedia+xml.

The clock rate in the "a=rtpmap" line is not specified in the present document. The resolution of the clock should be sufficient for the desired synchronization accuracy and for measuring packet arrival jitter. The clock rate of the referenced continuous media files within the presentation needs to be considered. For example, if the presentation contains referenced video which is to be synchronized with the presentation, the clock rate should be no less than 90,000.

The MIME parameters in subclause 11.1, when present, shall be included in the "a=fmtp" line of SDP. These parameters are expressed as a MIME media type string, in the form of a semicolon separated list of parameter=value pairs.

An example of a media-level description in SDP format is shown below.

m=video 12345 RTP/AVP 96
a=rtpmap:96 richmedia+xml/100000
a=fmtp:96 Version-profile=10; Level=10;

7.3.4 Separate Redundant Stream

Redundant DIMS Units may be carried in RTP in a separate stream. If there is more than one main stream, the redundant stream(s) shall be linked to the matching main stream(s) that they repair, using the media identification and group attributes as specified in [22]. For any RTP packet A in a redundant stream and any RTP packet B in the corresponding main stream, when the redundant DIMS units in packet A and the main DIMS units in packet B have the same media time, packet A and packet B shall have the same RTP timestamp value.

Redundant streams have contains-redundant set to "redundant data only". The stream they are connected to shall have contains-redundant set to "main data only".

A terminal may perform tune-in etc. using the ‘redundant’ stream by:

a) looking for a random access point in the redundant stream, or main stream;

b) if the random access point was in the redundant stream, processing DIMS Units from the redundant stream until the redundant-exit indication;

c) continuing processing at the temporally next DIMS Unit in the main stream.