13 HTTP streaming extensions
26.2443GPP3GPP file format (3GP)Release 17Transparent end-to-end Packet-switched Streaming Service (PSS)TS
13.1 Introduction
This clause describes extensions to the 3GP file format related to Dynamic Adaptive Streaming over HTTP as specified in 3GPP TS 26.247 [49] using HTTP [48] as delivery protocol for segments.
13.2 Segment types
It is possible in HTTP streaming to form files from segments – or concatenated segments – which would not necessarily form 3GP compliant files (e.g. they do not contain a movie box). If such segments are stored in separate files (e.g. on a standard HTTP server) it is recommended that these ‘segment files’ start with a segment-type box, to enable identification of those files, and declaration of the specifications with which they are compliant.
A segment type has the same format as an ‘ftyp’ box [7], except that it takes the box type ‘styp’. The brands within it should include the same brands that were included in the ‘ftyp’ box that preceded the ‘moov’ box, and may also include additional brands to indicate the compatibility of this segment with various specification(s) such as the 3GP Media Segment Profile defined in clause 5.4.10 of this specification.
Valid segment type boxes shall be the first box in a segment. Segment type boxes may be removed if segments are concatenated (e.g. to form a full 3GP file), but this is not required. Segment type boxes that are not first in their files may be ignored.
13.3 Track Fragment Adjustment Box
Track Fragment Adjustment Boxes describe the relative time difference of the first samples of tracks within a movie fragment. When randomly accessing a 3GP file or a Media Segment at a movie fragment that contains a Track Fragment Adjustment Box, the Track Fragment Adjustment Box provides instructions on how the timeline of one or more of the tracks may be modified to generate synchronization between the tracks. For example, if, in the previous fragment, one track ended later than another, the first sample of that track in this fragment will need to be presented later also; an edit-list in the track fragment adjustment box containing an empty edit, and then a media edit, achieves that effect.
The syntax of a Track Fragment Adjustment Box as described below is identical to that of edit-lists. However, unlike edit-lists, which must always be applied, when present, to adjust the timelines of the containing tracks, a Track Fragment Adjustment Box may only be applied when randomly accessing a 3GP file or a Media Segment at a movie fragment containing the Track Fragment Adjustment Box. In continuous playback, wherein the track alignment is known (e.g. from decoding the previous segment) and sync between tracks has been achieved, Track Fragment Adjustment Box shall not be applied.
The container of the Track Fragment Adjustment Box is the Track Fragment Box. If present, the Track Fragment Adjustment Box should be positioned after the Track Fragment Header Box and before the first Track Fragment Run box. The Track Fragment Adjustment Box is a container for the Track Fragment Media Adjustment Boxes.
aligned(8) class TrackFragmentAdjustmentBox extends Box(‘tfad’) {
}
The Track Fragment Media Adjustment Box provides explicit time line offsets. By indicating ’empty’ time, or by defining a ‘dwell’, the offset can advantageously delay the playback time of the media in the track so that media in different tracks can be synchronized. Alternatively, the media_time value may be used to discard part of the "earlier" tracks.
aligned(8) class TrackFragmentMediaAdjustmentBox extends FullBox(‘tfma’, version, 0) {
unsigned int(32) entry_count;
for (i=1; i <= entry_count; i++) {
if (version==1) {
unsigned int(64) segment_duration;
int(64) media_time;
} else { // version==0
unsigned int(32) segment_duration;
int(32) media_time;
}
int(16) media_rate_integer;
int(16) media_rate_fraction = 0;
}
}
version is an integer that specifies the version of this box (0 or 1).
entry_count is an integer that gives the number of entries in the following table.
segment_duration is an integer that specifies the duration of this adjustment segment in units of the timescale in the Movie Header Box. "Adjustment segment" in this context does not refer to the "Media Segment" that contains the ‘tfma’ but refers to the operation that is performed to place the track at appropriate composition time.
media_time is an integer containing the starting time within the media of this adjustment segment (in media time scale units, in composition time). If this field is set to -1, it is an empty edit. The last adjustment in a track shall never be an empty edit.
media_rate_integer specifies the relative rate at which to play the media corresponding to this adjustment segment. If this value is 0, then the adjustment is specifying a ‘dwell’: the media at media-time is presented for the segment-duration. Otherwise this field shall contain the value 1.
13.4 Segment Index Box
The Segment Index box (‘sidx’) provides a compact index of one track within the media segment to which it applies. The index is referring to movie fragments and other Segment Index Boxes in a segment.
Each Segment Index Box documents how a (sub)segment is divided into one or more subsegments (which may themselves be subdivided using Segment Index boxes). A subsegment is defined as a time interval of the containing (sub)segment, and corresponds to a single range of bytes of the containing (sub)segment. The durations of all the subsegments sum to the duration of the containing (sub)segment.
Specifically for this file format a subsegment is a self-contained set of one or more consecutive movie fragment boxes with corresponding Media Data box(es) and a Media Data Box containing data referenced by a Movie Fragment Box must follow that Movie Fragment box and precede the next Movie Fragment box containing information about the same track. The presentation times documented in the Segment Index are in the movie timeline that is they are composition times after the application of any edit list for the track.
Each entry in the Segment Index box contains a reference type that indicates whether the reference points directly to the media bytes of a referenced leaf subsegment, or to a Segment Index box that describes how the referenced subsegment is further subdivided; as a result, the segment may be indexed in a ‘hierarchical’ or ‘daisy-chain’ or other form by documenting time and byte offset information for other Segment Index boxes applying to portions of the same (sub)segment.
A Segment Index box provides information about a single track of the Segment, referred to as the reference stream. If provided, the first Segment Index box in a segment, for a given track, shall document the entirety of that track in the segment, and shall precede any other Segment Index box in the segment for the same track.
If a Segment Index is present for at least one track but not all tracks in the segment, then normally a track in which not every sample is independently coded, such as video, is selected to be indexed. For any track for which no segment index is present, referred to as non-indexed stream, the track associated with the first Segment Index box in the segment serves as a reference stream in a sense that it also describes the subsegments for any non-indexed track.
A Segment Index box contains a sequence of references to subsegments of the (sub)segment documented by the box. The referenced subsegments are contiguous in presentation time. Similarly, the bytes referred to by a Segment Index box are always contiguous within the segment. The referenced size gives the count of the number of bytes in the material referenced.
NOTE: A media segment may be indexed by more than one "top-level" Segment Index box that are independent of each other, each of which indexes one track within the media segment. In segments containing multiple tracks the referenced bytes may contain media from multiple tracks, even though the Segment Index box provides timing information for only one track.
The anchor point for a Segment Index box is the first byte after that box.
Within the two constraints (a) that, in time, the subsegments are contiguous, that is, each entry in the loop is consecutive from the immediately preceding one and (b) within a given segment the referenced bytes are contiguous, there are a number of possibilities, including:
1) a reference to a segment index box may include, in its byte count, immediately following Segment Index boxes that document subsegments;
2) using the first_offset field, it is possible to separate Segment Index boxes from the media that they refer to;
3) it is possible to locate Segment Index boxes for subsegments close to the media they index.
The Segment Index box documents the presence of Stream Access Points (SAPs), as specified in Annex G.6 of TS26.247 [49], in the referenced subsegments. The annex specifies characteristics of SAPs, such as ISAU, ISAP and TSAP, as well as SAP types, which are all used in the semantics below. A subsegment starts with a SAP when the subsegment contains a SAP, and for the first SAP, ISAU is the index of the first sample that follows ISAP, and ISAP is contained in the subsegment.
The container for ‘sidx’ box is the file or segment directly.
aligned(8) class SegmentIndexBox extends FullBox(‘sidx’, version, 0) {
unsigned int(32) reference_ID;
unsigned int(32) timescale;
if (version==0)
{
unsigned int(32) earliest_presentation_time;
unsigned int(32) first_offset;
}
else
{
unsigned int(64) earliest_presentation_time;
unsigned int(64) first_offset;
}
unsigned int(16) reserved = 0;
unsigned int(16) reference_count;
for(i=1; i <= reference_count; i++)
{
bit (1) reference_type;
unsigned int(31) referenced_size;
unsigned int(32) subsegment_duration;
bit(1) starts_with_SAP;
unsigned int(3) SAP_type;
unsigned int(28) SAP_delta_time;
}
}
reference_track_ID provides the track_ID for the reference track; if this Segment Index box is referenced from a "parent" Segment Index box, the value of reference_ID shall be the same as the value of reference_ID of the "parent" Segment Index box
timescale provides the timescale, in ticks per second, for the time and duration fields within this box; it is recommended that this match the timescale of the reference track, i.e. the timescale field of the Media Header Box of the track.
earliest_presentation_time is the earliest presentation time of any sample in the reference track in the first subsegment, expressed in the timescale of the timescale field.
first_offset is the distance in bytes from the first byte following the containing Segment Indexing Box, to the first byte of the first referenced box.
reference_count: the number of elements indexed by second loop.
reference_type: when set to 0 indicates that the reference is to a movie fragment (‘moof’) box; when set to 1 indicates that the reference is to a segment index (‘sidx’) box.
referenced_size: the distance in bytes from the first byte of the referenced box to the first byte of the next referenced box or in the case of the last entry, the first byte not indexed by this Segment Index Box.
subsegment_duration: when the reference is to Segment Index Box, this field carries the sum of the subsegment_duration fields in that box; when the reference is to a subsegment, this field carries the difference between the earliest presentation time of any sample of the reference track in the next subsegment (or the first subsegment of the next segment, if this is the last subsegment of the segment or the end composition time of the reference track if this is the last subsegment of the representation) and the earliest presentation time of any sample of the reference track in the referenced subsegment; the duration is expressed in the timescale value in this box.
starts_with_SAP: indicates whether the referenced subsegments start with a SAP. For the detailed semantics of this field in combination with other fields, see the table below.
SAP_type: indicates a SAP type as specified in TS26.247 [49], Annex G.6, or the value 0. Other type values are reserved. For the detailed semantics of this field in combination with other fields, see the table below.
SAP_delta_time: indicates TSAP of the first SAP, in decoding order, in the referenced subsegment for the reference stream. If the referenced subsegments do not contain a SAP, SAP_delta_time is reserved with the value 0; otherwise SAP_delta_time is the difference between the earliest presentation time of the subsegment, and the TSAP (note that this difference may be zero, in the case that the subsegment starts with a SAP).
Table 13.1: Semantics of SAP and reference type combinations
|
starts_with_SAP |
SAP_type |
reference_type |
Meaning |
|
0 |
0 |
0 or 1 |
No information of SAPs is provided. |
|
0 |
1 to 6, |
0 (media) |
The subsegment contains (but may not start with) a SAP of the given SAP_type and the first SAP of the given SAP_type corresponds to SAP_delta_time. |
|
0 |
1 to 6, |
1 (index) |
All the referenced subsegments contain a SAP of at most the given SAP_type and none of these SAPs is of an unknown type. |
|
1 |
0 |
0 (media) |
The subsegment starts with a SAP of an unknown type. |
|
1 |
0 |
1 (index) |
All the referenced subsegments start with a SAP which may be of an unknown type |
|
1 |
1 to 6, |
0 (media) |
The referenced subsegment starts with a SAP of the given SAP_type. |
|
1 |
1 to 6, |
1 (index) |
All the referenced subsegments start with a SAP of at most the given SAP_type and none of these SAPs is of an unknown type. |
13.5 Track Fragment Decode Time Box
The Track Fragment Base Media Decode Time (‘tfdt’) Box provides the decode time of the first sample in the track fragment. This can be useful, for example, when performing random access in a file; it is not necessary to sum the sample durations of all preceding samples in previous fragments to find this value (where the sample durations are the deltas in the Decoding Time to Sample Box and the sample_durations in the preceding track runs).
The Track Fragment Base Media Decode Time Box, if present, shall be positioned after the Track Fragment Header Box and before the first Track Fragment Run box.
Note: the decode timeline is a media timeline, established before any explicit or implied mapping of media time to presentation time, for example by an edit list or similar structure.
aligned(8) class TrackFragmentBaseMediaDecodeTimeBox
extends FullBox(‘tfdt’, version, 0) {
if (version==1) {
unsigned int(64) baseMediaDecodeTime;
} else { // version==0
unsigned int(32) baseMediaDecodeTime;
}
}
version is an integer that specifies the version of this box (0 or 1 in this specification).
baseMediaDecodeTime is an integer equal to the sum of the decode durations of all earlier samples in the media, expressed in the media’s timescale. It does not include the samples added in the enclosing track fragment.