5 Compressed Data Streams

23.0423GPPCompression algorithm for text messaging servicesRelease 17TS

This clause provides:

– A detailed specification of the algorithms and data structures that implement compression and decompression mechanisms.

5.1 Structure

A Compressed Data Stream (CDS) comprises three key components:

– a Compression Header (CH) containing a variable number of octets, the content of which defines the nature of the compressed data;

– the Compressed Data (CD) which is a bit stream of variable length;

– a Compression Footer (CF) which is used to signal the number of bits in the last octet of the CDS that form part of the compressed data.

5.2 Compression Header

The Compression Header (CH) comprises a variable number of octets that define the nature of the compressed data.

The compression header allows for a wide range of compression alternatives, however of these alternatives only one is defined as the basic mandatory form of compression that shall be supported by all implementations. This is the use of the basic Huffman algorithm initialized with no prior knowledge of character distribution. This case can be signalled directly by setting a single octet(octet 1) for the compression header with the value of 120 (decimal).

5.2.1 Compression Header – Octet 1

The first CH octet is mandatory and is defined as follows:

Table 1: CH octet

7

6

5

4

3

2

1

0

Description

0

There is no subsequent CH octet

1

A further CH octet follows

n

n

n

n

The "Compression Language Context" this is described below

0

Punctuation processing disabled

1

Punctuation processing enabled

0

Keyword processing disabled

1

Keyword processing enabled

0

Character group processing disabled

1

Character group processing enabled

As noted in clause 4, the compression algorithms can be configured to operate in a variety of ways and may rely on end-to-end knowledge of "prior" information such as which key word dictionary is to be used.

A requirement that all configuration information be explicitly stated in the CH is less efficient (in terms of compression ratio) than if a default configuration is known and only variations from this need be signalled. However, a major determinant of configuration is the language in which the original message to be compressed is composed. For example, different keyword dictionaries would be required for French and opposed to German and character frequency distributions for English texts may vary greatly from those for Swedish texts. From this it can be seen that a universal "default" configuration would be of little value.

To address this, the Compression Language Context (CLC) allows a default configuration to be specified for each of the languages defined in 3GPP TS 23.038 [1] in relation to the Cell Broadcast Data Coding Scheme as follows:

– The CLC in bits 6 to 3 of the CH specify the language as per 3GPP TS 23.038 [1] in the case where bits 7 to 4 of the Cell Broadcast Data Coding Scheme octet are set to 0000.

– If and when required, higher order bits of the CLC can be signalled by a subsequent CH octet as described below.

– The CLC value 1111 (language unspecified) will indicate a "default" configuration that is language independent. This is specified in annex R and involves the basic Huffman (de-)coding with no initial character frequency distribution, see example below.

Table 2: Huffman (de-)coding with no initial character frequency distribution

7

6

5

4

3

2

1

0

Description

0

1

1

1

1

0

0

0

Basic Huffman (de-)coding only.

5.2.2 Compression Header – Octets 2 to n

Any second and subsequent CH octets are used to vary the configuration defaults established by the CLC. These octets all comprise a continuation bit followed by a Type, Value structure as follows:

Table 3: Value structure

7

6

5

4

3

2

1

0

Description

0

There is no subsequent CH octet

1

A further CH octet follows

n

n

n

CH Extension Type

n

n

n

n

CH Extension Value

The bits of the semi-octet CH Extension value are interpreted left to right, MSB to LSB. If the CH contains more than one octet of the same CH Extension type, the CH Extension value of a subsequent CH octet, is interpreted as being next most significant semi-octet of the composite value being signalled.

For example if the CLC in CH octet 1 indicates that the default Huffman Initialization ID is 1 (decimal) and the required HI-ID is 37 (decimal), then the following octets (in the range 2 to n) would also be required in the CH.

Table 4: CH extension octets (Example)

7

6

5

4

3

2

1

0

Description

1

0

1

1

0

1

0

1

The default HI-ID is replaced with the value 0101

0

0

1

1

0

0

1

0

The current HI-ID value (0101) is extended to 0010 0101

The following values are defined for the CH Extension Type:

000 Extend CLC. The CH Extension Value contains higher order bits that are to be pre-pended to the current CLC value.

NOTE: for 1st occurrence of the Extend CLC CH Extension Type in the CH, the value for the CLC specified in CH octet 1 is not replaced but rather the process of "extension" begins directly. Thus is the CLC to be used is 18, octets 1 and 2 of the CH would contain:

Table 5: CLC extension (Example)

7

6

5

4

3

2

1

0

Description

1

0

0

1

0

The least significant semi-octet of the CLC is 0010

0

0

0

0

0

0

0

1

The CLC value (0010) is extended to 0001 0010

001 Change Character Set. The CLC defines a default character set (UCS2 or otherwise) within which compression will operate. The Change Character Set CH Extension Type indicates that this should be overridden by the character set specified by the CH Extension Value. If a CH contains more than one Change Character Set CH Extension Type octet, the CH Extension Value contained in subsequent CH octets of this type contains higher order bits and are to be pre-pended to the value of the new character set.

The following Character Sets are defined:

0000 No character set defined. To be used where original message content is binary data and compression is solely via Huffman coding with no initial frequency training and thus there is no requirement to ensure consistent use of character set by coder and decoder.

0001 GSM 7 bit default alphabet (3GPP TS 23.038 [1])

0010 Codepage 437

0011 Codepage 850

All other values are reserved – see section 5.2.2.1

A Change Character Set to UCS2 codepoint is not defined here. Where the CLC indicates a character set other than UCS2 and there is a need to change to UCS2 then this is achieved using the Change UCS2 row parameter described below.

010 Change UCS2 Row. The CLC defines a default character set (UCS2 or otherwise) within which compression will operate. The Change UCS2 Row CH Extension Type indicates that this should be overridden by the use of UCS2 and the UCS2 row value for the first character in the input stream is that specified by the CH Extension Value. If a CH contains more than one Change UCS2 Row CH Extension Type octet, the CH Extension Value contained in subsequent CH octets of this type contains higher order bits for the initial UCS2 Row value and are to be pre-pended to the current value.

NOTE: Change UCS2 Row CH Extension Type octet effectively overrides any prior Change Character Set CH Extension Type octet and vice versa so these types are logically mutually exclusive within a given CH.

011 Change Huffman Initialization. The CLC defines a default set of parameters for the initialization of the Huffman (de)coder. The Change Huffman Initialization CH Extension Type indicates that this should be overridden by the set of initialization parameters identified by the Huffman Initialization ID contained in the CH Extension Value. If a CH contains more than one Change Huffman Initialization CH Extension Type octet, the CH Extension Value contained in subsequent CH octets of this type contains higher order bits for the initial Huffman Initialization ID value and are to be pre-pended to the current value.

100 Change Keyword Dictionary. The CLC defines a default set of parameters for the initialization of the Keyword (de)coder. The Change Keyword Dictionary CH Extension Type indicates that this should be overridden by the set of initialization parameters identified by the Keyword Dictionary ID contained in the CH Extension Value. If a CH contains more than one Change Keyword Dictionary CH Extension Type octet, the CH Extension Value contained in subsequent CH octets of this type contains higher order bits for the initial Keyword Dictionary ID value and are to be pre-pended to the current value.

101 Change Punctuator. The CLC defines a default set of parameters for the initialization of the punctuation (de)coder. The Change Punctuator CH Extension Type indicates that this should be overridden by the set of initialization parameters identified by the Punctuator ID contained in the CH Extension Value. If a CH contains more than one Punctuator CH Extension Type octet, the CH Extension Value contained in subsequent CH octets of this type contains higher order bits for the initial Punctuator ID value and are to be pre-pended to the current value.

110 Change Character Group. The CLC defines a default set of parameters for the initialization of the Character Group (de)coder. The Change Character Group CH Extension Type indicates that this should be overridden by the set of initialization parameters identified by the Character Group ID contained in the CH Extension Value. If a CH contains more than one Change Character Group CH Extension Type octet, the CH Extension Value contained in subsequent CH octets of this type contains higher order bits for the initial Character Group ID value and are to be pre-pended to the current value.

111 Reserved, see section 5.2.2.1

5.2.2.1 Compression Header reserved extension types and values

Any currently undefined values in the range 0 to 255 decimal are reserved.

Values above 255 are available for user to user requirements.

5.2.3 Identifying unique parameter sets

The four component compression algorithms (Huffman, Keywords, Character Groups and Punctuation) may all have a variety of initialization options. For each algorithm, a given set of initialization options needs to be identified for the processing of a given input stream.

Initialization and operation of the algorithms depends not only on the language in which the original source text is composed but also the character set (UCS2 or otherwise) that is to be used during processing. Thus the Huffman Initialization ID (HI-ID), Keyword Dictionary ID (KD-ID), Punctuator ID (PU-ID) and Character Group ID (CG-ID) only define unique values within the context of a given character set (the default established by the CLC or subsequently amended via Change Character Set or Change UCS2 Row CH Extension types) and within the context of the language indicated by the CLC.

5.3 Compressed Data

The Compressed Data (CD) is a stream bits of variable length that represent either an encoding of the content original input stream or control information indication that the operation of some algorithm should vary in some manner.

Control information is signalled within the CD by Huffman encoded symbols (characters) whose value is greater than 255 decimal. Huffman encoded symbols in the range 0 to 255 are of course characters from the original input stream.

The following control symbols are defined:

Table 6: Compressed Data: control symbols

Decimal value

Significance

256

New 7 bit character.

On encoding, if a character (octet) from the input stream in the range 0 to 127 does not exist in the Huffman tree, then the New 7 bit character symbol is Huffman encoded to the CD and bits 6 to 0 of the original octet are copied unchanged to the CD. The Huffman tree would then be updated to include the new character as described in the sections below.

On decoding the New 7 bit character symbol, the symbol itself is discarded and the next 7 bits of the CD are copied unchanged to bits 6-0 of the octet to be output, bit 7 of which is zero. The Huffman tree would then be updated to include the new character.

257

New 8 bit character.

The operation of this is identical to that of the New 7 bit character except that on encoding, the input character is in the range 128-255 and on decoding, bit 7 of the output character is set to 1.

258

Keyword.

This symbol (Huffman encoded) Huffma\n encodedprefixes a sequence of bits of variable length in the CD that define a representation of characters in the uncompressed stream by an entry in a keyword dictionary.

On encoding, if a sequence of characters in the input stream can be represented by an entry in a keyword dictionary, the Keyword symbol is Huffman encoded to the CD followed by the bit sequence describing the keyword entry (this is described below).

On decoding the Keyword symbol, the symbol itself is discarded and the bit sequence describing the keyword entry is passed to the Keyword processor to recovery the original character sequence to be placed in the output stream.

259 to 265

Character Group Transitions.

These symbols signal transitions between groups of characters defined within the Character Group processor. For example, if 2 groups are defined to be the lower case and upper case characters then the input stream:

"abcdefABCDEF" would become "abcdef<Change Group>abcdef"

On encoding, Character Group Transition symbols are generated by the Character Group processor and simply passed to the Huffman processor for encoding.

On decoding a Character Group Transition symbol, it is simply passed from the Huffman processor to the Character Group processor which takes the appropriate action based its current state and the group transition indicated.

266

New UCS2 Row.

On encoding, if the next UCS2 character in the input stream has a "row octet" of a different value to that of the previous character in the input stream, the New UCS2 Row symbol is Huffman encoded to the CD and the 8 bit of the new row octet are copied unchanged to the CD. The new row octet is stored by the UCS2 processor as the "current row octet" and subsequent input characters within the current row are Huffman encoded as the 8 bit value of the character within the "current row".

On decoding the New UCS2 Row symbol, the symbol is discarded and the next 8 bits are read from the CD and stored by the UCS2 processor as the "current row octet". Subsequent UCS2 characters are decoded by treating the 8 bit character values decoded by the Huffman processor as characters within the "current row".

5.4 Compression Footer

Although Compressed Data Stream Length (CDSL) – the total number of octets that contain the CDS – is known, the CD element of the CDS is a bit stream and therefore may not end on an octet boundary. The Compression Footer (CF) is used to indicate the end of the CD as follows:

– Calculate the number of meaningful bits in the last octet of the CD (i.e. total CD bits modulo 8).

– If the number of meaningful bits is >0 and <6 store the number of meaningful bits in bits 2 to 0 of the last octet. Otherwise extend the CD by adding 1 octet and store the number of meaningful bits in bits 2 to 0 of this new octet. In the case where the number of meaningful bits is 8 then bits 2 to 0 of the new octet are set to zero.

For example if there are 4 meaningful bits in the last CD octet, the CF will be constructed to occupy the shaded area in table 7.

Table 7: CF with >0 and <6 meaningful bits in last octet (Example)

0

7

6

5

4

3

2

1

0

X

X

X

X

X

1

0

0

Alternatively if there are 6 meaningful bits in the last CD octet, a new octet needs to be added. The CF will be constructed to occupy the shaded area in table 8.

Table 8: CF with >5 meaningful bits in last octet (Example)

0

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

X

X

X

X

X

X

X

1

1

0

If there are 8 meaningful bits in the last CD octet, a new octet needs to be added. The CF will be constructed to occupy the shaded area in table 8a.

Table 8a: CF with 8 meaningful bits in last octet (Example)

0

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

X

X

X

X

X

X

X

X

X

0

0

0

In all the tables above, the bits in the shaded area which have no bit value defined are set according to the particular bearer being used to transport compressed data. e.g. CBS. Where no particular reference is made regarding the value of those bits they may be set to any value.