6 Compression processes

23.0423GPPCompression algorithm for text messaging servicesRelease 17TS

This clause defines the detailed operation of the various compression algorithms.

6.1 Overview

This subclause describes how the various compression algorithms are combined.

6.1.1 Compression

Table 9: Compression

Input

1) The nature of the compression to be performed.

2) The input stream of characters to be compressed.

Step 1

Construct the Compression Header so as to fully describe the nature of the compression to be performed as requested by higher software layers.

Note that it is the responsibility of higher software layers that use the compression algorithms to ensure that only those aspects of the compression algorithms that are supported by a particular implementation are requested.

Step 2

Initialize as defined by the CH the following components:

1) Character Set Converter

2) Punctuation Processor

3) Keyword Processor

4) UCS2 Processor

5) Character Group Processor

6) Huffman Processor

Step 3

If the Character set in which input stream is composed is different from that specified in the CH, convert the input stream so that it is rendered in the Character set (UCS2 or otherwise) specified in the CH.

Note that if characters in the input stream cannot be rendered in the character set specified in the CH, it is the responsibility of higher software layers that use the compression algorithms to detect this situation and take appropriate action.

Step 4

If the Punctuation Processor is enabled, use it to encode the character set converted input stream produced by Step 3 above.

Step 5

Set the current character position to the start of the character stream produced as the output of Step 4 above.

Step 6

If the Keyword processor is not enabled goto Step 7.

Examine the sequence of characters starting at the current character position in the input stream and determine if they can be represented by an entry in the keyword dictionary.

If an appropriate keyword is not found goto Step 7.

If the Character Group processor is enabled, pass it the Keyword symbol and Huffman encode to the CD the sequence of symbols output by it.

Huffman encode the Keyword symbol to the CD and then copy the bit sequence describing the keyword entry to the CD.

Goto Step 10.

Step 7

If the input stream is not UCS2 goto Step 8.

If the character at the current character position in the input stream has a different UCS2 row octet from the previous character Huffman encode the New UCS2 Row symbol to the CD and then copy the new row octet to the CD.

Remove the row octet from the character at the current character position in the input stream which will subsequently treated as an 8 bit value.

Step 8

If the Character Group processor is not enabled goto Step 9.

Pass the character at the current character position in the input stream to the Character Group processor and Huffman encode to the CD the sequence of symbols output by it.

Goto Step 10.

Step 9

Huffman encode the character at the current character position in the input stream.

Step 10

Increment the current character position by the number if input characters processed in steps 6 to 9 above.

If the entire input stream has not been processed goto Step 6 above.

Step 11

Construct the Compression Footer.

Output

The completed Compressed Data Stream.

Note that the possibility exists that the CDS may be larger than the original input stream. In this case it is the responsibility of higher software layers that use the compression algorithms to detect this situation and take appropriate action.

6.1.2 Decompression

Table 10: Decompression

Input

The Compressed Data Stream

Step 1

Interpret the Compression Header to determine the nature of the decompression to be performed.

Note that it is the responsibility of higher software layers that use the decompression algorithms to handle appropriately the case where the nature of the decompression to be performed is not supported by a particular implementation.

Step 2

Initialize as defined by the CH the following components:

1) Character Set Converter

2) Punctuation Processor

3) Keyword Processor

4) UCS2 Processor

5) Character Group Processor

6) Huffman Processor

Step 3

Interpret the Compression Footer to determine the total number of significant bits in the Compressed Data (CD). Set the total number of bits processed to zero.

Step 4

Read bits from the CD passing them to the Huffman decoder to generate the "current symbol". The bits should be read in the order bit 7 to bit 0 within each CD octet. CD octets are processed in the order 1 to n.

Step 5

If the Keyword processor is not enabled, goto Step 6.

If the "current symbol" is the Keyword symbol, read the bit sequence describing the keyword entry from the CD. Pass the keyword entry description to the Keyword processor for decoding and add the resulting sequence of characters representing the keyword to the output stream.

Goto Step 9.

Step 6

If the Character Group processor is not enabled goto Step 7.

If the "current symbol" is a Character Group Transition symbol, pass it to the Character Group processor so that the current group can be updated and goto Step 9.

If the value of the "current symbol" is in the range 0 to 255 (i.e. not a control symbol), pass the "current symbol" to the Character Group processor and set the new value of the "current symbol" to that returned by the Character Group processor.

Step 7

If the output stream is not UCS2 goto Step 8.

If the "current symbol" is the New USC2 Row symbol, read the new "current UCS2 row octet" from the CD and goto Step 9.

Pre-pend the "current UCS2 row octet" to the 8 bit value of the "current symbol" to produce a 16 bit UCS2 character.

Step 8

Add the "current symbol" to the output stream.

Step 9

Increment the total number of bits processed by the number of bits read from the CD in steps 4 to 8 above.

If the total number of bits processed is less than the total number of significant bits in the CD goto Step 4.

Step 10

If the Punctuation Processor is enabled, use it to decode output stream produced by steps 3 to 9 above.

Step 11

If the Character set (UCS2 or otherwise) specified in the CH, is different from that required by higher level software layers, convert the output stream produced by step 10 above so that it is rendered in the Character set (UCS2 or otherwise) required by higher level software layers.

Note that if characters in the stream cannot be converted, it is the responsibility of higher software layers that use the compression algorithms to detect this situation and take appropriate action.

Output

The decompressed original input stream.

6.2 Character sets

The need for character set conversion arises in that a number of the compression algorithms operate on the basis of "prior information" about the nature of human readable texts. For example Huffman frequency initializations may specify the an initial relative frequency for the letter "e" as opposed to the letter "x". Similarly, a keyword dictionary may contain the word "meeting".

Consider the case where a keyword dictionary contains the entry "£10,000" composed using the Code Page 850 character set. If an input stream containing the string "£10,000" also composed in Code Page 850 is processed, the string will be replace in the CD by a reference to the keyword entry. In contrast if the input string is composed using the GSM 7 bit default alphabet (3GPP TS 23.038 [1]) than a match between the input string and the keyword entry will not be found as the value of the "£" symbol in Code Page 850 is 156 decimal whereas in the GSM 7 bit default alphabet it is 2 decimal.

There can be no assumption that higher level software layers responsible for composing the original input stream to be compressed and displaying the resulting decompressed output stream use the same character set.

Thus:

– The character set used to compose initialization parameter sets and used for the compression of a given input stream shall be the same for both compression and decompression.

– Where an input stream is composed using a character set that is different from that used for compression it shall be converted prior to compression.

– Where an output stream is required in a character set that is different from that used for compression it shall be converted after decompression.

There is an additional requirement in that a number of the compression algorithms perform upper / lower case conversions upon the characters within the character set used for compression. The mapping between "lower" and "upper" case characters needs to therefore be known.

6.2.1 Initialization

Initialization of character set conversion processing will typically involve identifying and loading the appropriate tables to a) convert between character sets and b) convert between upper and lower case characters.

As the character set(s) in which uncompressed data is required to be rendered is largely an implementation specific matter, so is the precise specification of the tables to convert these to/from the character set specified for compression. However, they need to be sufficient to support the following functions:

6.2.2 Character set conversion

Table 11: Character set conversion

Input

1) The value of the source character.

2) The character set in which the source character is rendered.

3) The character set in which the source character is to be rendered.

Output

1) The value of the converted character.

2) A Boolean value indicating whether a successful conversion has been performed.

Process

If the source character can be rendered in the target character set, its value in the target characterset is returned and a successful conversion is indicated.

Otherwise, the value of the source character is returned unchanged, a conversion failure is indicated and higher software layers need to take appropriate action.

For example:

– The character "A", 65 decimal in Code Page 850 is rendered in the GSM 7 bit default alphabet also as 65 decimal so this value is returned and a successful conversion is indicated.

– The character "£", 156 decimal in Code Page 850 is rendered in the GSM 7 bit default alphabet as 1 decimal so the value 1 is returned and a successful conversion is indicated.

– The character "Û" 234 decimal in Code Page 850 cannot be rendered in the GSM 7 bit default alphabet so the value 234 is returned unchanged and a conversion failure is indicated.

6.2.3 Character case conversion

Conversion between upper and lower case for characters within the character set used for compression will also typically be supported by conversion tables that indicate for each character in the character set, the value of any lower case or upper case equivalent character such that the following function can be supported.

Table 12: Character case conversion

Input

1) The value of the source character.

2) The case (lower or upper) in which the source character is to be rendered.

Output

1) The value of the case converted character.

Process

If the character can be rendered in the case requested and the value of this case converted character is different from that of the source character, the value of the case converted character is returned.

Otherwise (i.e. the source character is already in the requested case or the character does not have upper and lower case equivalents), the value of the source character is returned unchanged.

6.3 Punctuation processing

The punctuation processor achieves compression by using the "prior information" that the uncompressed stream is human readable and is constructed of sentences that conform to a known set of punctuation rules. Essentially this means that certain characters within the input stream, of themselves imply information about subsequent characters and this may therefore be omitted from the compressed stream. In this way the algorithm achieves some significant compression in a very simple manner.

However, because the algorithm operates on information about sentence structure rather than the exact sequence of characters used to render this, it is non-symmetric. In other words, although the overall meaning of the human readable input stream is preserved between compression and decompression, the exact sequence of characters is not. Higher level software layers or even user inspection may therefore be required to determine if the use of this processor is appropriate for a given input stream.

In addition to the ability to handle the conversion of characters between upper and lower case (as described in the previous subclause), the processor requires that certain characters (expressed in the character set to be used for compression) are assigned special attributes. These are:

Table 13: special attributes

Attribute

Description

PU-IWS

Inter-word separator. A character with this attribute is that typically used to separate words within the input stream.

Only one character in the character set may have this attribute.

This attribute is typically set for the "space" character (32 decimal).

PU-LST

Last Sentence Terminator. A character with this attribute is that typically used to terminate the last sentence in the input stream.

Only one character in the character set may have this attribute.

This attribute is typically set for the "." full stop character (46 decimal).

PU-WSF

Word Separator Follows. A character with this attribute is expected to be followed by one or more characters which have the PU-IWS attribute set.

Any number of characters within the character set may have this attribute.

Examples of characters that would normally have this attribute set are the exclamation mark (!), comma (,), full stop (.), colon (:), semi-colon (;) and question mark (?).

PU-UCF

Upper Case Follows. A character with this attribute is expected to be followed by an upper case character such as occurs at the start of a sentence or paragraph.

Any number of characters within the character set may have this attribute.

Typically, characters with this attribute set will also have the PU-WSF attribute set. Examples are the exclamation mark (!), full stop (.), and question mark (?).

Other examples associated with new paragraphs might include the carriage return (13 decimal) and line feed (10 decimal) symbols.

PU-UCW

Upper Case Word. A character with this attribute set is expected to be upper case if it is a word i.e. if it is both preceded and succeeded by character with the PU-IWS attribute set.

Any number of characters within the character set may have this attribute.

An example in the English language is the letter "I".

PU-NSI

No Separator Insertion. A character with this attribute set is does not have the PU-IWS attribute set but is none the less expected to be preceded by a character for which the PU-WSF attribute is set.

Any number of characters within the character set may have this attribute.

Typically, characters with this attribute set will be numeric digits so that the case can be resolved where characters which have the PU-WSF attribute set such as comma (,) and full stop (.) can be used in number formatting as in the case of the string "£10,000.25".

6.3.1 Initialization

Initialization of the punctuation processor will typically involve loading a table containing the combination of attributes defined for each character in the character set to be used for compression for the language defined by the CLC.

6.3.2 Compression

For compression, the punctuation processor operates as follows:

Table 14: compression punctuation processor

Input

The input stream of characters to be compressed, rendered in the appropriate character set.

Step 1

Set the current character position to the start of the input stream.

Step 2

Determine the attributes of the current character.

If some previous character in the input stream has not had the PU-IWS attribute set goto Step 3.

If the current character has the PU-IWS attribute set goto Step 8.

Convert the current character to lower case and store the returned value as that of the "previous character". Store the attributes of the current character as those of the "previous character" after clearing any PU-UCW attribute.

Goto Step 8.

Step 3

If the previous character has the PU-WSF attribute and the current character has the PU-IWS attribute goto Step 8.

Otherwise clear the PU-WSF attribute for the "previous character".

Step 4

If the previous character has the PU-UCF attribute, convert the current character to lower case and clear the PU-UCF attribute for the "previous character".

Step 5

If the previous character has the PU-UCW attribute and the current character has the PU-IWS attribute, convert the previous character to lower case.

Step 6

If the previous character has the PU-IWS attribute and the current character has the PU-IWS attribute, goto Step 8.

Otherwise add the previous character to the output stream and set the value of the previous character to that of the current character.

Step 7

If the current character has the PU-UCW attribute and the previous character attributes do not contain the PU-IWS attribute, clear the PU-UCW attribute for the current character.

Set the attributes for the "previous character" to those of the current character.

Step 8

If the current character is the last character in the input stream and if some previous character in the input stream has not had the PU-IWS attribute set and if the previous character attributes contain neither the PU-IWS not the PU-LST attribute, add the previous character to the output stream.

Step 9

If the current character is not the last character in the input stream, read the next character from the input stream, set the current character to this value and goto Step 2.

Output

The de-punctuated data stream.

6.3.3 Decompression

For decompression, the punctuation processor operates as follows:

Table 15: decompression punctuation processor

Input

The de-punctuated stream of characters to be punctuated, rendered in the character set used for compression.

Step 1

Set the current character position to the start of the de-punctuated stream.

Step 2

Determine the attributes of the current character.

If the current character is the first character in the stream, convert it to upper case and goto Step 8.

Step 3

If the current character has the PU-IWS attribute and the "previous character" attributes has the PU-UCW attribute, convert the stored value of the "previous character" to upper case.

Step 4

If the "previous character" attributes contain the PU-UCF attribute, and the current character was not generated by Step 10 below, convert the current character to upper case and clear the PU-UCF attribute for the "previous character" attributes.

Step 5

If the "previous character" was generated as a result of Step 10 and the current character contains the PU-NSI attribute goto Step 7.

Step 6

Add the "previous character" value to the output stream.

Step 7

If "previous character" attributes contain the PU-IWS attribute and the current character has the PU-UCW attribute, add the PU-UCW attribute to those of the "previous character". Otherwise clear any PU-UCW attribute stored for the "previous character".

Step 8

Set the value of the "previous character" to be that of the current character.

Step 9

If the attributes of the current character contain the PU-UCF attribute set this attribute for the "previous character".

Step 10

If the attributes of the current character contain the PU-WSF attribute and the current character is not the last character in the de-punctuated stream, insert the character containing the PU-IWS attribute at the position following the current character in the de-punctuated stream.

Step 11

If the current character is not the last character in the de-punctuated stream, read the next character from the stream, set the current character to this value and goto Step 2.

Step 12

Add the previous character to the output stream.

If the current character attributes do not contain the PU-UCF attribute or the previous character value equals that of the character which has the PU-LST attribute set, add the character which has the PU-LST attribute set to the output stream.

Output

The punctuated data stream.

6.4 Keywords

The operation of the Keyword processor is controlled by the set of parameters defined by a Keyword Dictionary that is uniquely defined (within a CLC) by the value of the Keyword Dictionary ID (KD-ID) specified in the CH.

6.4.1 Dictionaries

A Keyword Dictionary specifies the following items:

1) Character Set ID

This is the character set in which the dictionary is composed and shall therefore be equal to the character set to be used for compression as specified in the CH.

2) Match Options

This is a collection of bit flags that control how text in the input stream is to be matched against key word dictionary entries. These are described in the table below in which Bit 0 is considered to be the lease significant bit of the Match Options value.

Table 16: Match options

Bit

Description

0

If set, input stream text shall exactly match the dictionary entry.

1

If set, input stream text may match the lower case conversion of a dictionary entry.

2

If set, input stream text may match the upper case conversion of a dictionary entry.

3

If set, input stream text may match the upper case conversion of the 1st character of a dictionary entry followed by the lower case conversion of the remaining characters of the dictionary entry.

4

If set, input stream text may match a dictionary entry prefixed by the keyword prefix characters (if any) described below.

5

If set, input stream text may match a dictionary entry suffixed by the keyword suffix characters (if any) described below.

6

If set, input stream text may match a part of a dictionary entry. A partial match occurs when, a dictionary entry contains n characters and a match is found with the first m characters where m is less than n.

7-

All other bits are reserved.

3) Keyword Prefix

The 1st octet is the Keyword Prefix Length which specifies the number of characters that form the prefix string. The length octet is followed by the actual characters of the prefix string.

4) Keyword Suffix

The 1st octet is the Keyword Suffix Length which specifies the number of characters that form the suffix string. The length octet is followed by the actual characters of the suffix string.

5) Keyword Threshold

This value determines the minimum number of characters in the input stream that needs to be replaced by a full match with a keyword entry. For a partial match the value of the threshold needs to be incremented by 2.

If a match occurs involving fewer characters than that specified by the threshold, keyword substitution does not take place.

6) Maximum Partial Match Length

This value determines the maximum number of characters in the input stream that needs to be replaced by a partial match with a keyword entry.

If a partial match occurs involving fewer characters than that specified by this value, keyword substitution does not take place.

7) Key Word Group List

The actual key word dictionary entries are not directly specified within the Keyword Dictionary. Instead, a set of key word dictionary entries is explicitly identified by a Key Word Group ID – an octet value that is unique within the language specified by the CLC. This approach allows the same set of keyword dictionary entries to be used in conjunction with different values for the parameters specified within the Keyword Dictionary and for Keyword Dictionaries to be defined that combine multiple Key Word Groups.

The 1st octet of the Key Word Group List specifies the number of Key Word Group IDs that follow, each of the following octets specifies a Key Word Group ID.

6.4.2 Groups

A Keyword Group specifies the following items:

1) Character Set ID

This is the character set in which the keyword dictionary entries are composed and shall therefore be equal to the character set to be used for compression as specified in the CH.

2) Number of Entries

The value specifies the number of keyword dictionary entries contained in the Keyword Group.

3) Keyword Entry

The 1st octet is the Keyword Entry Length which specifies the number of characters that form the keyword entry string. The length octet is followed by the actual characters of the entry string.

The sequence of entries within a dictionary needs to be known by both coder and decoder. Thus keyword entries in a Keyword Group needs to be sorted in ascending sequence of the actual characters of the entry string. Furthermore if a dictionary defines multiple Keyword Groups, the combined set of entries needs to be resorted as part of initialization of the Keyword processor so that the ascending alphanumeric sequence of entries is achieved for all entries in the combined set.

A further requirement is that all entries in the combined set shall be unique.

6.4.3 Matches

A Keyword Match specifies how a sequence of characters in the input stream is represented by a keyword dictionary entry. A Keyword Match is a bit stream that is interpreted left to right as described on the table below wherein Bit 0 refers to the most significant, left most bit.

Table 17:

Bits

Description

0 to N1

Case conversion.

If bit 0 of the Dictionary Match Options is set (i.e. Exact matching is enabled), the Case conversion bits are omitted and the Keyword Match starts with the Keyword Entry ID described below.

Otherwise, if the match involves a lower case conversion, a single Case conversion bit with value 0 is used.

Otherwise, 2 case conversion bits are used with the following value:

10 Upper Case.

11 1st character Upper case, remainder Lower case.

N1+1 to N2

Keyword Entry ID.

This value represents the position in the list of keyword dictionary entries of the entry with which a match has been found. A value of 0 indicates the first entry.

The number of bits used to express the Keyword Entry ID is minimum number of bits required to represent the total number of keyword dictionary entries defined for the Keyword Dictionary minus 1.

N2+1 to N3

Prefix Match.

If bit 4 of the Dictionary Match Options is set (i.e. Prefix matching is enabled), a single bit is used to indicate whether a prefix match applies (1) or not (0).

If prefix matching is not enabled, this bit is omitted from the Keyword Match.

N3+1 to N4

Partial Match.

If bit 6 of the Dictionary Match Options is set (i.e. Partial matching is enabled), a single bit is used to indicate whether a partial match has occurred (1) or not (0).

If partial matching is not enabled, this bit is omitted from the Keyword Match.

If partial matching is enabled and a full match has occurred, no further bits are required to describe the match.

If partial matching is enabled and a partial match has occurred, it is necessary to encode the length of the partial match as follows:

The partial match length equals the total number of characters in the input stream represented by the Keyword Match (excluding any characters represented by any prefix and suffix matches) less the value of the partial match threshold (i.e. Keyword Threshold +2).

If the partial match length is less than 8 a single bit (0) is added to the bit stream to indicate this fact followed by 3 bits containing the partial match length.

Otherwise a single bit (1) is added to the bit stream to indicate that more than 3 bits follow containing the partial match length. In this case the number of bits used to represent the partial match length is the minimum number of bits required to represent the value (Maximum Partial Match Length – (Keyword Threshold +2))

N4+1 to N5

Suffix Match.

If bit 5 of the Dictionary Match Options is set (i.e. Suffix matching is enabled), a single bit is used to indicate whether a suffix match applies (1) or not (0).

If suffix matching is not enabled, this bit is omitted from the Keyword Match.

6.4.4 Initialization

Initialization of the Keyword processor involves loading the various parameters specified by the KD-ID contained in the CH.

As noted above, if the dictionary is composed on more than 1 Keyword Group, the combined set of keyword entries needs to be resorted so that the full set conforms to an ascending alphanumeric sequence.

Clearly,as it is the total combined and sorted set of keyword entries that is required, implementors may choose to construct this from the component keyword groups at run time or to produce such a combination and use it directly as indicated by the constituent keyword group ID’s.

6.4.5 Compression

For compression, the Keyword processor operates as follows:

Table 18: compression Keyword processor

Input

A offset into the input stream of characters from which a matching keyword is to be found.

Step 1

Set the current character position to the input offset.

Step 2

If Prefix matching is not enabled goto Step 3.

If the string starting at the current character position exactly matches Keyword Prefix, record this fact and increment the current character position by the length of the prefix string.

Step 3

Identify the Keyword Entry ID and if enabled Case Conversion and Partial Match details for the longest match (i.e. that what whereby the greatest number of characters in the input stream are represented) between a dictionary entry and the string starting at the current character position subject to the following rules:

1) An exact match shall be greater than or equal to the Keyword Threshold to be considered.

2) A partial match shall be greater than or equal to the Keyword Threshold +2 to be considered.

3) If more than 1 partial match of equal length is found, the one with the greater Keyword Entry ID is used.

4) If an exact match and a partial match are found, the length of the partial match shall be at least 2 greater than that of the exact match for it to be used.

5) Although the case of more than 1 exact match of equal length being found is not possible as entries are unique, should such a case arise, the one with the greater Keyword Entry ID is used.

If the longest match is a partial match with length greater than the Maximum Partial Match Length, the match length is limited to the Maximum Partial Match Length.

If no match has been found goto Step 5.

Step 4

If Suffix matching is not enabled goto Step 5.

If the string starting at the current character position exactly matches Keyword Prefix, record this fact and increment the current character position by the length of the prefix string.

Step 5

If a matching keyword has been found, construct the Keyword Match bitstream.

Output

A Keyword Match bitstream or an indication that no suitable match is available.

6.4.6 Decompression

For decompression, the Keyword processor operates as follows:

Table 19: decompression Keyword processor

Input

A Keyword Match bitstream.

Step 1

Interpret the Keyword Match bitstream to determine if there is a Prefix match. If so add the Keyword Prefix string to the string to be output.

Step 2

Interpret the Keyword Match bitstream to identify the dictionary entry or part thereof as indicated by any Partial Match details.

Perform any case conversion (indicated by the Keyword Match bitstream) on the dictionary entry string and add the resulting string to the string to be output.

Step 3

Interpret the Keyword Match bitstream to determine if there is a Suffix match. If so add the Keyword Suffix string to the string to be output.

Output

The character string represented by the input Keyword Match bitstream.

6.5 UCS2

6.5.1 Initialization

Initialization of the USC2 processor involves storing the default UCS2 row as specified by the CH.

6.5.2 Compression

For compression, the UCS2 processor operates as follows:

Table 20:

Input

A 16 bit UCS2 character value.

Step 1

If the row octet of the input character is different from the "current UCS2 row" store the row octet of the input character as the new "current UCS2 row".

Output

A Boolean value indicating whether the current UCS2 row has been changed.

6.5.3 Decompression

For decompression, the USC2 processor needs to set and sense the "current UCS2 row" as required by the higher level software described in subclause 6.1.2 above.

6.6 Character group processing

The operation of the Character Group processor is controlled by the set of parameters defined by a Character Group that is uniquely defined (within a CLC) by the value of the Character Group ID (CG-ID) specified in the CH.

Character grouping operates by defining 2 or more subsets (groups) of characters within the character set used for compression with the following properties:

– Each sub set contains the same number of characters.

– One subset (referred to as Group 0 or the "base group" contains the characters expected to have higher frequencies in a input stream than those of the characters in other subsets.

– Input stream are expected to contain contiguous sequences of characters belonging to a single group.

Compression is achieved by assigning a 1:1 mapping between the characters in the base group and those in the other groups and when appropriate signalling a transition between groups and then continuing to encode base group characters. This has the effect of improving the performance of the Huffman encoder by reducing the need to add new characters to the tree and by maintaining a smaller overall tree with a more distinct frequency distribution.

For example, assume that we have a character set that comprises just the numeric digits 0 to 9 and the letters A to B and 3 groups containing the digits 1 to 3, 4 to 6 and 0 and 7 to 9. The digits 1 to 3 are considered to be the most frequent and are therefore the base group. The digit 0 is defined to exist in all the groups and the letters A and B do not occur in any group.

Encoding and decoding of characters is achieved using the various items in table 21.

Table 21: Encoding and decoding of characters

Item

Element

Comment

Value

0

1

2

3

4

5

6

7

8

9

10

11

Decimal character value

Character

0

1

2

3

4

5

6

7

8

9

A

B

Character symbol

Group 0

1

1

1

1

0

0

0

0

0

0

0

0

Bit flags for Group 0

Group 1

1

0

0

0

1

1

1

0

0

0

0

0

Bit flags for Group 1

Group 2

1

0

0

0

0

0

0

1

1

1

0

0

Bit flags for Group 2

Fold 0

0

1

2

3

1

2

3

1

2

3

A

B

Group 0 Conversions

Fold 1

0

4

5

6

4

5

6

7

8

9

A

B

Group 1 Conversions

Fold 2

0

7

8

9

4

5

6

7

8

9

A

B

Group 2 Conversions

The items Group 0, Group 1 and Group 2 simply enable the determination of whether a given character is a member of the given group by checking the value of the Group x element associated with the value of the character.

The elements of the Fold 0 item associated with the members of a given group represent the characters within Group 0 to which the characters of the given group are mapped. For example character 4 in Group 1 is mapped to character 1 in Group 0.

The elements of the Fold 1 and Fold 2 items provide the reverse mapping in that the elements associated with membership of Group 0 represent the characters in Groups 1 or 2 that are associated with the Group 0 characters.

Thus if the "current group" is Group x, a character with value c can be encoded as follows:

– If c is a member of Group x or not a member of any group, element c of Fold 0 is output.

– If c is not a member of Group x it can be output as a "literal" which is element c of Fold y where Group y has c as a member alternatively a change of group can be signalled.

Similarly, if the "current group" is Group x, a character with value c can be decoded as follows:

– If c is a member of Group x or x is not 0 then, element c of Fold x is output.

– Otherwise the value c is output unchanged.

The detailed operation of the Character Group processor (described below) primarily extends these simple rules to optimize the case where a choice between a "literal" or a group change arises.

6.6.1 Character Groups

A Character Group specifies the following items:

1) Character Set ID

This is the character set in which the character group is composed and shall therefore be equal to the character set to be used for compression as specified in the CH.

2) Number of Groups

This value specifies the number of groups to be defined. The maximum value is 8.

3) Group Transition Controls

Group transitions are signalled through the use of the Character Group Transition symbols in the decimal range 259 to 265.

If the Number of Groups is N, (N-1) Character Group Transition symbols shall be specified such that if the "current group" is x one Character Group Transition symbol is allocated to signify a transition to each of the other (N-1) groups.

4) Fold Tables

These are the inter-group character conversion tables described above. One is required for each group defined.

5) Group Membership

This is an array of octets, one for each character in the character set. The 1st octet in the array contains bit flags indicating the group membership of the character value 0 and so on.

Within each octet, bit 0 (least significant) indicates membership of Group 0, bit 1 that of Group 1 and so on.

6.6.2 Initialization

Initialization of the Character Group processor involves loading the various parameters specified by the CG-ID contained in the CH.

Additionally on initialization, the "current group" is assumed to be Group 0.

6.6.3 Compression

For compression, the Character Group processor operates as follows:

Table 22: compression Character Group processor

Input

1) A single symbol to be encoded.

2) An indication that this is the last symbol to be encoded.

Step 1

Set the number of output symbols to zero.

Step 2

If the input symbol is not the Keyword symbol, goto Step 3.

If a previous input symbol is being held, add this as a "literal" to the output sequence by calculating the value of the element indicated by the value of the previous symbol in the fold table associated with the group of the previous symbol and increment the number of output symbols and clear the previous symbol.

Goto Step 9.

Step 3

If the input symbol is a member of no group or a member of the current group, set the group for the input symbol to be the current group.

Otherwise, if a previous input symbol is being held and the input symbol is a member of the group of the previous symbol, set the group for the input symbol to be that of the previous symbol.

Otherwise, test the input symbol for membership of each group in ascending order of groups starting with group 0 and set the group for the input symbol to be that for which membership is first detected.

Step 4

If a previous input symbol is not being held goto Step 5.

If the input symbol group equals the previous symbol group:

– Add the Character Group Transition symbol that indicates a transition from the current group to the previous symbol group to the output sequence and increment the number of output symbols.

– Set the current group to the previous symbol group.

– Encode the previous symbol by calculating the value of the element indicated by the value of the previous symbol in the fold table associated with the base group and add this value to the output sequence and increment the number of output symbols.

– Encode the input symbol by calculating the value of the element indicated by the value of the input symbol in the fold table associated with the base group and add this value to the output sequence and increment the number of output symbols.

– Clear the previous symbol.

– Goto Step 9.

Otherwise, encode the previous symbol as a "literal" by calculating the value of the element indicated by the value of the previous symbol in the fold table associated with the group of the previous symbol group and add this value to the output sequence and increment the number of output symbols and clear the previous symbol.

Step 5

If the input symbol group is the base group and the current group is not the base group, add the Character Group Transition symbol that indicates a transition from the current group to the base group to the output sequence and increment the number of output symbols. Set the current group to be the base group.

Step 6

If the input symbol group is the base group or the current group:

– Encode the input symbol by calculating the value of the element indicated by the value of the input symbol in the fold table associated with the base group and add this value to the output sequence and increment the number of output symbols.

– Goto Step 9.

Step 7

If the input symbol is the last symbol to be encoded:

– Encode the input symbol as a "literal" by calculating the value of the element indicated by the value of the input symbol in the fold table associated with the group of the input symbol and add this value to the output sequence and increment the number of output symbols.

– Goto Step 9.

Step 8

Set the previous symbol to be the value of the input symbol and set the group for the previous symbol to be that of the input symbol.

Step 9

Output the number of output symbols and the associated symbols.

Output

A count of the number of encoded symbols output and a sequence of encoded symbols.

6.6.4 Decompression

For decompression, the Character Group processor operates as follows:

Table 23: Decompression Character Group processor

Input

A single symbol to be decoded.

Step 1

If the symbol is a Character Group Transition symbol, update the "current group" to be that indicated by the Character Group Transition.

Goto Step 3.

Step 2

If the input symbol is a member of the "current group" or the "current group" is not the base group, calculate the value of the decoded symbol as that given by the element indicated by the value of the input symbol in the fold table associated with the "current group".

Otherwise set the value of the decoded symbol to that of the input symbol.

Step 3

If a decoded symbol has been generated indicate this fact.

Output

The decoded symbol or an indication that no symbol has been generated.

6.7 Huffman coding

As described in subclause 4.2, Huffman encoding requires the set of characters that may be encoded to be represented within a binary tree structure. The tree is constructed of "nodes" which have the following properties:

– A Parent node. A node that has no parent is the "root" node.

– Up to 2 Child nodes. A node that has no children is a "leaf" node.

– Character value. If the node is a leaf node it represents a character represented within the tree.

– Weight. If the node is a leaf node, the weight is the frequency with which the associated character has occurred in the input stream. Otherwise the weight is simply the sum of the weights of the nodes children.

Typically, a tree will be implemented as an array of node structures and parent / child details for a given node will be represented by the index of the appropriate node within the array.

Every node in the tree (except the root node or in the case where the tree contains just a single leaf node) has a "sibling" – the other node that shares the same parent node.

For the binary tree to be a Huffman tree its construction needs to display a further property. This is that the nodes can be listed in ascending order of weight and in so doing every node is adjacent to its sibling in the list. This property needs to be preserved at all times – when the tree is initially created, when a new leaf node is added to the tree to represent a new character and when the frequency of a leaf node is incremented as a new instance of that character is processed.

The ordering of nodes is also significant in that it will determine which of the siblings is the "left-hand" as opposed to "right-hand" of the sibling pair. Encoding a symbol involves navigating the tree from leaf to root and emitting a bit to the encoded stream the value of which depends on whether the current node is the left or right hand sibling. If the node is a left hand node, the bit value is 0 and if it is a right hand node, the bit value is 1. Assuming that the 1st element of the array of nodes has an index value of 0, this means that left hand nodes will have even numbered indices and right hand nodes will have odd numbered indices.

Node weights are assumed to be 16 bit unsigned values and this means that the potential exists for these values to overflow. To handle this case, the algorithm defines a maximum weight value for the root node. If this is to be exceeded, the weights of all leaf nodes are divided by 2 and the tree is rebuilt. The maximum value for the root weight is defined to be 8000 (hex).

Although the bit sequence representing the encoded symbol is discovered in the order of traversing the tree from leaf to root, for decoding the bit sequence needs to be processed in the order that describes the navigation of the tree from root to leaf. Thus the entire encoding bit sequence needs to be collected in some temporary variable and emitted to the output stream in reverse order. For example if the passage from leaf to root is described by the sequence 010011, the bits added to the output stream would be 110010. The need to collect the bits in a temporary variable also introduces the potential for this value to overflow. Given the maximum value for the root node weight described above, a 32bit variable is suitable of containing all possible bit sequences.

If a symbol that does not already exist in the tree is to be encoded, either the "New 7bit Character" or the "New 8bit Character" is encoded, the lower 7 bits of the new character value are then added literally to the out put stream and the new character needs to be added to the tree. This is done by splitting the "lightest" node (the first node in the list ordered by ascending weight) such that it becomes a parent node whose right hand child is the leaf node that was originally represented by the node being split and the left hand child is a new leaf node representing the new character. The new leaf is initially created with a weight of 0 but this is immediately updated as described below.

If a new symbol has been added to the tree or a new instance of an existing symbol processed, the weight for the associated leaf node needs to be incremented and the tree updated to preserve the "sibling" property.

The tree is updated in the following manner. If the node a position x in the ascending weight ordered list has had its weight incremented by 1, the list needs to be scanned from position x in ascending weight order to identify the node at position y such that the node at position (y+1) is the first node encountered that has a weight greater than or equal to the new weight of the node at position x. The nodes at x and y are then "swapped" in terms of their position in the list and their parents while maintaining all other attributes. This process of weight increment and swapping is then repeated for the parent of the node at position y until the root node is reached.

The operation of the Huffman processor is controlled by the set of parameters defined by a Huffman Initialization that is uniquely defined (within a CLC) by the value of the Huffman Initialization ID (HI-ID) specified in the CH.

6.7.1 Initialization Overview

A Huffman Initialization specifies the following items:

1) Character Set ID

This is the character set in which the Huffman Initialization is composed and shall therefore be equal to the character set to be used for compression as specified in the CH.

2) Options

This is a collection of bit flags that control how the processor is to operate. These are described in table 24 in which Bit 0 is considered to be the lease significant bit of the Match Options value.

Table 24: collection of bit flags

Bit

Description

0

If set, weights for leaf nodes representing control symbols (other than New 7 bit character and New 8 bit character symbols) are to be updated.

1

If set, weights for leaf nodes representing control symbols are to be updated.

2

All other bits are reserved.

3) The Character Group ID with which these initializations may operate.

4) Number of initial symbol frequencies

2 values representing the cases where the Character Group processor is enabled or disabled.

These are counts of the number of characters or control symbols for which there are following initial frequencies defined.

As this initializations will vary significantly depending on whether the Character Group processor is enabled 2 sets of initializations are provided to cover both cases.

5) Initial frequencies

Two sets of initialization values are supplied as described above.

Any control symbol that may occur when processing an input stream needs to be represented within the tree, prior to the first character of the input stream being processed. These symbols shall therefore be handled by the initialization process. This is achieved by :

– The frequency initialization data will always include all control symbols that might occur for any stream. Thus the New 7bit character, New 8bit character, New UCS2 Row and Keyword symbols will always be included and if the initialization set is that for the case where the specified Character Group ID is enabled, the associated Character Group Transition symbols will also be included.

– For a given input stream, the frequency initialization process (described in subclause 6.7.2 below) will determine whether a control symbol contained in the frequency initialization data can occur in the input stream based on the information contained in the CH. If it is determined that a control symbol contained in the frequency initialization data can NOT occur in the input stream, this symbol will not be added to the Huffman tree.

Frequency initialization data comprises the value of the character or symbol and the initial frequency for that symbol.

– The order in which character or symbol values and their associated initial frequencies are stated is significant and this order must be preserved when these items are loaded as part of the Huffman Initialisation process. Frequency Initialisation data must be stated in ascending order of character or symbol initial frequency.

6.7.2 Initialization

Initialization of the Huffman processor involves loading the various parameters specified by the HI-ID contained in the CH.

The appropriate set of frequency initialization data is selected depending on whether the Character Group processor is enabled.

Leaf nodes are created for each symbol for which a frequency initialization is specified, subject to the following rules:

– Leaf nodes must be created within the array of Huffman tree nodes in exactly the same ascending order in which they are stated in the Huffman Initialisation data.

– If the character set specified for compression is the GSM 7 bit default alphabet, leaf nodes are not created for the New 8bit Character and the New UCS2 Row symbols.

– If the character set specified for compression is not UCS2 a leaf node is not created for the New UCS2 Row symbol.

– If the Keyword processor is disabled, no leaf node is created for the Keyword symbol.

The initial tree is then built as described below – rescaling is not indicated.

6.7.3 Build Tree

To build the tree, the Huffman processor operates as follows:

Table 25: Build Tree, Huffman processor operation

Input

1) The array of Huffman tree nodes.

2) A Boolean value indicating whether frequencies need to be rescaled as a result of the root node weight becoming the maximum value.

Step 1

Assemble all leaf nodes, preserving their ascending weight order at the start of the node array. This is achieved by setting the “current node” and “assembled leaf” node position to the base of the array. If the current node is a leaf node, set the symbol and frequency associated with assembled leaf node to those of the current node and increment the assembled leaf node position. Increment the current node position and repeat this process until the current node becomes the root node.

If rescaling is requested recalculate each leaf node weight as (current weight+1)/2.

Set the current node to the start of the array.

Step 2

Create a parent node for the current node and the next node and insert it into the array at position x where the node at position (x+1) is the first node with a weight greater than that of the newly created node.

If the newly created node is not the root node, increment the current node by 2 and goto Step 2.

Output

A completed Huffman tree.

6.7.4 Update Tree

To update the tree, the Huffman processor operates as follows:

Table 26: Update Tree, Huffman processor operation

Input

The symbol whose frequency is to be incremented by 1.

Step 1

If the weight of the root node +1 is greater than 0 x 8000 build the tree indicating that resealing is required.

Step 2

Increment the weight of the leaf node associated with the input symbol by 1 and "swap" it with the node at position y such that the node at position (y+1) is the first node encountered in the order list that has a weight greater than or equal to the new weight of the incremented leaf node.

Repeat this process of weight increment and "swap" for the parent of the node at position y until the node at position y becomes the root node.

Output

An updated Huffman tree.

6.7.5 Add New Node

To add a new node, the Huffman processor operates as follows:

Table 27: Add New Node, Huffman processor operation

Input

The symbol to be added to the tree.

Step 1

Splitting the "lightest" node (the first node in the list ordered by ascending weight) such that it becomes a parent node whose right hand child is the leaf node that was originally represented by the node being split and the left hand child is a new leaf node representing the new input symbol. The new leaf node is initially created with a weight of 0.

Step 2

Update the tree (as above) passing the new symbol as the input parameter.

Output

An updated Huffman tree.

6.7.6 Compression

For compression, the Huffman processor operates as follows:

Table 28: Compression, Huffman processor operation

Input

A character from the input stream or control symbol.

Step 1

If there is no existing leaf node for the input symbol set the "source" symbol to be either the New 7bit or New 8bit symbol depending on the value of the input symbol.

Otherwise set the source symbol to be the input symbol.

Step 2

Traverse the tree from the leaf node associated with the source symbol to the root node while generating the Huffman bit sequence.

Step 3

Reverse the generated Huffman bit sequence and add it to the output bitstream.

Step 4

If the source symbol equals the input symbol goto Step 5.

Add the lower 7 bits of the input symbol to the output bitstream.

Add a new node for the input symbol.

Update the tree for the input symbol.

Goto Output.

Step 5

If the input symbol value is less than 256 and bit 0 of the Huffman Initialization Options value is set, update the tree for the input symbol and goto Output.

Step 6

If the input symbol value is greater than or equal 256 and bit 1 of the Huffman Initialization Options value is set, update the tree for the input symbol.

Output

A Huffman bitstream.

6.7.7 Decompression

For decompression, the Huffman processor operates as follows:

Table 29: Decompression, Huffman processor operation

Input

A bit stream.

Step 1

Traverse the tree from the root node to a leaf node as indicated by the value of the bits read from the front of the input bitstream.

Step 2

If the symbol associated with the leaf node identified in step 1 is neither the New 7bit nor New 8bit symbol, goto Step 3.

Set the lower 7 bits of the output symbol to be next 7 bits read from the input bitstream and set bit 7 as indicated.

Add a new node for the output symbol.

Update the tree for the output symbol.

Goto Output.

Step 3

Set the output symbol to the symbol associated with the leaf node from Step 1.

Step 4

If the output symbol value is less than 256 and bit 0 of the Huffman Initialization Options value is set, update the tree for the output symbol and goto Output.

Step 5

If the input symbol value is greater than or equal 256 and bit 1 of the Huffman Initialization Options value is set update the tree for the output symbol.

Output

A decoded symbol.