Clever Geek Handbook
📜 ⬆️ ⬇️

ASCII85

Ascii85 (Also known as "Base85") is a form of encoding binary data using text developed by Paul E. Rutter for the btoa library. Due to the fact that 5 ASCII characters are used to encode 4 bytes of data (processed data is ¹⁄₄ larger than the original, using 8-bit ASCII characters), greater efficiency is achieved than in the case of uuencode or Base64 , in which 4 bytes are encoded 4 symbols (increase by ¹⁄₃ under the same conditions).

It is mainly used in Adobe's PostScript and Portable Document Format formats.

Content

The main idea

The main need for text encoding of data stems from the need to transmit binary data using existing protocols designed exclusively for text transmission (for example, e-mail). Such protocols can only guarantee the transmission of 7-bit values ​​(while avoiding the use of ASCII control characters), and may also require the insertion of an end-of-line character to limit line lengths, and also allow white space indents. As a result, there are only 94 printable characters that can be used.

4 bytes can contain 2 32 = 4 294 967 296 different values. 5 digits in the number system with a base of 85 give 85 5 = 4,437,053,125 different values, which is quite enough for a unique representation of 32-bit values. Five digits in the base notation 84 can only provide 84 5 = 4 182 119 424 values. Therefore, 85 is the minimum base of the number system in which 4 bytes can be encoded with five digits, which is why it is chosen.

When encoding, we divide the data stream into groups of 4 bytes, and consider each of them as a 32-bit number, with the most significant byte at the beginning . By sequential division by 85 we get 5 digits of the 85-decimal number system. Further, each digit is encoded by the printed ASCII character and is output to the output stream while maintaining the order from the highest order to the lowest.

ASCII digits are encoded with characters by increasing by 33, that is, characters with codes from 33 (" ! ") To 117 (" u ").

Since zero values ​​are not so rare, an additional exception has been made for the sake of additional compression - the zero four bytes are encoded with the single character “ z ” instead of “ !!!!! ".

A group of characters that, when decoded, give a value greater than 2 32 - 1 (encoded as " s8W-! "), s8W-! to a decoding error, as does the character " z " inside the group. All white space between characters is ignored and can be inserted arbitrarily for easy formatting.

The only drawback of Ascii85 is that in the resulting text there will be characters (such as slash and quotation marks), which have special meanings in programming languages ​​and text protocols.

btoa

The original btoa program always encoded in complete groups (the latter was supplemented with zeros) and added the line “xbtoa Begin” before the text received, and then “xbtoa End”, followed by the size of the source file (decimal and hexadecimal) and three 32-bit checksums. The decoder used the source length information to find out how many padding zeros were inserted.

This program also supported the special value “ z ” for encoding zeros (0x00000000), as well as “ y ” for a group of four spaces (0x20202020).

Adobe

Adobe adapted btoa coding, making some changes and giving the name Ascii85. In particular, the separator “ ~> ” was added to indicate the end of the encoded string and determine where to trim the decoded string to get the correct length. This is done as follows: If the last block contains less than 4 bytes, then it is supplemented with zero bytes before encoding, and after encoding, as many extreme characters as zeros are added from the last five.

When decoding, the last block is supplemented to a length of 5 with the character “ u ” (code 84), and after decoding, the same number of bytes is deleted (see the example below).

Note: The fill character is not random. In Base64, when transcoding, the bits are simply regrouped, neither their order nor the values ​​change (the high bits of the original sequence do not affect the low bits of the result). When converting to a number system with a base of 85 (85 is not a power of two), the most significant bits of the original sequence affect the least significant bits as a result (similarly with the inverse transformation). The addition of a minimum value (0) during encoding and a maximum (85) during decoding ensures the preservation of the most significant bits.

In the Ascii85 block of text, whitespace and line breaks can be inserted anywhere, including inside five letters. They should just be ignored.

The Adobe specification does not contain the “ y ” extension for four spaces.

Example

For example, the historical slogan of Wikipedia ,

Man is distinguished, not only by his reason, but by this singular passion from other animals, which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable generation of knowledge, exceeds the short vehemence of any carnal pleasure .

being encoded in Ascii85, it looks like this:

  <~ 9jqo ^ BlbD-BleB1DJ + * + F (f, q / 0JhKF <GL> Cj @ .4Gp $ d7F!, L7 @ <6 @) / 0JDEF <G% <+ EV: 2F !,
 O <DJ + *. @ <* K0 @ <6L (Df- \ 0Ec5e; DffZ (EZee.Bl.9pF "AGXBPCsi + DGm> @ 3BB / F * & OCAfu2 / AKY
 i (DIb: @FD, *) + C] U = @ 3BN # EcYf8ATD3s @ q? d $ AftVqCh [NqF <G: 8 + EV:. + Cf> -FD5W8ARlolDIa
 l (DId <j @ <? 3r @: F% a + D58'ATD4 $ Bl @ l3De:, - DJs`8ARoFb / 0JMK @ qB4 ^ F!, R <AKZ & -DfTqBG% G
 > uD.RTpAKYo '+ CT / 5 + Cei # DII? (E, 9) oF * 2M7 / c ~>
TextMan...sure
Ascii779711032...115117114101
binary representation0one00oneone0one0oneone0000one0oneone0oneoneone000one00000...0oneoneone00oneone0oneoneone0one0one0oneoneone00one00oneone00one0one
decimal notation1 298 230 816 = 24 × 85 4 + 73 × 85 3 + 80 × 85 2 + 78 × 85 + 61...1 937 076 837 = 37 × 85 4 + 9 × 85 3 + 17 × 85 2 + 44 × 85 + 22
85 decimal notation (+33)24 (57)73 (106)80 (113)78 (111)61 (94)...37 (70)9 (42)17 (50)44 (77)22 (55)
Ascii9jqo^...F*2M7

Since the last four is not complete, we must “finish” it with zeros:

Text.\ 0\ 0\ 0
Ascii46000
binary representation00one0oneoneone0000000000000000000000000
decimal notation771 751 936 = 14 × 85 4 + 66 × 85 3 + 56 × 85 2 + 74 × 85 + 46
85 decimal notation (+33)14 (47)66 (99)56 (89)74 (107)46 (79)
Ascii/cYkO

We added 3 bytes when encoding and should remove the last three characters 'YkO' from the result.

Decoding is absolutely symmetrical, with the exception of the last five, which we "finish off" with the characters 'u':

Ascii/cuuu
85 decimal notation (+33)14 (47)66 (99)84 (117)84 (117)84 (117)
decimal notation771 955 124 = 14 × 85 4 + 66 × 85 3 + 84 × 85 2 + 84 × 85 + 84
binary representation00one0oneoneone0000000oneone000oneone00oneone0oneone0one00
Ascii46325180
Text.[ ETX ][EM]not defined in ASCII

Since we added 3 'u' characters, we must remove the last 3 bytes from the result. As a result, we get a message of the original length.

In the original example, there were no four of the zero bytes, so we did not see the abbreviated record 'z' in the result.

Compatibility

Ascii85 encoding is compatible with both 7- and 8-bit MIMEs , with less volume overhead than Base64 .

The only potential problem is that Ascii85 may contain characters that must be escaped in markup languages ​​such as XML or SGML , for example, single and double quotes, angle brackets, ampersands (" '"<>& ").

Comic RFC 1924 for writing IPv6 addresses

Published April 1, 1996, informational RFC 1924 : “A Compact Representation of IPv6 Addresses” proposes encoding IPv6 addresses as numbers in the base 85 number system (base-85, similar to base-64). This proposal differs from the above schemes in that, firstly, it uses a set of other 85 ASCII characters, and secondly, it processes the entire group of 128 bits as a single number, converting it to 20 total characters, and not groups of 32 bits. Also, spaces are not allowed.

The proposed character set, in ascending order of codes: 0 - 9 , A - Z , a - z and 23 more characters !#$%&()*+-;<=>?@^_`{|}~ . The largest value that fits into 128 bits of IPv6 address - 2 128 −1 = 74 × 85 19 + 53 × 85 18 + 5 × 85 17 + ..., has the form =r54lj&NUUO~Hi%c2ym0 .

The character set is chosen so as to exclude the use of the most problematic characters ( "',./:[]\ ) That need to be escaped in some protocols, for example in JSON. But this set still contains characters that need to be escaped in SGML protocols, for example in XML.

See also

  • Base64

Links

  • btoa and atob Sources of the original program 1990
  • PostScript Language Reference (Adobe) - see ASCII85Encode Filter
  • implementations of encoding and decoding in different programming languages:
    • awk
    • C module for Python
    • F #
    • Java (documentation)
    • Javascript
    • Pascal
    • Perl
Source - https://ru.wikipedia.org/w/index.php?title=ASCII85&oldid=100791343


More articles:

  • Ries Representation Theorem
  • Syrian National Council
  • Spacious (Pervomaysky district)
  • Istihara
  • Argentina Cruiser List
  • Chilov
  • Romania Women's Open Tennis Championship 2011
  • Spanish Football Championship 1940/1941
  • Antonovich, Mikhail Dmitrievich
  • Usvecha (lake, Nevelsky district)

All articles

Clever Geek | 2019