Clever Geek Handbook
📜 ⬆️ ⬇️

Utf-8

UTF-8 (from the English Unicode Transformation Format, 8-bit - “Unicode conversion format, 8-bit”) is a common text encoding standard that allows for more compact storage and transmission of Unicode characters using a variable number of bytes (from 1 to 4) , and providing full backward compatibility with 7-bit ASCII encoding. The UTF-8 standard is officially enshrined in RFC 3629 and ISO / IEC 10646 Annex D. The UTF-8 encoding is now dominant in the web space. It has also found widespread use in UNIX-like operating systems [1] . The UTF-8 format was developed on September 2, 1992 by Ken Thompson and Rob Pike , and implemented in Plan 9 [2] . The encoding identifier on Windows is 65001 [3] .

Comparing UTF-8 and UTF-16 , it can be noted that UTF-8 gives the greatest compactness for texts in Latin , since Latin letters without diacritics, numbers and the most common punctuation are encoded in UTF-8 with only one byte, and codes these characters correspond to their codes in ASCII . [4] [5]

Content

  • 1 Coding Algorithm
    • 1.1 Coding Examples
  • 2 UTF-8 marker
  • 3 Fifth and sixth bytes
  • 4 notes
  • 5 Links

Coding Algorithm

The coding algorithm in UTF-8 is standardized in RFC 3629 and consists of 3 stages:

1. Determine the number of octets ( bytes ) required to encode a character. The character number is taken from the Unicode standard.

Character Number RangeOctet Required
00000000-0000007Fone
00000080-000007FF2
00000800-0000FFFF3
00010000-0010FFFFfour

For Unicode characters with numbers from U+0000 to U+007F (occupying one byte with zero in the high bit), the UTF-8 encoding is fully consistent with the 7-bit US-ASCII encoding.

2. Set the most significant bits of the first octet in accordance with the required number of octets determined in the first stage:

  • 0xxxxxxx - if one octet is required for encoding;
  • 110xxxxx - if two octets are required for encoding;
  • 1110xxxx - if three octets are required for encoding;
  • 11110xxx - if four octets are required for encoding.

If encoding requires more than one octet, then in octets 2-4, the two most significant bits are always set to 10 2 (10xxxxxx). This makes it easy to distinguish the first octet in the stream, because its most significant bits are never equal to 10 2 .

Number of octetsSignificant bitsTemplate
one70xxxxxxx
2eleven110xxxxx 10xxxxxx
3161110xxxx 10xxxxxx 10xxxxxx
four2111110xxx 10xxxxxx 10xxxxxx 10xxxxxx

3. Set the significant octet bits according to the Unicode character number, expressed in binary form. Start filling in the low-order bits of the symbol number, putting them in the low-order bits of the last octet, continue from right to left until the first octet. The free bits of the first octet, remaining unused, fill with zeros.

Coding Examples

SymbolBinary character codeUTF-8 in binaryUTF-8 in hexadecimal
$U+00241001000 010010024
¢U+00A210 100010110 00010 10 100010C2 A2
€U+20AC10 0000 10 1011001110 0010 10 000010 10 101100E2 82 AC
𐍈U+103481 0000 0011 01 00100011110 000 10 010000 10 001101 10 001000F0 90 8D 88

UTF-8 Marker

To indicate that the file or stream contains Unicode characters, a byte order mark (BOM ) can be inserted at the beginning of the file or stream, which in the case of encoding in UTF-8 takes the form of three bytes: EF BB BF 16 .

1st byte2nd byte3rd byte
Binary code1110 11111011 10111011 1111
Hexadecimal codeEFBBBF

Fifth and Sixth Bytes

Initially, UTF-8 encoding allowed the use of up to six bytes for encoding one character, however, in November 2003, RFC 3629 prohibited the use of fifth and sixth bytes, and the range of encoded characters was limited to the character U+10FFFF . This was done to ensure compatibility with UTF-16.

Notes

  1. ↑ Usage Statistics of Character Encodings for Websites, June 2011
  2. ↑ http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
  3. ↑ Code Page Identifiers - Windows applications | Microsoft docs
  4. ↑ Well, I'm Back. String Theory Robert O'Callahan (March 1, 2008). Date of treatment March 1, 2008. Archived August 23, 2011.
  5. ↑ Rostislav Chebykin. All encodings are encoding. UTF ‑ 8: modern, competent, convenient. (unspecified) . HTML and CSS . Date of treatment March 22, 2009. Archived August 23, 2011.

Links

  • UTF-8 encoding table and Unicode characters
  • UTF-8: Encoding and Decoding
  • UTF-8, UTF-16, UTF-32 & BOM - Questions and Answers
  • Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8)
  • Full description of the Unicode standard
  • UTF-8 Everywhere Manifesto
  • RFC-3629 "UTF-8, a transformation format of ISO 10646"
Source - https://ru.wikipedia.org/w/index.php?title=UTF-8&oldid=102273678


More articles:

  • Until We Die / Severed Head
  • Mahan (tribal union)
  • Riga 11th Dragoon Regiment
  • Grigoryeva, Lulia Nikolaevna
  • Rose Christine
  • Survival horror
  • Balamut Letters
  • History of Turkmenistan
  • Khanty-Mansiysk Autonomous Okrug - Ugra
  • Surrender

All articles

Clever Geek | 2019