Clever Geek Handbook
📜 ⬆️ ⬇️

Utf-32

UTF-32 ( Eng. Unicode Transformation Format ) or UCS-4 (universal character set, Eng. Universal Character Set ) in computer science is one of the Unicode character encoding methods that uses exactly 32 bits to encode any character. The rest of the encodings, UTF-8 and UTF-16, use a variable number of bytes to represent characters. The UTF-32 character is a direct representation of its code position ( Code point ).

The main advantage of UTF-32 over variable length encodings is that Unicode characters are directly indexable. Getting the nth code position is an operation that takes the same time. In contrast, variable length codes require sequential access to the nth code position. This makes replacing characters in UTF-32 strings simple, using an integer as an index, as is usually done for ASCII strings.

The main drawback of UTF-32 is its inefficient use of space, as four bytes are used to store the character. Characters that lie outside the zero (base) plane of the code space are rarely used in most texts. Therefore, doubling, in comparison with UTF-16, occupied by lines in UTF-32 space, is not justified.

Although the use of an unchanging number of bytes per character is convenient, but not as much as it seems. The truncation operation is easier than UTF-8 and UTF-16. But this does not make it faster to find a specific offset in the string, since the offset can also be calculated for fixed-size encodings. This does not facilitate the calculation of the displayed line width, with the exception of a limited number of cases, since even a “fixed width” symbol can be obtained by combining a regular symbol with a modifier, which does not have a width. For example, the letter “y” can be obtained from the letter “and” and the diacritic “ hook over the letter ”. The combination of such characters means that text editors cannot consider 32-bit code as a unit of editing. Editors who are limited to working with languages ​​with writing from left to right and compound characters ( English Precomposed character ) can use characters of a fixed size. But such editors are unlikely to support characters that lie outside the zero (base) plane of the code space and are unlikely to work equally well with UTF-16 characters.

Content

History

The ISO 10646 standard defines a 31-bit encoding form called UCS-4, in which each encoded character is represented by a 32-bit code value in the code space of numbers from 0 to 7FFFFFFF.

Since only 17 planes are actually used, the codes of all characters have values ​​from 0 to 0x10FFFF. UTF-32 is a subset of UCS-4 that uses only this range. Since document JTC1 / SC2 / WG2 states that all future character assignments will be limited to the zero (base) code space plane or the first 14 additional planes, then UTF-32 will be able to represent all Unicode characters. Accordingly, UCS-4 and UTF-32 are currently identical, except that the UTF-32 standard has additional Unicode semantics.

Usage

UTF-32 is used mainly not in character strings, but in internal APIs , where data is the only code position or glyph . For example, when drawing text at the last step, a list of structures is constructed, each of which includes the x and y positions, attributes and a single UTF-32 character that identifies the glyph for drawing. Often in “unused" 11 bits of each 32-bit character extraneous information is stored.

UTF-32 is used to store strings on Unix when the wchar_t type is defined as 32-bit. Programs in Python version 3.2 inclusive could be compiled to use UTF-32 instead of UTF-16. Since version 3.3, UTF-16 support has been removed and the lines are stored in UTF-32, but leading zeros are optimized if they are not used. On Windows , in which the wchar_t type is 16 bits, UTF-32 strings are almost never used.

Not Using UTF-32 in HTML5

The HTML5 standard states that "authors should not use UTF-32, because the encoding algorithms described in this specification do not distinguish it from UTF-16."

Links

  • Full description of the Unicode standard
Source - https://ru.wikipedia.org/w/index.php?title=UTF-32&oldid=88274950


More articles:

  • Reinickendorf (district of Berlin)
  • Poltava Medical Glass Factory
  • Mikhailov, Nikolai Dmitrievich
  • Asian Championships in International Drafts for Women 2016
  • Artyukhin, Yuri Petrovich
  • Tyagichev, Vladimir Vladimirovich
  • Shpis (company)
  • Denmark at the 2007 Ice Hockey World Championship
  • Wutaboshi
  • The road M-06 (Ukraine)

All articles

Clever Geek | 2019