Mon compte

connexion

inscription

   Publicité R▼


 » 
allemand anglais arabe bulgare chinois coréen croate danois espagnol espéranto estonien finnois français grec hébreu hindi hongrois islandais indonésien italien japonais letton lituanien malgache néerlandais norvégien persan polonais portugais roumain russe serbe slovaque slovène suédois tchèque thai turc vietnamien
allemand anglais arabe bulgare chinois coréen croate danois espagnol espéranto estonien finnois français grec hébreu hindi hongrois islandais indonésien italien japonais letton lituanien malgache néerlandais norvégien persan polonais portugais roumain russe serbe slovaque slovène suédois tchèque thai turc vietnamien

Significations et usages de UCS-2

Définition

⇨ voir la définition de Wikipedia

   Publicité ▼

Wikipedia - voir aussi

Wikipedia

UTF-16/UCS-2

From Wikipedia, the free encyclopedia

  (Redirected from UCS 2)
Jump to: navigation, search
Unicode
Character encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces

In computing, UTF-16 (16-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire.The encoding form maps each character to a sequence of 16-bit words. Characters are known as code points and the 16-bit words are known as code units.For characters in the Basic Multilingual Plane (BMP) the resulting encoding is a single 16-bit word.For characters in the other planes, the encoding will result in a pair of 16-bit words, together called a surrogate pair.All possible code points from U+0000 through U+10FFFF, except for the surrogate code points U+D800–U+DFFF (which are not characters), are uniquely mapped by UTF-16 regardless of the code point's current or future character assignment or use.

As many uses in computing require units of bytes (octets)there are three related encoding schemes which map to octet sequences instead of words: namely UTF-16, UTF-16BE, and UTF-16LE.They differ only in the byte order chosen to represent each 16-bit unit and whether they make use of a Byte Order Mark.All of the schemes will result in either a 2 or 4-byte sequence for any given character.

UTF-16 is officially defined in Annex Q of the international standard ISO/IEC 10646-1.It is also described in The Unicode Standard version 2.0 and higher, as well as in the IETF's RFC 2781.

UCS-2 (2-byte Universal Character Set) is a similar yet older character encoding that was superseded by UTF-16 in Unicode version 2.0, though it still remains in use.The UCS-2 encoding form is identical to that of UTF-16, except that it does not support surrogate pairs and therefore can only encode characters in the BMP range U+0000 through U+FFFF.As a consequence it is a fixed-length encoding that always encodes characters into a single 16-bit value.As with UTF-16, there are three related encoding schemes (UCS-2, UCS-2BE, UCS-2LE) that map characters to a specific byte sequence.

Because of the technical similarities and upwards compatibility from UCS-2 to UTF-16, the two encodings are often erroneously conflated and used as if interchangeable, so that strings encoded in UTF-16 are sometimes misidentified as being encoded in UCS-2.

For both UTF-16 and UCS-2, all 65,536 code points contained within the BMP (Plane 0), excluding the 2,048 special surrogate code points, are assigned to code units in a one-to-one correspondence with the 16-bit non-negative integers with the same values. Thus code point U+0000 is encoded as the number 0, and U+FFFF is encoded as 65535 (which is FFFF16 in hexadecimal).

Contents

Encoding of characters outside the BMP

The improvement that UTF-16 made over UCS-2 is its ability to encode characters in planes 1–16, not just those in plane 0 (BMP). This was done by taking an unassigned portion of the 16 bit UCS-2 space, shown to scale by color here:

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

 DC00DC01DFFF
D8000100000100010103FF
D8010104000104010107FF
  ⋮    ⋮
DBFF10FC0010FC0110FFFF

UTF-16 represents non-BMP characters (those from U+10000 through U+10FFFF) using a pair of 16-bit words, known as a surrogate pair. First 1000016 is subtracted from the code point to give a 20-bit value. This is then split into two separate 10-bit values each of which is represented as a surrogate with the most significant half placed in the first surrogate. To allow safe use of simple word-oriented string processing, separate ranges of values are used for the two surrogates: 0xD800–0xDBFF for the first, most significant surrogate (marked brown) and 0xDC00-0xDFFF for the second, least significant surrogate (marked azure).

For example, the character at code point U+10000 becomes the code unit sequence 0xD800 0xDC00, and the character at U+10FFFD, the upper limit of Unicode, becomes the sequence 0xDBFF 0xDFFD. Unicode and ISO/IEC 10646 do not, and will never, assign characters to any of the code points in the U+D800–U+DFFF range, so an individual code value from a surrogate pair does not ever represent a character.

Because the most commonly used characters are all in the Basic Multilingual Plane, code is often not tested thoroughly with surrogate pairs. This leads to persistent bugs, and potential security holes, even in popular and well-reviewed application software[1].

Byte order encoding schemes

The UTF-16 and UCS-2 encoding forms produce a sequence of 16-bit words or code units. These are not directly usable as a byte or octet sequence because the endianness of these words varies according to the computer architecture; either big-endian or little-endian. To account for this choice of endianness each encoding form defines three related encoding schemes: for UTF-16 there are the schemes UTF-16, UTF-16BE, and UTF-16LE, and for UCS-2 there are the schemes UCS-2, UCS-2BE, and UCS-2LE.

The UTF-16 (and UCS-2) encoding scheme allows either endian representation to be used, but mandates that the byte order should be explicitly indicated by prepending a Byte Order Mark before the first serialized character. This BOM is the encoded version of the Zero-Width No-Break Space (ZWNBSP) character, codepoint U+FEFF, chosen because it should never legitimately appear at the beginning of any character data. This results in the byte sequence FE FF (in hexadecimal) for big-endian architectures, or FF FE for little-endian. The BOM at the beginning of a UTF-16 or UCS-2 encoded data is considered to be a signature separate from the text itself; it is for the benefit of the decoder. Technically, with the UTF-16 scheme the BOM prefix is optional, but omitting it is not recommended as UTF-16LE or UTF-16BE should be used instead. If the BOM is missing, barring any indication of byte order from higher-level protocols, big endian is recommended to be used or assumed. The BOM is not optional in the UCS-2 scheme.

The UTF-16BE and UTF-16LE encoding schemes (and correspondingly UCS-2BE and UCS-2LE) are similar to the UTF-16 (or UCS-2) encoding scheme. However rather than using a BOM prepended to the data, the byte order used is implicit in the name of the encoding scheme (LE for little-endian, BE for big-endian). Since a BOM is specifically not to be prepended in these schemes, if an encoded ZWNBSP character is found at the beginning of any data encoded by these schemes it is not to be considered to be a BOM, but instead is considered part of the text itself. In practice most software will ignore these "accidental" BOMs.

The IANA has approved UTF-16, UTF-16BE, and UTF-16LE for use on the Internet, by those exact names (case insensitively). The aliases UTF_16 or UTF16 may be meaningful in some programming languages or software applications, but they are not standard names in Internet protocols.

Use in major operating systems and environments

UTF-16 is the native internal representation of text in the Microsoft Windows 2000/XP/2003/Vista/CE; Qualcomm BREW operating systems; the Java and .NET bytecode environments; Mac OS X's Cocoa and Core Foundation frameworks; and the Qt cross-platform graphical widget toolkit.[2][3][citation needed]

Symbian OS used in Nokia S60 handsets and Sony Ericsson UIQ handsets uses UCS-2.

The Joliet file system, used in CD-ROM media, encodes file names using UCS-2BE (up to 64 Unicode characters per file).

Older Windows NT systems (prior to Windows 2000) only support UCS-2.[4]. In Windows XP, no code point above U+FFFF is included in any font delivered with Windows for European languages.[citation needed]

The Python language environment officially only uses UCS-2 internally since version 2.1, but the UTF-8 decoder to "Unicode" produces correct UTF-16. Python can be compiled to use UCS-4 (UTF-32) but this is commonly only done on Unix systems.

Java originally used UCS-2, and added UTF-16 supplementary character support in J2SE 5.0. However, non-BMP characters require the individual surrogate halves to be entered individually, for example: "\uD834\uDD1E" for U+1D11E.[5] C# additionally allows syntax such as "\U0001D11E", that is an upper-case 'U' instead of a lower, and with 8 hex digits instead of 4.[6]

All of these implementations return the number of 16-bit "code units" rather than the number of Unicode "characters" when you use the equivalent of strlen() on their strings, and that indexing into a string returns the indexed 16-bit word, not the indexed "character"[7][8][9]. This is often considered to mean that UTF-16 is not really supported, but any practical API requires measuring memory in fixed-sized units and therefore these are correct.[citation needed] A "character" is an undefined unit in Unicode, due to combining characters, invisible characters, and the need to handle invalid encodings. Most of the confusion is due to obsolete ASCII-era documentation using the word "character" when a fixed-size "byte" was intended.

Examples

code pointcharacterUTF-16 code value(s)glyph*
U+007Asmall Z (Latin)007Az
U+6C34water (Chinese)6C34
U+1D11Emusical G clefD834 DD1E𝄞
"水z𝄞" (water, z, G clef), UTF-16 encoded
labeled encodingbyte orderbyte sequence
UTF-16LElittle-endian34, 6C, 7A, 00, 34, D8, 1E, DD
UTF-16BEbig-endian6C, 34, 00, 7A, D8, 34, DD, 1E
UTF-16little-endian, with BOMFF, FE, 34, 6C, 7A, 00, 34, D8, 1E, DD
UTF-16big-endian, with BOMFE, FF, 6C, 34, 00, 7A, D8, 34, DD, 1E

* Appropriate font and software are required to see the correct glyphs.

Example UTF-16 encoding procedure

The character at code point U+64321 (hexadecimal) is to be encoded in UTF-16. Since it is above U+FFFF, it must be encoded with a surrogate pair, as follows:

v  = 0x64321v′ = v - 0x10000   = 0x54321   = 0101 0100 0011 0010 0001vh = 01 0101 0000 // higher 10 bits of v′vl = 11 0010 0001 // lower  10 bits of v′w1 = 0xD800 // the resulting 1st word is initialized with the high bitsw2 = 0xDC00 // the resulting 2nd word is initialized with the low bitsw1 = w1 | vh   = 1101 1000 0000 0000   |        01 0101 0000   = 1101 1001 0101 0000   = 0xD950w2 = w2 | vl   = 1101 1100 0000 0000   |        11 0010 0001   = 1101 1111 0010 0001   = 0xDF21

The correct UTF-16 encoding for this character is thus the following word sequence:

0xD950 0xDF21

Since the character is above U+FFFF, the character cannot be encoded in UCS-2.

See also

References

External links

   Publicité ▼

 

Toutes les traductions de UCS-2


Contenu de sensagent

  • définitions
  • synonymes
  • antonymes
  • encyclopédie

dictionnaire et traducteur pour sites web

Alexandria

Une fenêtre (pop-into) d'information (contenu principal de Sensagent) est invoquée un double-clic sur n'importe quel mot de votre page web. LA fenêtre fournit des explications et des traductions contextuelles, c'est-à-dire sans obliger votre visiteur à quitter votre page web !

Essayer ici, télécharger le code;

SensagentBox

Avec la boîte de recherches Sensagent, les visiteurs de votre site peuvent également accéder à une information de référence pertinente parmi plus de 5 millions de pages web indexées sur Sensagent.com. Vous pouvez Choisir la taille qui convient le mieux à votre site et adapter la charte graphique.

Solution commerce électronique

Augmenter le contenu de votre site

Ajouter de nouveaux contenus Add à votre site depuis Sensagent par XML.

Parcourir les produits et les annonces

Obtenir des informations en XML pour filtrer le meilleur contenu.

Indexer des images et définir des méta-données

Fixer la signification de chaque méta-donnée (multilingue).


Renseignements suite à un email de description de votre projet.

Jeux de lettres

Les jeux de lettre français sont :
○   Anagrammes
○   jokers, mots-croisés
○   Lettris
○   Boggle.

Lettris

Lettris est un jeu de lettres gravitationnelles proche de Tetris. Chaque lettre qui apparaît descend ; il faut placer les lettres de telle manière que des mots se forment (gauche, droit, haut et bas) et que de la place soit libérée.

boggle

Il s'agit en 3 minutes de trouver le plus grand nombre de mots possibles de trois lettres et plus dans une grille de 16 lettres. Il est aussi possible de jouer avec la grille de 25 cases. Les lettres doivent être adjacentes et les mots les plus longs sont les meilleurs. Participer au concours et enregistrer votre nom dans la liste de meilleurs joueurs ! Jouer

Dictionnaire de la langue française
Principales Références

La plupart des définitions du français sont proposées par SenseGates et comportent un approfondissement avec Littré et plusieurs auteurs techniques spécialisés.
Le dictionnaire des synonymes est surtout dérivé du dictionnaire intégral (TID).
L'encyclopédie française bénéficie de la licence Wikipedia (GNU).

Copyright

Les jeux de lettres anagramme, mot-croisé, joker, Lettris et Boggle sont proposés par Memodata.
Le service web Alexandria est motorisé par Memodata pour faciliter les recherches sur Ebay.
La SensagentBox est offerte par sensAgent.

Traduction

Changer la langue cible pour obtenir des traductions.
Astuce: parcourir les champs sémantiques du dictionnaire analogique en plusieurs langues pour mieux apprendre avec sensagent.

 

4223 visiteurs en ligne

calculé en 0,047s


Je voudrais signaler :
section :
une faute d'orthographe ou de grammaire
un contenu abusif (raciste, pornographique, diffamatoire)
une violation de copyright
une erreur
un manque
autre
merci de préciser :