صيغة التحويل الموحد-8

UTF-8
Standard	Unicode Standard
Classification	Unicode Transformation Format, extended ASCII, variable-width encoding
Extends	US-ASCII
Transforms / Encodes	ISO 10646 (Unicode)
Preceded by	UTF-1
	v; t; e;

UTF-8 هي اختصار للجملة (8-bit Unicode Transformation Format) وترجمتها (صيغة تحويل نظام الحروف الدولي الموحد بقوة 8 بت) ، هذا الترميز وضع من قبل كل من روب بايك و كين تومسن لتمثيل معيار نظام الحروف الدولي الموحد للحروف الأبجدية لأغلب دول العالم ، ويتم تشفير الرموز فيها في حجم يتراوح بين بايت واحد و4 بايت للرمز الواحد .

يتم تحديد طول تشفير الرمز بحسب بالشكل الآتي:

إذا كان قيمة البايت الأول أقل من 127، أي أن البت الثامن يساوي صفر، فإن هذا البايت هو كامل تشفير الرمز، وبالتالي طوله واحد بايت، تقع قيم ASCII في هذا المجال.
إذا كان قيمة البايت الأول أكبر من 127، أي أن قيمة البت الثامن يساوي واحد، فإن تشفير الرمز متعدد البايتات حسب الأتي:
- لا يجوز أن يكون البت الثامن من البايت الأول مساويا لواحد والبت السابع يساوي صفر، ووقوع مثل هذه الحالة في البايت الأول من التشفير تعني أن هناك خطأ إما في التشفير أو في طريقة القراءة، فهذه القيم مسموحة في البايت الثاني والثالث والرابع ولكن ليس الأول.
- إذا كان البت الثامن من البايت الأول مساويا لواحد وكذلك البت السابع مساويا لواحد والبت السادس يساوي صفر، فإن طول التشفير هو 2 بايت.
- إذا كان البت الثامن من البايت الأول مساويا لواحد وكذلك البت السابع مساويا لواحد والبت السادس يساوي واحد والخامس يساوي صفر، فإن طول التشفير هو 3 بايت.
- إذا كان البت الثامن من البايت الأول مساويا لواحد وكذلك البت السابع مساويا لواحد والبت السادس يساوي واحد والخامس يساوي واحد والرابع يساوي صفر، فإن طول التشفير هو 4 بايت.

أمثلة

Examples of UTF-8 encoding
Character		Binary code point	Binary UTF-8	Hex UTF-8
$	U+0024	010 0100	00100100	24
¢	U+00A2	000 1010 0010	11000010 10100010	C2 A2
ह	U+0939	0000 1001 0011 1001	11100000 10100100 10111001	E0 A4 B9
€	U+20AC	0010 0000 1010 1100	11100010 10000010 10101100	E2 82 AC
한	U+D55C	1101 0101 0101 1100	11101101 10010101 10011100	ED 95 9C
𐍈	U+10348	0 0001 0000 0011 0100 1000	11110000 10010000 10001101 10001000	F0 90 8D 88

Octal

UTF-8's use of six bits per byte to represent the actual characters being encoded, means that octal notation (which uses 3-bit groups) can aid in the comparison of UTF-8 sequences with one another and in manual conversion.^[1]

Octal code point <-> Octal UTF-8 conversion
First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
0	177	xxx
200	3777	3xx	2xx
4000	77777	34x	2xx	2xx
100000	177777	35x	2xx	2xx
200000	4177777	36x	2xx	2xx	2xx

With octal notation, the arbitrary octal digits, marked with x in the table, will remain unchanged when converting to or from UTF-8.

Example: € = U+20AC = 02 02 54 is encoded as 342 202 254 in UTF-8 (E2 82 AC in hex).

Codepage layout

The following table summarizes usage of UTF-8 code units (individual bytes or octets) in a code page format. The upper half (0_ to 7_) is for bytes used only in single-byte codes, so it looks like a normal code page; the lower half is for continuation bytes (8_ to B_) and leading bytes (C_ to F_), and is explained further in the legend below.

UTF-8
	_0	_1	_2	_3	_4	_5	_6	_7	_8	_9	_A	_B	_C	_D	_E	_F
(1 byte) 0_	NUL 0000	SOH 0001	STX 0002	ETX 0003	EOT 0004	ENQ 0005	ACK 0006	BEL 0007	BS 0008	HT 0009	LF 000A	VT 000B	FF 000C	CR 000D	SO 000E	SI 000F
(1) 1_	DLE 0010	DC1 0011	DC2 0012	DC3 0013	DC4 0014	NAK 0015	SYN 0016	ETB 0017	CAN 0018	EM 0019	SUB 001A	ESC 001B	FS 001C	GS 001D	RS 001E	US 001F
(1) 2_	SP 0020	! 0021	" 0022	# 0023	$ 0024	% 0025	& 0026	' 0027	( 0028	) 0029	* 002A	+ 002B	, 002C	- 002D	. 002E	/ 002F
(1) 3_	0 0030	1 0031	2 0032	3 0033	4 0034	5 0035	6 0036	7 0037	8 0038	9 0039	: 003A	; 003B	< 003C	= 003D	> 003E	? 003F
(1) 4_	@ 0040	A 0041	B 0042	C 0043	D 0044	E 0045	F 0046	G 0047	H 0048	I 0049	J 004A	K 004B	L 004C	M 004D	N 004E	O 004F
(1) 5_	P 0050	Q 0051	R 0052	S 0053	T 0054	U 0055	V 0056	W 0057	X 0058	Y 0059	Z 005A	[ 005B	\ 005C	] 005D	^ 005E	_ 005F
(1) 6_	` 0060	a 0061	b 0062	c 0063	d 0064	e 0065	f 0066	g 0067	h 0068	i 0069	j 006A	k 006B	l 006C	m 006D	n 006E	o 006F
(1) 7_	p 0070	q 0071	r 0072	s 0073	t 0074	u 0075	v 0076	w 0077	x 0078	y 0079	z 007A	{ 007B	\| 007C	} 007D	~ 007E	DEL 007F
8_	• +00	• +01	• +02	• +03	• +04	• +05	• +06	• +07	• +08	• +09	• +0A	• +0B	• +0C	• +0D	• +0E	• +0F
9_	• +10	• +11	• +12	• +13	• +14	• +15	• +16	• +17	• +18	• +19	• +1A	• +1B	• +1C	• +1D	• +1E	• +1F
A_	• +20	• +21	• +22	• +23	• +24	• +25	• +26	• +27	• +28	• +29	• +2A	• +2B	• +2C	• +2D	• +2E	• +2F
B_	• +30	• +31	• +32	• +33	• +34	• +35	• +36	• +37	• +38	• +39	• +3A	• +3B	• +3C	• +3D	• +3E	• +3F
(2) C_	2 0000	2 0040	Latin 0080	Latin 00C0	Latin 0100	Latin 0140	Latin 0180	Latin 01C0	Latin 0200	IPA 0240	IPA 0280	IPA 02C0	accents 0300	accents 0340	Greek 0380	Greek 03C0
(2) D_	Cyril 0400	Cyril 0440	Cyril 0480	Cyril 04C0	Cyril 0500	Armeni 0540	Hebrew 0580	Hebrew 05C0	Arabic 0600	Arabic 0640	Arabic 0680	Arabic 06C0	Syriac 0700	Arabic 0740	Thaana 0780	N'Ko 07C0
(3) E_	Indic 0800	Misc. 1000	Symbol 2000	Kana… 3000	CJK 4000	CJK 5000	CJK 6000	CJK 7000	CJK 8000	CJK 9000	Asian A000	Hangul B000	Hangul C000	Hangul D000	PUA E000	Forms F000
(4) F_	SMP… 10000	񀀀 40000	򀀀 80000	SSP… C0000	SPU… 100000	4 140000	4 180000	4 1C0000	5 200000	5 1000000	5 2000000	5 3000000	6 4000000	6 40000000

Blue cells are 7-bit (single-byte) sequences. They must not be followed by a continuation byte.^[2]

Orange cells with a large dot are a continuation byte.^[3] The hexadecimal number shown after the + symbol is the value of the 6 bits they add. This character never occurs as the first byte of a multi-byte sequence.

White cells are the leading bytes for a sequence of multiple bytes,^[4] the length shown at the left edge of the row. The text shows the Unicode blocks encoded by sequences starting with this byte, and the hexadecimal code point shown in the cell is the lowest character value encoded using that leading byte.

Red cells must never appear in a valid UTF-8 sequence. The first two red cells (C0 and C1) could be used only for a 2-byte encoding of a 7-bit ASCII character which should be encoded in 1 byte; as described below, such "overlong" sequences are disallowed.^[5] To understand why this is, consider the character 128, hex 80, binary 1000 0000. To encode it as 2 characters, the low six bits are stored in the second character as 128 itself 10 000000, but the upper two bits are stored in the first character as 110 00010, making the minimum first character C2. The red cells in the F_ row (F5 to FD) indicate leading bytes of 4-byte or longer sequences that cannot be valid because they would encode code points larger than the U+10FFFF limit of Unicode (a limit derived from the maximum code point encodable in UTF-16 ^[6]). FE and FF do not match any allowed character pattern and are therefore not valid start bytes.^[7]

Pink cells are the leading bytes for a sequence of multiple bytes, of which some, but not all, possible continuation sequences are valid. E0 and F0 could start overlong encodings, in this case the lowest non-overlong-encoded code point is shown. F4 can start code points greater than U+10FFFF which are invalid. ED can start the encoding of a code point in the range U+D800–U+DFFF; these are invalid since they are reserved for UTF-16 surrogate halves.^[8]

Overlong encodings

In principle, it would be possible to inflate the number of bytes in an encoding by padding the code point with leading 0s. To encode the Euro sign € from the above example in four bytes instead of three, it could be padded with leading 0s until it was 21 bits long – 000 000010 000010 101100, and encoded as 11110000 10000010 10000010 10101100 (or F0 82 82 AC in hexadecimal). This is called an overlong encoding.

انظر أيضاً

ملاحظات

الهامش

^ "BinaryString (flink 1.9-SNAPSHOT API)". ci.apache.org. Retrieved 2021-03-24.
^ "Chapter 3", The Unicode Standard, p. 54
^ "Chapter 3", The Unicode Standard, p. 55
^ "Chapter 3", The Unicode Standard, p. 55
^ "Chapter 3", The Unicode Standard, p. 54
^ Yergeau, F. (November 2003), [[[:قالب:Cite IETF/makelink]] UTF-8, a transformation format of ISO 10646], IETF, doi:10.17487/RFC3629, قالب:Cite IETF/doctypes, قالب:Cite IETF/makelink, retrieved on August 20, 2020
^ "Chapter 3", The Unicode Standard, p. 55
^ Yergeau, F. (November 2003), [[[:قالب:Cite IETF/makelink]] UTF-8, a transformation format of ISO 10646], IETF, doi:10.17487/RFC3629, قالب:Cite IETF/doctypes, قالب:Cite IETF/makelink, retrieved on August 20, 2020

وصلات خارجية

Original UTF-8 paper (or pdf) for Plan 9 from Bell Labs
UTF-8 test pages:
Unix/Linux: UTF-8/Unicode FAQ, Linux Unicode HOWTO, 8.xml UTF-8 and Gentoo
Characters, Symbols and the Unicode Miracle at YouTube

قالب:Rob Pike navbox

الكلمات الدالة:

روب بايك

[1] "BinaryString (flink 1.9-SNAPSHOT API)". ci.apache.org. Retrieved 2021-03-24.

[2] "Chapter 3", The Unicode Standard, p. 54

[3] "Chapter 3", The Unicode Standard, p. 55

[4] "Chapter 3", The Unicode Standard, p. 55

[5] "Chapter 3", The Unicode Standard, p. 54

[6] Yergeau, F. (November 2003), [[[:قالب:Cite IETF/makelink]] UTF-8, a transformation format of ISO 10646], IETF, doi:10.17487/RFC3629, قالب:Cite IETF/doctypes, قالب:Cite IETF/makelink, retrieved on August 20, 2020

[7] "Chapter 3", The Unicode Standard, p. 55

[8] Yergeau, F. (November 2003), [[[:قالب:Cite IETF/makelink]] UTF-8, a transformation format of ISO 10646], IETF, doi:10.17487/RFC3629, قالب:Cite IETF/doctypes, قالب:Cite IETF/makelink, retrieved on August 20, 2020

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

v t e Ken Thompson
Operating systems	Unix Plan 9 from Bell Labs Inferno
Programming languages	B Go
Software	Belle ed grep sam Space Travel Thompson shell
Associated institutions	Bell Labs Google
Other	UTF-8