Langbahn Team – Weltmeisterschaft

Hong Kong Supplementary Character Set

The Hong Kong Supplementary Character Set (香港增補字符集; commonly abbreviated to HKSCS) is a set of Chinese characters – 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong (whether in written Cantonese or standard written Chinese sentences).[1]

It evolved from the preceding Government Chinese Character Set (政府通用字庫) or GCCS. GCCS is a set of supplementary Chinese characters coded in the user-defined areas of the Big5 character set. It was originally used within the Hong Kong Government and later used by the public. It later evolved into Hong Kong Supplementary Character Set when the characters in the set were submitted to ISO-10646 for coding.

History and versions

The HKSCS has gone through a few iterations.[2]

Version Total characters Big5-HKSCS characters Publication date
GCCS 3,049 3,049 1995
HKSCS-1999 4,702 4,702 09/1999
HKSCS-2001 4,818 4,818 12/2001
HKSCS-2004 4,941 4,941 05/2005
HKSCS-2008 5,009 5,009 12/2009
HKSCS-2016 5,033 5,009 05/2017
MSCS-2020 442 59 03/2021

Big-5 extensions (1995–2009)

HKSCS Big-5 extension
MIME / IANABig5-HKSCS
Alias(es)big5hk, csBig5HKSCS
Language(s)Traditional Chinese, Cantonese
Classification8-bit CJK DBCS
ExtendsBig5 ETen

HKSCS versions up to HKSCS-2008 are encoded in Big5 (Big5-HKSCS,[3] big5hk)[4] and ISO 10646 (Unicode).

GCCS

Due to the inherent differences between standard written Chinese and written Cantonese, the Government of Hong Kong recognised the need for a standardised set of proprietary characters that would allow for the streamlining of electronic communication; at the time, the Big5 Chinese encoding scheme did not contain a vast majority of these characters (some were erroneously cross-listed with similar characters).

The Government Chinese Character Set (政府通用字庫) or GCCS was thus developed by the government. The character set consists of Chinese characters commonly used in Hong Kong. Some characters are Cantonese-specific, while some are alternative forms of characters. The set is not well-organised and the characters are not closely examined.

HKSCS-1999

Subsequently, the HKSCS-1999 (HKSCS 1999 specification) was developed. 106 GCCS characters were removed in HKSCS-1999 as a result of unification, and their Big5 code points are reserved for compatibility.[5][6]

Retired "not verifiable" GCCS characters are found in UTC Sources (UTC-00877–UTC-00898),[7] where they are sourced from Adobe-CNS1-1,[8] an Adobe-CNS1 supplement implemented to support GCCS.[9]

HKSCS-2001 and HKSCS-2004

Following the acceptance of HKSCS-1999, newer revisions were released in 2001 (adding 116 new characters) and in 2004 (adding 123 new characters), totalling 4,941 characters.[5]

Starting from HKSCS-2004, all characters previously using the Private Use Area (PUA) section of Unicode (via Microsoft's mapping of the Unicode PUA over the private-use ranges of Big5) are remapped, with many of them reassigned to characters in the Supplementary Ideographic Plane, such as in the CJK Unified Ideographs Extension B or CJK Compatibility Ideographs Supplement Unicode blocks.[10] However, to preserve compatibility with programs that generated PUA code points, the already-allocated code points are reserved, and no new characters will be mapped to the private use area.

HKSCS-2008

Since around 2005, many Hong Kong and Macau websites have switched encoding from Big5-HKSCS to Unicode, including HKGolden.

The last edition of HKSCS to encode all of its characters in Big5 was HKSCS-2008.[11]

Unicode subsets (2015 onwards)

HKSCS-2016

By 2015, efforts were underway in Hong Kong to migrate away from Big5-HKSCS and towards a defined subset of Unicode, at the time tentatively termed the Hong Kong Character Set (HKCS). This was planned to be published by the end of 2015 as "HKCS-2015", and to have four parts differentiated by different Unihan source prefixes:[12]

  • Source prefix H followed by Big5 hexadecimal: character repertoire of HKSCS-2008 in the narrow sense
  • Source prefix HB followed by Big5 hexadecimal: character repertoire of Big5-ETEN
  • Source prefix HC followed by a four-digit decimal incremental accession number: post-2008 vertical extensions (i.e. newly submitted CJK Unified Ideographs)
  • Source prefix HD followed by hexadecimal Unicode code point: post-2008 horizontal extensions (i.e. the addition of a Hong Kong reference glyph and source reference to an existing CJK Unified Ideograph)

In particular, 22 horizontal extensions were already planned for inclusion as of June 2015. These were all minor variants of existing Big5 characters containing the 昷, 兑/兌 and 吿/告 components, for which Hong Kong font conventions were closer to those of mainland China than to those of Taiwan, but for which the preferred versions had been encoded separately from the Big5 versions in the Unified Repertoire and Ordering due to the source separation rule.[12][13] Of these 22 characters, 14 were considered "core" characters for Hong Kong use.[13]

By November, a total of 78 requested additions to HKSCS had been received by the Hong Kong government, all of which already existed in Unicode. By this point, the planned horizontal extension was being referred to as an "HKSCS" version, rather than as "HKCS".[14] Ultimately, HKSCS-2016 added a total of 24 characters relative to HKSCS-2008, 22 of which were the source-separated variants of existing Big5 characters.[13] Since all 24 characters already existed in Unicode, all received a source reference prefixed with HD; the HC source prefix was not used.[15]

As such, the characters added in HKSCS-2016 are referenced to Unicode only, and were not added to the Big5 extension.[11]

Macao Supplementary Character Set

Similarly to Hong Kong's situation, there are also characters that are needed by Macau but included in neither Big5 nor HKSCS, hence, the Macao Supplementary Character Set was developed, building on HKSCS with additional Unicode-mapped characters. The first batch of 121 MSCS characters were submitted for addition to or horizontal extension in Unicode (as appropriate) in 2009.[16] At the time, the term Macao Information Systems Character Set (MISCS) was in use for the entire character set, while "MSCS" referred more narrowly to the additional characters only.[16]

The first final version of MSCS, MSCS-2020, was established in 2021, and uses the following Unihan source prefixes.[11][17] Although the potential scope of these source prefixes collectively comprises a superset of HKSCS-2016, "MSCS" in a strict sense does not cover the Big5 or HKSCS characters (since it is intended to be combined with HKSCS) except those which are used as base characters for ideographic variation sequences.[11][17]

  • Source prefix MA followed by Big5 hexadecimal: character repertoire of HKSCS-2008 in the narrow sense (same as H)
  • Source prefix MB followed by level number and Big5 hexadecimal: character repertoire of Big5-ETEN (same as HB)
  • Source prefix MC followed by a five-digit decimal incremental accession number: Macau-specific vertical extensions. This prefix had initially been MAC in 2009,[16] but was shortened to MC for vertical extensions in 2020, while horizontal extensions had their source references replaced with MD references.[11]
  • Source prefix MD followed by hexadecimal Unicode code point: Macau-specific horizontal extensions
  • Source prefix MDH followed by hexadecimal Unicode code point: horizontal extensions from HKSCS-2016 (same as HD)
  • Source prefix ME followed by hexadecimal Unicode code point and a three-digit decimal number: used for variation sequences registered in the Ideographic Variation Database (IVD)

Compatibility

Operating systems

In Microsoft Windows 98, NT 4.0, 2000, XP, HKSCS support can be enabled using Microsoft's patch. In Microsoft's implementation, application using code page 950 automatically uses a hidden code page 951 table for the Big5 encoding of the HKSCS extensions. The table supports all code points in HKSCS-2001, except for the compatibility code points specified by the standard.[18] In addition, the MingLiU font is altered using Microsoft's patch. This patch is known to create conflicts in applications such as Microsoft Office, or any application using fonts supporting simplified Chinese characters (e.g.: SimSun). If the target environment contains custom font mapped to the code points affected by Microsoft's patch, the custom fonts can undo Microsoft's patch. Furthermore, the patch breaks EUDC Editor supplied with the affected versions of Windows.[19] Starting with Windows Vista, HKSCS-2004 characters are only supported as Unicode 4.1 or later; however, HKSCS-2001 and HKSCS-1999 characters are supported as Big5-HKSCS and Unicode, but Big5-HKSCS is available only if set "Language for non-Unicode programs" to "Hong Kong" or "Macau".[20][21] All characters are assigned standard, non-PUA codepoints. The characters are displayed with the MingLiU font, and these characters can be entered via the keyboard. The patch that provides Big5 encoding of HKSCS is unsupported in Windows Vista and later. A utility provided by Microsoft is available to convert HKSCS and Unicode PUA-encoded characters to Unicode 4.1 version.[22] In 2010, Microsoft published a HKSCS-2004 patch for Windows XP and Windows Server 2003.[23] It replaces Windows XP version of MingLiU, PMingLiU, and MingLiU_HKSCS (if HKSCS-2001 patch was applied) with Windows 7 version of MingLiU, PMingLiU and MingLiU_HKSCS. In addition, MingLiU-ExtB, MingLiU_HKSCS-ExtB and PMingLiU-ExtB fonts will be added onto target system. However, IME is not updated as it was in the case of HKSCS-2001 patch, and the fonts are from pre-release of Windows 7. For earlier versions of the OS, HKSCS support requires the use of Microsoft's patch, or the Hong Kong government's Digital 21's utilities.

IBM assigns CCSID 5471 to the HKSCS-2001 Big5 code page (with CPGID 1374 as CCSID 5470 as the double byte component),[24][25] CCSID 9567 to the HKSCS-2004 code page (with CPGID 1374 as CCSID 9566 as the double byte component),[26] and CCSID 13663 to the HKSCS-2008 code page (with CPGID 1374 as CCSID 13662 as the double byte component),[27] while CCSID 1375 (with CPGID 1374 as CCSID 1374 as its double byte component) is assigned to a growing HKSCS code page, currently equivalent to CCSID 13663.[28]

HKSCS support was added to glibc in 2000, but it has not been updated since then. HKSCS-2004 support is handled as Unicode 4.1 and later. For freedesktop.org setup, AR PL ShanHeiSun Uni font fully supports HKSCS-2004 since 0.1-0.dot.1, with latest revision of HKSCS-2004 supported in version 0.1.20060903-1. Modern desktop distributions (e.g. Ubuntu) include Arphic Technology's HKSCS-compliant UKai and UMing fonts out of the box when Traditional Chinese Language support is selected during installation. They can also be installed manually at a later time.

Mac OS X 10.0–10.2 supports HKSCS-1999. 10.3–10.4 supports HKSCS-2001. Some of the letters added to HKSCS-2004 is supported via Unicode PUA in OS X 10.4. Starting with OS X 10.5, all the HKSCS-2004 characters are supported via standard Unicode 4.1 code points.

Applications and the Web

Mozilla 1.5 and above supports HKSCS, with HKSCS-2004 support added into Gecko 1.8.1 code base.[29] Unlike the above-mentioned patch, Mozilla uses its own code page table. However, the fix for bug 343129 does not support characters mapped to code points above Basic Multilingual Plane.[30]

QT 3.x-based applications (e.g.: KDE) only support characters mapped to code points FFFF or lower. In QT4, characters outside BMP are supported via surrogates. Big5-HKSCS Text Codec supports HKSCS-1999 back in Qt-2.3.x, but it was too late in Qt development schedule to be officially included in the Qt-2.3.x series, so it was officially supported in Qt-3.0.1. HKSCS-2001 support was added in Qt-3.0.5.[31]

GNOME supports HKSCS characters in Unicode ranges, except those mapped to the Basic Multilingual Plane compatibility block. Patches to support characters mapped to above Basic Multilingual Plane was introduced during Pango 1.1.[32]

The WHATWG Encoding Standard (used by HTML5) includes HKSCS in its definition of Big5 (used even with the plain Big5 label). However, only its decoder uses all HKSCS extensions, while its encoder explicitly excludes those with lead bytes below 0xA1 (thus excluding most of the HKSCS extensions but including, for example, those inherited from Big5 ETEN).[33] Newer browsers follow this standard, including Firefox.

See also

References

  1. ^ FAQs about GovHK Online Services – Other Technical Questions and Trouble Shooting
  2. ^ "OGCIO - Development of HKSCS". Archived from the original on 22 August 2017. Retrieved 21 August 2017.
  3. ^ "Character Sets". IANA.
  4. ^ "SDK components".
  5. ^ a b "Big5CMP.txt". Archived from the original on 13 September 2016. Found at Mapping table - HKSCS-2008
  6. ^ "HKSCS-2004 Annex IV. Compatibility Points for GCCS" (PDF). Archived from the original (PDF) on 30 September 2016. Retrieved 29 September 2016.
  7. ^ "Group:Big5-GCCS外字". Retrieved 30 September 2016.
  8. ^ "U-source glyphs" (PDF). Retrieved 30 September 2016.
  9. ^ "The Adobe-CNS1-6 Character Collection" (PDF). Retrieved 30 September 2016.
  10. ^ "Big5-HKSCS:2004".
  11. ^ a b c d e Macao Special Administrative Region Government (2 July 2020) [2020-06-11]. "Submission of Macao's Vertical Extension (UNC Characters), Horizontal Extension, and IVSes Registration for MSCS" (PDF). ISO/IEC JTC 1/SC 2/WG 2 IRGN 2430.
  12. ^ a b Lu, Qin (8 June 2015). "The Proposed Hong Kong Character Set" (PDF). ISO/IEC JTC1/SC2/WG2/IRG N2074.
  13. ^ a b c Lunde, Ken. "Exploring IICore—Part 4". CJK Type Blog. Adobe Inc.
  14. ^ "Activity Report of the Hong Kong Special Administrative Region (HKSAR)" (PDF). 16 November 2015. ISO/IEC JTC1/SC2/WG2/IRG N2098.
  15. ^ Lunde, Ken; Cook, Richard (31 July 2024). "kIRG_HSource". Unicode Han Database (Unihan). Revision 37. Unicode Consortium.
  16. ^ a b c Computer Chinese Characters Encoding Workgroup (12 June 2009). "Submission of Characters from Macao Information Systems Character Set" (PDF). ISO/IEC JTC 1/SC 2/WG 2 IRGN 1580.
  17. ^ a b "Macao SAR Activity Report (IRG Meeting #56)" (PDF). 14 March 2021. ISO/IEC JTC1/SC2/WG2/IRG N2456.
  18. ^ Steele, Shawn. "CP 951 & HKSCS". I'm not a Klingon. MS Dev Blog. Retrieved 13 September 2016.
  19. ^ 華通資訊網: 小心!有人悄悄換掉了你的Windows系統字型
  20. ^ Microsoft: Hong Kong Supplementary Character Set – Support for Windows Platform
  21. ^ "Big5-HKSCS編碼初探(上)-黑暗執行緒". blog.darkthread.net. 27 February 2014. Retrieved 3 September 2024.
  22. ^ Microsoft Character Code Conversion Routines For HKSCS-2004
  23. ^ Windows XP Font Pack for ISO 10646:2003 + Amendment 1 Traditional Chinese Support
  24. ^ "CCSID 5471: Mixed Big-5 ext for HKSCS-2001". IBM Globalization - Coded character set identifiers. IBM. Archived from the original on 29 November 2014.
  25. ^ International Components for Unicode (ICU), ibm-5471_P100-2006.ucm, 9 May 2007
  26. ^ "CCSID 9567: Mixed Big-5 ext for HKSCS-2004". IBM Globalization - Coded character set identifiers. IBM. Archived from the original on 29 November 2014.
  27. ^ "CCSID 13663: Mixed Big-5 ext for HKSCS-2008". IBM Globalization - Coded character set identifiers. IBM. Archived from the original on 29 November 2014.
  28. ^ "CCSID 1375: Mixed Big-5 ext for HKSCS". IBM Globalization - Coded character set identifiers. IBM. Archived from the original on 29 November 2014.
  29. ^ Mozilla.org: Bug 343129 – Big5-HKSCS 2004 <==> Unicode Table Update
  30. ^ Bug 162431 – add non-BMP Unicode (plane 1 and above. surrogate) support to charset encoder/decoder
  31. ^ "Qt 4.7: Big5-HKSCS Text Codec". Archived from the original on 4 March 2016. Retrieved 10 November 2011.
  32. ^ Bug 101081 – Non-BMP (plane 1 thru plane 16) characters are not supported
  33. ^ van Kesteren, Anne. "Encoding Standard". WHATWG.