Latin, FarEast, Complex unicode ranges

J

Jialiang Ge [MSFT]

Hello Dave,

According to the description, you are wondering the character range of some
standardized subsets of Unicode, such as CJK, Latin, etc.

The complete Unicode 5.0 character code chart can be found at
http://www.unicode.org/charts/, which is monitored by Unicode Consortium.
By clicking the name of a subset, we can view its range, disclaimer, fonts,
and terms of use. For instance, CJK Ideographs Ext. A ranges from
3400-4DBF; Basic Latin ranges from 0000-007F.

If you have any other concerns, please feel free to let me know.

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

==================================================
For MSDN subscribers whose posts are left unanswered, please check this
document: http://blogs.msdn.com/msdnts/pages/postingAlias.aspx

Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscriptions/managednewsgroups/default.aspx#notif
ications. If you are using Outlook Express/Windows Mail, please make sure
you clear the check box "Tools/Options/Read: Get 300 headers at a time" to
see your reply promptly.

Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscriptions/support/default.aspx.
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
J

Jialiang Ge [MSFT]

Hello Dave,

The Unicode subset bitfields setting can be found at
Office Open XML Part4 - Markup Language Reference (tagged).pdf, Page 758
2.8.2.16 sig (Support Unicode Subranges and Code Pages) section.
Turn to its attribute usb0, usb1, usb2, usb3 located in Page 761. It is a
128-bit Unicode subset bit field (USB) defined for each font.

In the Word xml: \word\fontTable.xml, it shows something like:
<w:font w:name="Times New Roman">
<w:panose1 w:val="02020603050405020304" />
<w:charset w:val="00" />
<w:family w:val="roman" />
<w:pitch w:val="variable" />
<w:sig w:usb0="20002A87" w:usb1="C0007841" w:usb2="00000009"
w:usb3="00000000" w:csb0="000001FF" w:csb1="00000000" />
</w:font>

w:usb0 ~ w:sub3 defines the Unicode subset for this font (Times New Roman).
For instance,
w:usb0="20002A87"
means w:usb0 = "00100000000000000010101010000111" which corresponds to:
Basic Latin, Latin-1 Supplement, Latin Extended-A, Basic Greek, Cyrillic,
Basic Hebrew, Basic Arabic, Latin Extended Additional.

As for the setting of "Latin text" in Word, it corresponds to the sig
element:
<w:sig w:usb0="00000003" w:usb1="00000000" w:usb2="00000000"
w:usb3="00000000" w:csb0="00000001" w:csb1="00000000" />
"East Asian text" refers to
<w:sig w:usb0="00000001" w:usb1="080F0000" w:usb2="00000010"
w:usb3="00000000" w:csb0="00040000" w:csb1="00000000" />

For each Unicode Subset Bitfields, we can query its Unicode subrange from
the table:
http://msdn2.microsoft.com/en-us/library/ms776439(VS.85).aspx

Hope it helps.

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
When responding to posts, please "Reply to Group" via your newsreader
so that others may learn and benefit from your issue.
=================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
J

Jialiang Ge [MSFT]

Hello Dave,

The sig element (see my last reply) for Thai characters generated by Word
is defined as:
<w:sig w:usb0="81000003" w:usb1="00000000" w:usb2="00000000"
w:usb3="00000000" w:csb0="00010001" w:csb1="00000000" />
The first 32bits defined in w:usb0 is: 10000001000000000000000000000011.
According to the table in
http://msdn2.microsoft.com/en-us/library/ms776439.aspx, it corresponds to
the Unicode subrange:
Basic Latin, Latin-1 Supplement, *Thai (0E00-0E7F)* and
General/Supplemental Punctuation

Therefore, Thai characters' Unicode subrange is 0E00-0E7F. Because some
basic latin characters, e.g. 'a', 'b', and punctuations are also supported
in Thai, Basic Latin, Latin-1 Supplement and General/Supplemental
Punctuation are also included in its sig element.

As we know, each font defines its own Unicode range that it can be applied
to. We can retrieve the Unicode range of a font by calling
GetFontUnicodeRanges API
(http://www.pinvoke.net/default.aspx/gdi32/GetFontUnicodeRanges.html). Word
will choose the best font installed in our computer that can fit the
specified Unicode range. In my side, Word chooses the font "Cordia New" for
Thai (0E00-0E7F):
<w:font w:name="Cordia New">
<w:panose1 w:val="020B0304020202020204" />
<w:charset w:val="00" />
<w:family w:val="swiss" />
<w:pitch w:val="variable" />
<w:sig w:usb0="81000003" w:usb1="00000000" w:usb2="00000000"
w:usb3="00000000" w:csb0="00010001" w:csb1="00000000" />
</w:font>

The font may differ in your side if you have not installed the "Cordia New"
font.

But how can we find a proper font when given a Unicode subrange?
The method I can think of is to
1. List all the fonts installed in the computer.
http://msdn2.microsoft.com/en-us/library/0yf5t4e8.aspx
2. Iterate each font, call "GetUnicodeRanges"
http://msdn2.microsoft.com/en-us/library/ms533944(VS.85).aspx, and see if
it fits the given Unicode range.

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
When responding to posts, please "Reply to Group" via your newsreader
so that others may learn and benefit from your issue.
=================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
D

David Thielen

Hi;

Thank you for the font info - that helps. But my main problem is the
following:

If a run of text has the setting <w:rFonts w:ascii="Arial"
w:fareast="Verdana" w:h-ansi="Courier New" w:cs="Arial Unicode MS"/>

How do I know which font to use for a character. I know what ascii is
(0-127) and I know CJK is fareast and Hebrew/Arabic is cs. But what is Thai,
Bengali, etc?

And what ranges does Word set to ascii/h-ansi or cs continuing what the
characters before them are?

And h-ansi is everything that is not ascii/fareast/cs - correct?

--
thanks - dave
david_at_windward_dot_net
http://www.windwardreports.com

Cubicle Wars - http://www.windwardreports.com/film.htm
 
J

Jialiang Ge [MSFT]

Hello Dave,

An EA font slot defines the font used for the Unicode ranges for the JPN,
CHT, CHS, or KOR languages. In general, characters in the PUA and the
corresponding Unicode planes are formatted with the EA font.

The CS font slot defines the font used for any scripts that uses combining
diacritics, is laid out right-to-left, or builds characters vertically.
This includes a variety of script groupings as defined in Unicode. Examples
are Thai, Arabic, Hebrew, and the Indic scripts.

The "hansi" font slot is better thought of as "other" and defines the font
used for any other script, including Latin, Cyrillic, and Greek.

Which font slots are exposed in the Word UI is determined by the enabled
languages on the system.

If you want to dig into how Word itself decide what font to be used for a
Unicode character, I am sorry it is not appropriate to tell in the public.
The internal implementation may change from time to time. If somebody
develops the software by referencing the current internal implementation of
Office system, the software may break if the internal design is changed.

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
When responding to posts, please "Reply to Group" via your newsreader
so that others may learn and benefit from your issue.
=================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top