Non-standard characters on Web

P

Peter Jamieson

Glad you said that.

<<
I'm going to China in a
couple of weeks, so I will let you know when I get back :)

Good luck!

Peter Jamieson
John McGhie said:
Hi Peter:

Yes, that's my understanding too.

UTF-8 uses a "Shift" character to express high-order characters as
double-byte (16 bit) but expresses all ANSI characters as single-byte.
Since the majority of characters in English text ARE ANSI characters, it's
half the size.

UTF-16 encodes every character as 16-bits (two bytes) and is thus close to
double the size. And because it can be either "Big endian" or "Little
endian", it relies on the recipient application getting the byte order
correct.

Things can (and do...) go wrong along the way and one can get some
problems
with badly-coded applications.

I believe that Asian applications will do better with UTF-16 because the
majority of their characters are double-byte. I'm going to China in a
couple of weeks, so I will let you know when I get back :)

Cheers

As I understood it, UTF-8 and UTF-16 are both just encodings primarily
intended for compression- either of them can be used to encode any
Unicode
character. Is tht not the case?

Peter Jamieson

Corentin Cras-Méneur said:
[...]
What if usually do is save directly out of Word, but I set the
encoding to UTF-8. That will support almost any character in
known universe :)

(that would be UTF-16 John ;-) )


Corentin

--
--- Mac:MS MVP (Francophone) http://www.cortig.net/wordpress/ ---
http://www.mvps.org - http://mvp.support.microsoft.com MVPs
are not MS employees - Les MVP ne travaillent pas pour MS Remove
"NoSpam" to e-mail me - Retirez "NoSpam" pour m'écrire

--
Don't wait for your answer, click here: http://www.word.mvps.org/

Please reply in the group. Please do NOT email me unless I ask you to.

John McGhie, Consultant Technical Writer
McGhie Information Engineering Pty Ltd
http://jgmcghie.fastmail.com.au/
Sydney, Australia. S33°53'34.20 E151°14'54.50
+61 4 1209 1410, mailto:[email protected]
 
C

Corentin Cras-Méneur

John McGhie said:
No, I did mean UTF-8 :)

UTF-16 plays up on many applications, because it relies upon a byte-order
mark which often gets screwed up :)


Oh, UTF-16 is not perfect. I'm not even saying it is "better" than
UTF-8.
I'm just saying that UTF-8 will not "support almost any character in
the known universe" because some of them are coded over two bytes which
requires something like UTF-16 (and since you're going to China in a
couple of weeks, you should be able to play with that there ;-) ).


Corentin
 
P

Peter Jamieson

Well, I tried, but obviously failed...
because some of them are coded over two bytes

Unicode was originally a 16-bit character set, i.e. there would be at most
65535/6 Unicode characters.

The original Version 1.0, volume 1 standard (I have the book here) does not
even metnion "UTF" - AFAIK it came later

However, obviously someone decided that it would be sensible to have a good
encoding that allowed any 16-bit Unicode character to be encoded as a
sequence of octets, and that's UTF-8

In other words, it doesn't really matter how wide "Unicode" is. It might for
example eventually include the 975,657,228,345,367,926 different characters
known in our galaxy (OK, my initial estimate :) ), but it would still be
possible to represent any of those charaters using a seqeunce of octets.

As it happens, Unicode moved beyond 16-bits to a 32-bit standard, and as far
as I know, every Unicode character can be represented as a unique 32-bit
number, and can also be represnted using either UTF-8 or UTF-16 encoding.

Luckily, we no longer have to consider processors that only deal with 4-bit
characters or hex digits (originally known in some circles as "nybles"). But
there's no reason why in principle, any 32-bit Unicode character should not
be encoded as a sequence of nybles or hex digits. Unfortunately, because of
the absence of some key standards, when encoding Unicode as UTF-16 we have
to pay attention to the sequence of 8-bit bytes within those 16-bit "words"

Peter Jamieson
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top