How do I tell Word 2003 my mail-merge text is NOT UTF-8?

J

Jolyon Cox

I am using Word 2000 automation to invoke a mail-merge from a number of C++
applications. For various operational reasons not relevant here, the
applications create a tab-separated temporary file and call a library
function which uses the OpenDataSource method of the MailMerge object to
connect it to the template before using Execute to perform the merge. The
verson of Word actually in use is Word 2003.

It all works fine in the majority of cases, but one of the merge fields is
8-bit text used to print an i2of5 barcode using a special font - in this
format each 8-bit character encodes 2 barcode digits. Every now and again
the value in this field gets truncated - this is triggered by certain values
in the first data line of the file (i.e. the second line, as the first
contains field names), and thereafter applies to every instance of the field
in the data source. Moving the offending data line further down the file
"cures" the problem, but is not an option in practice.

By dropping the offending field into a simple text file and opening it
interactively with Word, I have established that Word is wrongly guessing the
file (or maybe only the field ?) to be encoded in UTF-8 - it brings up an
interactive dialogue which shows the guessed encoding and previews the text
as truncated. Selecting other encodings displays the text in various other
ways, but (critically) does not truncate it. Similarly, most values for the
barcode text are guessed by Word to be in various encodings which do not
truncate.

I have tried explicitly setting the WdOpenFormat parameter of OpenDataSource
to wdOpenFormatText, but it makes no difference - in fact, whatever value is
in this parameter seems to be ignored. Is there any other way I can get
these characters passed through without being corrupted, other than by
writing the data source file in 16-bit Unicode in the first place ? I am
reluctant to do this as it would mean either changing many programs or making
the library routine transcribe the entire data source file, which can be
quite large.

For the record, a value of the text string which causes problems is "(7å<Ó)"
(i.e. hex 28 37 e8 cc 4a 29), whereas values such as "(7èÌJ)" (i.e. hex 28
37 e5 3c d3 29) work OK. My PC is running Windows XP SP2 with UK English
regional settings (code page 0x0809).

Any help anyone can offer will be much appreciated.

Regards,

Jolyon
 
P

Peter Jamieson

I am using Word 2000 automation to invoke a mail-merge from a number of C++
applications. For various operational reasons not relevant here, the
applications create a tab-separated temporary file and call a library
function which uses the OpenDataSource method of the MailMerge object to
connect it to the template before using Execute to perform the merge. The
verson of Word actually in use is Word 2003.

Is it Word 2000, as you first mention, or Word 2003? From your description
it sounds like 2003, but there are significant differences.

It is probably worth seeing if setting the DefaultCPG registry value
described in

http://support.microsoft.com/kb/290981/en-us

(It's also possible that opening the document in Word with an explicit
encoding , saving it as a WOrd document, then using that as the data source
for a merge, as described in that article, might do the trick).

If your data source has 255 columns or fewer, and you are using Word 2003,
you can try the approach using .odc and SCHEMA.INI that I described in the
conversation beginning at

http://groups.google.com/group/micr...q=jamieson+SCHEMA.INI+odc+text+unicode&rnum=1

If you are using Word 2000, that can't work because it doesn't support OLE
DB and .odc files. Although you can use ODBC to connect using a similar
SCHEMA.INI, the ODBC driver does not seem to recognise all the entries in
the .INI file that the OLE DB provider does, and it doesn't have the same
character encoding support anyway AFAIK.

In this particular case, it's possible that the 8-bit encoding will screw up
any encoding choices you make anyway, if Word does recoognise the characters
as being part of the character set implicitly or explicity specified.
However, you can but try.

Peter Jamieson
 
P

Peter Jamieson

BTW, to use the .odc sample I reference, you will probably need to replace
the reference to the ACE OLE DB provider in the connection string by the Jet
one, i.e. replace

Microsoft.ACE.OLEDB.12.0

with

Microsoft.Jet.OLEDB.4.0

Peter Jamieson
 
J

Jolyon Cox

Thanks for the reply. In answer to your points:

1) We are using the Word 2000 automation interface because that is what is
supported by the tool in which the programs are developed (Borland Developer
Studio 2006). In this version, for instance, the OpenDataSource() method has
only 14 parameters rather than the current 16. However, we actually have
Office 2003 installed because this fixes some (though not all) other problems
to do with mail-merge.

2) I tried adding the registry key mentioned in the KB article - for Word 11
as well as Word 10 - but it makes no difference.

3) I don't think your suggestion for using .odc is feasible - the items in
the data source come from a variety of places. Many of them are retrieved
from an Ingres database via ODBC, but some are created on the fly by
application code. Anyway, I do not have the resources to rewrite and re-test
all the affected applications, not to mention retraining all the developers.

All I am trying to do is prevent Word from making wild (and wrong) guesses
about the content of a mail-merge data source - surely this is not an
unreasonable thing to expect ? I do note that even the latest implementation
of OpenDataSource() does not have an encoding parameter as Documents.Open()
has - this seems to be the real problem.

I will keep plugging away and let you know of any progress...

Jolyon
 
P

Peter Jamieson

All I am trying to do is prevent Word from making wild (and wrong) guesses
about the content of a mail-merge data source - surely this is not an
unreasonable thing to expect ?
I do note that even the latest implementation
of OpenDataSource() does not have an encoding parameter as
Documents.Open()
has - this seems to be the real problem.

Yes, I would also prefer it if Word let you specify the encoding and
everything could be kept very simple, but unfortunately
a. I don't work for Microsoft - I'm just a volunteer - so I am also stuck
with the way Word actually works.
b. the .odc approach is the only one I know that has a chance of solving
the specific problem you described in a reasonably simple way.

(FWIW
c. several of the parameters in OpenDataSource have no effect and are
probably only there because someone in the WordBasic era decided that
OpenDataSource would probably need much the same parameters as Open.
d. Arguably the whole problem with OpenDataSource, ODBC and OLE DB is that
between them they don't abstract the business of opening an arbitrary data
source anything like well enough. For example, even if you had a character
encoding parameter, Word would have to know that it would have to be able to
provide it to its external text converter via one mechanism, and to OLE DB
via another, and that would be data source-dependant. For example, when
opening a .txt file via OLE DB there is no parameter you can specify in the
connection string that says "use this character encoding".)
3) I don't think your suggestion for using .odc is feasible - the items in
the data source come from a variety of places. Many of them are retrieved
from an Ingres database via ODBC, but some are created on the fly by
application code. Anyway, I do not have the resources to rewrite and
re-test
all the affected applications, not to mention retraining all the
developers.

OK, but in your original post you described a specific situation where you
were creating a tab-separated temp file with barcode data and using that as
your data source - in that case I would hope that you be able to limit the
use of .odc to the specific situation where you are creating the data source
on-the-fly. If you're connecting on-the-fly using a library routine under
your control then if absolutely necessary you could consider adding a .odc
and an entry in a SCHEMA.INI on-the-fly as well. As far as the .odc is
concerned, it is in effect just a text file with the path name of the text
file's folder and the file name of the .txt file, i.e. no nasty binary stuff
to create, and the SCHEMA.INI is a standard .INI with one section per file.
1) We are using the Word 2000 automation interface because that is what is
supported by the tool in which the programs are developed (Borland
Developer
Studio 2006). In this version, for instance, the OpenDataSource() method
has
only 14 parameters rather than the current 16. However, we actually have
Office 2003 installed because this fixes some (though not all) other
problems
to do with mail-merge.

OK, I'm not sure whether using the Word 2000 interface would make any
difference as far as encoding issues are concerned, unless it prevented you
from using the OLE DB connectivity available in Word 2003 (in which case you
could not use the .odc approach), /or/ you needed to connect using DDE or
ODBC and the inability to specify the Subtype parameter prevented you from
doing that. I would have thought that with Borland you would be able to
specify a .olb or .tlb with the correct parameter list if necessary, somehow
or other (with the older versions of Delphi, for example, there was support
for the dispatch method of Automation and I don't think the compiler did any
type checking at all, but that approach may be significantly harder to use
in more recent versions, and/or with C++ rather than Delphi.

Peter Jamieson
 
J

Jolyon Cox

Peter,

My apologies - on re-checking, I noticed that I had inadvertently applied
the default code page registry fix under HKLM instead of HKCU. Doing it
under HKCU fixes the problem. Many thanks for the insight - I will re-rate
your original reply.

Just for the record, I have replied below to your latest points. Many
thanks again.

Jolyon

Peter Jamieson said:
Yes, I would also prefer it if Word let you specify the encoding and
everything could be kept very simple, but unfortunately
a. I don't work for Microsoft - I'm just a volunteer - so I am also stuck
with the way Word actually works.
b. the .odc approach is the only one I know that has a chance of solving
the specific problem you described in a reasonably simple way.

Fair enough - sorry, didn't mean to take out my frustrations on you...
(FWIW
c. several of the parameters in OpenDataSource have no effect and are
probably only there because someone in the WordBasic era decided that
OpenDataSource would probably need much the same parameters as Open.
d. Arguably the whole problem with OpenDataSource, ODBC and OLE DB is that
between them they don't abstract the business of opening an arbitrary data
source anything like well enough. For example, even if you had a character
encoding parameter, Word would have to know that it would have to be able to
provide it to its external text converter via one mechanism, and to OLE DB
via another, and that would be data source-dependant. For example, when
opening a .txt file via OLE DB there is no parameter you can specify in the
connection string that says "use this character encoding".)


OK, but in your original post you described a specific situation where you
were creating a tab-separated temp file with barcode data and using that as
your data source - in that case I would hope that you be able to limit the
use of .odc to the specific situation where you are creating the data source
on-the-fly. If you're connecting on-the-fly using a library routine under
your control then if absolutely necessary you could consider adding a .odc
and an entry in a SCHEMA.INI on-the-fly as well. As far as the .odc is
concerned, it is in effect just a text file with the path name of the text
file's folder and the file name of the .txt file, i.e. no nasty binary stuff
to create, and the SCHEMA.INI is a standard .INI with one section per file.

OK - maybe I've misunderstood what is involved here - it's not an area I'm
familiar with...
OK, I'm not sure whether using the Word 2000 interface would make any
difference as far as encoding issues are concerned, unless it prevented you
from using the OLE DB connectivity available in Word 2003 (in which case you
could not use the .odc approach), /or/ you needed to connect using DDE or
ODBC and the inability to specify the Subtype parameter prevented you from
doing that. I would have thought that with Borland you would be able to
specify a .olb or .tlb with the correct parameter list if necessary, somehow
or other (with the older versions of Delphi, for example, there was support
for the dispatch method of Automation and I don't think the compiler did any
type checking at all, but that approach may be significantly harder to use
in more recent versions, and/or with C++ rather than Delphi.

Yes, of course I could have faked up a interface definition for the latest
version if it would have fixed it, though it would be a bit tedious and
harder to maintain.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top