Character entities in Infopath XML

N

Nick Head

We've got some Word RTF docs that we are saving as HTML and then transforming
to IP-compatible XML for editing. Problem ocurrs when extended characters are
found in the XML file.

For example the character ≥ is exported by MS Word as ≥ when you save as
filtered HTML. Then when trying to open any document with a character like
this in IP I get the error:

"The form contains schema validation errrors - Reference to undefined entity
'ge'."

'Fair enough!' I thought and so added a DTD character entity reference to
the document so that it knew how to handle the character. My resulting XML
looks like this (with IP processing instructions and namespace references
removed for clarity):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE myFields [
<!ENTITY ge "≥" >
]>
<my:myFields>
<my:legacyContent>
3 ≥ 2
</my:legacyContent>
</my:myFields>

However I still get the same issue. I tried opening this XML file in IE to
test it, and sure enough it displays perfectly with no validation errors.

Has anyone managed to successfully do this in IP before? Or does IP just not
handle DTD character entity references?

TIA
Nick
 
N

Nick Head

Perfect! Thanks Matthew

Not technically an Infopath issue here at all but I'll outline the solution
for google:

First process the HTML using HTMLTidy with the 'numeric-entities' switch
turned on. Instead of generating named entities such as & ge; it outputs the
numeric version e.g. &# 2265.

These characters can now be read and edited within IP without requiring a
DTD reference.

As an aside, Infopath will read these numeric characters as single-byte
ASCII characters. However when you save or insert new special symbols it will
save the file as UTF8 so instead of taking up 6 bytes to represent a special
character, they only take up 2 e.g 0x65 0x22. But of course your file size
will double anyway as all the other characters will have taken on an extra
byte.

Cheers
Nick

Matthew Blain (Serriform) said:
If you can, consider saving as WordML, though that may be harder to
transform.
Alternately, tidy can save out XHTML including using numeric Unicode
references instead of named entities.

I have no idea if InfoPath supports DTDs, perhaps someone else here does.

--Matthew Blain
http://tips.serriform.com/
http://www.developingsolutionswithinfopath.com/

Nick Head said:
We've got some Word RTF docs that we are saving as HTML and then transforming
to IP-compatible XML for editing. Problem ocurrs when extended characters are
found in the XML file.

For example the character ? is exported by MS Word as ? when you save as
filtered HTML. Then when trying to open any document with a character like
this in IP I get the error:

"The form contains schema validation errrors - Reference to undefined entity
'ge'."

'Fair enough!' I thought and so added a DTD character entity reference to
the document so that it knew how to handle the character. My resulting XML
looks like this (with IP processing instructions and namespace references
removed for clarity):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE myFields [
<!ENTITY ge "?" >
]>
<my:myFields>
<my:legacyContent>
3 ? 2
</my:legacyContent>
</my:myFields>

However I still get the same issue. I tried opening this XML file in IE to
test it, and sure enough it displays perfectly with no validation errors.

Has anyone managed to successfully do this in IP before? Or does IP just not
handle DTD character entity references?

TIA
Nick
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top