Office 2007 and onwards use a zip file to contain XML of each part of a Word document. If you study the layout (use Winzip to open the .docx file and then extract to a temporary location for analysis) you might find a better way to do this search and replace directly in the XML. If so, a tool like TextPipe Pro (which can find and replace in the XML of Office 2007 documents and zip files directly) could be handy.
Hope this helps!
PrincessLea wrote:
Frustration with regexes in Word, VBA, VBScript
02-Feb-09
I am trying to do some fairly simple (in the scheme of things) regular
expressions in VBA for Word. For 15 years or more, I have been frustrated by
many things about Word, and right now it's back to limitations respecting
regular expressions
I am looking for various patterns such as "section 54", "subsections 56(4)
and 62.12(2)", "paragraph 23(2)(b)"
The keywords may be singular or plural, the numbers may or may not have
decimal values. I mark the section numbers (such as 62.12) with unique
characters on either side (I use upside-down exclamation and question marks),
then do a single Word find-and-replace to remove the unique characters and
mark the numbers with a character style
I have done all this very nicely with VBScript regexes incorporated into
VBA, but the problem is that any character-based formatting in my text is
destroyed. I know why this happens, but it's frustrating nonetheless --
Microsoft should have developed a way to avoid the destruction
Unfortunately, my frustration is only increased by the REAL cause of the
problem -- the lack of competent regexes in Word itself. I mean, why is it
that after all these years, Word regexes are still so stunted that they don't
even have a zero-or-more option and an OR function??? Surely Microsoft could
have spent a tiny bit of money on fixing this long-standing omission,
preferably by exposing VBScript regexes in Word itself, as an alternative to
what's there
Yes, I can accomplish my task with a pile of individual search-and-replace
operations, but that is inefficient, inelegant, frustrating and downright
stupid
My main regex is the following
strFindExpr = "(section|subsection|paragraph|subparagraph)(s)?
strFindExpr2 = "[\x20\xA0]+([1-9][0-9]{0,2})(\.[1-9][0-9]{0,2})?
objRegEx.Pattern = strFindExpr & strFindExpr
(I know I don't need the backslash before the period, but it I find it a
useful holdover from Perl-type regexes.
I am hoping that there is something I am unaware of that would allow me to
use VBScript-style regexes to do what I'm trying to do, without losing my
character formatting
Is there such a facility, or am I just relegated to either a brute-force
pile of Word regex statements or programming the recognition at each find
Thanks for any help you can provide, even if it's just to confirm that there
is no other possibility within the realm of Word and VBA.
Previous Posts In This Thread:
Frustration with regexes in Word, VBA, VBScript
I am trying to do some fairly simple (in the scheme of things) regular
expressions in VBA for Word. For 15 years or more, I have been frustrated by
many things about Word, and right now it's back to limitations respecting
regular expressions
I am looking for various patterns such as "section 54", "subsections 56(4)
and 62.12(2)", "paragraph 23(2)(b)"
The keywords may be singular or plural, the numbers may or may not have
decimal values. I mark the section numbers (such as 62.12) with unique
characters on either side (I use upside-down exclamation and question marks),
then do a single Word find-and-replace to remove the unique characters and
mark the numbers with a character style
I have done all this very nicely with VBScript regexes incorporated into
VBA, but the problem is that any character-based formatting in my text is
destroyed. I know why this happens, but it's frustrating nonetheless --
Microsoft should have developed a way to avoid the destruction
Unfortunately, my frustration is only increased by the REAL cause of the
problem -- the lack of competent regexes in Word itself. I mean, why is it
that after all these years, Word regexes are still so stunted that they don't
even have a zero-or-more option and an OR function??? Surely Microsoft could
have spent a tiny bit of money on fixing this long-standing omission,
preferably by exposing VBScript regexes in Word itself, as an alternative to
what's there.
Yes, I can accomplish my task with a pile of individual search-and-replace
operations, but that is inefficient, inelegant, frustrating and downright
stupid.
My main regex is the following:
strFindExpr = "(section|subsection|paragraph|subparagraph)(s)?"
strFindExpr2 = "[\x20\xA0]+([1-9][0-9]{0,2})(\.[1-9][0-9]{0,2})?"
objRegEx.Pattern = strFindExpr & strFindExpr2
(I know I don't need the backslash before the period, but it I find it a
useful holdover from Perl-type regexes.)
I am hoping that there is something I am unaware of that would allow me to
use VBScript-style regexes to do what I'm trying to do, without losing my
character formatting.
Is there such a facility, or am I just relegated to either a brute-force
pile of Word regex statements or programming the recognition at each find?
Thanks for any help you can provide, even if it's just to confirm that there
is no other possibility within the realm of Word and VBA.
RE: Frustration with regexes in Word, VBA, VBScript
By the way, I realize that the expression:
strFindExpr = "(section|subsection|paragraph|subparagraph)(s)?"
should be just:
strFindExpr = "(section|paragraph)(s)?"
At the moment, it's just a bit of "self-documentation" to be removed later.
PL
:
Another frustration with Word's regex is that it seems impossible to disable
Another frustration with Word's regex is that it seems impossible to disable
case sensitivity. Is there isome way to make it case insensitive?
Thanks.
PL
:
Larry, thanks for your efforts.
Larry, thanks for your efforts.
Your program code seems to be a simple replace operation in an SGML/XML DTD
(the CDATA keyword), which would not contain any formatting codes (unless
done for documentation purposes, in which case you probably wouldn't be doing
that replacement).
I've done lots of VBScript regexes in VBA, and they work fine if I don't
care about losing character formatting.
Your suggestion about "salvaging" character coding is possible, but more
complex than programming the solution in VBA by doing a series of finds and
parsing found strings in code rather than via regexes (which I have done).
So I have a working program -- I'm just really frustrated that I seem to
have to use a more complex solution than I should have to, just because
Microsoft seems to have not implemented "fully-competent" regexes in Word.
Even a "zero-or-more" operator would be a huge improvement.
I can't understand this lack of capability in the most important
text-handling program in the world.
Of course, this is one case where I really hope I am wrong, and someone can
tell me that there is a way to use "fully-competent" regexes in Word without
losing character formatting.
I'm not holding my breath, though.
PL
:
Sorry, I really don't want to "steal" the thread, but I didn't find anything
Sorry, I really don't want to "steal" the thread, but I didn't find
anything about RegExp in Word VBA help. It would be very useful for me
RegExp to be available in Word VBA. Where could I find some
documentation about it?
Regards,
--
Pablo Cardellino
Florian??polis, SC
Brazil
PrincessLeah escribi??:
Hi Pablo,The "trick" is that RegExp is supplied as a COM object from the
Hi Pablo,
The "trick" is that RegExp is supplied as a COM object from the Scripting
library, C:\Windows\System32\vbscript.dll. VBA can use it, either by
assigning the result of CreateObject("VBScript.RegExp") to an Object
variable, or by going into the Tools > References dialog and setting a
reference to "Microsoft VBScript Regular Expressions". But because it isn't
literally a part of VBA, there's no VBA help topic for it.
Try these articles:
http://msdn.microsoft.com/en-us/library/ms974570.aspx
http://www.vbaexpress.com/forum/showthread.php?t=6805
http://msdn.microsoft.com/en-us/library/yab2dx62(VS.85).aspx (the official
Help topic)
The problem to which PrincessLeah referred is that RegExp operates only on
strings within VBA, not on formatted ranges within documents. So if you use
RegExp to find some text in a document and replace it with some other text,
you'll lose any character formatting the original text had (and possibly
paragraph/style formatting, if you're unlucky enough to replace paragraph
marks). Word's built-in Find object can preserve or modify formatting, but
its search syntax is comparatively brain-dead and there's no sign that
anyone is looking at fixing it.
--
Regards,
Jay Freedman
Microsoft Word MVP
Email cannot be acknowledged; please post all follow-ups to the newsgroup so
all may benefit.
Pablo Cardellino wrote:
Hi, Jay,thanks, your explanation will be very useful.
Hi, Jay,
thanks, your explanation will be very useful. One mor question: if I use
this object in Word 2003, the macro should run succesfully under word
2003 and 2000?
Regards
--
Pablo Cardellino
Florian?polis, SC
Brazil
Jay Freedman escribi?:
That's correct, there's no difference in the way any of the
That's correct, there is no difference in the way any of the VBA-enabled
applications work with external COM objects. (Word 95 was the last of the
WordBasic-using versions.)
Princess, to protect significant local formatting, I've used a macrothat
Princess, to protect significant local formatting, I've used a macro
that surrounds all italic text with <i>...</i>, all bold with <b>...</
b>, etc. Such an approach will surely complicate your regex, but
perhaps you can come up with something more subtle that will be seen
by your regex as just garden-variety text but which a later macro will
be able to recognise and recast back to italic, bold, etc.
Also, I dredged up this snippet from one of my macros (I know I wrote
it, I just don't remember much about it):
Dim rx As RegExp
strReplace = "<![CDATA[&]]>$1"
Set rx = New RegExp
With rx
.Pattern = "&([A-Z][A-Z0-9._\-]*
"
.IgnoreCase = True
.Global = True
End With
wraptext = rx.Replace(wraptext, strReplace)
I referenced Microsoft VBScript Regular Expressions 5.5. to get the
RegExp class. Is this just what you're already doing?
Try OpenOffice
Have you tried a different word processor, such as OpenOffice?
Try OpenOffice
Have you tried a different word processor, such as OpenOffice?
Submitted via EggHeadCafe - Software Developer Portal of Choice
WPF Circular Progress Indicator
http://www.eggheadcafe.com/tutorial...a-cc047643fd42/wpf-circular-progress-ind.aspx