Removing redundant characters

R

ralf

Hi!

I'm trying to remove all redundant characters like two many spaces or
paragraph signs from a word doc.
This is my code so far:

Dim wdApp As Word.Application
Dim wdDatei As Word.Document
Try
wdApp = New Word.Application
wdDatei = wdApp.Documents.Open(txtSource.Text)
Catch ex As Exception
MessageBox.Show(ex.Message)
Exit Sub
End Try
Try
With wdDatei.Range.Find
.Text = "^13^13"
.Replacement.Text = "^p"
.Forward = True
.Wrap = Word.WdFindWrap.wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
wdDatei.Range.Find.Execute(Replace:=Word.WdReplace.wdReplaceAll)
Catch ex As Exception
MessageBox.Show(ex.Message)
Finally
wdDatei.SaveAs(txtTarget.Text)
wdDatei.Close(Word.WdSaveOptions.wdDoNotSaveChanges)
wdApp.Quit()
wdDatei = Nothing
wdApp = Nothing
End Try´

When it gets to line
..Text = "^13^13"
it throws an exception.

I'm using Word 2000 and the Microsoft Word 9.0 Object Library.
Can anybody tell me what I'm doing wrong?
ralf
 
W

Word Heretic

G'day "ralf" <[email protected]>,

A repost from Word PC-L and AusTechWriter mailing lists

In trying to explain how Word's Find and Replace (FnR) wilcard
mechanism works, I'll also present a practical solution to the
multitude of problems encountered by the seemingly innocuous ^p^p to
^p, whose usual objective is to remove unnecessary blank lines. In
doing so, we shall traverse the width of Word's pitfalls that never
fail to trip up a traveller.

First up, the Word Help System has some excellent help on wildcards.
It is a complete PITA to access, but you can find something. In Word
2k:

F1 - help > Answer Wizard | Index > Search on: wildcard The second
topic down is the master list of all FnR stuff. Select it.
Pick the Wildcard Characters topic down that list.
Now select the _type a wildcard_ hyperlink.
Hooray. Print the damn thing. Use it as a guide from now on :) You
have just found the first excellent Quick Reference in the help
system.


The very last two paragraphs are the key to what I am attempting here.

For our replace a double para with a single para, we would think that
Find ^p^p and replace with ^p would do the job right?

Well, not really. If you do it via VBA you find yourself stalling
forever if your document is terminated by a blank paragraph as you
have to perform it iteratively until you get a Not .Found condition.
Why does it fail to replace the last paragraph mark? Well, you can't
delete the last paragraph mark - ever. When you a start a brand new
virgin document and turn on View Formatting, that paragraph mark you
see is the End Of Document paragraph mark. As the document exists and
has a finite end point, that magic pilcrow (backwards P) has to
appear. It is also the marker point in memory to place the nasty
little objects we infest our nice clean ascii text with. Style
definitions, table formatting, list templates, graphical objects and
the list goes on. See Alt + F11 > F2 > Enter for more information.

So, to get around the VBA problem, we simple pre-process the final
paragraph. If it is blank, just a para mark, then kill the second last
character - which must be the penultimate paragraph mark. Manually,
press Ctrl+End and use the backspace key as often as required.

The main problem with the simple FnR replace postulation is similar.
If you just delete a para mark, you lose the style for that paragraph.
So, we can get around this by ensuring it is always the trailing
paragraph that gets deleted. It won't do the final blank paragraph in
a document, but this is solved above.

First, we need to understand how the brackets work, and the help topic
does that nicely. So let us put the guide into good use. (^p)^p means
that we have marked the first para mark as our first 'text chunk'. If
we use \1 in the replace string, it means to leave the first text
chunk, the para mark with the holy styling applied, in place.
Unfortunately for us, we still haven't got there yet.

We get an error, we can't use ^p if we are using wildcards. Bastards.
So we have to use ^013 instead. Herein lies our next problem -
paragraph marks that aren't! Oh yes kiddies, just because you see a
pilcrow does not mean you are looking at a paragraph mark. Oh no. Not
with Paste Special and even weirder applications handing in clipboard
data streams without thought. Word dutifully displays a pilcrow when
it encounters an ASCII 013, but the background machinery may not have
resolved into a paragraph object to be kept dynamically updated.

How do I know it is ASCII 013? Well, I cheat. I select the paragraph
mark, or whatever character I need to know, and use VBA. Alt + F11 (VB
Editor or the VBE). Ctrl+G (Immediate Window). Enter: ?
ASCW(Selection)

I use ASCW() rather than ASC() because I want the full Unicode value.
For ASCII characters the Unicode value is the same. Go ahead, work out
the wildcards' ASCII numbers and write it on yer guide.

So, if we are going to use replace (^013)^013 with ^013 we have to
make sure every ASCII 13 is a damn paragraph mark. Without wildcards
on, find ^013 and replace it with ^p. Honest paragraphs will see no
change, fake paragraphs get converted to your will on the spot.

Now you can get serious and stick yer wildcard search on. Replace
(^013)^013 with \1 and we're in the clear. Done.

In a similar fashion, the much simpler exercise of replacing a colon
that occurs after a ket - a ) char - without destroying the ket
itself, would be to use wildcards, and replace (^041)^058 with \1.

However, if we were searching for a bra, a ( character, we run into
another peculiar little Word problem with managing RTF strings. If you
insert a symbol from the Windings range, or many other non-unicode
graphical fonts, Word actually stores a marker there instead, and then
stores the actual font character off beyond the end of section mark.
That marker is ASCII 40, our unfortunate bra. So an ^040^058 sequence
could very well be any damn symbol followed by a colon.

If we were using two blank paragraphs before every heading and no
space before to ensure our new pages always start at the very top no
matter the method used to page break, and we wanted to get rid of
scads of three or more blank paras in excess of a single hit (are we
listening VBA people?) we could do something evil and wicked like
this: find (^013{2,2})(^013)@ and replace it with \1. This leaves us
with a maximum of two following blank paragraphs anywhere in the
document, even at the end - in one single find operation.

Interestingly enough, for those still able to follow,
(^013{2,2})^013{1,} fails with an invalid pattern. I forced it with
the brackets for the above solution.

Which then brings us to the final solution for technical writers
seeking to mass destroy all blank lines. It has taken a while, but boy
haven't we learn't a lot of useless stuff about Word on the way. Find
(^013)(^013)@ and replace with \1 to kill all blank paras in the
document in a single pass, with the exception of the first paragraph
(there is no start of document paragraph mark to give us a
two-in-a-row target) and the last paragraph mark (which is forbidden
from the find range).


Steve Hudson

Word Heretic, Sydney, Australia
Tricky stuff with Word or words for you.
www.wordheretic.com
ABN: 86 453 419 554
"Qualified Good Tech Writer Dude"
Free Association of Words
Without prejudice



Steve Hudson - Word Heretic

steve from wordheretic.com (Email replies require payment)
Without prejudice


ralf reckoned:
 
R

ralf

Thanks.
Another error was that I used Word 2000.
When I tried it on Word 2002 I could at least replace a row of spaces with
one single space and delete the Chr(11).ToString-signs and the application
did not throw an exception, what it did using Word2000.
I'll try your proposal for the ^p^p.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top