Finding word position (start/end) in a word document (II)

F

Fernando Cabral

About my previous question (finding word position (start/end) in a word
document),
a very simple solution would be an assigment that could replace the one
I am using. That is, instead of

s = ActiveDocument.Content.Text ' which does not work

something like

s = ActiveDocument.Content.TextWithEnbeddedStuff

If such a property/method exists, than my program would work
with this single modification.

Thank you

- fernando
 
D

Dave Lett

Hi Fernando,

You might be able to use something like the following, which is more
efficient:
Dim lStart As Long
Dim lEnd As Long
Dim oWrd
For Each oWrd In ActiveDocument.Words
lStart = oWrd.Start
lEnd = oWrd.End
Next oWrd

HTH,
Dave
 
F

Fernando Cabral

Dave

You solution worked much better then mine. Time was down from about
one hour to 1 minute, 3 seconds, 141 thounds of a second.
Nevertheless, it is still too slow for any practical ("real-time" application.
Also, it still compares very poorly with my original solution that
does the same thing in about 4 seconds. (But it can´t be used
as a generic solution because it fails when text has enbedded imagens.)

There must be a faster solution. Otherwise spellchecker/find/replace
and the like would be too slow too be acceptable.

But your solution was very useful, anyway. On one hand, I can
use it for other applications. On the other hand, if I can´t
find an alternative, I still can have a solution that works during lunch time.
That is, I start it before lunch and get the results after lunch :)

Thank you.

- fernando
 
H

Helmut Weber

Hi Fernando

it seems (!), that every object other than
an ordinary text object (word, character, etc ...)
in the selection or in a range is represented by chr(1).

Like:
MsgBox InStr(selection.Range.Text, Chr(1))

which would return x, the position of
an inlineshape in the range, for example.

I don't know, whether this can help you.

--
Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"
 
F

Fernando Cabral

Helmut and Klaus

I put your suggestion (as well as Dave´s) to good use.
They helped me shrink processing time from about one hour
to about 1 minute. That´s still too much, but much closer to
what I need (I need something that does the whole thing in
less than one minute. This leaves about 4 or 5 seconds for
the word-collecting phase)

Now, to the specifics:
it seems (!), that every object other than
an ordinary text object (word, character, etc ...)
in the selection or in a range is represented by chr(1).

I found 28 positions for a drawing. See the dump bellow:

0,27:^A
27,28:^M
28,29:^M
29,41:pRESIDÊNCIA
41,44:DA
44,53:REPÚBLICA
53,54:^M

^A is chr(1). It takes 28 positions. (from 0 to 27).
Also, each word occupies two positions more. See PRESIDÊNCIA. It is
11-character long, but goes from 29 to 41. I stil can´t understand why.

See a dump from the TOC:

1950,1972:Apresentação
1972,1973:^I
1973,2006:VIII
2006,2007:^M
2007,2014:Sinais
2014,2016:e
2016,2029:Abreviaturas
2029,2039:Empregados
2039,2040:^I
2040,2071:IX

VIII and IX are page numbers. They occupy 31 positions. Nevertheless,
I can´t see where they are. The same happens if page number is in arabic,
as seen for the positions consumed by 2 (2111,2141):

2089,2102:COMUNICAÇÕES
2102,2110:OFICIAIS
2110,2111:^I
2111,2141:2
2141,2142:^M
2142,2151:CAPÃTULO


Still, the solution you guys suggested take care of those issues. They word
even if I can´t understand why those gaps are there.

If I can´t find a faster way to do it, I´ll have to make do with with
a few minutes instead of the expected few seconds. :-(

Thank you guys. (if you still have ideas I can try, keep sending them)

- fernando
 
F

Fernando Cabral

Here´s how I solved the problem:

Where I had

s=activedocument.content.text

I now have

GetTheWholeThing (s)

The sub being:

Sub GetTheWholeThing(ByRef s As String)
Dim oWrd As Word.Range

s = Space(ActiveDocument.Characters.Last.end)

For Each oWrd In ActiveDocument.Words
Mid(s, oWrd.start + 1) = oWrd.Text
Next oWrd
End Sub

It works. The document that gave me a headache (the largest
and most complex one) is now being processed in 65 seconds.
Barely acceptable. On the bright side, the whole program
(more than 2000 lines) is still working as it used to when the
documents didn´t have images, TOC and other stuffs that
introduce "phantom" spaces that you can not reach using
activedocument.content.text...

I still hope for some improvement...

Thank you all, guys.

- fernando
 
K

Klaus Linke

Yes, the Chr(1) are objects.
The ^M are paragraph marks Chr(13), the ^I tabs Chr(9)?

The page numbers are probably fields. If you turn on field codes in Tools >
Options > View (or Alt+F9), you can see the field code.
The opening field brace is Chr(19) , the closing brace Chr(21) .

If you read the stuff into a string, you'll get either the field code (with
the braces), or the result, depending on whether you view field codes or
not.
If you go through the ranges, you'll see a range of length 1 for the opening
brace, then the field code, an empty range of length 1, the field result,
followed by another range of length 1 for the closing field brace.

The TOC is a field too, and may contain additional fields (say if you have
hyperlinks back to the headings).

Greetings,
Klaus
 
K

Klaus Linke

You could speed it up more, but it would take quite a bit of work.

If you know where the fields and object anchors are, you can "map" s1 =
ActiveDocument.Content.Text to the string you want to build (which is as
"long" as the range).
Then you don't need to get and place the words individually (Mid(s,
oWrd.start + 1) = oWrd.Text), which is what takes so long. Instead you could
build the new string completely with string functions from s1, which is
much, much quicker.

For that, you need to loop the fields and shapes and collect their positions
(which takes a bit of time, but is usually pretty quick).

Why I didn't mention tables? End-of-cell markers and end-of-row-markers take
two characters in the string, but only a range of length one each. But you
can remove that problem by simply replacing Chr(13) & Chr(7) with some
character of your choice.

One problem you may run into that there are lots of different kinds of
fields. Some have no Result, only Code, so you will have to treat those
differently.

It took me about a day to work it out, and I haven't really debugged the
code thoroughly. If you want, I can mail you what I have (... no promise
that it'll work reliably: You might be better off writing it yourself
instead of debugging mine).
Send me a mail if you are interested.

Regards,
Klaus
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top