Finding word position (start/end) on a word document

F

Fernando Cabral

By doing
for i = 1 to activedocument.words(i).count
ActiveDocument.Words(i).Select
word(i).start = ActiveDocument.Words(i).start
word(i).end = ActiveDocument.Words(i).end
next i
I create an array with pointers to every word in a word document.
Problem: SLLLLLLOOOOOWWWWWW. It takes forever even for a "small"
document with (say) 300 pages.
I can do the same by first copying the whole text into a variable and then
tokening it. Say:

Dim s as string
s = activedocument.content.text
for i = 1 len(s)
word(i).start = NextToken(s).start
word(i).end = NextToken(s).end
next i

The second method is hundred times faster.
Problem arise when the document is not "plain" text. That is, it also contains
pictures, drawing, TOC, etc.

In this case each non-textual element adds an additional offset in the first
method, but not in the second. As we move towards the end of the text
the offset increases as we pass by each non-textual element.

Question: is there a way for me to get how many objects there are in the
text, where they are, how many bytes they take?

- fernando
 
K

Klaus Linke

Hi Fernando,
By doing
for i = 1 to activedocument.words(i).count
ActiveDocument.Words(i).Select
word(i).start = ActiveDocument.Words(i).start
word(i).end = ActiveDocument.Words(i).end
next i
I create an array with pointers to every word in a word document.
Problem: SLLLLLLOOOOOWWWWWW. It takes forever even for a "small"
document with (say) 300 pages.

Dim myWord as Range
For each myWord in ActiveDocument.Words
' Do something with myWord
Next myWord

would be quite a bit faster.
In your code above, Word has to locate Words(i) in each iteration by
counting words from the start, and that takes longer and longer.
There's also probably no reason to select anything... that only takes time.

I can do the same by first copying the whole text into a variable and then
tokening it. Say:

Dim s as string
s = activedocument.content.text
for i = 1 len(s)
word(i).start = NextToken(s).start
word(i).end = NextToken(s).end
next i

The second method is hundred times faster.
Problem arise when the document is not "plain" text. That is, it also
contains
pictures, drawing, TOC, etc.

In this case each non-textual element adds an additional offset in the
first
method, but not in the second. As we move towards the end of the text
the offset increases as we pass by each non-textual element.

Question: is there a way for me to get how many objects there are in the
text, where they are, how many bytes they take?


In principle: yes. There's one extra character for each shape anchor and
inline graphic/object, I think, two characters each for each table cell and
an additional 2 each for each table row. And then there is a character for
each field opening brace and closing brace, plus one between the field code
and the field result (both of which will be in the string).
It's doable, but not too easy. Depending on what you do, you might check
whether you have other options (say, work with range.XML or with HTML code
of the document to get at all the formatting and other stuff).

Regards,
Klaus
 
C

cradino

Caro Fernando Cabral
Onde quer que esteja compreende este texto. Quer por solidariedade lusa
dar-me um pequeno apoio na compreensão do que aqui se tratou nesta
conversação.
Isto não é uma resposta mas sim um pedido de ajuda.
Pedia-lhe que "vestisse" o macro de forma que eu o possa usar e perceber-lhe
o sentido ... sem muito trabalho! Obrigado
Cumprimentos
Arcindo Lucas

"Klaus Linke" escreveu:
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top