Bullets key code

A

Adnan Hebibovic

Hi professionals

I am using C# to access text in the Microsoft Word Document. There is no
problem for parsing the text but how to avoid a special signs as bullets for
instance?

How to detect bullets? Is there any kind of key code or something to avoid
that signs? I must select all text and get that text using WholeStory
property but within text there could be a bullets.

Thanks in advanced

Adnan
 
J

Jim Vierra

You should use the document object to parse text. It has a model that letss you get at each of the elements without dealing with the formatting issues. See Word VBA Help.

The Document object lets you esily navigate around the formatting and even alter the formatting on whatever object or level you want. If you set a reference to WOrd library you will get all of the constants and be able to browse the library and help.

Sample - in VB but works the same in C# just change the syntax. If you use the Characters collection you can test each character for printable range. A_Z, a-z 0-9, and puctuation list. Look at an ascii chart to see Chr value ranges for characters you want. Most printables are in the lower 7 bits of 8 character code. The high and low ends are special characters like control characters. See http://www.lookuptables.com/ You will need to be sensetive to Unicode (Wide) characters. Use ChrB to convert.

char c = 'D';
if (c >= 40 & c <= 90){
System.Console.Write("It's Printable");

}


Private Sub ParseDoc(doc As Word.Document)
Dim strSentence As String
Dim n As Long
Dim s2 As String

strSentence = doc.Characters(1).Sentences(1).Text
n = doc.Characters(1).Sentences(1).Characters.Count
s2 = Left(strSentence, n - 2) 'without CrLf
MsgBox s2

End Sub
 
A

Adnan Hebibovic

Thanks
You should use the document object to parse text. It has a model that letss you get at each of the elements without dealing with the formatting issues. See Word VBA Help.

The Document object lets you esily navigate around the formatting and even alter the formatting on whatever object or level you want. If you set a reference to WOrd library you will get all of the constants and be able to browse the library and help.

Sample - in VB but works the same in C# just change the syntax. If you use the Characters collection you can test each character for printable range. A_Z, a-z 0-9, and puctuation list. Look at an ascii chart to see Chr value ranges for characters you want. Most printables are in the lower 7 bits of 8 character code. The high and low ends are special characters like control characters. See http://www.lookuptables.com/ You will need to be sensetive to Unicode (Wide) characters. Use ChrB to convert.

char c = 'D';
if (c >= 40 & c <= 90){
System.Console.Write("It's Printable");

}


Private Sub ParseDoc(doc As Word.Document)
Dim strSentence As String
Dim n As Long
Dim s2 As String

strSentence = doc.Characters(1).Sentences(1).Text
n = doc.Characters(1).Sentences(1).Characters.Count
s2 = Left(strSentence, n - 2) 'without CrLf
MsgBox s2

End Sub
 
K

Klaus Linke

Looping the characters is terribly slow, and probably not necessary for the job.
Getting the text into a string (strDoc=ActiveDocument.Content.Text -- and
perhaps look at the other StoryRanges, too) should be enough.

If you don't have tables and fields in the document, you should even be able to
get at some character once you have determined its position in the string: Its
Range.End is the same as its position in the string.

What codes to look out for depends on what you mean with "bullets and special
symbols".
Even old fonts contain a lot more than only letters and digits (like punctuation
characters, currency signs, some fractions, ...).

You can usually assume that fonts that originated on the Mac contain at least
all characters from
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/Roman.TXT
while fonts that originated in Windows have at least all characters in
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

If you are especially interested in bullets from symbol fonts like "Symbol" or
"Wingdings", look for characters with codes between &HF000 and &HF0FF.
Word uses that code page (in a "private use" area of the Unicode standard) for
all such fonts.

Greetings,
Klaus
 
J

Jim Vierra

If you walk the story(s) and then the paraagraphs and parse the chars you
will not get any objects or formatting - only odd characters typed in
outside of the formatting. If you are trying to just extract the test of a
document then save it to a text file and then parse the text the same way
using the filter I sent you.
 
K

Klaus Linke

A lot depends on what kinds of objects Adnan's documents contain, how large they
are, and what exactly he wants to do.

If the documents are large, accessing the text by using the characters
collection should IMO be a last resort if everything else has failed (and there
are a lot more options than have been mentioned), since it'll be very slow. I
wanted to mention one of the other options, in case that speed is an important
objective.

Regards,
Klaus
 
J

Jim Vierra

Klaus - I agree. I would do it with a C program and go after teh character
strem directly if I needed speed. I have done this in the past butit takes
a while to set up c to skip all of the formatting and other charaacter
streams. When I want text I just dump to a text file and stream through and
clean it up. Of course all of teh formatting gets lost but I can extract
the text. Another way if you need formatting, is to convert to plain HTML
and use an XSLT on it. That's alittle better sometimes as it goes after
objects instead of characters but it can't easily get after junk remaining
in the text like hand inserted bullet chars. By using C and the Word DOM
you can speed up the cleanup very efficiently. We tried it with VB and it
was horribly slow. I ran the same concept in C and it was like lightning.
The Word Object is pretty efficient but VBA and VB6 don't deal with it well.
I think the bertter answer is to make people learn how to use word without
embedding characters. Then all those funny things stay in the formatting
and out of thee word lists.
 
K

Klaus Linke

Hi Jim,

I agree... and plan to do more using WordprocessingML in the future. It's hard
to learn, and not a very "nice" XML for further processing (no proper start and
end tags for formatting), but processing it is sure to be much faster than
anything you can achieve with Word VBA and the object model.

Greetings,
Klaus
 
J

Jim Vierra

I would use the XML Document API. MSXML 5.0 is out and should be pretty
efficient. Also, we all have to get more proficient with XML anyway. It
will hopefully reduce the number of APIs that have to be used.

Still think XSL is the ugliest thing on the planet.

--
Jim Vierra
Klaus Linke said:
Hi Jim,

I agree... and plan to do more using WordprocessingML in the future. It's
hard
to learn, and not a very "nice" XML for further processing (no proper
start and
end tags for formatting), but processing it is sure to be much faster than
anything you can achieve with Word VBA and the object model.

Greetings,
Klaus
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top