M
Maciej Bliziñski
Hello!
Let's say I have some ".doc" files that I want to extract words from. I
don't need doc2txt utility, as I don't need any formatting extracted.
The only thing I need is words separated in some way (spaces, other
characters, whtever). The other thing is that I need those words
encoded in ISO-8859-2, as they contain polish letters (like ó).
Everything should be done on Linux server, so I will have to parse them
with my own utility. When I open doc file in text editor, I can see lots
of rubbish, and the text, but letters are separated with some binary
byte, it looks like this.
^@w^@o^@r^@d^@s^@ ^@a^@r^@e^@^@h^@e^@r^@e^@
I will need those letter put together if I'm going to extract
words from the file:
"words are here"
There's no matter if there will be any rubbish around the words
"#$^#@$%&^@$words@#$%#$are@#$#@$%here#$%^"
^^^^^ ^^^ ^^^^
because this kind of output is just fine for me.
What I need is to know how to transform the binary doc file into file
that will contain words in ISO-8859-2. The words will be then found with
the regular expression:
([a-zA-Z0-9±æê³ñ󶿼¡ÆÊ£ÑÓ¦¯¬]{3,})
There are polish letters between 9 and ].
The program will be written in Python on Linux. Any help will be greatly
appreciated.
Regards,
Maciej Bliziñski
Let's say I have some ".doc" files that I want to extract words from. I
don't need doc2txt utility, as I don't need any formatting extracted.
The only thing I need is words separated in some way (spaces, other
characters, whtever). The other thing is that I need those words
encoded in ISO-8859-2, as they contain polish letters (like ó).
Everything should be done on Linux server, so I will have to parse them
with my own utility. When I open doc file in text editor, I can see lots
of rubbish, and the text, but letters are separated with some binary
byte, it looks like this.
^@w^@o^@r^@d^@s^@ ^@a^@r^@e^@^@h^@e^@r^@e^@
I will need those letter put together if I'm going to extract
words from the file:
"words are here"
There's no matter if there will be any rubbish around the words
"#$^#@$%&^@$words@#$%#$are@#$#@$%here#$%^"
^^^^^ ^^^ ^^^^
because this kind of output is just fine for me.
What I need is to know how to transform the binary doc file into file
that will contain words in ISO-8859-2. The words will be then found with
the regular expression:
([a-zA-Z0-9±æê³ñ󶿼¡ÆÊ£ÑÓ¦¯¬]{3,})
There are polish letters between 9 and ].
The program will be written in Python on Linux. Any help will be greatly
appreciated.
Regards,
Maciej Bliziñski