How do I discover repeating text portions across text files?

P

paddys

Suppose I have a specific number of text [dot doc] files of specified size
[say not more than 500 words], and have to discover if there are text
portions [ie., a set of words or phrases or clauses or entire sentences]
'repeating' across these files. In other words, it is a 'search' for files,
from a given set of text files, containing 'repeating text portions' across
themselves. The challenge is to discover them intelligently even without
any pre-specified 'text portions'. Simple 'find' mechanism is very
cumbersome and tedious, especailly when you have to search a number of text
files.
 
H

Helmut Weber

Hi Paddys,

in my very humble opinion,
there is no intelligent way, just a brute force approach.

If you want to know, which text parts appear where,
you have to set up a list of all text parts,
and check all files for all items in the list.

That is collect all words, phrases, clauses, sentences
without duplicates in a list first and then go on searching.

I did something similar once for words.
Google for "Corpus Linguistics".

One could of course when setting up the list
remember from which file a new item comes from
and exclude this file form searching for the item,
but whether this exchange of simplicity for speed
is worth the effort, I don't know.

--
Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"
 
R

Richard Relpht

I would create a database table (or a plain old CSV file or a tab separated
file if you have semicolons in the data!) with three fields, named
1. "PhraseClauseWord"
2. "FromFileName" - and -
3. "Counter"

VBscript can do that with a text file.
And if you can handle VBA, then you can handle VBScript.

So....
Then open each file and
scan through it once looking for phrases,
output (append) the results to the database,
entering, for each phrase, the phrase, the FromFileName and a counter set at
1.

Then do the same for clauses, then the same for words, appending the data
into the databse table.(I don't know how you would define a clause ...)

Do that for each file.
When that's finished, you have everything in the database table.

Then you look at this data through an Excel pivot table.
Using the external data option.

If File A contains :
The cat sat on the mat.
The dog sat on the cat.


and
File B contains
The rat ate the cat.


The your table will look like this

PhraseWord File Ctr
The A 1
cat A 1
sat A 1
on A 1
the A 1
mat A 1
The A 1
dog A 1
sat A 1
on A 1
the A 1
cat A 1
The B 1
rat B 1
ate B 1
the B 1
cat B 1
The rat ate the cat. B 1
The cat sat on the mat. A 1
The dog sat on the cat. A 1


So your pivot table can look like this
File PhraseWord Total
A cat 2
dog 1
mat 1
on 2
sat 2
The 4
The cat sat on the mat. 1
The dog sat on the cat. 1
Total A 14
B ate 1
cat 1
rat 1
The 2
The rat ate the cat. 1
Total B 6
Total 20


or this
PhraseWord A B Total
ate 1 1
cat 2 1 3
dog 1 1
mat 1 1
on 2 2
rat 1 1
sat 2 2
The 4 2 6
The rat ate the cat. 1 1
The cat sat on the mat. 1 1
The dog sat on the cat. 1 1
Total 14 6 20



or this
PhraseWord A B Total
The 4 2 6
cat 2 1 3
on 2 2
sat 2 2
ate 1 1
The rat ate the cat. 1 1
rat 1 1
The cat sat on the mat. 1 1
mat 1 1
dog 1 1
The dog sat on the cat. 1 1
Total 14 6 20


which is the same thing, only sorted from most to least,
so that words that only appear once (i.e. unique ocurences)
are at the bottom of the table.

etc, etc.
This will probably be illegible once posted due to plain text hassles in
newsgroups but if you want a private mail, just ask in the newgroup.

HTH
Richard.
 
P

paddys

Thanks, Weber.
I have an idea but am not sure how to create and validate the code, to be
made part of word custom program or through command line program. Please
advise if and how it could work.

1. First, accept the directory or folder containing the text files; assume
[or validate] all of them are dot doc files.
2. Also prompt and accept the initial search string subject to limits, say
50 words; default could be the number of words contained in the first two
lines identified by end-of-line mark of the first file. Store this into
StringA.
3. Build an array or table of 3 dimensions, viz., name of file, number of
words, and number of lines, by reading either from the files one by one, or
by getting 'property values'; i have no clue as to how to do this part, but
believe there must be a way out. call this Array1. Also create an empty
Array2, of two dimensions, to contain names of 'original file' and 'copy
file'.
4. Now, begin a two-level-nested loop process of ALL files; beginning with
StringA of file1, compare with StringB [to be formed first as per default
value, and then to be replaced by the next set] of file2; if matched, then
file2 contains at least one portion repeated; note this into Array2, and
exclude this from further comparison; like a typical sort procedure, while
inner loop will compare every 'source string' of file1 with 'target strings'
of every other file, [ranging from 2 ....n or less depending on matching
occurred], the outer loop will build and complete Array2 containing answers
to my search.
5. I know the core code works in simple Basic or Visual Foxpro, but do not
know how to embed it as a feature into Word.

Could you help, please? Thanks again.
Paddys
 
P

paddys

thanks, richard.
1. the opening and checking of each file manually could be tedious and also
be prone to errors and omissions; i have some ideas about doing the whole
thing as a 'batch' process. pl. read my post to weber.

would welcome any help in coding custom solution to be made part of my Word
program. thanks.

paddys.
 
H

Helmut Weber

Hi Paddys,

sorry, but this is asking for too much at once.
Split it all up in several questions and
ask for help of each in turn in the groups.

1. First, accept the directory or folder containing the text files; assume
[or validate] all of them are dot doc files.

As a start, see:
http://word.mvps.org/faqs/macrosvba/BatchFR.htm
http://www.gmayor.com/batch_replace.htm

Also, give us the word-version you talking about.
Furthermore,
.... it would be rather unusual to process dot-files that way.
.... what do you mean by "Accept the directory"
probably getting the name of the directory in your code somehow.
Is the program just for you?
You could type the name of the directory in an input box.
You could use several controls which allow to pick a directory
using the mouse. But you'll need a userform.
.... What to do if not all files in a directory are doc-files?

To get a list of all docs in "c:\test\word"
into a text-file,
you may use in the command shell:
c:\test\word\>dir *.doc /b > c:\dir.txt

There are other ways, but it may be about doc-files
organized in several subdirectories.
Then ...

Still, Word's definition of "word" and "sentence"
is different from the fuzzy human concept of them.

"Clause" and "phrase" Word doesn't know at all.

Sorry, it ain't that easy.

....

--
Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"
 
J

Jeff Mathewson

This wouldn't be to hard to do.

All you need to do is create an array of sentences example:
arySentences(Sentence, counter). From there use a for loop to go through
all the sentences (for each oSen in activedocument.Sentences). Any new
sentences found, add to array. Any dup sentences add to the array counter.

That's just the basics, but once you play around with it, it shouldn't be
that hard. I have such a macro that goes one level up by Words and
collections the vocabulary of the document(s). So it can be done.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top