Hi all - not sure if this is appropriate place to seek help, but a Google
Groups search suggested it might be!
I remember doing a lot of this at university, but that was a good few years
ago now, and I haven't had any practical experience since I left. Reading
through my old course notes hasn't really helped, and I'm at a bit of an
impasse - if anyone can offer any suggestions, I'll be *very* grateful!
Scenario: I have a number of documents. I need to locate, and then record
the positions of particular words in each document. The data I'm working
with looks like this -
- Document_Number (an arbitrary, unique identifier)
- Document_Location (room/cabinet/shelf)
- Word (a given word appearing in the document)
- Word_Length (length of each identified word)
- Word_Frequency (number of times a given word appears in the document)
- Word_Offset (Reading from top-left to bottom-right, one word at a time -
ignoring punctuation - where a given word appears in the document e.g.
'today' might appear 3 times, as the 20th word, the 43rd word, and the 92nd
word).
I keep ending up with three tables that look like (have a feeling this will
wrap horribly...):
'documents'
Document_Number <--- Primary Key
Document_Location
'occurrences'
Document_Number <------ Compound Key (referencing 'Document_Number' in
'documents' table)
Word <------ Compound Key (referencing 'Word' in
'words' table)
Word_Frequency
'words'
Word <--- Primary Key
Word_Length
....but I've got no idea about how to deal with the 'Word_Offset's. Each word
might appear many times in a document, and therefore have many offsets, but
I just can't work where to put the offset values, and it's driving me nuts!
I've been trying to work it out mechanically, by following the rules (for
1st, 2nd and 3rd NF) and not really thinking about what the table structures
I've got out of it are implying about the data, so I'm not even sure that
what I've got makes any sense. Just to make the point - each document may
contain many words. Each word may appear in many documents, and many times
in each document.
Any help very much appreciated - sorry for the long post folks
Kind regards
David