F
Fernando Cabral
My documents have accented characters. I found an inconsistency that
completely destroyed my performance. In my machine the letters
"a" (lowercase 'a') comes before "Ã " (lowercase 'a' with a grave accent)
when sorted. That´s how it should work. Fine.
Nevertheless, when I sort "a tarde" against "Ã tarde" the positions
are reversed! For those of you could are not familiar with accent
(or perhaps can´t seem them on your display), it is like having
"a"
"b"
but
"b c"
"a c"
That is, the collating sequence for "Ã " changes place depending on
the character that follows it!
Now for the practical problem.
I have two lists. One is comprised of single words. It is usually huge,
like hundreds os elements. The other may have either words or
sentences. This last one may be as small as a single word or as big
as several thounds words/sentences.
My mission is to find in the first list every occurrence of words and
sentences that are in the second list.
Exemple: if first list has a,b,c,d,e and second list contains b, d, then I´ll
have to find them (b and d).
Now, if the first list has 100000 elements and the second one has 10000,
this entails 100000 x 10000 = 1,000,000,000 comparisons. For strings
sometimes as long as 50 characters, this takes a long time to complete.
Now, the simplest way to improve this is "shortening" the second list
at each pass. Since both lists are sorted, I should be able to say: well,
next word from the first list begins with letter "d". This means from now
on I can skip all the elements in the second list that are less than "d".
Additionally I can stop comparing as soon as the word in the first
list is bigger than the last word in the second list.
So, instead of having to do 1,000,000,000 comparisons, I may be
able to make do with 100,000 or even less. More reallistically,
perhaps 500,000.
It works fine, as long as I don´t have words that begin either
with an "a" or with an "à ". Alas! I can´t expect that to happen in any
real text.
For the time being I am stuck with the inefficient solution.
Question: is there a way to chance that behaviour, that is,
change the collating sequence from VBA, Word or perhaps
wearing the Windows XP administrator´s hat?
- fernando
completely destroyed my performance. In my machine the letters
"a" (lowercase 'a') comes before "Ã " (lowercase 'a' with a grave accent)
when sorted. That´s how it should work. Fine.
Nevertheless, when I sort "a tarde" against "Ã tarde" the positions
are reversed! For those of you could are not familiar with accent
(or perhaps can´t seem them on your display), it is like having
"a"
"b"
but
"b c"
"a c"
That is, the collating sequence for "Ã " changes place depending on
the character that follows it!
Now for the practical problem.
I have two lists. One is comprised of single words. It is usually huge,
like hundreds os elements. The other may have either words or
sentences. This last one may be as small as a single word or as big
as several thounds words/sentences.
My mission is to find in the first list every occurrence of words and
sentences that are in the second list.
Exemple: if first list has a,b,c,d,e and second list contains b, d, then I´ll
have to find them (b and d).
Now, if the first list has 100000 elements and the second one has 10000,
this entails 100000 x 10000 = 1,000,000,000 comparisons. For strings
sometimes as long as 50 characters, this takes a long time to complete.
Now, the simplest way to improve this is "shortening" the second list
at each pass. Since both lists are sorted, I should be able to say: well,
next word from the first list begins with letter "d". This means from now
on I can skip all the elements in the second list that are less than "d".
Additionally I can stop comparing as soon as the word in the first
list is bigger than the last word in the second list.
So, instead of having to do 1,000,000,000 comparisons, I may be
able to make do with 100,000 or even less. More reallistically,
perhaps 500,000.
It works fine, as long as I don´t have words that begin either
with an "a" or with an "à ". Alas! I can´t expect that to happen in any
real text.
For the time being I am stuck with the inefficient solution.
Question: is there a way to chance that behaviour, that is,
change the collating sequence from VBA, Word or perhaps
wearing the Windows XP administrator´s hat?
- fernando