Collating sequence: feature or bug?

Fernando Cabral · Jul 16, 2006

My documents have accented characters. I found an inconsistency that
completely destroyed my performance. In my machine the letters
"a" (lowercase 'a') comes before "Ã " (lowercase 'a' with a grave accent)
when sorted. ThatÂ´s how it should work. Fine.

Nevertheless, when I sort "a tarde" against "Ã tarde" the positions
are reversed! For those of you could are not familiar with accent
(or perhaps canÂ´t seem them on your display), it is like having

"a"
"b"

but

"b c"
"a c"

That is, the collating sequence for "Ã " changes place depending on
the character that follows it!

Now for the practical problem.

I have two lists. One is comprised of single words. It is usually huge,
like hundreds os elements. The other may have either words or
sentences. This last one may be as small as a single word or as big
as several thounds words/sentences.

My mission is to find in the first list every occurrence of words and
sentences that are in the second list.

Exemple: if first list has a,b,c,d,e and second list contains b, d, then IÂ´ll
have to find them (b and d).

Now, if the first list has 100000 elements and the second one has 10000,
this entails 100000 x 10000 = 1,000,000,000 comparisons. For strings
sometimes as long as 50 characters, this takes a long time to complete.

Now, the simplest way to improve this is "shortening" the second list
at each pass. Since both lists are sorted, I should be able to say: well,
next word from the first list begins with letter "d". This means from now
on I can skip all the elements in the second list that are less than "d".

Additionally I can stop comparing as soon as the word in the first
list is bigger than the last word in the second list.

So, instead of having to do 1,000,000,000 comparisons, I may be
able to make do with 100,000 or even less. More reallistically,
perhaps 500,000.

It works fine, as long as I donÂ´t have words that begin either
with an "a" or with an "Ã ". Alas! I canÂ´t expect that to happen in any
real text.

For the time being I am stuck with the inefficient solution.

Question: is there a way to chance that behaviour, that is,
change the collating sequence from VBA, Word or perhaps
wearing the Windows XP administratorÂ´s hat?

- fernando

Jezebel · Jul 16, 2006

I can't replicate the problem. On my machine, the sort sequence is
consistently in this order, whether (and whatever) the following text --
a
á
à
â
ä
ã
å

Helmut Weber · Jul 16, 2006

Hi Fernando,

what sorting algorithm are you using?

--
Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"

John Nurick · Jul 16, 2006

Hi Fernando,

It sounds as if you are sorting your list and then using a sequential
search. ISTM it would be much faster to put the first list into a
Dictionary object and then just look up the words in the second list.
Pseudoaircode:

Dim WordList As Object
Dim WordsFound As Object

Set WordList = CreateObject("Scripting.Dictionary")
Set WordsFound = CreateObject("Scripting.Dictionary")

'Build dictionary of words in first list
For Each word In first list
WordList.Add word
Next

'Compare words in second list with dictionary
For Each item In second list
For Each word in item
If WordList.Exists(word) Then
WordsFound.Add word, word
End If
Next word
Next item

'WordsFound now contains the words in the second list that
'exist in the first list

On Sat, 15 Jul 2006 18:42:01 -0700, Fernando Cabral

[snip]

Custom Number sequences	0	Sep 11, 2023
Formating bolded text exclusively.	0	Sep 30, 2021
Drop down list control value in vba	2	Mar 8, 2022
Problem with SET & REF function	0	Mar 14, 2023
Need some PWA help	0	Apr 15, 2023
body text numbering in same sequence as its heading	0	Feb 1, 2015
Need help modifying code	0	Nov 9, 2021
Adding a big matrix in Word 2013	5	Apr 25, 2018

Collating sequence: feature or bug?

Fernando Cabral

Jezebel

Helmut Weber

John Nurick

Ask a Question

Similar Threads