Similar / Fuzzy "Match" Logic

C

cw

I would like to develop a query that finds similar match's across 2 files.

Table 1
Company Name: The Atlas Company

Table 2
Company Name: Atlas

Query would see these as possible match and display them in the query
results. I can not figure out how to get this to work.

Any help is Greatly Appreciated.
 
J

John Nurick

There's no easy way to do this.

One approach would be to strip out all terms such as "The" and
"Company", then compare the remainder using one or more of various
techniques:

-exact matches
-longest exact submatches starting at first character (e.g. "Atlas"
and "Atlas International" match the first 5)
-longest submatches starting anywhere (e.g. "International Atlas
Tools" and "Atlas Tools International" match on a run of 11
characters)
-least Levenshtein distance (Google for it)
 
C

cw

Thank you John,
This is sort of what I thought but was hoping I was wrong :)
Thank you very much!
 
D

Dale Fye

cw,

A while back, I wrote a function (fnSimilarity) that might help. Basically,
it accepts two variants (to accomdate NULL values in my fields) and parses
them into their component parts (delimited by a space). Then compares each
'word' in each string and counts the number of matches compared to the number
of combinations. Unfortunately, the more words in the values passed, the
lower the "similarity", unless they are exact matches. You could modify this
to ignore 'and', 'the', or other words in both parameters. It might at least
give you a start. You might use it like:

SELECT tbl1.*, tbl2.*
FROM tbl1, tbl2
WHERE Similarity(tbl1.field1, tbl2.Field1) > csng([Cutoff value?])
Order by Similarity(tbl1.field1, tbl2.field1) DESC

Public Function Similarity(FirstValue As Variant, SecondValue As Variant) As
Single

Dim FArray() As String, SArray() As String
Dim intCounter As Integer, intMatches As Integer
Dim intLoop1 As Integer, intLoop2 As Integer

If IsNull(FirstValue) Or IsNull(SecondValue) Then
Similarity = 0
Exit Function
ElseIf FirstValue = SecondValue Then
Similarity = 99
Exit Function
End If

FArray() = Split(FirstValue, " ")
SArray() = Split(SecondValue, " ")

For intLoop1 = LBound(FArray) To UBound(FArray)
For intLoop2 = LBound(SArray) To UBound(SArray)
intCounter = intCounter + 1
If FArray(intLoop1) = SArray(intLoop2) Then
intMatches = intMatches + 2
ElseIf InStr(FArray(intLoop1), SArray(intLoop2)) > 0 Then
intMatches = intMatches + 1
ElseIf InStr(SArray(intLoop2), FArray(intLoop1)) > 0 Then
intMatches = intMatches + 1
End If
Next intLoop2
Next intLoop1

Similarity = IIf(intCounter = 0, 0, CSng(intMatches) / CSng(intCounter))

End Function
 
C

cw

AWESOME!!!!! Thanks VERY MUCH - this should do the trick!

Dale Fye said:
cw,

A while back, I wrote a function (fnSimilarity) that might help. Basically,
it accepts two variants (to accomdate NULL values in my fields) and parses
them into their component parts (delimited by a space). Then compares each
'word' in each string and counts the number of matches compared to the number
of combinations. Unfortunately, the more words in the values passed, the
lower the "similarity", unless they are exact matches. You could modify this
to ignore 'and', 'the', or other words in both parameters. It might at least
give you a start. You might use it like:

SELECT tbl1.*, tbl2.*
FROM tbl1, tbl2
WHERE Similarity(tbl1.field1, tbl2.Field1) > csng([Cutoff value?])
Order by Similarity(tbl1.field1, tbl2.field1) DESC

Public Function Similarity(FirstValue As Variant, SecondValue As Variant) As
Single

Dim FArray() As String, SArray() As String
Dim intCounter As Integer, intMatches As Integer
Dim intLoop1 As Integer, intLoop2 As Integer

If IsNull(FirstValue) Or IsNull(SecondValue) Then
Similarity = 0
Exit Function
ElseIf FirstValue = SecondValue Then
Similarity = 99
Exit Function
End If

FArray() = Split(FirstValue, " ")
SArray() = Split(SecondValue, " ")

For intLoop1 = LBound(FArray) To UBound(FArray)
For intLoop2 = LBound(SArray) To UBound(SArray)
intCounter = intCounter + 1
If FArray(intLoop1) = SArray(intLoop2) Then
intMatches = intMatches + 2
ElseIf InStr(FArray(intLoop1), SArray(intLoop2)) > 0 Then
intMatches = intMatches + 1
ElseIf InStr(SArray(intLoop2), FArray(intLoop1)) > 0 Then
intMatches = intMatches + 1
End If
Next intLoop2
Next intLoop1

Similarity = IIf(intCounter = 0, 0, CSng(intMatches) / CSng(intCounter))

End Function

--
Email address is not valid.
Please reply to newsgroup only.


cw said:
I would like to develop a query that finds similar match's across 2 files.

Table 1
Company Name: The Atlas Company

Table 2
Company Name: Atlas

Query would see these as possible match and display them in the query
results. I can not figure out how to get this to work.

Any help is Greatly Appreciated.
 
F

FBxiii

Hi.

I have used this piece of code to do some 'fuzzy' matching.

In my recordset, I am getting values of 0, 1 or 99 in the Similarity field.
Can you explaing what the Cut Off value is please?

Thanks,
Steve.



Dale Fye said:
cw,

A while back, I wrote a function (fnSimilarity) that might help. Basically,
it accepts two variants (to accomdate NULL values in my fields) and parses
them into their component parts (delimited by a space). Then compares each
'word' in each string and counts the number of matches compared to the number
of combinations. Unfortunately, the more words in the values passed, the
lower the "similarity", unless they are exact matches. You could modify this
to ignore 'and', 'the', or other words in both parameters. It might at least
give you a start. You might use it like:

SELECT tbl1.*, tbl2.*
FROM tbl1, tbl2
WHERE Similarity(tbl1.field1, tbl2.Field1) > csng([Cutoff value?])
Order by Similarity(tbl1.field1, tbl2.field1) DESC

Public Function Similarity(FirstValue As Variant, SecondValue As Variant) As
Single

Dim FArray() As String, SArray() As String
Dim intCounter As Integer, intMatches As Integer
Dim intLoop1 As Integer, intLoop2 As Integer

If IsNull(FirstValue) Or IsNull(SecondValue) Then
Similarity = 0
Exit Function
ElseIf FirstValue = SecondValue Then
Similarity = 99
Exit Function
End If

FArray() = Split(FirstValue, " ")
SArray() = Split(SecondValue, " ")

For intLoop1 = LBound(FArray) To UBound(FArray)
For intLoop2 = LBound(SArray) To UBound(SArray)
intCounter = intCounter + 1
If FArray(intLoop1) = SArray(intLoop2) Then
intMatches = intMatches + 2
ElseIf InStr(FArray(intLoop1), SArray(intLoop2)) > 0 Then
intMatches = intMatches + 1
ElseIf InStr(SArray(intLoop2), FArray(intLoop1)) > 0 Then
intMatches = intMatches + 1
End If
Next intLoop2
Next intLoop1

Similarity = IIf(intCounter = 0, 0, CSng(intMatches) / CSng(intCounter))

End Function

--
Email address is not valid.
Please reply to newsgroup only.


cw said:
I would like to develop a query that finds similar match's across 2 files.

Table 1
Company Name: The Atlas Company

Table 2
Company Name: Atlas

Query would see these as possible match and display them in the query
results. I can not figure out how to get this to work.

Any help is Greatly Appreciated.
 
D

Dale Fye

Well, a zero means that one or both of the values passed to the function is
NULL, or that none of the words in the passed values match. A 99 means that
they match exactly (not case sensitive).

If neither one of those conditions is met, then the algorithm compares each
element of each value looking for exact matches (worth 2 points) or for
subsets (worth 1 point). Finally, the algorithm divides the sum of these
values by the number of comparisons.

Can you give me an example of some of the values you are passing to the
function, and the value you are getting back?

Dale
--
Don''t forget to rate the post if it was helpful!

email address is invalid
Please reply to newsgroup only.



FBxiii said:
Hi.

I have used this piece of code to do some 'fuzzy' matching.

In my recordset, I am getting values of 0, 1 or 99 in the Similarity field.
Can you explaing what the Cut Off value is please?

Thanks,
Steve.



Dale Fye said:
cw,

A while back, I wrote a function (fnSimilarity) that might help. Basically,
it accepts two variants (to accomdate NULL values in my fields) and parses
them into their component parts (delimited by a space). Then compares each
'word' in each string and counts the number of matches compared to the number
of combinations. Unfortunately, the more words in the values passed, the
lower the "similarity", unless they are exact matches. You could modify this
to ignore 'and', 'the', or other words in both parameters. It might at least
give you a start. You might use it like:

SELECT tbl1.*, tbl2.*
FROM tbl1, tbl2
WHERE Similarity(tbl1.field1, tbl2.Field1) > csng([Cutoff value?])
Order by Similarity(tbl1.field1, tbl2.field1) DESC

Public Function Similarity(FirstValue As Variant, SecondValue As Variant) As
Single

Dim FArray() As String, SArray() As String
Dim intCounter As Integer, intMatches As Integer
Dim intLoop1 As Integer, intLoop2 As Integer

If IsNull(FirstValue) Or IsNull(SecondValue) Then
Similarity = 0
Exit Function
ElseIf FirstValue = SecondValue Then
Similarity = 99
Exit Function
End If

FArray() = Split(FirstValue, " ")
SArray() = Split(SecondValue, " ")

For intLoop1 = LBound(FArray) To UBound(FArray)
For intLoop2 = LBound(SArray) To UBound(SArray)
intCounter = intCounter + 1
If FArray(intLoop1) = SArray(intLoop2) Then
intMatches = intMatches + 2
ElseIf InStr(FArray(intLoop1), SArray(intLoop2)) > 0 Then
intMatches = intMatches + 1
ElseIf InStr(SArray(intLoop2), FArray(intLoop1)) > 0 Then
intMatches = intMatches + 1
End If
Next intLoop2
Next intLoop1

Similarity = IIf(intCounter = 0, 0, CSng(intMatches) / CSng(intCounter))

End Function

--
Email address is not valid.
Please reply to newsgroup only.


cw said:
I would like to develop a query that finds similar match's across 2 files.

Table 1
Company Name: The Atlas Company

Table 2
Company Name: Atlas

Query would see these as possible match and display them in the query
results. I can not figure out how to get this to work.

Any help is Greatly Appreciated.
 
F

FBxiii

Hi. Thanks for the info. I beleive the function is working correctly. I
have posted some values below and their similarity score (Expr1). The last
one looks like a 'fuzzy' match, but it has a score of 0. I am guessing the
function only compares tbl2 with tbl1 and not the other way round...

tbl1.Field1 tbl2.Field1 Expr1
M016A0286900A6 M016A0286900A6 99
G4A03654160601 G4A03654160601 99
00337409 00337409 99
A683978 A683978 99
G4K16925190601 G4K16925190601 99
0062987 0062987 99
8003028S 8003028S 99
8002197S 002197 1
L9401182976S 1182976 1
0 0000995 1
51200144700000 2001447 1
8004720S 8004720 1
0 0873259 1
0434632S 434632 1
0184622S 84622 1
01312846S 0312846 0



Dale Fye said:
Well, a zero means that one or both of the values passed to the function is
NULL, or that none of the words in the passed values match. A 99 means that
they match exactly (not case sensitive).

If neither one of those conditions is met, then the algorithm compares each
element of each value looking for exact matches (worth 2 points) or for
subsets (worth 1 point). Finally, the algorithm divides the sum of these
values by the number of comparisons.

Can you give me an example of some of the values you are passing to the
function, and the value you are getting back?

Dale
--
Don''t forget to rate the post if it was helpful!

email address is invalid
Please reply to newsgroup only.



FBxiii said:
Hi.

I have used this piece of code to do some 'fuzzy' matching.

In my recordset, I am getting values of 0, 1 or 99 in the Similarity field.
Can you explaing what the Cut Off value is please?

Thanks,
Steve.



Dale Fye said:
cw,

A while back, I wrote a function (fnSimilarity) that might help. Basically,
it accepts two variants (to accomdate NULL values in my fields) and parses
them into their component parts (delimited by a space). Then compares each
'word' in each string and counts the number of matches compared to the number
of combinations. Unfortunately, the more words in the values passed, the
lower the "similarity", unless they are exact matches. You could modify this
to ignore 'and', 'the', or other words in both parameters. It might at least
give you a start. You might use it like:

SELECT tbl1.*, tbl2.*
FROM tbl1, tbl2
WHERE Similarity(tbl1.field1, tbl2.Field1) > csng([Cutoff value?])
Order by Similarity(tbl1.field1, tbl2.field1) DESC

Public Function Similarity(FirstValue As Variant, SecondValue As Variant) As
Single

Dim FArray() As String, SArray() As String
Dim intCounter As Integer, intMatches As Integer
Dim intLoop1 As Integer, intLoop2 As Integer

If IsNull(FirstValue) Or IsNull(SecondValue) Then
Similarity = 0
Exit Function
ElseIf FirstValue = SecondValue Then
Similarity = 99
Exit Function
End If

FArray() = Split(FirstValue, " ")
SArray() = Split(SecondValue, " ")

For intLoop1 = LBound(FArray) To UBound(FArray)
For intLoop2 = LBound(SArray) To UBound(SArray)
intCounter = intCounter + 1
If FArray(intLoop1) = SArray(intLoop2) Then
intMatches = intMatches + 2
ElseIf InStr(FArray(intLoop1), SArray(intLoop2)) > 0 Then
intMatches = intMatches + 1
ElseIf InStr(SArray(intLoop2), FArray(intLoop1)) > 0 Then
intMatches = intMatches + 1
End If
Next intLoop2
Next intLoop1

Similarity = IIf(intCounter = 0, 0, CSng(intMatches) / CSng(intCounter))

End Function

--
Email address is not valid.
Please reply to newsgroup only.


:

I would like to develop a query that finds similar match's across 2 files.

Table 1
Company Name: The Atlas Company

Table 2
Company Name: Atlas

Query would see these as possible match and display them in the query
results. I can not figure out how to get this to work.

Any help is Greatly Appreciated.
 
J

John Spencer

The Score of 0 is correct for the pair
01312846S and 0312846
They are not an exact match and neither one is contained in the other. The
similarity function matches "whole words" and not individual characters.

You might try using Levenshtein Distance. This calculates the number of
"changes" required to convert one string to another.

There's an article and sample database of Doug Steele's here (the April 2005
link) http://www.accessmvp.com/djsteele/SmartAccess.html
which might be of help.


Fuzzy Matching - Levenshtein Distance
See http://www.merriampark.com/ld.htm#VB

Paste the following into a new module and save the module.
'*******************************
'*** Get minimum of three values
'*******************************

Private Function Minimum(ByVal a As Integer, _
ByVal b As Integer, _
ByVal c As Integer) As Integer
Dim mi As Integer

mi = a
If b < mi Then
mi = b
End If
If c < mi Then
mi = c
End If

Minimum = mi

End Function

'********************************
'*** Compute Levenshtein Distance
'********************************

Public Function LD(ByVal s As String, ByVal t As String) As Integer
Dim d() As Integer ' matrix
Dim m As Integer ' length of t
Dim n As Integer ' length of s
Dim i As Integer ' iterates through s
Dim j As Integer ' iterates through t
Dim s_i As String ' ith character of s
Dim t_j As String ' jth character of t
Dim cost As Integer ' cost

' Step 1
n = Len(s)
m = Len(t)
If n = 0 Then
LD = m
Exit Function
End If

If m = 0 Then
LD = n
Exit Function
End If

ReDim d(0 To n, 0 To m) As Integer

' Step 2
For i = 0 To n
d(i, 0) = i
Next i

For j = 0 To m
d(0, j) = j
Next j

' Step 3
For i = 1 To n

s_i = Mid$(s, i, 1)

' Step 4
For j = 1 To m

t_j = Mid$(t, j, 1)

' Step 5
If s_i = t_j Then
cost = 0
Else
cost = 1
End If

' Step 6
d(i, j) = Minimum(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) +
cost)

Next j

Next i

' Step 7
LD = d(n, m)

Erase d

End Function
--
John Spencer
Access MVP 2002-2005, 2007-2008
Center for Health Program Development and Management
University of Maryland Baltimore County
..
 
F

FBxiii

Thanks for the reply. Do you know where the code may be for the Levenshtein
distance?

I have created a function that looks for the given number of characters
within String1 and String2 and returns a percentage match. This picks up the
one that did not have an exact match...

Public Function Partial_Match(txtString1 As Variant, txtString2 As Variant,
intChars_to_Match As Integer) As Double

' This function compares txtString1 and txtString2 for intChars_to_Match
matching characters
' It passes a value back which is a Score based on how the values match.

If (IsNull(txtString1) = True Or txtString1 = "") Or (IsNull(txtString2)
= True Or txtString2 = "") Then
' String1 or 2 are null. No values to match so Score of 0.
Partial_Match = 0
Exit Function
ElseIf txtString1 = txtString2 Then
' String 1 and 2 match exactly. Give it a score of 100%
Partial_Match = 100
Exit Function
End If

Dim intTemp1 As Integer
Dim intTemp2 As Integer

' Compare txtString1 against txtString2
For intTemp1 = 1 To (Len(txtString1) - intChars_to_Match)

For intTemp2 = 1 To (Len(txtString2) - intChars_to_Match)

If Mid(txtString1, intTemp1, intChars_to_Match) =
Mid(txtString2, intTemp2, intChars_to_Match) Then
Partial_Match = Format((intChars_to_Match / Len(txtString1)
* 100), "0.00")
End If

Next intTemp2

Next intTemp1

End Function
 
J

John Spencer

The function named LD in my post computes the Levenshtein Distance and Doug
Steeele's article and sample database at the URL I posted also has a
rsoutine to do that.

--
John Spencer
Access MVP 2002-2005, 2007-2008
Center for Health Program Development and Management
University of Maryland Baltimore County
..
 
F

FBxiii

Hi.

Thanks for the replies. With your example (John), is the returned number
the number of changes required to string s or t?

Can I use the len() function to determine how far out the string is? i.e.
len(s) - no. chars out?
 
J

John Spencer

I believe the number of changes is the same to change String A to String B
as to change StringB to StringA.

I guess you could use
Len(StringA) - LD(StringA,StringB)
but I'm not sure what that would tell you.

two strings (A and B)
"abc" "abcdef"
LD would return 3, so
Len(A) - 3 = 0
Len(B) - 3 = 3

Perhaps a ratio of the number of changes to the total length would be more
significant.
"ABCDEFGHIJ" "BCDEFGHIJ"
LD(A,B) / (Len(A) + Len(B))
will return .0526 (1 change in 19 letters)

The closer to zero the better the match; the closer to 1 the worse the
match. And even then you might need to adjust the significance on the
length of the two strings.


--
John Spencer
Access MVP 2002-2005, 2007-2008
Center for Health Program Development and Management
University of Maryland Baltimore County
..
 
J

jenniferspnc

I'm trying this but I keep getting this error:

compile error. in query expression 'similarity (emea1000.companyname,
uk1000.companyname) > csng([cutoffvalue?])

here is my sql. have a feeling it's to do with the cutoffvalue?, just not
sure what to input there (if that's the case).

SELECT emea1000.*, uk1000.*
FROM emea1000, uk1000
WHERE similarity(emea1000.companyname, uk1000.companyname) > csng([Cutoff
value?])
Order by similarity(emea1000.companyname, uk1000.companyname) DESC

here is the code in the module labeled fnsimilarity
Public Function similarity(FirstValue As Variant, SecondValue As Variant) As
Single

Dim FArray() As String, SArray() As String
Dim intCounter As Integer, intMatches As Integer
Dim intLoop1 As Integer, intLoop2 As Integer

If IsNull(FirstValue) Or IsNull(SecondValue) Then
similarity = 0
Exit Function
ElseIf FirstValue = SecondValue Then
similarity = 99
Exit Function
End If

FArray() = Split(FirstValue, " ")
SArray() = Split(SecondValue, " ")

For intLoop1 = LBound(FArray) To UBound(FArray)
For intLoop2 = LBound(SArray) To UBound(SArray)
intCounter = intCounter + 1
If FArray(intLoop1) = SArray(intLoop2) Then
intMatches = intMatches + 2
ElseIf InStr(FArray(intLoop1), SArray(intLoop2)) > 0 Then
intMatches = intMatches + 1
ElseIf InStr(SArray(intLoop2), FArray(intLoop1)) > 0 Then
intMatches = intMatches + 1
End If
Next intLoop2
Next intLoop1

similarity = IIf(intCounter = 0, 0, CSng(intMatches) / CSng(intCounter))

End Function

Dale Fye said:
Well, a zero means that one or both of the values passed to the function is
NULL, or that none of the words in the passed values match. A 99 means that
they match exactly (not case sensitive).

If neither one of those conditions is met, then the algorithm compares each
element of each value looking for exact matches (worth 2 points) or for
subsets (worth 1 point). Finally, the algorithm divides the sum of these
values by the number of comparisons.

Can you give me an example of some of the values you are passing to the
function, and the value you are getting back?

Dale
--
Don''t forget to rate the post if it was helpful!

email address is invalid
Please reply to newsgroup only.



FBxiii said:
Hi.

I have used this piece of code to do some 'fuzzy' matching.

In my recordset, I am getting values of 0, 1 or 99 in the Similarity field.
Can you explaing what the Cut Off value is please?

Thanks,
Steve.



Dale Fye said:
cw,

A while back, I wrote a function (fnSimilarity) that might help. Basically,
it accepts two variants (to accomdate NULL values in my fields) and parses
them into their component parts (delimited by a space). Then compares each
'word' in each string and counts the number of matches compared to the number
of combinations. Unfortunately, the more words in the values passed, the
lower the "similarity", unless they are exact matches. You could modify this
to ignore 'and', 'the', or other words in both parameters. It might at least
give you a start. You might use it like:

SELECT tbl1.*, tbl2.*
FROM tbl1, tbl2
WHERE Similarity(tbl1.field1, tbl2.Field1) > csng([Cutoff value?])
Order by Similarity(tbl1.field1, tbl2.field1) DESC

Public Function Similarity(FirstValue As Variant, SecondValue As Variant) As
Single

Dim FArray() As String, SArray() As String
Dim intCounter As Integer, intMatches As Integer
Dim intLoop1 As Integer, intLoop2 As Integer

If IsNull(FirstValue) Or IsNull(SecondValue) Then
Similarity = 0
Exit Function
ElseIf FirstValue = SecondValue Then
Similarity = 99
Exit Function
End If

FArray() = Split(FirstValue, " ")
SArray() = Split(SecondValue, " ")

For intLoop1 = LBound(FArray) To UBound(FArray)
For intLoop2 = LBound(SArray) To UBound(SArray)
intCounter = intCounter + 1
If FArray(intLoop1) = SArray(intLoop2) Then
intMatches = intMatches + 2
ElseIf InStr(FArray(intLoop1), SArray(intLoop2)) > 0 Then
intMatches = intMatches + 1
ElseIf InStr(SArray(intLoop2), FArray(intLoop1)) > 0 Then
intMatches = intMatches + 1
End If
Next intLoop2
Next intLoop1

Similarity = IIf(intCounter = 0, 0, CSng(intMatches) / CSng(intCounter))

End Function

--
Email address is not valid.
Please reply to newsgroup only.


:

I would like to develop a query that finds similar match's across 2 files.

Table 1
Company Name: The Atlas Company

Table 2
Company Name: Atlas

Query would see these as possible match and display them in the query
results. I can not figure out how to get this to work.

Any help is Greatly Appreciated.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top