fuzzy match

M

Marcy

Somebody posted something like this last week and it was never solved. Is
this too much for Excel?

A1 thru An contain the names of companies. B1 thru Bn contain another list
of companies. Each list was created by different people at different times
from different sources. Most of the names in the first list should be in the
second list. Is it possible to match the names to get something showing the
% match? Maybe counting the number of consecutive letters that match even if
they're in a different position or something. Or maybe ignore things like
"Inc." So the end result would look like this:

Company_list_1 Company_list_2 Matches with
%
AT&T AT&T B2
100%
IBM Wal-Mart B4
100%
WalMart IBM, Inc B3
90%
Sears Ford
Ford Motors Southwest Airlines B4
40%
 
C

CLR

This is tough because of the different ways that different (and even the
same) people do type the same thing in.........it will take some creative
effort........you can use the Edit > Replace feature to eliminate things like
the "Inc.", by just replacing them with nothing (leave that window
blanik)........you can show the number of characters in each cell by =LEN(A1)
and copying it down....you can also use the Data > TextToColumns >
SpaceDelimited to separate out each term into it's own column, and the LEFT
and RIGHT and VLOOKUP formulas might help separate things out further, and
even AutoFilter using "contains" as a Custom Selection might
help..........there's no magic bullet for this kind of thing, and one must
have the data in front of them to do much good, and even then, it's hit and
miss..........

Also, maybe there is a Street Address column or phone number column or
Vendor number column that might have the same numbers in it for the same
company no matter how the company name is spelled

hth
Vaya con Dios,
Chuck, CABGx3
 
P

Pete

Ok, just building on Chuck's comments, you could do Edit | Replace
repeatedly to get rid of spaces and punctuation symbols like hyphens
(better to copy column A to a new column B and carry out these actions
on B and C, so that you have the original column A to fall back on).
You can convert all characters to upper case with =UPPER(B1) etc. You
can sort both columns B and C independently to get all names beginning
with "A" etc. together. Scanning through B and C, you can insert a cell
in column C to move the rest of the cells in that column down whenever
you come across an entry in B which is not in C (and vice-versa, though
you have to insert a cell in both A and B to keep them lined up).

There is no easy formulaic way to achieve this.

Hope this helps.

Pete
 
K

Kevin B

You can try the find function and use the asterisk wild card character as the
first and last entry to do the closest thing to a "like" search.

*Find Text*
 
S

Sloth

These might get you close to what you want.

=TEXT(SUMPRODUCT(ROW($A$2:$A$6),--(((SUBSTITUTE(A2,"-","")=SUBSTITUTE($B$2:$B$6,"-",""))+IF(ISERROR(FIND(A2,$B$2:$B$6)),0,1)+IF(ISERROR(FIND($B$2:$B$6,A2)),0,1))>0)),"B#;;")

copy it into the cell and confirm it with ctrl+shift+enter. Brackets will
automatically be entered to show it is an array function. This function
basically does three comparisons.

1. Compares the A2 (minus "-") with each item in B (minus "-").
Example: Walmart and Wal-Mart.
2. Compares A2 to see if any part of each item in B contains A2.
Example: IBM and IBM, Inc.
3. Compares each item of B to see if it is part of A2.
Example: Ford Motors and Ford

=IF(C2="","",MIN(LEN(A2)/LEN(INDIRECT(C2)),LEN(INDIRECT(C2))/LEN(A2)))

this simply compares the length of the two names (slightly different than
what you have).

They result in something like this...

Company_list_1 Company_list_2 Matches_With %
AT&T AT&T B2 100%
IBM Wal-Mart B4 33%
WalMart IBM, Inc. B3 88%
Sears Ford
Ford Motors Southwest Airlines B5 36%

If you have more than two items that are close to each other, then the
output might be incorrect. Your problem is with the versatility you want not
with excel. You need to figure out exactly what constitutes a match, exactly
what can be ignored.

It would be a lot easier if you could do a find and replace, and remove
anything you want ignored. For instance enter - in the find (or ", inc.")
and nothing in the replace, and then hit replace all.

This might not be exactly what you want, but I hope it will get you close to
it.
 
B

Bill Martin

Your problem is of course impossible to solve perfectly. However, you might
consider using a Soundex algorithm when the issue is misspelling. It takes a
character string and returns a value of the sound of that word in English. It
would not be a complete solution, but might be part of a larger solution.

Here's a Soundex implementation in VBA:

http://j-walk.com/ss/excel/tips/tip77.htm
Bill
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top