Your current table is in fact correctly normalized insofar as the Crop and
County columns are concerned. The fact that "each crop name appears hundreds
of times" is perfectly acceptable. You don't say what the other columns are
so I can't comment on them. If for instance the table includes a State
column then that is redundant as County determines State.
What you are lacking are referenced Crops and Counties tables to allow
referential integrity to be enforced, and a States table if your current
table includes a State column. I'll assume that it does as far as the
examples below are concerned, so the first steps would be to create a States
table, with column State (primary key); a Counties table with columns County
(primary key) and State; and a Crops table with column Crop (primary key).
I've assumed that county names are unique, i.e. no two states have a county
of the same name.
The general principle underlying the process of decomposing a table into a
set of normalized tables is that the referenced tables are filled first, then
the referencing tables. In your case your existing table is the referencing
table of course, so it doesn't need filling. To fill the refaced tables
you'd execute three 'append' queries:
1. Fill the States table if necessary with:
INSERT INTO States(State)
SELECT DISTINCT State
FROM YourCurrentTable
ORDER BY State;
2. File the Counties table with
INSERT INTO Counties(County,State)
SELECT DISTINCT County,State
FROM YourCurrentTable
ORDER BY County, State;
3. Fill the Crops table with:
INSERT INTO Crops(Crop)
SELECT DISTINCT Crop
FROM YourCurrentTable
ORDER BY Crop;
In each case the ORDER BY clause isn't really necessary; its just for
neatness.
The next step is to create relationships between Countes and States on State
if necessary; between YourCurrentTable and Counties on County; and between
YourCurrentTable and Crops on Crop. In each case enforce referential
integrity and cascade updates. Finally, once you are sure the data in the
new tables is correct delete the redundant State column from YourCurrentTable
if necessary.
The above model uses natural keys, which is perfectly OK in this case as the
values of the Crop, County and State are all unique. The primary key of
YourCurrentTable is a composite one of County and Crop. What this table is
modelling in fact is a many-to-many relationship type between the Counties
and Crops entity types. As it stands the table can of course record only one
instance of a crop per county. I'd have expected a column such as
PlantingYear for instance so that the table can record plantings over time,
in which case the PlantingYear column would also be part of the table's
primary key. If by any chance you are creating separate tables for this then
that's a bad design as its encoding data as table names. Data must be stored
as explicit values at column positions in rows in tables and in no other way;
it was Codd's Rule 1, the Information Rule, when he first proposed the
database relational model in 1971.
As regards your second point you can if you wish include surrogate numeric
keys in place of the 'natural' keys, but its not necessary, and natural keys
do have certain advantages (when creating 'correlated' combo boxes for
instance). Surrogate keys are necessary where the text values are not
unique, e.g. if a county name occurs in more than one state, but that appears
not to be the case. They are necessary with cities for instance, as city
names are widely duplicated both nationally and internationally. If you do
wish to use surrogate keys then you'd add autonumber columns to each of the
referenced tables and corresponding long integer numeric columns to the
referencing table, join the referring table to a referenced table in an
'append' query on the natural 'candidate' keys, and update the numeric
foreign key in the referencing table to the value of the primary key in the
referenced table, e.g. to update a CountyID column in YourCurrentTable:
UPDATE YourCurrentTable INNER JOIN Counties
ON YourCurrentTable.County = Counties.County
SET YourCurrentTable,CountyID = Counties.CountyID;
You can then delete the redundant County column from YourCurrentTable.
Ken Sheridan
Stafford, England