Reverse-engineering a source table from a large database

KG · Mar 21, 2009

I have a large database (12 columns x 8700 rows) listing planted acres by
CROP in each COUNTY of the contiguous United States. Because there are 12
different crops, each county name may appear 12 different times. Each crop
name appears hundreds of times.

1) How can I go back and create new tables in which each county and each
crop is listed ONLY ONCE? (I don't believe that I can do this by searching
for unique entries because each crop/county combination is in fact unique)
2) Once Step #1 has been accomplished, can I assign a unique ID number
(autonumber?) after the fact so that I can create relationships between the
tables?

Many thanks in advance.

Albert D. Kallal · Mar 21, 2009

Just create a new query.

Base the query on this table.

Drop the County field, and then drop in the crop field.

Whack the sideways M (summation) in the menu bar.

(both fields should have a "group by" appear under them.

You done. Try running the query (hit the !)

If you want to export this list into another table, then simply flip the
above back into design view..and then change the query into an append
query....

KG · Mar 21, 2009

Many thanks, I'll ty it.

Ken Sheridan · Mar 21, 2009

Your current table is in fact correctly normalized insofar as the Crop and
County columns are concerned. The fact that "each crop name appears hundreds
of times" is perfectly acceptable. You don't say what the other columns are
so I can't comment on them. If for instance the table includes a State
column then that is redundant as County determines State.

What you are lacking are referenced Crops and Counties tables to allow
referential integrity to be enforced, and a States table if your current
table includes a State column. I'll assume that it does as far as the
examples below are concerned, so the first steps would be to create a States
table, with column State (primary key); a Counties table with columns County
(primary key) and State; and a Crops table with column Crop (primary key).
I've assumed that county names are unique, i.e. no two states have a county
of the same name.

The general principle underlying the process of decomposing a table into a
set of normalized tables is that the referenced tables are filled first, then
the referencing tables. In your case your existing table is the referencing
table of course, so it doesn't need filling. To fill the refaced tables
you'd execute three 'append' queries:

1. Fill the States table if necessary with:

INSERT INTO States(State)
SELECT DISTINCT State
FROM YourCurrentTable
ORDER BY State;

2. File the Counties table with

INSERT INTO Counties(County,State)
SELECT DISTINCT County,State
FROM YourCurrentTable
ORDER BY County, State;

3. Fill the Crops table with:

INSERT INTO Crops(Crop)
SELECT DISTINCT Crop
FROM YourCurrentTable
ORDER BY Crop;

In each case the ORDER BY clause isn't really necessary; its just for
neatness.

The next step is to create relationships between Countes and States on State
if necessary; between YourCurrentTable and Counties on County; and between
YourCurrentTable and Crops on Crop. In each case enforce referential
integrity and cascade updates. Finally, once you are sure the data in the
new tables is correct delete the redundant State column from YourCurrentTable
if necessary.

The above model uses natural keys, which is perfectly OK in this case as the
values of the Crop, County and State are all unique. The primary key of
YourCurrentTable is a composite one of County and Crop. What this table is
modelling in fact is a many-to-many relationship type between the Counties
and Crops entity types. As it stands the table can of course record only one
instance of a crop per county. I'd have expected a column such as
PlantingYear for instance so that the table can record plantings over time,
in which case the PlantingYear column would also be part of the table's
primary key. If by any chance you are creating separate tables for this then
that's a bad design as its encoding data as table names. Data must be stored
as explicit values at column positions in rows in tables and in no other way;
it was Codd's Rule 1, the Information Rule, when he first proposed the
database relational model in 1971.

As regards your second point you can if you wish include surrogate numeric
keys in place of the 'natural' keys, but its not necessary, and natural keys
do have certain advantages (when creating 'correlated' combo boxes for
instance). Surrogate keys are necessary where the text values are not
unique, e.g. if a county name occurs in more than one state, but that appears
not to be the case. They are necessary with cities for instance, as city
names are widely duplicated both nationally and internationally. If you do
wish to use surrogate keys then you'd add autonumber columns to each of the
referenced tables and corresponding long integer numeric columns to the
referencing table, join the referring table to a referenced table in an
'append' query on the natural 'candidate' keys, and update the numeric
foreign key in the referencing table to the value of the primary key in the
referenced table, e.g. to update a CountyID column in YourCurrentTable:

UPDATE YourCurrentTable INNER JOIN Counties
ON YourCurrentTable.County = Counties.County
SET YourCurrentTable,CountyID = Counties.CountyID;

You can then delete the redundant County column from YourCurrentTable.

Ken Sheridan
Stafford, England

KG · Mar 21, 2009

Fantastic! Instructive and extremely helpful. Thank you.

KG · Mar 21, 2009

PS: yes, there is a State column. The remaining columns are sales potential
in $ by year for 2009-2014. I am not yet sure how I can deconstruct the
referencing table without losing the sales $ values for each combination of
county/crop/year.

Many thanks. It is greatly appreciated.

John W. Vinson · Mar 21, 2009

Surrogate keys are necessary where the text values are not
unique, e.g. if a county name occurs in more than one state, but that appears
not to be the case.

Actually it is, Ken - I've been in Benton County, Arkansas and in Benton
County, Washington. <checking my USCounties table> Heh. There are 31 states
with a Washington County.

My table was derived (with some effort, the source is nasty!) from the US
Census Bureau listing at
http://www.census.gov/datamap/fipslist/AllSt.txt
If you Google for County FIPS there's lots of hits; including "Access" in the
google search finds several commercial sites with Access (or compatible)
databases for a fee.

John Spencer · Mar 21, 2009

You could build the necessary table with a UNION query.

SELECT County, State, Crop, "2009" as TheYear, [fld2009]
FROM [SourceTable]
UNION
SELECT County, State, Crop, "2010" as TheYear, [fld2010]
FROM [SourceTable]
UNION
SELECT County, State, Crop, "2011" as TheYear, [fld2011]
FROM [SourceTable]
UNION
....
SELECT County, State, Crop, "2014" as TheYear, [fld2014]
FROM [SourceTable]

That is only 9 columns, but that is the best I can do from your post -
you mention county, state, crop, and year columns from 2009 to 2014.

If you want a query of just the unique County and State combinations
then use a query like the following.
SELECT DISTINCT County, State
FROM [SourceTable]

'====================================================
John Spencer
Access MVP 2002-2005, 2007-2009
The Hilltop Institute
University of Maryland Baltimore County
'====================================================

KG · Mar 21, 2009

Thank you guys, I really appreciate it. I think that, in reference to viable
options, I am guilty of omitting one important detail: my database in
question is LINKED to an Excel source that will be updated on a regular
basis. I can develop tables and queries from the linked database but I cannot
do anything that loses the links. Am I stuck with the original design?!

Paul Shapiro · Mar 21, 2009

That would make sense. Even if the data doesn't yet exist, the natural key
is the combination of {country, state, county}, or you could omit the
country if all data was guaranteed to be for a single country, with NO
possibility of EVER including any other country. Since each state is
independent, there is no reason to assume they could not EVER use the same
county name. Similarly for state names, which can only be guaranteed unique
within a single country. So the surrogate key for counties is not necessary,
but it might be an option worth considering for the programmer's convenience
when the natural key is more than a single attribute.

John W. Vinson · Mar 22, 2009

Thank you guys, I really appreciate it. I think that, in reference to viable
options, I am guilty of omitting one important detail: my database in
question is LINKED to an Excel source that will be updated on a regular
basis. I can develop tables and queries from the linked database but I cannot
do anything that loses the links. Am I stuck with the original design?!

You can't update a linked Excel spreadsheet (since Microsoft lost a lawsuit a
couple of years ago and had to turn off that capability); but you can indeed
create Queries linking a main spreadsheet table to local Access tables of
counties or states or zipcodes.

What you *can't* do is enforce referential integrity between an external table
(such as your spreadsheet) and a local Access table, which vitiaties a lot of
the advantage of having such a table. There's no way to keep someone in Excel
from updating the spreadsheet to have a record for "Canion County, Idaho" even
though it's actually Canyon County.

Ken Sheridan · Mar 22, 2009

John:

You've confirmed what I suspected. I'd have been surprised if in a federal
structure like the USA county names were distinct, unlike in the UK, where
counties, as the primary local governmental unit do have distinct names. In
the context of the current thread, however, it appears at first sight, on the
basis of the information given, that a distinct subset of 725 (8700/12)
counties are involved, as "each crop/county combination is in fact unique".
Databases do of course invariably model a subset of the real world, so if the
above statement will always be true in the scope of this database the use of
a natural key would be valid, notwithstanding the duplication of county names
in the wider context, on the same basis that allows department names to be
valid keys in a database which models one particular company, despite the
fact that a million and one companies will use the same department names in
the external world.

On the other hand, the above statement might not be quite what it seems; it
could perhaps be referring to each county not as the value of the column,
i.e. the county name, but the physical geographical entity of the county, in
which case, in terms of the data values, what is unique is not 'each
crop/county combination' but 'each crop/county/state combination'. The
obvious solution here would be a surrogate CountyID key, but its worth noting
that there is also the alternative 'purist' solution of a composite
county/state key.

Ken Sheridan
Stafford, England

Ken Sheridan · Mar 22, 2009

What you can do in that sort of situation is include a routine in the Access
database which updates the referencing table from the current Excel
worksheet. As it happens I've been helping a friend of mine in California
with something similar over the last few weeks for importing price data on
products from Excel into an Access database, but the key in this case is the
SKU of each product, so its unlikely you'd get the sort of inconsistencies
that John refers to with his example of 'Canion County' instead of 'Canyon
County'. In my friend's case an unrecognized SKU is almost certainly a new
product added to the range and a new row is inserted into the Products table.
With the sort of sloppy data entry into the Excel worksheet to which John
refers 'Canion County' would be rejected as it doesn't appear in the
referenced Counties table, so you could cater for this in the update routine
by means of a query which returns all rows from the Excel data with
'unmatched' counties. You could then give the user then opportunity to
accept this as a new county, and insert a row into the Counties table, or
change it to a value already in the Counties table such as 'Canyon County'.
The same can be done for crops. As John points out, however, you can't edit
the Excel data, but you can do as I have with my friend's database, and
import the Excel data into a single Access 'middleman' table which can then
be edited to before being used to update the working tables.

Ken Sheridan
Stafford, England

John W. Vinson · Mar 22, 2009

On the other hand, the above statement might not be quite what it seems; it
could perhaps be referring to each county not as the value of the column,
i.e. the county name, but the physical geographical entity of the county, in
which case, in terms of the data values, what is unique is not 'each
crop/county combination' but 'each crop/county/state combination'. The
obvious solution here would be a surrogate CountyID key,

Such a key does exist: the US Census assigned and maintained FIPS code.

but its worth noting
that there is also the alternative 'purist' solution of a composite
county/state key.

I would hope that would work too. There might be a lot of Washington Counties
but there should only be one in each state!

OTOH I've been in Los Alamos, New Mexico and driven past the dirt road to the
*other* Los Alamos, New Mexico. <g>

Advice on creating a database	1	Dec 7, 2009
Re-Reverse Engineering from rebuilt Database	1	Nov 17, 2008
Visio reverse engineering of MYSQL database does not see views	0	Jan 22, 2010
relation problems	1	Apr 20, 2010
Database for a Testing Center - Tracking dates	1	Jun 1, 2010
Eliminating Cartesian Effect when Querying Multiple Tables	8	Dec 14, 2009
Linking Large Text Database to Excel	6	May 6, 2010
how to 'add shapes' after reverse engineering a large db	1	Mar 29, 2007

Reverse-engineering a source table from a large database

KG

Albert D. Kallal

KG

Ken Sheridan

KG

KG

John W. Vinson

John Spencer

KG

Paul Shapiro

John W. Vinson

Ken Sheridan

Ken Sheridan

John W. Vinson

Ask a Question

Similar Threads