P
Petr Danes
I have seen repeated questions about locating gaps in a sequential series of
numbers, and have had to deal with the problem myself on occasion. Most of
the suggestions involve methods that I do not want to use, for a number of
reasons.
Matching against a static table of 1-n is fairly quick and simple, but
requires the storage of such a table, mostly containing numbers that will
never be used for anything and naturally, does not handle the possibility
that a given sequence will go beyond whatever is preloaded into such a
table.
VBA code does an acceptable job, but is slow and can be cumbersome to call.
I wanted a SQL-only solution and have created one that works quite nicely.
It is done as a nested set of three simple queries, all joined into one SQL
statement. No temporary tables are created, so there is no need to clean up
anything afterwards.
The queries presuppose a table named [Sequence] with a single numeric field
named [SeqNum]. More complicated tables or ordering different from ascending
positive integer would require some modifications, but the principles will
work on any well-defined sequential ordering method.
Here is my sample dataset:
1
2
3
6
7
8
10
11
12
13
14
50
51
52
1. First query, named [DemoGapBorders]:
SELECT S1.SeqNum+1 AS GapBorder, "F" AS Edge
FROM Sequence AS S1
LEFT JOIN Sequence AS S2
ON S1.SeqNum+1=S2.SeqNum
WHERE S2.SeqNum Is Null
UNION ALL
SELECT S1.SeqNum-1, "L"
FROM Sequence AS S1
LEFT JOIN Sequence AS S2
ON S1.SeqNum-1=S2.SeqNum
WHERE S2.SeqNum Is Null;
2. Second query, named [DemoGapsIntermediate]:
SELECT GF.GapBorder AS Start,
GL.GapBorder AS Ends
FROM DemoGapBorders AS GF
INNER JOIN DemoGapBorders AS GL
ON GL.GapBorder>=GF.GapBorder
WHERE GF.Edge='F'
AND GL.Edge='L';
( Optional: ORDER BY GF.GapBorder, GL.GapBorder )
3. Third query, named [DemoGaps]:
SELECT Start,
Min(Ends) AS End,
Min(Ends)-Start+1 as Length
FROM DemoGapsIntermediate
GROUP BY Start
UNION ALL
SELECT Max(SeqNum)+1, "(oo)", "(oo)"
FROM Sequence;
Storing each of these queries under the indicated name and simply calling
the third query will return a table like this:
Start End Length
4 5 2
9 9 1
15 49 35
53 (oo) (oo)
Anyone wanting to modify this for themselves, I would highly recommend
starting with this configuration before proceeding to the last step of
merging them, outlined below. And actually, the three chained queries work
fine; there is no real need to merge them, except that I wanted to see if I
could do it.
Comments on the procedures:
1) First query:
a) The first query generates a set of rows marked with 'F' and 'L' to
indicate whether the number marks the beginning or end of a gap.
b) Both the first and the last numbers of a dataset are marked as an edge of
a gap.
c) For debugging purposes, UNION may be used instead of UNION ALL to force
the result set into numerical order with alternating edges, but performance
will suffer slightly.
2) Second query:
a) This does a self-join on an end-of-gap greater than or equal to
start-of-gap condition, resulting in multiple matches for a given
start-of-gap. I was unable to think of a way around this, there does not
seem to be a way to match the smallest end-of-gap greater than or equal to
start-of-gap, only all ends-of-gaps greater than or equal to start-of-gap.
So I feed the results of this to a third query, which groups to eliminate
the multiple ends for each start.
b) A gap of one single number results in the start-of-gap and end-of-gap
equal, since the single missing member is both the first and last in the
sequence of the missing, so the self-join needs a greater than or equal to
condition, instead of a simple greater than.
c) The leading and trailing edges of the dataset are dropped here, since
neither has a corresponding other-edge-of-gap.
d) The optional ORDER BY clause may be helpful in debugging, but would serve
no purpose in a production version.
3) Third query:
a) This groups to eliminate the duplicate starts from the second query,
selecting always the smallest end-of-gap that goes with each start-of-gap.
b) The union re-adds a single line for the unbounded start-of-gap at the end
of the sequence, which is initially generated by the first query, but then
discarded by the second query. The string "(oo)" is a simple-minded
representation of the infinity symbol. I needed this line for the project
that prompted me to invent his whole scheme, but it can be discarded with no
ill effects on the remainder of the query. It also has a side effect that it
causes the second and third columns to be delivered as strings, rather than
numbers. Without this line, all three columns are returned as numbers.
c) To generate a similar line for the unbounded end-of-gap before the
sequence, the following analogous phrase could be prepended to indicate
negative infinity:
SELECT "-(oo)" AS Start, Min(SeqNum)-1 AS End, "(oo)" as Length
FROM Sequence
UNION ALL
This will also result in all three columns being strings. The use of names
is necessary, since SQL uses the field names from the first query in a
union. The AS name segments could then be discarded from the main SELECT
clause.
Finally, to put it all together, I substituted the entire text of the first
query (except for the trailing semicolon) for the query name
[DemoGapBorders] into the two places in the FROM clause of the second query
where the first query is named, then substituted that entire string (again,
except for the trailing semicolon) for the query name [DemoGapsIntermediate]
into the first FROM clause of the third query. I ran it through this online
SQL formatter
http://www.dpriver.com/pp/sqlformat.htm
which laid it out fairly nicely, but messed up some of my capitalization and
didn't get all of the spacing quite right, so I had to do a bit more manual
dolling up afterwards. The result is this (more readable in a fixed-width
font):
SELECT Start,
Min(Ends) AS End,
Min(Ends) - Start + 1 AS Length
FROM (SELECT GF.GapBorder AS Start,
GL.GapBorder AS Ends
FROM (SELECT S1.SeqNum + 1 AS GapBorder,
"F" AS Edge
FROM Sequence AS S1
LEFT JOIN Sequence AS S2
ON S1.SeqNum + 1 = S2.SeqNum
WHERE S2.SeqNum IS NULL
UNION ALL
SELECT S1.SeqNum - 1,
"L"
FROM Sequence AS S1
LEFT JOIN Sequence AS S2
ON S1.SeqNum - 1 = S2.SeqNum
WHERE S2.SeqNum IS NULL) AS GF
INNER JOIN (SELECT S1.SeqNum + 1 AS GapBorder,
"F" AS Edge
FROM Sequence AS S1
LEFT JOIN Sequence AS S2
ON S1.SeqNum + 1 = S2.SeqNum
WHERE S2.SeqNum IS NULL
UNION ALL
SELECT S1.SeqNum - 1,
"L"
FROM Sequence AS S1
LEFT JOIN Sequence AS S2
ON S1.SeqNum - 1 = S2.SeqNum
WHERE S2.SeqNum IS NULL) AS GL
ON GL.GapBorder >= GF.GapBorder
WHERE GF.Edge = 'F'
AND GL.Edge = 'L')
GROUP BY Start
UNION ALL
SELECT Max(SeqNum)+1,
"(oo)",
"(oo)"
FROM Sequence;
There it is; a single SQL query that locates and displays start, end and
length of all gaps in a numerical sequence. I'm using it on a table of over
80,000 records and it responds in less than one second.
Pete
numbers, and have had to deal with the problem myself on occasion. Most of
the suggestions involve methods that I do not want to use, for a number of
reasons.
Matching against a static table of 1-n is fairly quick and simple, but
requires the storage of such a table, mostly containing numbers that will
never be used for anything and naturally, does not handle the possibility
that a given sequence will go beyond whatever is preloaded into such a
table.
VBA code does an acceptable job, but is slow and can be cumbersome to call.
I wanted a SQL-only solution and have created one that works quite nicely.
It is done as a nested set of three simple queries, all joined into one SQL
statement. No temporary tables are created, so there is no need to clean up
anything afterwards.
The queries presuppose a table named [Sequence] with a single numeric field
named [SeqNum]. More complicated tables or ordering different from ascending
positive integer would require some modifications, but the principles will
work on any well-defined sequential ordering method.
Here is my sample dataset:
1
2
3
6
7
8
10
11
12
13
14
50
51
52
1. First query, named [DemoGapBorders]:
SELECT S1.SeqNum+1 AS GapBorder, "F" AS Edge
FROM Sequence AS S1
LEFT JOIN Sequence AS S2
ON S1.SeqNum+1=S2.SeqNum
WHERE S2.SeqNum Is Null
UNION ALL
SELECT S1.SeqNum-1, "L"
FROM Sequence AS S1
LEFT JOIN Sequence AS S2
ON S1.SeqNum-1=S2.SeqNum
WHERE S2.SeqNum Is Null;
2. Second query, named [DemoGapsIntermediate]:
SELECT GF.GapBorder AS Start,
GL.GapBorder AS Ends
FROM DemoGapBorders AS GF
INNER JOIN DemoGapBorders AS GL
ON GL.GapBorder>=GF.GapBorder
WHERE GF.Edge='F'
AND GL.Edge='L';
( Optional: ORDER BY GF.GapBorder, GL.GapBorder )
3. Third query, named [DemoGaps]:
SELECT Start,
Min(Ends) AS End,
Min(Ends)-Start+1 as Length
FROM DemoGapsIntermediate
GROUP BY Start
UNION ALL
SELECT Max(SeqNum)+1, "(oo)", "(oo)"
FROM Sequence;
Storing each of these queries under the indicated name and simply calling
the third query will return a table like this:
Start End Length
4 5 2
9 9 1
15 49 35
53 (oo) (oo)
Anyone wanting to modify this for themselves, I would highly recommend
starting with this configuration before proceeding to the last step of
merging them, outlined below. And actually, the three chained queries work
fine; there is no real need to merge them, except that I wanted to see if I
could do it.
Comments on the procedures:
1) First query:
a) The first query generates a set of rows marked with 'F' and 'L' to
indicate whether the number marks the beginning or end of a gap.
b) Both the first and the last numbers of a dataset are marked as an edge of
a gap.
c) For debugging purposes, UNION may be used instead of UNION ALL to force
the result set into numerical order with alternating edges, but performance
will suffer slightly.
2) Second query:
a) This does a self-join on an end-of-gap greater than or equal to
start-of-gap condition, resulting in multiple matches for a given
start-of-gap. I was unable to think of a way around this, there does not
seem to be a way to match the smallest end-of-gap greater than or equal to
start-of-gap, only all ends-of-gaps greater than or equal to start-of-gap.
So I feed the results of this to a third query, which groups to eliminate
the multiple ends for each start.
b) A gap of one single number results in the start-of-gap and end-of-gap
equal, since the single missing member is both the first and last in the
sequence of the missing, so the self-join needs a greater than or equal to
condition, instead of a simple greater than.
c) The leading and trailing edges of the dataset are dropped here, since
neither has a corresponding other-edge-of-gap.
d) The optional ORDER BY clause may be helpful in debugging, but would serve
no purpose in a production version.
3) Third query:
a) This groups to eliminate the duplicate starts from the second query,
selecting always the smallest end-of-gap that goes with each start-of-gap.
b) The union re-adds a single line for the unbounded start-of-gap at the end
of the sequence, which is initially generated by the first query, but then
discarded by the second query. The string "(oo)" is a simple-minded
representation of the infinity symbol. I needed this line for the project
that prompted me to invent his whole scheme, but it can be discarded with no
ill effects on the remainder of the query. It also has a side effect that it
causes the second and third columns to be delivered as strings, rather than
numbers. Without this line, all three columns are returned as numbers.
c) To generate a similar line for the unbounded end-of-gap before the
sequence, the following analogous phrase could be prepended to indicate
negative infinity:
SELECT "-(oo)" AS Start, Min(SeqNum)-1 AS End, "(oo)" as Length
FROM Sequence
UNION ALL
This will also result in all three columns being strings. The use of names
is necessary, since SQL uses the field names from the first query in a
union. The AS name segments could then be discarded from the main SELECT
clause.
Finally, to put it all together, I substituted the entire text of the first
query (except for the trailing semicolon) for the query name
[DemoGapBorders] into the two places in the FROM clause of the second query
where the first query is named, then substituted that entire string (again,
except for the trailing semicolon) for the query name [DemoGapsIntermediate]
into the first FROM clause of the third query. I ran it through this online
SQL formatter
http://www.dpriver.com/pp/sqlformat.htm
which laid it out fairly nicely, but messed up some of my capitalization and
didn't get all of the spacing quite right, so I had to do a bit more manual
dolling up afterwards. The result is this (more readable in a fixed-width
font):
SELECT Start,
Min(Ends) AS End,
Min(Ends) - Start + 1 AS Length
FROM (SELECT GF.GapBorder AS Start,
GL.GapBorder AS Ends
FROM (SELECT S1.SeqNum + 1 AS GapBorder,
"F" AS Edge
FROM Sequence AS S1
LEFT JOIN Sequence AS S2
ON S1.SeqNum + 1 = S2.SeqNum
WHERE S2.SeqNum IS NULL
UNION ALL
SELECT S1.SeqNum - 1,
"L"
FROM Sequence AS S1
LEFT JOIN Sequence AS S2
ON S1.SeqNum - 1 = S2.SeqNum
WHERE S2.SeqNum IS NULL) AS GF
INNER JOIN (SELECT S1.SeqNum + 1 AS GapBorder,
"F" AS Edge
FROM Sequence AS S1
LEFT JOIN Sequence AS S2
ON S1.SeqNum + 1 = S2.SeqNum
WHERE S2.SeqNum IS NULL
UNION ALL
SELECT S1.SeqNum - 1,
"L"
FROM Sequence AS S1
LEFT JOIN Sequence AS S2
ON S1.SeqNum - 1 = S2.SeqNum
WHERE S2.SeqNum IS NULL) AS GL
ON GL.GapBorder >= GF.GapBorder
WHERE GF.Edge = 'F'
AND GL.Edge = 'L')
GROUP BY Start
UNION ALL
SELECT Max(SeqNum)+1,
"(oo)",
"(oo)"
FROM Sequence;
There it is; a single SQL query that locates and displays start, end and
length of all gaps in a numerical sequence. I'm using it on a table of over
80,000 records and it responds in less than one second.
Pete