Most cost efficient way to page through a poorly ordered table?
Essentially, you are asking if you can perform a single ordered scan through the data overall, while making no copies of the data, and returning 'x' disjoint sets of rows from the full set on each call. This is exactly the behaviour of an appropriately-configured API cursor.
For example, using the AdventureWorks table Person.EmailAddress
to return sets of 1,000 rows:
DECLARE
@cur integer,
-- FAST_FORWARD | AUTO_FETCH | AUTO_CLOSE
@scrollopt integer = 16 | 8192 | 16384,
-- READ_ONLY, CHECK_ACCEPTED_OPTS, READ_ONLY_ACCEPTABLE
@ccopt integer = 1 | 32768 | 65536,
@rowcount integer = 1000,
@rc integer;
-- Open the cursor and return the first 1,000 rows
EXECUTE @rc = sys.sp_cursoropen
@cur OUTPUT,
N'
SELECT *
FROM AdventureWorks2012.Person.EmailAddress
WITH (INDEX([IX_EmailAddress_EmailAddress]))
ORDER BY EmailAddress;
',
@scrollopt OUTPUT,
@ccopt OUTPUT,
@rowcount OUTPUT;
IF @rc <> 16 -- FastForward cursor automatically closed
BEGIN
-- Name the cursor so we can use CURSOR_STATUS
EXECUTE sys.sp_cursoroption
@cur,
2,
'MyCursorName';
-- Until the cursor auto-closes
WHILE CURSOR_STATUS('global', 'MyCursorName') = 1
BEGIN
EXECUTE sys.sp_cursorfetch
@cur,
2,
0,
1000;
END;
END;
Each fetch operation returns a maximum of 1,000 rows, remembering the position of the scan from the previous call.
Without knowing the purpose behind the windowing it's going to be difficult to be specific. Considering you're looking at twenty thousand rows at a time, I'm guessing this is a batch process and not for human viewing.
If there is an index on the email address then it is sorted. Indexes are BTrees and they maintain an order internally. This will be the sort order of the collation of that column (which is likely, but not necessarily, the default colation of the database).
Temporary tables - both #table and @table - will have a presence in tempdb. Also large resultsets will spill out of memory to tempdb.
If by "statistics" you mean SQL Server's internal statistics it maintains on indexes or through the create statistics..
statement then I don't think that will fly. Those statistics only have a few hundred buckets (forgotten the correct limit just now) where as you will need 39,000 "windows" to read read your full table. If you intend to maintain your own row-to-window mapping through triggers, this is achievable but the overhead may be significant.
The traditional way to page through a large dataset is by remembering the largest key value from each group and reading from there onward. If the email address column is not unique i.e. one address can occur more than once you have a couple of options. A) process each batch row-by-row in the application and skip duplicates or b) filter them out in the SQL. "B" will require a sort but if the data is read in key sequence this sort may be optimised away:
declare @MaxKey varchar(255) = ''; -- email size
while exists (select 1 from mytable where address_name > @MyKey)
begin
;with NewBatch as
(
select top 20000 -- whatever size a "window" must be
address_name
from mytable
where address_name > @MaxKey
order by address_name
)
select distinct
address_name
from NewBatch;
--process and then
select @MaxKey = max(address_name) -- from this batch of rows
end
The itteration can happen in SQL or your applicaiton, depending on your architeture.
If many columns are requied, other than just the email address, you may consider a cursor with the KEYSET or STATIC keyword defined. This will still use resources in tempdb, however.
Taking a step backward, SSIS is specifically designed to process large rowsets efficiently. Defining a package that meets your requirements may be the best long-term answer.