controlling the fetching of a large set of rows/records when batch processing

jkmcsw · Post by **jkmcsw** » Thu 16 Nov 2006 22:30

My app in some settings needs to batch process a rather large number of records. The records themselves are not small, so for a selection of, say, 380,000 records, my app can really bloat in memory.

What is the best way to control fetching? What I'd like to do is grab the first X records, process them, grab another X, etc. I've looked at the parameters at my disposal and at a loss as to how to do this. The one way that
seems like it would work is using clientdatasets, but I would prefer not to involve them.

Thanks very much for your help, Cheers!

Jackson · Post by **Jackson** » Fri 17 Nov 2006 12:14

Use the TSQLQuery component.
TSQLQuery is a unidirectional dataset.
This component does not store multiple records in memory.

jkmcsw · Post by **jkmcsw** » Mon 20 Nov 2006 17:09

I am using TSQLQuery already.
After the query is set to active my app's Mem Usage in the Windows Task
Manager climbs in proportion to the set of records requested by the
sql select statement. The problem I'm having occurs for selections of
hundreds of thousands of records (each of which is not small). I
realize that TSQLQuery is a unidirectional dataset but a lot of data
is nevertheless written to memory in my app, not just one. I'm thinking that
maybe setting fetchall to false would do the trick, but what bothers
me is that I have no idea how many records then would get read
in and I want to keep the I/O of the program efficient.

Jackson · Post by **Jackson** » Tue 21 Nov 2006 14:55

Try to set the FetchAll parameter value to False.
Use the 'RowsetSize' parameter to specify the number of records that will be transferred across the network at the same time.
Don't forget to convert your TSQLConnection component to TCRSQLConnection (context menu of TSQLConnection component).

jkmcsw · Post by **jkmcsw** » Wed 22 Nov 2006 00:12

Setting FetchAll=FALSE and ROWSETSIZE=1000 seemed to start working, but then a secondary problem ensued. After the first fetch I was not able to post an update to the database inside a transcation within the same connection. The error message arose from the sql server database: Cannot create new connection because in manual or distributed transaction mode.

So, I'm wondering whether you know off the top of your head the best way around this problem. I can think of two possible ways, maybe there are others.

1) remove the transaction(s) from my updates and update within the same TCRSQLconnection (as I do when FetchALL = TRUE, etc.)
2) do my updates in a separate connection.

EvgeniyM -- thanks very much for your (dobra=good?) assistance!

Jackson · Post by **Jackson** » Wed 22 Nov 2006 11:27

You are right, a query execution in FetchAll = False mode blocks the current session.
In order to avoid blocking, OLE DB creates an additional session.
The first way does not seem to be the best way to solve the problem.
Each update will cause creation of additional session, performing update and closing this session.
Moreover the additional session runs outside of the transaction context, thus TMSConnection.Commit and TMSConnection.Rollback operations in main session won't apply changes made in additional sessions.
The second way is the most acceptable.
Both ways may cause deadlocks if your updates concern tables from which you are fetching.

jkmcsw · Post by **jkmcsw** » Sat 16 Dec 2006 19:57

In my app I applied a separate, secondary, TCRSQLConnection to
the TSQLQuery that I want using FetchAll and RowsetSize. All the other
TSQLQueries remain with the original TCRSQLConnection, and this one (the original) uses transactions for posting updates. So, I ran the app
and it just hung. Note -- it didn't give me an error message; it just hung.
Is it obvious why it would do that?

jkmcsw · Post by **jkmcsw** » Sat 16 Dec 2006 20:04

Ultimately what I'm trying to do is this:
from my selection set:
1. fetch the first X rows
2. iterate the first X rows, processing as I iterate
3. iterate the next X rows, processing as I iterate
etc. until the whole selection set is iterated and processed

And, just to be clear, I wouldn't have to do this if my selection didn't eat up
all my memory when I selected to entire set at once. Please note:
it's import that I be able to process the whole thing (via some technique)
because I need to do summary reports from the whole.

Thanks so much!

Jackson · Post by **Jackson** » Mon 18 Dec 2006 13:50

The most possible reason of such behaviour is a deadlock.
While results of a query execution are not fetched completely, all the tables from the query are blocked (FetchAll = False mode).
If your updates concern a table which is fetching now, it will cause a deadlock.

jkmcsw · Post by **jkmcsw** » Mon 18 Dec 2006 14:59

My logic is this:
with query1
sql = "select statment"
active = true
first
while not query.eof do
begin
process current row
query2: update current row in database
inc(updatecount)
if updatecount >= 100
begin
commit
start new transaction
end
end
I changed my code so that query1 and query2 have different connections.
So -- are you saying that even though the two queries have different connections a deadlock is still possible?
If this is true, how do people process very large tables?

Thanks again

Jackson · Post by **Jackson** » Wed 20 Dec 2006 11:39

A query execution in FetchAll = False mode causes locking of all the tables from the query on a server side until all data is fetched.
It also causes session locking on a client side. So you can't perform any operations using this blocked session.
An additional session allows you to avoid the client side session blocking, but it can't help you to avoid the server side lock if you perform updates that concern the tables being fetched.
For example, you have fetched 10 of 100 rows from some table (for now this table is locked on the server and nobody can modify it), then you create an additional session and try to perform updates for this table through this session.
Your updates are waiting until table is unlocked, in the same time your application is waiting for completion of UPDATE operation to perform fetch of the next 10 rows. This is the deadlock.
Usually similar logic is implemented on the server side by stored procedures.

jkmcsw · Post by **jkmcsw** » Wed 20 Dec 2006 14:53

Thanks very much Evgeniy for being so clear.

I found that in sql server 2005 there is a row_number that can be used to partition a large query so that in effect I execute the select query repeatedly
asking for row_number between x,y and vary x and y...

I don't suppose you know of any other way to handle processing very large
queries?

Thanks...

Jackson · Post by **Jackson** » Fri 22 Dec 2006 14:46

In our opinion, using stored procedure is the most convenient way for such tasks.

jkmcsw · Post by **jkmcsw** » Thu 28 Dec 2006 18:55

I'm at a loss to imagine how that would work.
With a stored procedure how would you control accessing blocks/ranges of
records? Would a store procedure send X records at a time somehow?

Jackson · Post by **Jackson** » Fri 29 Dec 2006 16:06

For example, you can use server cursors and implement all logic on the server side without fetching data to client side at all.
For more information please see MSDN or MS SQL Server Books Online.