deleting duplicate records

Discussion of open issues, suggestions and bugs regarding UniDAC (Universal Data Access Components) for Delphi, C++Builder, Lazarus (and FPC)
Post Reply
sandy771
Posts: 194
Joined: Tue 22 May 2007 13:57

deleting duplicate records

Post by sandy771 » Wed 28 Jul 2010 15:59

I have a largeish database (10 million+ records)

I need to add a few thousand records to the database and then remove any duplicates. This is an operation that I have to do reasonably often.

I am after ideas as to the most efficient way to do this.

At the moment drop the index, append the records, sort and then step through and remove duplicates this would take a couple of minutes (acceptable) but I have a problem.

I am using UniQuery->next() to go through the table

If I check the current record against the previous and see that there is a duplicate I could use an sql query to delete the current record based on ROWID (sqlite). As I understand it the only way to do this is with a separate uniquery component - otherwise I would lose the cursor position when my query changes. Is this right

Is there a better way of doing this?
Last edited by sandy771 on Thu 29 Jul 2010 09:22, edited 1 time in total.

DepSoft
Posts: 20
Joined: Tue 27 Jul 2010 03:01
Location: Western Australia

Post by DepSoft » Thu 29 Jul 2010 04:07

I'm not familiar with SQLite but I would think it would be more efficient to do it something like this:

create a temp table with the same structure as your main table
add your batch of records to the temp table
delete from the temp table where key value in main table
add records in temp table to main table

It depends on the nature of the records and primary keys/indexes and whether your check for duplicates needs code to check duplicity, etc

It should be faster since more of the work is being done on the server without fetching all the records back

Regards, Paul

sandy771
Posts: 194
Joined: Tue 22 May 2007 13:57

Post by sandy771 » Thu 29 Jul 2010 08:09

DepSoft wrote:I'm not familiar with SQLite but I would think it would be more efficient to do it something like this:

create a temp table with the same structure as your main table
add your batch of records to the temp table
delete from the temp table where key value in main table
add records in temp table to main table

It depends on the nature of the records and primary keys/indexes and whether your check for duplicates needs code to check duplicity, etc

It should be faster since more of the work is being done on the server without fetching all the records back

Regards, Paul
There is a real possibilitythatthe temp table would be too big.

I have considered adding a flag field to each record, setting it for each duplicate and then clearing each record with the flag set. But looking for ideas.

DepSoft
Posts: 20
Joined: Tue 27 Jul 2010 03:01
Location: Western Australia

Post by DepSoft » Thu 29 Jul 2010 09:18

The temp table would be created as an empty table and only be populated with your records for insertion into the main table.

If those records would make a temp table too big, wouldn't they also cause your main table to be too big once inserted?

Regards, Paul.

sandy771
Posts: 194
Joined: Tue 22 May 2007 13:57

Post by sandy771 » Thu 29 Jul 2010 09:24

Ah OK sorry I have now read your first post properly, that may work I'll have a play.

sandy771
Posts: 194
Joined: Tue 22 May 2007 13:57

Post by sandy771 » Sat 31 Jul 2010 11:43

The solution I went for in the end - which works well - was to sort the database, step through a file at a time, set a flag on each row that is the same as the one before and then do a delete from table where flag = 1

On 8 Million rows it still completes in about a minute

bork
Devart Team
Posts: 649
Joined: Fri 12 Mar 2010 07:55

Post by bork » Mon 02 Aug 2010 09:26

Hello

For the table:
CREATE TABLE test_table (
ID INTEGER NOT NULL,
Name VARCHAR(250) NOT NULL
)

You can select duplicates:
select rowid, * from test_table
where rowid not in (select min(rowid) from btest0 group by id, name)

You can delete duplicates:
delete from test_table
where rowid not in (select min(rowid) from btest0 group by id, name)

For your table you should add all your fields in the group by statement.

sandy771
Posts: 194
Joined: Tue 22 May 2007 13:57

Post by sandy771 » Mon 02 Aug 2010 10:00

Thanks Bork I'll look at that.

Post Reply