deleting duplicate records

sandy771 · Post by **sandy771** » Wed 28 Jul 2010 15:59

I have a largeish database (10 million+ records)

I need to add a few thousand records to the database and then remove any duplicates. This is an operation that I have to do reasonably often.

I am after ideas as to the most efficient way to do this.

At the moment drop the index, append the records, sort and then step through and remove duplicates this would take a couple of minutes (acceptable) but I have a problem.

I am using UniQuery->next() to go through the table

If I check the current record against the previous and see that there is a duplicate I could use an sql query to delete the current record based on ROWID (sqlite). As I understand it the only way to do this is with a separate uniquery component - otherwise I would lose the cursor position when my query changes. Is this right

Is there a better way of doing this?

DepSoft · Post by **DepSoft** » Thu 29 Jul 2010 04:07

I'm not familiar with SQLite but I would think it would be more efficient to do it something like this:

create a temp table with the same structure as your main table
add your batch of records to the temp table
delete from the temp table where key value in main table
add records in temp table to main table

It depends on the nature of the records and primary keys/indexes and whether your check for duplicates needs code to check duplicity, etc

It should be faster since more of the work is being done on the server without fetching all the records back

Regards, Paul

sandy771 · Post by **sandy771** » Thu 29 Jul 2010 08:09

DepSoft wrote:I'm not familiar with SQLite but I would think it would be more efficient to do it something like this:

create a temp table with the same structure as your main table
add your batch of records to the temp table
delete from the temp table where key value in main table
add records in temp table to main table

It depends on the nature of the records and primary keys/indexes and whether your check for duplicates needs code to check duplicity, etc

It should be faster since more of the work is being done on the server without fetching all the records back

Regards, Paul

There is a real possibilitythatthe temp table would be too big.

I have considered adding a flag field to each record, setting it for each duplicate and then clearing each record with the flag set. But looking for ideas.

DepSoft · Post by **DepSoft** » Thu 29 Jul 2010 09:18

The temp table would be created as an empty table and only be populated with your records for insertion into the main table.

If those records would make a temp table too big, wouldn't they also cause your main table to be too big once inserted?

Regards, Paul.

sandy771 · Post by **sandy771** » Thu 29 Jul 2010 09:24

Ah OK sorry I have now read your first post properly, that may work I'll have a play.

sandy771 · Post by **sandy771** » Sat 31 Jul 2010 11:43

The solution I went for in the end - which works well - was to sort the database, step through a file at a time, set a flag on each row that is the same as the one before and then do a delete from table where flag = 1

On 8 Million rows it still completes in about a minute

bork · Post by **bork** » Mon 02 Aug 2010 09:26

Hello

For the table:
CREATE TABLE test_table (
ID INTEGER NOT NULL,
Name VARCHAR(250) NOT NULL
)

You can select duplicates:
select rowid, * from test_table
where rowid not in (select min(rowid) from btest0 group by id, name)

You can delete duplicates:
delete from test_table
where rowid not in (select min(rowid) from btest0 group by id, name)

For your table you should add all your fields in the group by statement.

sandy771 · Post by **sandy771** » Mon 02 Aug 2010 10:00

Thanks Bork I'll look at that.