deleting duplicate records
deleting duplicate records
I have a largeish database (10 million+ records)
I need to add a few thousand records to the database and then remove any duplicates. This is an operation that I have to do reasonably often.
I am after ideas as to the most efficient way to do this.
At the moment drop the index, append the records, sort and then step through and remove duplicates this would take a couple of minutes (acceptable) but I have a problem.
I am using UniQuery->next() to go through the table
If I check the current record against the previous and see that there is a duplicate I could use an sql query to delete the current record based on ROWID (sqlite). As I understand it the only way to do this is with a separate uniquery component - otherwise I would lose the cursor position when my query changes. Is this right
Is there a better way of doing this?
I need to add a few thousand records to the database and then remove any duplicates. This is an operation that I have to do reasonably often.
I am after ideas as to the most efficient way to do this.
At the moment drop the index, append the records, sort and then step through and remove duplicates this would take a couple of minutes (acceptable) but I have a problem.
I am using UniQuery->next() to go through the table
If I check the current record against the previous and see that there is a duplicate I could use an sql query to delete the current record based on ROWID (sqlite). As I understand it the only way to do this is with a separate uniquery component - otherwise I would lose the cursor position when my query changes. Is this right
Is there a better way of doing this?
Last edited by sandy771 on Thu 29 Jul 2010 09:22, edited 1 time in total.
I'm not familiar with SQLite but I would think it would be more efficient to do it something like this:
create a temp table with the same structure as your main table
add your batch of records to the temp table
delete from the temp table where key value in main table
add records in temp table to main table
It depends on the nature of the records and primary keys/indexes and whether your check for duplicates needs code to check duplicity, etc
It should be faster since more of the work is being done on the server without fetching all the records back
Regards, Paul
create a temp table with the same structure as your main table
add your batch of records to the temp table
delete from the temp table where key value in main table
add records in temp table to main table
It depends on the nature of the records and primary keys/indexes and whether your check for duplicates needs code to check duplicity, etc
It should be faster since more of the work is being done on the server without fetching all the records back
Regards, Paul
There is a real possibilitythatthe temp table would be too big.DepSoft wrote:I'm not familiar with SQLite but I would think it would be more efficient to do it something like this:
create a temp table with the same structure as your main table
add your batch of records to the temp table
delete from the temp table where key value in main table
add records in temp table to main table
It depends on the nature of the records and primary keys/indexes and whether your check for duplicates needs code to check duplicity, etc
It should be faster since more of the work is being done on the server without fetching all the records back
Regards, Paul
I have considered adding a flag field to each record, setting it for each duplicate and then clearing each record with the flag set. But looking for ideas.
Hello
For the table:
CREATE TABLE test_table (
ID INTEGER NOT NULL,
Name VARCHAR(250) NOT NULL
)
You can select duplicates:
select rowid, * from test_table
where rowid not in (select min(rowid) from btest0 group by id, name)
You can delete duplicates:
delete from test_table
where rowid not in (select min(rowid) from btest0 group by id, name)
For your table you should add all your fields in the group by statement.
For the table:
CREATE TABLE test_table (
ID INTEGER NOT NULL,
Name VARCHAR(250) NOT NULL
)
You can select duplicates:
select rowid, * from test_table
where rowid not in (select min(rowid) from btest0 group by id, name)
You can delete duplicates:
delete from test_table
where rowid not in (select min(rowid) from btest0 group by id, name)
For your table you should add all your fields in the group by statement.