Thursday, March 22, 2012

(almost) duplicates

I have a script to remove duplicate records from a table. I have since
found out that I have dups with a date/time stamp that are a few seconds off
(thus, I guess technically making them not duplicates). How can I properly
get rid of the later records. Again the only difference is the seconds in a
date/time field.
JeffPlease post your table structures and sample data along with the script
which you use to distingush the duplicates. Without clear specifications and
an usable repro, it is hard for others to understand what would
"technically" make some rows non-duplicates when in reality, they are.
Anith|||Any other column to identitfy the row?
delete t
where exists (select * from t as t1 where t1.col1 = t.col1 and
t1.col_datetime < t.col_datetime)
AMB
"J. Clarke" wrote:

> I have a script to remove duplicate records from a table. I have since
> found out that I have dups with a date/time stamp that are a few seconds o
ff
> (thus, I guess technically making them not duplicates). How can I properl
y
> get rid of the later records. Again the only difference is the seconds in
a
> date/time field.
> Jeff
>
>|||Hmmm...wouldn't this delete everything except the most recent record (or am
I missreading this)?
"Alejandro Mesa" <AlejandroMesa@.discussions.microsoft.com> wrote in message
news:E562DCB5-4F62-4DD2-B859-2F357856DBF4@.microsoft.com...[vbcol=seagreen]
> Any other column to identitfy the row?
> delete t
> where exists (select * from t as t1 where t1.col1 = t.col1 and
> t1.col_datetime < t.col_datetime)
>
> AMB
> "J. Clarke" wrote:
>|||I was afraid of this. Basically all the fields in a multiple rows contain
the same values EXCEPT the datetime field. So technically their not
duplicates (the datetimes are a couple of seconds off from each other).
However, I know the front end application had an error that was sticking
them in. I need to keep the 1st record it stuck in and get rid of the rest
that are a few seconds off from the 1st record.
"Anith Sen" <anith@.bizdatasolutions.com> wrote in message
news:%239EpuQ7MFHA.2656@.TK2MSFTNGP10.phx.gbl...
> Please post your table structures and sample data along with the script
> which you use to distingush the duplicates. Without clear specifications
> and an usable repro, it is hard for others to understand what would
> "technically" make some rows non-duplicates when in reality, they are.
> --
> Anith
>|||No, it wouldn't . this is a correlated subquery... Take a look at the table
alias t and t1... the same table but treated as 2 different tables...
This takes a row from table t and is trying to decide whether or not to
delete it... It looks to find a row in the same table ( but aliased to t1)
that has the same col1 ( supposedly the id column) but which has a date
which is < than the date of the row in t you are considering for
deletion... If that expression is true, that means there is another row
with the same key but which has an earlier date, so this row must be the
additionaly row that was added later , and there fore should be deleted.
Hope this helps..
Wayne Snyder, MCDBA, SQL Server MVP
Mariner, Charlotte, NC
www.mariner-usa.com
(Please respond only to the newsgroups.)
I support the Professional Association of SQL Server (PASS) and it's
community of SQL Server professionals.
www.sqlpass.org
"J. Clarke" <jaclarke01@.hotmail.comNOSPAM> wrote in message
news:u8lQjW$MFHA.2604@.TK2MSFTNGP10.phx.gbl...
> Hmmm...wouldn't this delete everything except the most recent record (or
> am I missreading this)?
> "Alejandro Mesa" <AlejandroMesa@.discussions.microsoft.com> wrote in
> message news:E562DCB5-4F62-4DD2-B859-2F357856DBF4@.microsoft.com...
>|||On Mon, 28 Mar 2005 10:35:43 -0500, J. Clarke wrote:

>I have a script to remove duplicate records from a table. I have since
>found out that I have dups with a date/time stamp that are a few seconds of
f
>(thus, I guess technically making them not duplicates). How can I properly
>get rid of the later records. Again the only difference is the seconds in
a
>date/time field.
>Jeff
>
Hi Jeff,
Assuming Col1, Col2 and Col3 are exactly the same and Col4 is the
datetime column with a few seconds difference, and that you want to
delete the duplicates if the time difference is no more than 20 seconds,
use:
DELETE FROM MyTable
WHERE EXISTS
(SELECT *
FROM MyTable AS b
WHERE b.Col1 = MyTable.Col1
AND b.Col2 = MyTable.Col2
AND b.Col3 = MyTable.Col3
AND b.Col4 < MyTable.Col4
AND b.Col4 >= DATEDIFF(second, 20, MyTable.Col4))
(untested)
Best, Hugo
--
(Remove _NO_ and _SPAM_ to get my e-mail address)|||similar to Hugo's idea you could also try the following:
DELETE MyTable
from MyTable my
LEFT JOIN (
select Col1, Col2, Col3, MIN(Col4)as MinCol4 from MyTable
group by Col1, Col2, Col3) as gb
ON my.Col1 = gb.Col1,
and my.Col2 = gb.Col2,
and my.Col3 = gb.Col3,
and my.Col4 = gb.MinCol4
where gb.MinCol4 IS NULL
This does not take into account any specific time lag. This would find
duplicates that are days apart not just 20 seconds.
This uses a subquery "group by" to identify distinct records and the
"earliest" (min) date stamp and then matches it back and drops all the
matching records that are not the earliest datetime.
The magic comes from the LEFT JOIN and the fact that MinCol4 IS NULL
Good luck.
Message posted via http://www.droptable.com|||Just to make you feel better, you can try this test to prove the theory:
(also note: I inadvertantly had commas in the "and" section of the left
join)
if exists(select name from sysobjects where name = 'MyTable' and type = 'U')
drop table MyTable
go
create table MyTable(Col1 int, Col2 varchar(10), Col3 int, Col4 datetime)
insert MyTable select 1, 'Lucy', 101, '1/1/2005'
insert MyTable select 1, 'Lucy', 101, '1/2/2005'
insert MyTable select 2, 'Ricky', 102, '1/1/2005'
insert MyTable select 2, 'Ricky', 102, '1/2/2005'
insert MyTable select 2, 'Ricky', 102, '1/3/2005'
insert MyTable select 3, 'Fred', 103, '1/1/2005'
insert MyTable select 4, 'Ethel', 104, '1/1/2005'
insert MyTable select 4, 'Ethel', 104, '1/2/2005'
insert MyTable select 4, 'Ethel', 104, '1/3/2005'
insert MyTable select 4, 'Ethel', 104, '1/4/2005'
DELETE MyTable
from MyTable my
LEFT JOIN (
select Col1, Col2, Col3, MIN(Col4)as MinCol4 from MyTable
group by Col1, Col2, Col3) as gb
ON my.Col1 = gb.Col1
and my.Col2 = gb.Col2
and my.Col3 = gb.Col3
and my.Col4 = gb.MinCol4
where gb.MinCol4 IS NULL
select * from MyTable
Message posted via http://www.droptable.com|||Ah...Thanks for the explaination Wayne. That helps alot.
"Wayne Snyder" <wayne.nospam.snyder@.mariner-usa.com> wrote in message
news:%232IH7KGNFHA.1476@.TK2MSFTNGP09.phx.gbl...
> No, it wouldn't . this is a correlated subquery... Take a look at the
> table alias t and t1... the same table but treated as 2 different
> tables...
> This takes a row from table t and is trying to decide whether or not to
> delete it... It looks to find a row in the same table ( but aliased to t1)
> that has the same col1 ( supposedly the id column) but which has a date
> which is < than the date of the row in t you are considering for
> deletion... If that expression is true, that means there is another row
> with the same key but which has an earlier date, so this row must be the
> additionaly row that was added later , and there fore should be deleted.
> Hope this helps..
> --
> Wayne Snyder, MCDBA, SQL Server MVP
> Mariner, Charlotte, NC
> www.mariner-usa.com
> (Please respond only to the newsgroups.)
> I support the Professional Association of SQL Server (PASS) and it's
> community of SQL Server professionals.
> www.sqlpass.org
> "J. Clarke" <jaclarke01@.hotmail.comNOSPAM> wrote in message
> news:u8lQjW$MFHA.2604@.TK2MSFTNGP10.phx.gbl...
>

No comments:

Post a Comment