nixCraft Linux Forum

nixCraft

Linux / UNIX Tech Support Forum

Remove Duplicate Files From 2 Partitions

This is a discussion on Remove Duplicate Files From 2 Partitions within the Shell scripting forums, part of the Development/Scripting category; Hello Nixcraft, I have 2 partitions that contain thousands of files in a folder structure as follows: Code: /data1/wcnn/*.mp3 /data1/wxxr/*.mp3 ...


Go Back   nixCraft Linux Forum > Development/Scripting > Shell scripting

Linux answers from nixCraft.


Shell scripting You can discuss the shell scripting, request shell scripts and scripting techniques

Reply

 

LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 25-09-2009, 02:15 AM
jaysunn's Avatar
Powered By Linux
User
 
Join Date: Apr 2009
Location: 41.332032,-73.089775
OS: RHEL - OSX
Scripting language: BASH - Learning Ruby
Posts: 600
Thanks: 61
Thanked 78 Times in 70 Posts
Rep Power: 10
jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold
Default Remove Duplicate Files From 2 Partitions

Hello Nixcraft,

I have 2 partitions that contain thousands of files in a folder structure as follows:

Code:
/data1/wcnn/*.mp3
/data1/wxxr/*.mp3
/data1/trrn/*.mp3

/data2/wcnn/*.mp3
/data2/wxxr/*.mp3
/data2/trrn/*.mp3
So basically there is data1/station/*.mp3 and data2/station/*.mp3

Now each stations abbreviation in 4 characters and just that piece is the same on each partition. We have duplicate MP3'S residing on both partitions.

So there may be e.g.:

Code:
/data1/wbcn/666998.mp3 and in /data2/wbcn/666998.mp3
Is it possible to search for duplicate files and write the filenames that are on both data1 and data2 to a large text file?

I tried using the diff command but this data is just way to large. Almost 1TB per partition.

Jaysunn
__________________
Have a look at what I have been working on
http://www.shellasaurus.com
Reply With Quote
  #2 (permalink)  
Old 25-09-2009, 11:55 AM
nixcraft's Avatar
Never say die
User
 
Join Date: Jan 2005
Location: BIOS
OS: RHEL
Scripting language: Bash and Python
Posts: 2,710
Thanks: 11
Thanked 244 Times in 183 Posts
Rep Power: 10
nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute
Default

Run find command on both partition in a background and create a text file. Once done use those text files to create diff or uniq view.
Code:
find /path/to/partition1 -iname "*.mp3" -print0 >output1.txt 2>error1.txt &
find /path/to/partition2 -iname "*.mp3" -print0 >output2.txt 2>error2.txt &
Now run
Code:
diff output1.txt output2.txt > diff.txt
You can also use sort and uniq command to sort and get list of uniq files.
__________________
Vivek Gite
Linux Evangelist
Be proud RHEL user, and let the world know about your enterprise choices! Join RedHat user group.
Always use CODE tags for posting system output and commands!
Do you run a Linux? Let's face it, you need help
Reply With Quote
  #3 (permalink)  
Old 25-09-2009, 06:58 PM
jaysunn's Avatar
Powered By Linux
User
 
Join Date: Apr 2009
Location: 41.332032,-73.089775
OS: RHEL - OSX
Scripting language: BASH - Learning Ruby
Posts: 600
Thanks: 61
Thanked 78 Times in 70 Posts
Rep Power: 10
jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold
Default

Thanks for you replies. I executed the find commands that Vivek posted and I generated two files.

Code:
data1.txt = 14MB
data2.txt = 4.5MB
These files were pretty large. Next I attempted to create the diff output file with this command. Here were the results. Please Advise?

Code:
[root@podcast2 ~]# diff data1.txt data2.txt > duplicate_mp3.txt
Code:
[root@podcast2 ~]# cat duplicate_mp3.txt 
Binary files data1.txt and data2.txt differ
[root@podcast2 ~]#
I am currently researching this error. However if you have the quick solution, I would appreciate.

Thanks for your support,

Jaysunn
__________________
Have a look at what I have been working on
http://www.shellasaurus.com
Reply With Quote
  #4 (permalink)  
Old 25-09-2009, 08:42 PM
nixcraft's Avatar
Never say die
User
 
Join Date: Jan 2005
Location: BIOS
OS: RHEL
Scripting language: Bash and Python
Posts: 2,710
Thanks: 11
Thanked 244 Times in 183 Posts
Rep Power: 10
nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute
Default

Jay,

I think you need to write a bit of shell script. I had something link this for finding out duplicate mp3s but my collection is 2-3 gb max.

Code:
#!/bin/bash
F1=sda1.txt
F2=sdb2.txt

# grep dupes from /dev/sda1
while IFS= read -r  line 
do 
    cf=$(basename $line)
    grep -q "$cf" ${F1} && echo $line
done < "$F2"
Here is what I do to create those two files:
Code:
find /mnt/sda1 -iname "*.mp3" >/tmp/sda1.txt &
find /mnt/sdb2 -iname "*.mp3" >/tmp/sdb2.txt &
Once generated I will run script as follows to create duplicates list on second partition:
/path/to/script > dups.txt

I suggest you run this on small data set like 20 or 30 mp3 in /tmp/d1 and /tmp/d2 directory (copy them manually). Remove or add few duplicates in d2 and test the script. Once evething is working, try it on your actual data set.

You may also need to consider md5 and not just filenames. Are those exact duplicates? For example. /tmp/foo and /data/foo got same name but they might have different content. In that case you need to run md5 checksup on both files. Let me know...
__________________
Vivek Gite
Linux Evangelist
Be proud RHEL user, and let the world know about your enterprise choices! Join RedHat user group.
Always use CODE tags for posting system output and commands!
Do you run a Linux? Let's face it, you need help

Last edited by nixcraft; 25-09-2009 at 08:45 PM.
Reply With Quote
  #5 (permalink)  
Old 25-09-2009, 08:50 PM
nixcraft's Avatar
Never say die
User
 
Join Date: Jan 2005
Location: BIOS
OS: RHEL
Scripting language: Bash and Python
Posts: 2,710
Thanks: 11
Thanked 244 Times in 183 Posts
Rep Power: 10
nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute
Default

I think md5sum and cmp are the two commands you should look into it. Take a look at the following


Dr. Dobb's | Finding Duplicate Files | December 1, 2003
__________________
Vivek Gite
Linux Evangelist
Be proud RHEL user, and let the world know about your enterprise choices! Join RedHat user group.
Always use CODE tags for posting system output and commands!
Do you run a Linux? Let's face it, you need help
Reply With Quote
The Following 2 Users Say Thank You to nixcraft For This Useful Post:
chiku (28-09-2009), jaysunn (25-09-2009)
  #6 (permalink)  
Old 25-09-2009, 10:50 PM
jaysunn's Avatar
Powered By Linux
User
 
Join Date: Apr 2009
Location: 41.332032,-73.089775
OS: RHEL - OSX
Scripting language: BASH - Learning Ruby
Posts: 600
Thanks: 61
Thanked 78 Times in 70 Posts
Rep Power: 10
jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold
Default

This did the trick. I have generated and wonderful list of duplicates. Thank you so so much.

You are a genius.

Jaysunn
__________________
Have a look at what I have been working on
http://www.shellasaurus.com
Reply With Quote
  #7 (permalink)  
Old 25-09-2009, 11:34 PM
jaysunn's Avatar
Powered By Linux
User
 
Join Date: Apr 2009
Location: 41.332032,-73.089775
OS: RHEL - OSX
Scripting language: BASH - Learning Ruby
Posts: 600
Thanks: 61
Thanked 78 Times in 70 Posts
Rep Power: 10
jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold jaysunn is a splendid one to behold
Default

Hey using your concept I came up with this. And I created another great list. Thanks again.


PHP Code:
[root@podcast2 ~]# cat duplicate_finder.sh
#!/bin/bash

F1=data1_copy.txt
F2
=data2.txt

while IFSread -r line

do

    
cf=$line
    grep 
-"$cf" ${F1

    if [ $? == 
]
        
then 
        
        diff $line 
`echo $line | sed 's/data2/data1/g' `
        
        if [ $? == 
]
        
then 
        
echo $line 
        fi 

        fi

done 
"$F2"
[root@podcast2 ~]


Jaysunn
__________________
Have a look at what I have been working on
http://www.shellasaurus.com
Reply With Quote
Reply

Tags
diff , grep , if command , read , shell script compiler , shell while command


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads

Thread Thread Starter Forum Replies Last Post
Linux / UNIX Remove All Files in Folder Ending With ~ Symbol demuytree Shell scripting 4 17-08-2008 07:25 AM
Find Duplicate IP Address / Subnet with arping dougp23 Networking, Firewalls and Security 2 03-08-2008 07:20 PM
Comparing filename-substrings and remove unnecessary files cypher82 Shell scripting 1 28-05-2008 12:53 PM
Grep and remove files Linux software 1 05-01-2006 06:25 PM
Script to remove executable files sweta Shell scripting 4 12-03-2005 01:21 PM


All times are GMT +5.5. The time now is 10:39 PM.


Powered by vBulletin® Version 3.8.5 - Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.3.2
©2005-2010 nixCraft. All rights reserved

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38