file compare utility, loop through drive's folders

Index > Projects and Ideas > file compare utility, loop through drive's folders

Author

Thread

sleepsleep

Joined: 05 Oct 2006
Posts: 14029
Location: ˛　　　　　　　　　　　　　　　　　　　　　　　　　　　　　⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣Posts: 0010456

sleepsleep 19 Nov 2012, 01:38

maybe there are thousand utilities like this out there,
this one is different only because i probably will code it Laughing

so, it is a spark of idea, because file management in 2012 is a tedious job.

we store lots of pictures, songs, documents, binary files, videos, movies and maybe whole lot stuffs inside our hard disks.

we do lots of backup, we are so scare to loose them, it is better to have more copies of same file.

the worse and chaos place is inside your computer, your hard disk,

and i actually believe, you can understand a person behaviour through the way he stores things inside his pc.

so, what i want to achieve here is,

1. a tool to loop through defined drive, store file name, file size in sqlite database. (unicode is a must), i don't think i need file checksum in the beginning.

2. once we got the data in sqlite db, we let user to go through order by most duplicated file, a click to open all folders in one time, or delete all and leave only 1 copy, or etc feature to manage those duplication.

3. just thinking about able to thumbnail picture files, store thumbnail into another sqlite db.

19 Nov 2012, 01:38

kalambong

Joined: 08 Nov 2008
Posts: 165

kalambong 23 Jan 2013, 05:58

Sleepsleep, when you are awake, maybe you might consider expand your filecompare utility a bit:

What is needed right now is to have a bit-by-bit compare utility comparing an ISO-file (essentially a DVD image file) to the image that has written into a DVD-R (or DVD-RW)

Disk burning utilities such as Nero has this function built in. Unfortunately, there is no standalone utility that can compare an ISO file with the DVD (or CD) disk itself.

23 Jan 2013, 05:58

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20801
Location: In your JS exploiting you and your system

revolution 23 Jan 2013, 06:14

If you have a database of your drives contents then you only need to store the file's hash and associate that with the name and path. After that, finding duplicates is easy with a simple function to sort the hashes and expose the duplicates.

You could also extend this to backup drives to ensure that that backups have a copy of each file by comparing the hashes and exposing singletons that exist in only one place.

I don't know how you can make the hash table be synchronised with the file system updates. This may the the really difficult part: to make sure that the current hashes are up-to-date each time a file is changed, deleted or added.

Perhaps if you use an FS that supports alternate streams then the hash can be put into a new stream with a date/time field to show when the hash was last computed? But that will require you to periodically scan the drive to find outdated hashes. However this suffers from the problem that the normal file attribute of the last modified date/time is writeable by applications and might be a false value. I know that truecrypt can be set to do this and a simple date/time last modified search would not show any new updates.

23 Jan 2013, 06:14

LocoDelAssembly
Your code has a bug

Joined: 06 May 2005
Posts: 4623
Location: Argentina

LocoDelAssembly 23 Jan 2013, 06:35

revolution, maybe this (and this for Linux)?

(I don't have experience with neither of both)

23 Jan 2013, 06:35

ejamesr

Joined: 04 Feb 2011
Posts: 52
Location: Provo, Utah, USA

ejamesr 25 Jan 2013, 22:53

Comparing hash codes does NOT tell you that two files are identical, but it CAN, however, tell you they are not identical (when the hash values do not match). When the hash values do match, however, it is still possible the files are not the same, and so they should be compared bit by bit to determine whether they are, in fact, duplicates. Comparing the file sizes helps, too (if they are different, of course the files are different). But even when the file size and hash match, they can still be different, so a byte comparison is needed.

As revolution pointed out, you need some fool-proof way of making sure that whatever heuristics you use are based upon correct, up-to-date information. Otherwise, you're subject to accidentally destroying data.

25 Jan 2013, 22:53

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20801
Location: In your JS exploiting you and your system

revolution 29 Jan 2013, 05:51

ejamesr wrote:

Comparing hash codes does NOT tell you that two files are identical ...

Indeed this is theoretically true. But for practical purposes this won't be an issue when using "good" hashes like SHA256, Whirlpool, etc.

29 Jan 2013, 05:51

baldr

Joined: 19 Mar 2008
Posts: 1651

baldr 29 Jan 2013, 17:27

sleepsleep,

Beware of hard/symlinks and mount points (in general, reparse points for NTFS).

29 Jan 2013, 17:27

pelaillo
Missing in inaction

Joined: 19 Jun 2003
Posts: 878
Location: Colombia

pelaillo 29 Jan 2013, 19:01

Use git for that. Fast and reliable.

29 Jan 2013, 19:01

hopcode

Joined: 04 Mar 2008
Posts: 563
Location: Germany

hopcode 29 Jan 2013, 23:48

ejamesr wrote:

...some fool-proof way of making sure that whatever heuristics you use...

as for example the one used in forensic tools,
cluster,sectors etc, a general glossary http://www.cnwrecovery.com/html/ntfs_forensic.html
Cheers,
Very Happy

_________________
⠓⠕⠏⠉⠕⠙⠑

29 Jan 2013, 23:48

ejamesr

Joined: 04 Feb 2011
Posts: 52
Location: Provo, Utah, USA

ejamesr 30 Jan 2013, 01:44

revolution wrote:

Indeed this is theoretically true. But for practical purposes this won't be an issue when using "good" hashes like SHA256, Whirlpool, etc.

Hashes have traditionally been used to determine whether a file has been changed at all, and the more hash output bits, the greater the confidence (assuming you have a good way of determining the hash algorithm is good). But it is still possible for two totally different files to have the same hash...

You are probably right, I don't know the real probabilities here, but I'm not so sure if this "won't be an issue" or if it "shouldn't be an issue". To me, it still seems safer to perform a bit comparison before deleting a file that a hash comparison says is an exact duplicate. Or at least, in a commercial product, let the end user determine which method to use to determine duplicates, therefore shifting the burden onto the user.

30 Jan 2013, 01:44

sleepsleep 21 Jun 2018, 19:59

i still think i need such tool, after 5 years, Laughing

21 Jun 2018, 19:59

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum