flat assembler
Message board for the users of flat assembler.

flat assembler > Projects and Ideas > file compare utility, loop through drive's folders

Author
Thread Post new topic Reply to topic
sleepsleep



Joined: 05 Oct 2006
Posts: 7722
Location: ˛                              ⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣ Posts: 6699
maybe there are thousand utilities like this out there,
this one is different only because i probably will code it Laughing

so, it is a spark of idea, because file management in 2012 is a tedious job.

we store lots of pictures, songs, documents, binary files, videos, movies and maybe whole lot stuffs inside our hard disks.

we do lots of backup, we are so scare to loose them, it is better to have more copies of same file.

the worse and chaos place is inside your computer, your hard disk,

and i actually believe, you can understand a person behaviour through the way he stores things inside his pc.

so, what i want to achieve here is,

1. a tool to loop through defined drive, store file name, file size in sqlite database. (unicode is a must), i don't think i need file checksum in the beginning.

2. once we got the data in sqlite db, we let user to go through order by most duplicated file, a click to open all folders in one time, or delete all and leave only 1 copy, or etc feature to manage those duplication.

3. just thinking about able to thumbnail picture files, store thumbnail into another sqlite db.
Post 19 Nov 2012, 01:38
View user's profile Send private message Reply with quote
kalambong



Joined: 08 Nov 2008
Posts: 165
Sleepsleep, when you are awake, maybe you might consider expand your filecompare utility a bit:

What is needed right now is to have a bit-by-bit compare utility comparing an ISO-file (essentially a DVD image file) to the image that has written into a DVD-R (or DVD-RW)

Disk burning utilities such as Nero has this function built in. Unfortunately, there is no standalone utility that can compare an ISO file with the DVD (or CD) disk itself.
Post 23 Jan 2013, 05:58
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16106
Location: Squiddler's Patch
If you have a database of your drives contents then you only need to store the file's hash and associate that with the name and path. After that, finding duplicates is easy with a simple function to sort the hashes and expose the duplicates.

You could also extend this to backup drives to ensure that that backups have a copy of each file by comparing the hashes and exposing singletons that exist in only one place.

I don't know how you can make the hash table be synchronised with the file system updates. This may the the really difficult part: to make sure that the current hashes are up-to-date each time a file is changed, deleted or added.

Perhaps if you use an FS that supports alternate streams then the hash can be put into a new stream with a date/time field to show when the hash was last computed? But that will require you to periodically scan the drive to find outdated hashes. However this suffers from the problem that the normal file attribute of the last modified date/time is writeable by applications and might be a false value. I know that truecrypt can be set to do this and a simple date/time last modified search would not show any new updates.
Post 23 Jan 2013, 06:14
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4634
Location: Argentina
revolution, maybe this (and this for Linux)?

(I don't have experience with neither of both)
Post 23 Jan 2013, 06:35
View user's profile Send private message Reply with quote
ejamesr



Joined: 04 Feb 2011
Posts: 52
Location: Provo, Utah, USA
Comparing hash codes does NOT tell you that two files are identical, but it CAN, however, tell you they are not identical (when the hash values do not match). When the hash values do match, however, it is still possible the files are not the same, and so they should be compared bit by bit to determine whether they are, in fact, duplicates. Comparing the file sizes helps, too (if they are different, of course the files are different). But even when the file size and hash match, they can still be different, so a byte comparison is needed.

As revolution pointed out, you need some fool-proof way of making sure that whatever heuristics you use are based upon correct, up-to-date information. Otherwise, you're subject to accidentally destroying data.
Post 25 Jan 2013, 22:53
View user's profile Send private message Send e-mail Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16106
Location: Squiddler's Patch
ejamesr wrote:
Comparing hash codes does NOT tell you that two files are identical ...
Indeed this is theoretically true. But for practical purposes this won't be an issue when using "good" hashes like SHA256, Whirlpool, etc.
Post 29 Jan 2013, 05:51
View user's profile Send private message Visit poster's website Reply with quote
baldr



Joined: 19 Mar 2008
Posts: 1651
sleepsleep,

Beware of hard/symlinks and mount points (in general, reparse points for NTFS).
Post 29 Jan 2013, 17:27
View user's profile Send private message Reply with quote
pelaillo
Missing in inaction


Joined: 19 Jun 2003
Posts: 863
Location: Colombia
Use git for that. Fast and reliable.
Post 29 Jan 2013, 19:01
View user's profile Send private message Yahoo Messenger Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
ejamesr wrote:
...some fool-proof way of making sure that whatever heuristics you use...
as for example the one used in forensic tools,
cluster,sectors etc, a general glossary http://www.cnwrecovery.com/html/ntfs_forensic.html
Cheers,
Very Happy

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 29 Jan 2013, 23:48
View user's profile Send private message Visit poster's website Reply with quote
ejamesr



Joined: 04 Feb 2011
Posts: 52
Location: Provo, Utah, USA
revolution wrote:
Indeed this is theoretically true. But for practical purposes this won't be an issue when using "good" hashes like SHA256, Whirlpool, etc.

Hashes have traditionally been used to determine whether a file has been changed at all, and the more hash output bits, the greater the confidence (assuming you have a good way of determining the hash algorithm is good). But it is still possible for two totally different files to have the same hash...

You are probably right, I don't know the real probabilities here, but I'm not so sure if this "won't be an issue" or if it "shouldn't be an issue". To me, it still seems safer to perform a bit comparison before deleting a file that a hash comparison says is an exact duplicate. Or at least, in a commercial product, let the end user determine which method to use to determine duplicates, therefore shifting the burden onto the user.
Post 30 Jan 2013, 01:44
View user's profile Send private message Send e-mail Reply with quote
sleepsleep



Joined: 05 Oct 2006
Posts: 7722
Location: ˛                              ⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣ Posts: 6699
i still think i need such tool, after 5 years, Laughing
Post 21 Jun 2018, 19:59
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2018, Tomasz Grysztar.

Powered by rwasa.