MOBZHash
Hashes
A hash value is a basically a value, computed from some type of content (typically a byte array), that changes radically when those contents change. Also, a hash value is meant to be unpredictable: you can only get at a hash value by calculating it.
This makes it suitable for 'fingerprinting'. If you supply content plus the accompanying hash value, anyone can re-compute the hash on the content and verify the hash was unchanged. That way, tampering can be detected. The longer the hash (the more bits in it), the less likely it is that someone can substitute different content, that just happens to have the same hash value.
The most commonly used hash algorithms are MD5 and SHA; the latter in a variety of bit lengths: SAH1 (128 bits), SHA256, SHA384, and SHA512. These are all supported natively by the .NET Framework.
Hashing files
Hashing a file it easy, since a hash is usually calculated from a byte stream. You can read the contents of the file and then hash the resulting byte array, but this requires the file to reside in memory. Fortunately, you can just open a file for reading and compute the hash on its stream. That's a few lines of .NET code.
Making an application around that therefore mainly involves argument and exception handling, as you can see in the code. The result is a console application that displays hashes.
Usage
When MOBZHash is started without arguments, you'll see:
MOBZHash v1.0.1 (64-bit) by MOBZystems - http://www.mobzystems.com/Tools/MOBZHash
Usage:
MOBZHash [-v] [-r] [-d] hash-type [file-spec [file-spec ...]]
hash-type: hash type to use. All .NET hash algorihms are supported
Examples: MD5, SHA1 (default), SHA256, SHA384, SHA512
file-spec: A file name or wildcard. Defaults to all files in current directory
Switches:
-v: verbose
-r: recursive
-d: find duplicates. Display a list of all files with an identical hash
In a lot of ways this is similar to the way DIR works: you provide a file specification such as
c:\temp\*.txt
or
..\*.xml
and MOBZHash will calculate the hash on all matching files, and display them. Which hash algorithm is used is determined by the first argument, which can be MD5, SHA1, SHA256, SHA384, or SHA512. (It's possible there are more supported hashes in the .NET Framework - or a future version of it. These can be used by MOBZHash simply by providing the right hash type for it).
Since SHA1 is the default hash and "*.*" is the default file specification,
MOBZHash MD5
will show MD5 hashes for all files in the current directory, while
MOBZHash -v
will show SHA1 hashes for all files in the current directory in verbose mode. That particular example produces output like this:
MOBZHash v1.0.1 (64-bit) by MOBZystems - http://www.mobzystems.com/Tools/MOBZHash
Computing SHA1 hash for *.*...
E9-5C-92-68-35-D4-D2-A6-37-45-B3-CE-86-4A-48-59-8E-41-FB-8E: MOBZHash.exe
62-16-50-F9-F4-98-E0-7B-0D-BE-00-D2-BE-40-81-E1-2E-8E-EB-4A: MOBZHash.exe.config
EB-AA-45-68-58-4F-9A-9D-D0-8D-34-B8-54-17-18-58-FD-AA-40-85: MOBZHash.pdb
27-35-2B-A0-E6-80-25-B6-03-38-37-2D-D8-1A-40-D2-B5-B5-5F-21: MOBZHash.vshost.exe
5D-BD-3D-03-59-38-DE-C5-B0-7E-E4-78-9A-B1-14-93-6D-2D-E6-68: MOBZHash.xml
(It may seem a little strange to display the hash before the file name, but the hash is always of the same length so output looks neater that way.)
If you specify -r on the command line, files in subdirectories will be hashed as well.
Finding duplicates
Because hash 'collisions' are so unlikely (or at least, meant to be), finding two identical hashes is a strong indication that the two corresponding contents are equal. Storing and comparing hashes is much cheaper than comparing byte arrays (maybe not so much time-wise, but certainly memory-wise), so comparing files by comparing their hashes makes sense.
And since we're walking a directory tree collecting hashes, it's really very easy to store those hashes on the fly and to report any duplicates found.
That's what the -d switch is for. It makes MOBZHash collect the hashes first, and then report any files that have identical hashes. The output looks like this:
SHA1 5D-BD-3D-03-59-38-DE-C5-B0-7E-E4-78-9A-B1-14-93-6D-2D-E6-68: 2 files
- MOBZHash - Copy.xml
- MOBZHash.xml
Combining the -r and -d switches, and the ability to handle multiple file specifications make MOBZHash a useful tool scan for (likely!) duplicate files in one or multiple directory trees.
As usual, MOBZHash is free open source.
Enjoy!
BTW: .NET FTW, because writing MOBZHash took all of two hours, publishing on CodePlex another hour. And the (really wide-spaced) code prints to three pages.
Update January 25, 2018: Once again, these tasks are also easily accomplished using Powershell. For regular hashing, there's the Get-FileHash cmdlet. For finding duplicates, the following script (much less than three printed pages) also does the trick:
[CmdletBinding()]
Param(
[Parameter(Mandatory = $true)]
[string]$Path,
[switch]$Recurse = $false,
[string]$Algorithm = 'SHA256'
)
# Use a Powershell version 5 class for the result.
# For lower versions, you can use a PSCustomObject
class FileDuplicate {
[string]$Hash
[string[]]$FullNames
}
Function FindDuplicates()
{
# Write-Verbose "Finding duplicate files in $Path"
# Get the files in the specified path
$files = Get-ChildItem -Path $Path -Recurse:$Recurse
# Set up a hashtable to store hashes and file names
[HashTable]$hashes = @{}
# Loop over the files
$files | ForEach-Object {
# Hash the file
$hash = Get-FileHash -Path $_.FullName -Algorithm $Algorithm
# If we already know the hash, add this file name to it;
# if not, create a new hash antry for just this file
if ($hashes.ContainsKey($hash.Hash)) {
# Write-Verbose "$($_.FullName) is a duplicate"
$list = $hashes[$hash.Hash]
} else {
$list = New-Object System.Collections.ArrayList
$hashes[$hash.Hash] = $list
}
$list.Add($hash.Path) | Out-Null
}
# Create a new ArrayList for the results
$result = New-Object System.Collections.ArrayList
# Loop over the hashes (the keys)
$hashes.Keys | ForEach-Object {
$hash = $_
$list = $hashes[$hash]
# If there's more than one file invovled, store it in the result
# and report it verbosely
if ($list.Count -gt 1) {
$d = New-Object FileDuplicate
$d.Hash = $hash
$d.FullNames = $list.ToArray()
$result.Add($d) | Out-Null
Write-verbose "$($list.Count) files share hash $($hash):"
$list | ForEach-Object { Write-verbose "- $_" }
}
}
if ($result.Count -eq 0) {
Write-verbose "No duplicates found"
}
return $result
}
Set-StrictMode -Version Latest
return FindDuplicates
The main difference with the VB.NET code is the total lack of argument parsing - Powershell does that for us and that saves a lot of code!
Update 2: We can really make use of Powershell if we use Select-Object, Group-Object and Where-Object! How about this?
[CmdletBinding()]
Param(
[Parameter(Mandatory = $true)]
[string]$Path,
[switch]$Recurse = $false,
[string]$Algorithm = 'SHA256'
)
Get-ChildItem $Path -Recurse:$Recurse | # Get the child items in the path
Select -ExpandProperty FullName | # Select only their full names
ForEach-Object { # Hash each file
Get-FileHash -Path $_ -Algorithm $Algorithm
} |
Group-Object -Property Hash | # Group by hash
Where-Object { $_.Count -gt 1 } | # Select only hashes with more than 1 file
ForEach-Object { # Return a hash and a list of file names per duplicate
[pscustomobject]@{ Hash = $_.Name; FullNames = $_.Group.Path }
}
Six statements in a single Powershell pipeline - now that's the "Power" in Powershell, I guess!
Download MOBZHash
Happy with this tool? Consider buying us a cup of coffee