du gives two different results for the same file
You really should use something like md5sum
or sha1sum
to check integrity.
If you really want to use the size use ls -l
or du -b
.
The du
utility normally only shows the disk usage of the file, i.e. how much of the file system is used by it. This value totally depends on the backing file system and other factors like sparse files.
Example:
$ truncate -s 512M foo
$ cat foo >bar
$ ls -l foo bar
-rw-r--r-- 1 michas users 536870912 23. Dez 00:06 bar
-rw-r--r-- 1 michas users 536870912 23. Dez 00:03 foo
$ du foo bar
0 foo
524288 bar
$ du -b foo bar
536870912 foo
536870912 bar
We have two files both containing 512MB of zeros. The first one is stored sparse and does not use any disk space, while the second stores each byte explicitly on disk. -- Same file, but completely different disk usage.
The -b
option might be good for you:
-b, --bytes
equivalent to '--apparent-size --block-size=1'
--apparent-size
print apparent sizes, rather than disk usage; although the apparent
size is usually smaller, it may be larger due to holes in
('sparse') files, internal fragmentation, indirect blocks, and the
like
This is a common problem when you put the same data on 2 different HDDs. You'll want to run the du
command with and additional switch, assuming it has it - which it should given these are Linux nodes.
The switch?
--apparent-size
print apparent sizes, rather than disk usage; although the
apparent size is usually smaller, it may be larger due to holes in
('sparse') files, internal fragmentation, indirect blocks, and the
like
Example
$ du -sh --apparent-size /home/sam/scsconfig.log ~/scsconfig.log
93K /home/sam/scsconfig.log
93K /root/scsconfig.log
The above filesystems are a local disk (/root
) while the other /home/sam
is a NFS share from my NAS.
$ df -h . /home/sam
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
222G 118G 92G 57% /
mulder:/export/raid1/home/sam
917G 566G 305G 65% /home/sam
So what's up?
This confuses a lot of people but remember that when files are stored to a disk they consume blocks of space even if they're only using a portion of those blocks. When you run du
without the --apparent-size
you're getting the size based on the amount of disk's block space used, not the actual space consumed by the file(s).
using a checksum instead?
This is likely a better option if you're concerned about comparing 2 trees of files. You can use this command to calculate a checksum for all the files, and then calculate a final checksum of checksums. This example uses sha1sum
but you could just as easily use md5sum
instead.
$ cd /some/dir
$ find . -type f \( -exec sha1sum "{}" \; \) | sort -k2,2 | sha1sum
Example
$ cd ~/dir1
$ find . -type f \( -exec sha1sum "{}" \; \) | sort -k2,2 | sha1sum
55e2672f8d6fccff6d83f0bffba1b67aeab87911 -
$ cd ~/dir2
$ find . -type f \( -exec sha1sum "{}" \; \) | sort -k2,2 | sha1sum
55e2672f8d6fccff6d83f0bffba1b67aeab87911 -
So we can see that the 2 trees are identical.
(Note: find command will list files as they appeared in the file system. So, if you are comparing two directories from the different file system (e.g. Ext3 vs. APFS), you need to sort first before the final sha1sum. (added by Xianjun Dong)
The short answer: don't test the file size, test the return status of the command. The return status the only a reliable indication of whether the copy succeeded (short of comparing the two files byte by byte, directly of indirectly — which is redundant if the copy succeeded).
Checking the file size is not a very useful way of checking whether a copy succeeded. In some cases, it may be a useful sanity check, for example when you download a file from the web. But here there's a better way.
All Unix commands return a status to indicate whether they succeeded: 0 for success, 1 or more for errors. So check the exit status of cp
. cp
will normally have printed an error message if it failed, indicating what the error is. In a script, the exit status of the last command is in the magic variable $?
.
cp -v traj.trr ~/mysimulation1/
if [ $? -ne 0 ]; then
echo 1>&2 "cp failed due to the error above"
exit 2
fi
Instead of checking whether $?
is zero, you can use boolean operators.
cp -v traj.trr ~/mysimulation1/ || exit 2
If you're running a script and want the script to stop if any command fails, run set -e
. If any command fails (i.e. returns a non-zero status), the script will exit immediately with the same status as the command.
set -e
…
cp -v traj.trr ~/mysimulation1/
As for the reason your copied file was larger, it must be because it was a sparse file. Sparse file are a crude form of compression where blocks containing only null bytes are not stored. When you copy a file, the cp
command reads and writes null bytes, so where the original had missing blocks, the copy has blocks full of null bytes. Under Linux, the cp
command tries to detect sparse files, but it doesn't always succeed; cp --sparse=always
makes it try harder at the expense of a very slight increase in CPU time.
More generally, du
could return different results due to other forms of compression. Compressed filesystems are rare, though. If you want to know the size of a file as in the number of bytes in the file, as opposed to the number of disk blocks it uses, use ls -l
instead of du
.