du gives two different results for the same file

You really should use something like md5sum or sha1sum to check integrity.

If you really want to use the size use ls -l or du -b.

The du utility normally only shows the disk usage of the file, i.e. how much of the file system is used by it. This value totally depends on the backing file system and other factors like sparse files.

Example:

$ truncate -s 512M foo
$ cat foo >bar
$ ls -l foo bar
-rw-r--r-- 1 michas users 536870912 23. Dez 00:06 bar
-rw-r--r-- 1 michas users 536870912 23. Dez 00:03 foo
$ du foo bar
0       foo
524288  bar
$ du -b foo bar
536870912       foo
536870912       bar

We have two files both containing 512MB of zeros. The first one is stored sparse and does not use any disk space, while the second stores each byte explicitly on disk. -- Same file, but completely different disk usage.

The -b option might be good for you:

   -b, --bytes
          equivalent to '--apparent-size --block-size=1'

   --apparent-size
          print apparent sizes, rather than disk usage; although the apparent
          size is  usually  smaller,  it  may  be  larger  due  to  holes  in
          ('sparse')  files, internal fragmentation, indirect blocks, and the
          like

This is a common problem when you put the same data on 2 different HDDs. You'll want to run the du command with and additional switch, assuming it has it - which it should given these are Linux nodes.

The switch?

   --apparent-size
          print  apparent  sizes,  rather  than  disk  usage;  although the 
          apparent size is usually smaller, it may be larger due to holes in
          ('sparse') files, internal fragmentation, indirect blocks, and the 
          like

Example

$ du -sh --apparent-size /home/sam/scsconfig.log ~/scsconfig.log 
93K /home/sam/scsconfig.log
93K /root/scsconfig.log

The above filesystems are a local disk (/root) while the other /home/sam is a NFS share from my NAS.

$ df -h . /home/sam
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                      222G  118G   92G  57% /
mulder:/export/raid1/home/sam
                      917G  566G  305G  65% /home/sam

So what's up?

This confuses a lot of people but remember that when files are stored to a disk they consume blocks of space even if they're only using a portion of those blocks. When you run du without the --apparent-size you're getting the size based on the amount of disk's block space used, not the actual space consumed by the file(s).

using a checksum instead?

This is likely a better option if you're concerned about comparing 2 trees of files. You can use this command to calculate a checksum for all the files, and then calculate a final checksum of checksums. This example uses sha1sum but you could just as easily use md5sum instead.

$ cd /some/dir
$ find . -type f \( -exec sha1sum "{}" \; \) | sort -k2,2 | sha1sum

Example

$ cd ~/dir1
$ find . -type f \( -exec sha1sum "{}" \; \) | sort -k2,2 | sha1sum
55e2672f8d6fccff6d83f0bffba1b67aeab87911  -

$ cd ~/dir2
$ find . -type f \( -exec sha1sum "{}" \; \) | sort -k2,2 | sha1sum
55e2672f8d6fccff6d83f0bffba1b67aeab87911  -

So we can see that the 2 trees are identical.

(Note: find command will list files as they appeared in the file system. So, if you are comparing two directories from the different file system (e.g. Ext3 vs. APFS), you need to sort first before the final sha1sum. (added by Xianjun Dong)


The short answer: don't test the file size, test the return status of the command. The return status the only a reliable indication of whether the copy succeeded (short of comparing the two files byte by byte, directly of indirectly — which is redundant if the copy succeeded).

Checking the file size is not a very useful way of checking whether a copy succeeded. In some cases, it may be a useful sanity check, for example when you download a file from the web. But here there's a better way.

All Unix commands return a status to indicate whether they succeeded: 0 for success, 1 or more for errors. So check the exit status of cp. cp will normally have printed an error message if it failed, indicating what the error is. In a script, the exit status of the last command is in the magic variable $?.

cp -v traj.trr ~/mysimulation1/
if [ $? -ne 0 ]; then
  echo 1>&2 "cp failed due to the error above"
  exit 2
 fi

Instead of checking whether $? is zero, you can use boolean operators.

cp -v traj.trr ~/mysimulation1/ || exit 2

If you're running a script and want the script to stop if any command fails, run set -e. If any command fails (i.e. returns a non-zero status), the script will exit immediately with the same status as the command.

set -e
…
cp -v traj.trr ~/mysimulation1/

As for the reason your copied file was larger, it must be because it was a sparse file. Sparse file are a crude form of compression where blocks containing only null bytes are not stored. When you copy a file, the cp command reads and writes null bytes, so where the original had missing blocks, the copy has blocks full of null bytes. Under Linux, the cp command tries to detect sparse files, but it doesn't always succeed; cp --sparse=always makes it try harder at the expense of a very slight increase in CPU time.

More generally, du could return different results due to other forms of compression. Compressed filesystems are rare, though. If you want to know the size of a file as in the number of bytes in the file, as opposed to the number of disk blocks it uses, use ls -l instead of du.

Tags:

Disk Usage