Why is my sorted file bigger?
While your original file has lines that end with \n
, your sorted file has \r\n
. The addition of the \r
is what changes the size.
To illustrate, here's what happens when I run your command on my Linux system:
$ sort < file.txt | uniq > sorted-file.linux.txt
$ ls -l file.txt sorted-file.linux.txt
-rw-r--r-- 1 terdon terdon 2958616 Jul 10 12:11 file.txt
-rw-r--r-- 1 terdon terdon 2942389 Jul 10 15:15 sorted-file.linux.txt
$ wc -l file.txt sorted-file.linux.txt
273882 file.txt
271576 sorted-file.linux.txt
As you can see, the sorted de-duped file is a few lines shorter and, consequently, a few bytes smaller. Your file, however, is different:
$ wc -l sorted-file.linux.txt sorted-file.txt
271576 sorted-file.linux.txt
271576 sorted-file.txt
The two files have exactly the same number of lines, but:
$ ls -l file.txt sorted-file.linux.txt sorted-file.txt
-rw-r--r-- 1 terdon terdon 2958616 Jul 10 12:11 file.txt
-rw-r--r-- 1 terdon terdon 2942389 Jul 10 15:15 sorted-file.linux.txt
-rw-r--r-- 1 terdon terdon 3213965 Jul 10 12:11 sorted-file.txt
The sorted-file.txt
, the one I downloaded from your link, is larger. If we now examine the first line, we can see the extra \r
:
$ head -n1 sorted-file.txt | od -c
0000000 a \r \n
0000003
Which aren't present in the one I created on Linux:
$ head -n1 sorted-file.linux.txt | od -c
0000000 a \n
0000002
If we now remove the \r
from your file:
$ tr -d '\r' < sorted-file.txt > new-sorted-file.txt
We get the expected result, a file that is smaller than the original, just like the one I created on my system:
$ ls -l sorted-file.linux.txt new-sorted-file.txt file.txt
-rw-r--r-- 1 terdon terdon 2958616 Jul 10 12:11 file.txt
-rw-r--r-- 1 terdon terdon 2942389 Jul 10 15:19 new-sorted-file.txt
-rw-r--r-- 1 terdon terdon 2942389 Jul 10 15:15 sorted-file.linux.txt
hexdump
reveals it!
$ hexdump -cn 32 file.txt
0000000 a d h d \n a d s l \n a m v b \n a
0000010 o v \n a o w \n a r o b \n a s f a
0000020
$ hexdump -cn 32 my-sorted.txt
0000000 a \n a a \n a a a \n a a d \n a a d
0000010 s \n a a f j e \n a a f j e s \n a
0000020
$ hexdump -cn 32 sorted-file.txt
0000000 a \r \n a a \r \n a a a \r \n a a d \r
0000010 \n a a d s \r \n a a f j e \r \n a a
0000020
Your sorted file is bigger because it uses Windows line endings \r\n
(two bytes) instead of Linux line endings \n
(one byte).
Could it be that you were running that command above under Windows using either tools like cygwin
or this new Linux subsystem for Windows 10? Or did you maybe run something in Wine?