How do I remove the first 300 million lines from a 700 GB txt file on a system with 1 TB disk space?

Removing the first n lines (or bytes) can be done in-place using dd (or alternatively using loop devices). It does not use a temporary file and there is no size limit; however, it is dangerous since there is no track of progress, and any error leaves you with a broken file.

Example: Create a sample file with 1000 lines:

$ seq 1 1000 > 1000lines.txt
$ head -n 3 1000lines.txt
1
2
3
$ tail -n 3 1000lines.txt
998
999
1000

We want to remove the first 300 lines. How many bytes does it correspond to?

$ stat -c %s 1000lines.txt
3893 # total bytes
$ head -n 300 1000lines.txt | wc -c
1092 # first 300 lines bytes
$ echo $((3893-1092))
2801 # target filesize after removal

The file is 3893 bytes, we want to remove the first 1092 bytes, leaving us with a new file of 2801 bytes.

To remove these bytes, we use the GNU dd command, with conv=notrunc as otherwise the file would be deleted before you can copy its contents:

$ dd conv=notrunc iflag=skip_bytes skip=1092 if=1000lines.txt of=1000lines.txt
5+1 records in
5+1 records out
2801 bytes (2.8 kB, 2.7 KiB) copied, 8.6078e-05 s, 32.5 MB/s

This removes the first 300 lines, but now the last 1092 bytes repeat, because the file is not truncated yet:

$ truncate -s 2801 1000lines.txt

This reduces the file to its final size, removing duplicated lines at end of file.

The result:

$ stat -c %s 1000lines.txt 
2801

$ head -n 3 1000lines.txt
301
302
303

$ tail -n 3 1000lines.txt
998
999
1000

The process for a larger file is similar. You may need to set a larger blocksize for better performance (the blocksize option for dd is bs).

The main issue is determining the correct byte offset for the exact line number. In general it can only be done by reading and counting. With this method, you have to read the entire file at least once even if you are discarding a huge chunk of it.


If you have enough space to compress the file, which should free a significant amount of space, allowing you to do other operations, you can try this:

gzip file && zcat file.gz | tail -n +300000001 | gzip > newFile.gz

That will first gzip the original input file (file) to create file.gz. Then, you zcat the newly created file.gz, pipe it through tail -n +300000001 to remove the first 3M lines, compress the result to save disk space and save it as newFile.gz. The && ensures that you only continue if the gzip operation was successful (it will fail if you run out of space).

Note that text files are very compressible. For example, I created a test file using seq 400000000 > file, which prints the numbers from 1 to 400,000,000 and this resulted in a 3.7G file. When I compressed it using the commands above, the compressed file was only 849M and the newFile.gz I created only 213M.


On some filesystems like ext4 or xfs, you can use the fallocate() system call for that.