Java : Read last n lines of a HUGE file

I found it the simplest way to do by using ReversedLinesFileReader from apache commons-io api. This method will give you the line from bottom to top of a file and you can specify n_lines value to specify the number of line.

import org.apache.commons.io.input.ReversedLinesFileReader;


File file = new File("D:\\file_name.xml");
int n_lines = 10;
int counter = 0; 
ReversedLinesFileReader object = new ReversedLinesFileReader(file);
while(counter < n_lines) {
    System.out.println(object.readLine());
    counter++;
}

RandomAccessFile is a good place to start, as described by the other answers. There is one important caveat though.

If your file is not encoded with an one-byte-per-character encoding, the readLine() method is not going to work for you. And readUTF() won't work in any circumstances. (It reads a string preceded by a character count ...)

Instead, you will need to make sure that you look for end-of-line markers in a way that respects the encoding's character boundaries. For fixed length encodings (e.g. flavors of UTF-16 or UTF-32) you need to extract characters starting from byte positions that are divisible by the character size in bytes. For variable length encodings (e.g. UTF-8), you need to search for a byte that must be the first byte of a character.

In the case of UTF-8, the first byte of a character will be 0xxxxxxx or 110xxxxx or 1110xxxx or 11110xxx. Anything else is either a second / third byte, or an illegal UTF-8 sequence. See The Unicode Standard, Version 5.2, Chapter 3.9, Table 3-7. This means, as the comment discussion points out, that any 0x0A and 0x0D bytes in a properly encoded UTF-8 stream will represent a LF or CR character. Thus, simply counting the 0x0A and 0x0D bytes is a valid implementation strategy (for UTF-8) if we can assume that the other kinds of Unicode line separator (0x2028, 0x2029 and 0x0085) are not used. You can't assume that, then the code would be more complicated.

Having identified a proper character boundary, you can then just call new String(...) passing the byte array, offset, count and encoding, and then repeatedly call String.lastIndexOf(...) to count end-of-lines.

If you use a RandomAccessFile, you can use length and seek to get to a specific point near the end of the file and then read forward from there.

If you find there weren't enough lines, back up from that point and try again. Once you've figured out where the Nth last line begins, you can seek to there and just read-and-print.

An initial best-guess assumption can be made based on your data properties. For example, if it's a text file, it's possible the line lengths won't exceed an average of 132 so, to get the last five lines, start 660 characters before the end. Then, if you were wrong, try again at 1320 (you can even use what you learned from the last 660 characters to adjust that - example: if those 660 characters were just three lines, the next try could be 660 / 3 * 5, plus maybe a bit extra just in case).

The ReversedLinesFileReader can be found in the Apache Commons IO java library.

    int n_lines = 1000;
    ReversedLinesFileReader object = new ReversedLinesFileReader(new File(path));
    String result="";
    for(int i=0;i<n_lines;i++){
        String line=object.readLine();
        if(line==null)
            break;
        result+=line;
    }
    return result;

Java : Read last n lines of a HUGE file

Tags:

Java

File Io

Large Files

Related

Recent Posts