Reading files with a BOM in Go

No standard way, IIRC (and the standard library would really be a wrong layer to implement such a check in) so here are two examples of how you could deal with it yourself.

One is to use a buffered reader above your data stream:

import (
    "bufio"
    "os"
    "log"
)

func main() {
    fd, err := os.Open("filename")
    if err != nil {
        log.Fatal(err)
    }
    defer closeOrDie(fd)
    br := bufio.NewReader(fd)
    r, _, err := br.ReadRune()
    if err != nil {
        log.Fatal(err)
    }
    if r != '\uFEFF' {
        br.UnreadRune() // Not a BOM -- put the rune back
    }
    // Now work with br as you would do with fd
    // ...
}

Another approach, which works with objects implementing the io.Seeker interface, is to read the first three bytes and if they're not BOM, io.Seek() back to the beginning, like in:

import (
    "os"
    "log"
)

func main() {
    fd, err := os.Open("filename")
    if err != nil {
        log.Fatal(err)
    }
    defer closeOrDie(fd)
    bom := [3]byte
    _, err = io.ReadFull(fd, bom[:])
    if err != nil {
        log.Fatal(err)
    }
    if bom[0] != 0xef || bom[1] != 0xbb || bom[2] != 0xbf {
        _, err = fd.Seek(0, 0) // Not a BOM -- seek back to the beginning
        if err != nil {
            log.Fatal(err)
        }
    }
    // The next read operation on fd will read real data
    // ...
}

This is possible since instances of *os.File (what os.Open() returns) support seeking and hence implement io.Seeker. Note that that's not the case for, say, Body reader of HTTP responses since you can't "rewind" it. bufio.Buffer works around this feature of non-seekable streams by performing some buffering (obviously) — that's what allows you yo UnreadRune() on it.

Note that both examples assume the file we're dealing with is encoded in UTF-8. If you need to deal with other (or unknown) encoding, things get more complicated.

I thought I would add here the way to strip the Byte Order Mark sequence from a string -- rather than messing around with bytes directly (as shown above).

package main

import (
    "fmt"
    "strings"
)

func main() {
    s := "\uFEFF is a string that starts with a Byte Order Mark"
    fmt.Printf("before: '%v' (len=%v)\n", s, len(s))

    ByteOrderMarkAsString := string('\uFEFF')

    if strings.HasPrefix(s, ByteOrderMarkAsString) {

        fmt.Printf("Found leading Byte Order Mark sequence!\n")
        
        s = strings.TrimPrefix(s, ByteOrderMarkAsString)
    }
    fmt.Printf("after: '%v' (len=%v)\n", s, len(s)) 
}

Other "strings" functions should work as well.

And this is what prints out:

before: ' is a string that starts with a Byte Order Mark (len=50)'
Found leading Byte Order Mark sequence!
after: ' is a string that starts with a Byte Order Mark (len=47)'

Cheers!

There's no standard way of doing this in the Go core packages. Follow the Unicode standard.

Unicode Byte Order Mark (BOM) FAQ

You can use utfbom package. It wraps io.Reader, detects and discards BOM as necessary. It can also return the encoding detected by the BOM.

Reading files with a BOM in Go

Tags:

Unicode

Byte Order Mark

Go

Related

Recent Posts