Linux file command classifying files
file
uses several kinds of test:
1: If file does not exist, cannot be read, or its file status could not be determined, the output shall indicate that the file was processed, but that its type could not be determined.
This will be output like cannot open file: No such file or directory
.
2: If the file is not a regular file, its file type shall be identified. The file types directory, FIFO, socket, block special, and character special shall be identified as such. Other implementation-defined file types may also be identified. If file is a symbolic link, by default the link shall be resolved and file shall test the type of file referenced by the symbolic link. (See the
-h
and-i
options below.)
This will be output like .: directory
and /dev/sda: block special
. Much of the format for this and the previous point is partially defined by POSIX - you can rely on certain strings being in the output.
3: If the length of file is zero, it shall be identified as an empty file.
This is foo: empty
.
4: The file utility shall examine an initial segment of file and shall make a guess at identifying its contents based on position-sensitive tests. (The answer is not guaranteed to be correct; see the -d, -M, and -m options below.)
5: The file utility shall examine file and make a guess at identifying its contents based on context-sensitive default system tests. (The answer is not guaranteed to be correct.)
These two use magic number identification and are the most interesting part of the command. A magic number is a special sequence of bytes that's in a known place in a file that identifies its type. Traditionally that place is the first two bytes, but the term has been extended further to include longer strings and other locations. See this other question for more detail about magic numbers in the file
command.
The file
command has a database of these numbers and what type they correspond to; that database is usually in /usr/share/mime/magic
, and maps file contents to MIME types. The output there (often part of file -i
if you don't get it by default) will be a defined media type or an extension. "Context-sensitive tests" use the same sort of approach, but are a bit fuzzier. None of these are guaranteed to be right, but they're intended to be good guesses.
file
also has a database mapping those types to names, by which it will know that a file it has identified as application/pdf
can be described as a PDF document
. Those human-readable names may be localised to another language too. These will always be some high-level description of the file type in a way a person will understand, rather than a machine.
The majority of different outputs you can get will come from these stages. You can look at the magic
file for a list of supported types and how they're identified - my system knows 376 different types. The names given and the types supported are determined by your system packaging and configuration, and so your system may support more or fewer than mine, but there are generally a lot of them. libmagic
also includes additional hard-coded tests in it.
6: The file shall be identified as a data file.
This is foo: data
, when it failed to figure out anything at all about the file.
There are also other little tags that can appear. An executable (+x
) file will include "executable
" in the output, usually comma-separated. The file
implementation may also know extra things about some file formats to be able to describe additional points about them, as in your "PDF document, version 1.4
".
Man pages are usually terse references, not introductions. Start with the Wikipedia page.
file
looks only at the file content, not at the file name. (It also looks at some file metadata such as the file type: directory, symbolic link, named pipe, etc. But in the cases you're interested in, it's the content that matters.)
file
typically guesses the format of a file by looking at the first few bytes and comparing them with a built-in table of magic numbers. For example, if the file begins with %PDF
, then file
reports “PDF document” (and goes digging further to report the minimum version). For file types that don't start with magic numbers, it contains heuristics, e.g. report “ASCII text” if the first few bytes are all in the printable ASCII range.
The output of file
is fragile: it can vary from unix variant to unix variant and from version to version. On Linux, Cygwin and *BSD, the file
command supports an option -i
which produces predictable output in the form of a MIME media type (IANA manages the list of standard media types). There's aren't as many details and the output is less human-friendly but the output is predictable and computer-friendly.
$ file -i somefile.csv
somefile.csv: text/plain; charset=us-ascii
$ file -i somefile.jpg
somefile.jpg: image/jpeg; charset=binary
$ file -i somefile.pdf
somefile.pdf: application/pdf; charset=binary
Use file --mime-type
if you only want the MIME type itself without encoding information, e.g. application/pdf
. Pass the option -b
if you don't want to display the file name at the beginning of the line.
I would like you to read the answer from here. Some of the excerpts from the answer are,
From man page of file
command,
file
command actually performs 3 tests on determining the file type.
First test
The filesystem tests are based on examining the return from a stat(2) system call.
Second test
The magic number tests are used to check for files with data in particular fixed formats.
Third test
The language tests look for particular strings (cf names.h) that can appear anywhere in the first few blocks of a file. For example, the keyword .br indicates that the file is most likely a troff(1) input file, just as the keyword struct indicates a C program.
The output of the file
command is generally based on the result of any of the tests that succeeds.
Now, assuming the C++ program starts like this, and the third test succeeds,
#include <iostream>
bla
bla
As per the third test the keyword #include
particularly specifies it is of type C program though we have a CPP program in hand. Now, when I check,
$ file example.cpp
example.cpp: ASCII C program text
Now, the concepts of object oriented are specific to C++. Let us create a file specific to C++.
I start my C++ program as,
class something
{
}
bla
bla
Now, when I issue
$ file example.cpp
The output is,
example.cpp: ASCII C++ program text
This basically explains on how file
command works on similar files (In this example, C program and C ++ program are treated alike unless and until we use the object oriented features specific to C++).