What are the end bytes of *.docx file format
A .docx
file is just a .zip
file. This is how a Zip file is structured:
The end of a Zip file is indicated by the end of central directory record (EOCD). The length of the EOCD is variable because it can contain a comment up to 65535 bytes long. See the bold part of the EOCD layout below:
+---------+--------+--------------------------------------------------------------------+ | Offset | Bytes | Description | +---------+--------+--------------------------------------------------------------------+ | 0 | 4 | End of central directory signature = 0x06054b50 | | 4 | 2 | Number of this disk | | 6 | 2 | Disk where central directory starts | | 8 | 2 | Number of central directory records on this disk | | 10 | 2 | Total number of central directory records | | 12 | 4 | Size of central directory (bytes) | | 16 | 4 | Offset of start of central directory, relative to start of archive | | 20 | 2 | Comment length (n) | | 22 | n | Comment | +---------+--------+--------------------------------------------------------------------+
Table from Wikipedia » Zip (file format) » End of central directory record (EOCD)
You can get the end of a Zip file by looking for 0x06054b50
(the beginning of the EOCD), then counting 16 bytes after that. Set the next two bytes to 0x0000
to ignore the comment, and you should now have the end of a valid Zip file.
Note: This does not take file system fragmentation into account. Your recovery approach will not work if the .docx
/.zip
file was fragmented on the disk because the signatures you're finding would be broken up. You would need some information from the file system in order to piece together fragmented files; beginning and end signatures don't have this information.
PhotoRec is a software I've used before that has some tricks to figure out how to piece together fragmented files. Crucially for you, PhotoRec has built-in support for Zip files, so you might want to try TestDisk/PhotoRec if your current signature search strategy isn't working for you.
Deltik's answer is correct. Some potentially helpful information:
The sequence of bytes for the End-Of-Central-Directory Header will actually appear as 504b0506 (reverse order), as viewed by a hex editor such as xxd, or in a byte-addressed sequence.
In a valid OpenOfficeXML file, such as a .docx file, there is never an end-of-central-directory comment (See ECMA-376, Part 2, page 76: "ZIP file comment" should not be produced. However, consumers are supposed to support reading a file containing such a comment anyway.)
Also, multi-disk archives are not supported (see page 75), so the "Number of this disk" field and the "Disk where central directory starts" field are always 0. Moreover, the "Number of central directory records on this disk" and the "Total number of central directory records" fields should be equal.
All told, the final 22 bytes of any .docx file should always have the form
50 4b 05 06 00 00 00 00 ## ## ## ## ## ## ## ## ## ## ## ## 00 00
| signature |disk |CD- |num. |num. |size of CD | CD offset |comment
| |num. |disk |recs |recs | | |length