What are the performance implications for millions of files in a modern file system?
Solution 1:
The reason one would create this sort of directory structure is that filesystems must locate a file within a directory, and the larger the directory is, the slower that operation.
How much slower depends on the filesystem design.
The ext4 filesystem uses a B-tree to store directory entries. A lookup on this table is expected to take O(log n) time, which most of the time is less than the naive linear table that ext3 and previous filesystems used (and when it isn't, the directory is too small for it to really matter).
The XFS filesystem uses a B+tree instead. The advantage of this over a hash table or B-tree is that any node may have multiple children b, where in XFS b varies and can be as high as 254 (or 19 for the root node; and these numbers may be out of date). This gives you a time complexity of O(logb n), a vast improvement.
Either of these filesystems can handle tens of thousands of files in a single directory, with XFS being significantly faster than ext4 on a directory with the same number of inodes. But you probably don't want a single directory with 3M inodes, as even with a B+tree the lookup can take some time. This is what led to creating directories in this manner in the first place.
As for your proposed structures, the first option you gave is exactly what is shown in nginx examples. It will perform well on either filesystem, though XFS will still have a bit of an advantage. The second option may perform slightly better or slightly worse, but it will probably be pretty close, even on benchmarks.
Solution 2:
In my experience, one of the scaling factors is the size of the inodes given a hash-name partitioning strategy.
Both of your proposed options creates up to three inode entries for each created file. Also, 732 files will create an inode that is still less than the usual 16KB. To me, this means either option will perform the same.
I applaud you on your short hash; previous systems I've worked on have taken the sha1sum of the given file and spliced directories based on that string, a much harder problem.
Solution 3:
Certainly either option will help reduce the number of files in a directory to something that seems reasonable, for xfs or ext4 or whatever file system. It is not obvious which is better, would have to test to tell.
Benchmark with your application simulating something like the real workload is ideal. Otherwise, come up with something that simulates many small files specifically. Speaking of that, here's an open source one called smallfile. Its documentation references some other tools.
hdparm
doing sustained I/O isn't as useful. It won't show the many small I/Os or giant directory entries associated with very many files.