Is there a faster way than this to find all the files in a directory and all sub directories?
Try this iterator block version that avoids recursion and the Info
objects:
public static IEnumerable<string> GetFileList(string fileSearchPattern, string rootFolderPath)
{
Queue<string> pending = new Queue<string>();
pending.Enqueue(rootFolderPath);
string[] tmp;
while (pending.Count > 0)
{
rootFolderPath = pending.Dequeue();
try
{
tmp = Directory.GetFiles(rootFolderPath, fileSearchPattern);
}
catch (UnauthorizedAccessException)
{
continue;
}
for (int i = 0; i < tmp.Length; i++)
{
yield return tmp[i];
}
tmp = Directory.GetDirectories(rootFolderPath);
for (int i = 0; i < tmp.Length; i++)
{
pending.Enqueue(tmp[i]);
}
}
}
Note also that 4.0 has inbuilt iterator block versions (EnumerateFiles
, EnumerateFileSystemEntries
) that may be faster (more direct access to the file system; less arrays)
Cool question.
I played around a little and by leveraging iterator blocks and LINQ I appear to have improved your revised implementation by about 40%
I would be interested to have you test it out using your timing methods and on your network to see what the difference looks like.
Here is the meat of it
private static IEnumerable<FileInfo> GetFileList(string searchPattern, string rootFolderPath)
{
var rootDir = new DirectoryInfo(rootFolderPath);
var dirList = rootDir.GetDirectories("*", SearchOption.AllDirectories);
return from directoriesWithFiles in ReturnFiles(dirList, searchPattern).SelectMany(files => files)
select directoriesWithFiles;
}
private static IEnumerable<FileInfo[]> ReturnFiles(DirectoryInfo[] dirList, string fileSearchPattern)
{
foreach (DirectoryInfo dir in dirList)
{
yield return dir.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
}
}
The short answer of how to improve the performance of that code is: You cant.
The real performance hit your experiencing is the actual latency of the disk or network, so no matter which way you flip it, you have to check and iterate through each file item and retrieve directory and file listings. (That is of course excluding hardware or driver modifications to reduce or improve disk latency, but a lot of people are already paid a lot of money to solve those problems, so we'll ignore that side of it for now)
Given the original constraints there are several solutions already posted that more or less elegantly wrap the iteration process (However, since I assume that I'm reading from a single hard-drive, parallelism will NOT help to more quickly transverse a directory tree, and may even increase that time since you now have two or more threads fighting for data on different parts of the drive as it attempts to seek back and fourth) reduce the number of objects created, etc. However if we evaluate the how the function will be consumed by the end developer there are some optimizations and generalizations that we can come up with.
First, we can delay the execution of the performance by returning an IEnumerable, yield return accomplishes this by compiling in a state machine enumerator inside of an anonymous class that implements IEnumerable and gets returned when the method executes. Most methods in LINQ are written to delay execution until the iteration is performed, so the code in a select or SelectMany will not be performed until the IEnumerable is iterated through. The end result of delayed execution is only felt if you need to take a subset of the data at a later time, for instance, if you only need the first 10 results, delaying the execution of a query that returns several thousand results won't iterate through the entire 1000 results until you need more than ten.
Now, given that you want to do a subfolder search, I can also infer that it may be useful if you can specify that depth, and if I do this it also generalizes my problem, but also necessitates a recursive solution. Then, later, when someone decides that it now needs to search two directories deep because we increased the number of files and decided to add another layer of categorization you can simply make a slight modification instead of re-writing the function.
In light of all that, here is the solution I came up with that provides a more general solution than some of the others above:
public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, string rootFolderPath)
{
return BetterFileList(fileSearchPattern, new DirectoryInfo(rootFolderPath), 1);
}
public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, DirectoryInfo directory, int depth)
{
return depth == 0
? directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly)
: directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly).Concat(
directory.GetDirectories().SelectMany(x => BetterFileList(fileSearchPattern, x, depth - 1)));
}
On a side note, something else that hasn't been mentioned by anyone so far is file permissions and security. Currently, there's no checking, handling, or permissions requests, and the code will throw file permission exceptions if it encounters a directory it doesn't have access to iterate through.