A Faster way of Directory walking instead of os.listdir?

You should measure directly on the machines (OSs, filesystems and caches thereof, etc) of your specific interest -- whether or not os.walk is faster than os.listdir on a specific and totally different machine / OS / FS will tell you very little about performance on yours.

Not sure what you mean by cachedir.listdir -- no standard library module / function by that name. listdir already reads all the directory in at one gulp (as it must sort the results) as does os.walk (as it must separate subdirectories from files). If, depending on your platform, you have a fast way of being notified about file/directory changes, then it's probably worth building the tree up once and editing it incrementally as change notifications come... but it depends on the relative frequency of changes vs requests, which is, again, totally dependent on your specific application circumstances.

I was just trying to figure out how to speed up os.walk on a largish file system (350,000 files spread out within around 50,000 directories). I'm on a linux box usign an ext3 file system. I discovered that there is a way to speed this up for MY case.

Specifically, Using a top-down walk, any time os.walk returns a list of more than one directory, I use os.stat to get the inode number of each directory, and sort the directory list by inode number. This makes walk mostly visit the subdirectories in inode order, which reduces disk seeks.

For my use case, it sped up my complete directory walk from 18 minutes down to 13 minutes...

Did you check out scandir (previously betterwalk)? Did not try it myself, but there's a discussion about it here and another one here. It claims to have a speedup of 3~10x on MacOSX/Linux and 7~50x on Windows by avoiding redundant calls to os.stat(). It's also now included in the standard library as of Python 3.5.

Python's built-in os.walk() is significantly slower than it needs to be, because -- in addition to calling listdir() on each directory -- it calls stat() on each file to determine whether the filename is a directory or not. But both FindFirstFile / FindNextFile on Windows and readdir on Linux/OS X already tell you whether the files returned are directories or not, so no further stat system calls are needed. In short, you can reduce the number of system calls from about 2N to N, where N is the total number of files and directories in the tree.

In practice, removing all those extra system calls makes os.walk() about 7-50 times as fast on Windows, and about 3-10 times as fast on Linux and Mac OS X.

From the project's readme.

A Faster way of Directory walking instead of os.listdir?

Tags:

Python

Performance

File Io

Directory

Related

Recent Posts