Haven’t posted in a million years, so I’ll start off with an easy one. We were listing the files in a directory using find [path] -type f -ls and dumping to a file. Should work fine, right? Except it was taking days (without exaggeration) to run, which was sort of cramping our style.
The problems were twofold:
- Lots of subdirectories
Specifically, the mount was an NFS volume and one of the directories we were listing had lots (several hundred thousand!) of subdirectories. This is generally considered “not best practices”, but it’s legacy and we haven’t rebuilt it to work w/ a different directory structure yet. (We’ll get to that shortly.)
So for those of you that don’t know, directories don’t like having lots of subdirectories (or files). Different filesystems and operating systems handle things differently, but they pretty much all suck when dealing with more than a few thousand entries. Normally, things like find get around this by utilizing the OS’s built-in filesystem cache. The problem is, as a general rule, clients don’t cache things when accessing an NFS mount - since isn’t local and can’t be locked, cache can’t be consistently marked dirty and updated, etc. etc. etc. yadda yadda yadda.
So every time it was running through each subdirectory (or few), it was listing out the results of that one big directory (which was above these other directories in the path, right?). Sigh.
The ghetto solution? I did an ls of the directory and wrote a quick shell script to iterate through it w/ a few parallel find’s, rather than letting find do it by itself. The end result of this experiment? A performance improvement of nearly 10x (yes, an order of magnitude). Instead of finishing in days, it’ll now finish in hours. This is one of those things that’s obvious when you think about it, but really, how many people spend a lot of time thinking about running a “routine” find?