Had several recent enquiries from the field recently asking about the low impact methods to count the number of files in large directories containing hundreds of thousands to millions of files).
Unfortunately, there’s no ‘silver bullet’ command or data source available that will provide that count instantaneously: Something will have to perform a treewalk to gather these stats. That said, there are a couple of approaches to this, each with its pros and cons:
- If the customer has a SmartQuotas license, they can configure an advisory directory quota on the directories they want to check. As mentioned, the first job run will require working the directory tree, but they can get fast, low impact reports moving forward.
- Another approach is using traditional UNIX commands, either from the OneFS CLI or, less desirably, from a UNIX client. The two following commands will both take time to run: “
# ls -f /path/to/directory | wc –l
# find /path/to/directory -type f | wc -l
It’s worth noting that when counting files with ls, you’ll probably get faster results by omitting the ‘-l’ flag and using ‘-f’ flag instead. This is because ‘-l’ resolves UID & GIDs to display users/groups, which creates more work thereby slowing the listing. In contrast, ‘-f’ allows the ‘ls’ command to avoid sorting the output. This should be faster, and reduce memory consumption when listing extremely large numbers of files.
Ultimately, there really is no quick way to walk a file system and count the files – especially since both ls and find are single threaded commands. Running either of these in the background with output redirected to a file is probably the best approach.
Depending on your arguments for the ls or find command, you can gather a comprehensive set of context info and metadata on a single pass.
# find /path/to/scan -ls > output.file
It will take quite a while for the command to complete, but once you have the output stashed in a file you can pull all sorts of useful data from it.
Assuming a latency of 10ms per file it would take 33 minutes for 200,000 files. While this estimate may be conservative, there are typically multiple protocol ops that need to be done to each file, and they do add up. Plus, as mentioned before, ‘ls’ is a single threaded command.
- If possible, ensure the directories of interest are stored on a file pool that has at least one of the metadata mirrors on SSD (metadata-read).
- Windows Explorer can also enumerate the files in a directory tree surprisingly quickly. All you get is a file count, but it can work pretty well.
- If the directory you wish to know the file count for just happens to be /ifs, you can run the LinCount job, which will tell you how many LINs there are in the file system.
Lincount (relatively) quickly scans the filesystem and returns the total count of LINs (logical inodes). The LIN count is essentially equivalent to the total file and directory count on a cluster. The job itself runs by default at the LOW priority, and is the fastest method of determining object count on OneFS, assuming no other job has run to completion.
The following syntax can be used to kick off the Lincount job from the OneFS CLI:
# isi job start lincount
The output from this will be along the lines of “Added job ”.
Note: The number in square brackets is the job ID.
To view results, run the following command from the CLI: # isi job reports view [job ID] For example:
# isi job reports view 52 LinCount phase 1 (2021-07-06T09:33:33) ------------------------------------------ Elapsed time 1 seconds Errors 0 Job mode LinCount LINs traversed 1722 SINs traversed 0 The "LINs traversed" metric indicates that 1722 files and directories were found. Note: The Lincount job will also include snapshot revisions of LINs in its count.
Alternatively, if another treewalk job has run against the directory you wish to know the count for, you might be in luck.
At any rate, hundreds of thousands of files is a large number to store in one directory. To reduce the directory enumeration time, where possible divide the files up into multiple subdirectories.
When it comes to NFS, the behavior is going to partially depend on whether the client is doing READDIRPLUS operations vs READDIR. READDIRPLUS is useful if the client is going to need the metadata. However, ff all you’re trying to do is list the filenames, it actually makes that operation much slower.
If you only read the filenames in the directory, and you don’t attempt to stat any associated metadata, then this requires a relatively small amount of I/O to pull the names from the meta-tree, and should be fairly fast.
If this has already been done recently, some or all of the blocks are likely to already be in L2 cache. As such, a subsequent operation won’t need to read from hard disk and will be substantially faster.
NFS is more complicated regarding what it will and won’t cache on the client side, particularly with the attribute cache and the timeouts that are associated with it.
Here are some options from fastest to slowest:
- If NFS is using READDIR, as opposed to READDIRPLUS, and the ‘ls’ command is invoked with the appropriate arguments to prevent it polling metadata or sorting the output, execution will be relatively swift.
- If ‘ls’ polls the metadata (or if NFS uses READDIRPLUS) but doesn’t sort the results, output will be fairly immediately, but will take longer to complete overall.
- If ‘ls’ sorts the output, nothing will be displayed until ls has read everything and sorted it, then you’ll get the output in a deluge at the end.