EldoS | Feel safer!

Software components for data protection, secure storage and transfer

Enumerate Directory with a large number of files

Also by EldoS: CallbackDisk
Create virtual disks backed by memory or custom location, expose disk images as disks and more.
#24850
Posted: 05/05/2013 20:11:51
by Kenny Kim (Standard support level)
Joined: 08/19/2009
Posts: 38

Good morning.

I have a question about enumerating a directory with 100 000 and more files.

Basically, CBFS drive is capable of displaying even 200 000 files, but, it is taking quite a long time to enumerate all files.

I have tested the same number of files in SMB Share Drive.
What SMB drive does is, it adds files to Exlorer's file list on-flight (is it "NotifyDirectoryChange"?).
For example, when you double click the target folder, Explorer is not blocked.
You can open, read, write files, where the number of files in the folders continue growing, until all of them are enumerated.

Is it possible to mimic that behaviour in CBFS Drive too?

Thank you.
#24851
Posted: 05/06/2013 02:37:26
by Volodymyr Zinin (EldoS Corp.)

CallbackFS should work similar to SMB. Windows performs directory enumeration in the following way:
1. Opens a directory.
2. Performs several "enumerate directory" calls to enumerate files in the directory.
3. Close the directory.

These "enumerate directory" calls are not blocking calls. I.e. in parallel it's possible to do any other operations with files/directories on the disk. But it's necessary to have free worker threads for it (the CallbackFileSystem.ThreadPoolSize property must be greater than 1, but maybe it's better to set it to 10 or even more) and the CallbackFileSystem.SerializeCallbacks property must be set to false.

Quote
Ulughbek Muslimov wrote:
For example, when you double click the target folder, Explorer is not blocked.

Perhaps Explorer is blocked because your directory enumeration is too slow. During one enumerate directory request, which Explorer performs synchronously, it requested several files to enumerate (actually it's the ZwQueryDirectoryFile API call). But CallbackFS, in order to simplify implementation of the user CallbackFS callbacks, calls the OnEnumerateDirectory callback for each file being enumerated. So Explorer is waiting until several files (usually it's about from 1 to 10 which depends on a buffer size passed to ZwQueryDirectoryFile) are finished to enumerate. Try to process the OnEnumerateDirectory callback maximally fast.
Another reason is in the case you use a local type of mounting points (i.e any one except created with the flag CBFS_SYMLINK_NETWORK). In this case Explorer "thinks" that the disk is local (i.e. fast) and during enumeration also opens each enumerated file and reads thumbnail for it.
#24852
Posted: 05/06/2013 03:34:08
by Kenny Kim (Standard support level)
Joined: 08/19/2009
Posts: 38

Thank you for the detailed answer.

When directory is enumerated for the first time and context is created on EnumerateDirectory callback
Code
...
context = new DirectoryEnumerationContext(mRootPath + DirectoryInfo.FileName,
Mask);
...

this method
Code
public DirectoryEnumerationContext(string DirName, string Mask)
        {
            DirectoryInfo dirinfo = new DirectoryInfo(DirName);

            mFileList = dirinfo.GetFileSystemInfos(Mask);

            mIndex = 0;
        }

returns mFileList pretty fast (tested with a directory containing just 10 000 files).

But, in our case, we have to deal with POSIX readdir() method in DirectoryEnumerationContext() to return mFileList.
This is where our drive is blocking.

Is is safe to return mFileList with, let's say, 10 000 FileInfos first, and push FileInfos starting from 10 001 to the end of mFileList using a worker thread?
What negative issues may come out from using a worker thread inside callbacks?

Thank you.
#24854
Posted: 05/06/2013 04:49:14
by Volodymyr Zinin (EldoS Corp.)

Quote
Ulughbek Muslimov wrote:
What negative issues may come out from using a worker thread inside callbacks?

Explorer will be "frozen" until the OnDirectoryEnumeration callback isn't finished.

Quote
Ulughbek Muslimov wrote:
Is is safe to return mFileList with, let's say, 10 000 FileInfos first, and push FileInfos starting from 10 001 to the end of mFileList using a worker thread?

In the case of 10000 files it seems it's ok. I'm not good at .NET, but as I understand the GetFileSystemInfos method allocates list of objects for each of enumerated files. Let suppose that each object is about 50 bytes long (~15 symbols for file name and 20 bytes extra). So 50*10000 = 500000 bytes, which is not so much for desktop/server systems.
#24916
Posted: 05/13/2013 07:39:48
by Oleg Savelos (Standard support level)
Joined: 08/25/2008
Posts: 21

I had a similar problem while enumerating directories and the best solution i came up with in terms of performance and stability was to use native api to enumerate directory contents. Its much faster and allows you to have enumeration context of directory something .NET itself lacked.

I haven't used the latest .NET additions to directory enumeration but its seems to me that the problem will remain in your case since you still cant have a real enumeration context with them, you should check out this function on MSDN FindFirstFile and build a custom enumeration context based on it.
#24920
Posted: 05/13/2013 18:15:10
by Kenny Kim (Standard support level)
Joined: 08/19/2009
Posts: 38

Hello Oleg.

I agree with you that native API is much more faster than .NET in the context of working with File IO.
But, the thing is, I do use a native API library.
The library is used to communicate with our Linux Servers.

I have solved the problem partially.
What I did is, DirectoryEnumerationContext() is returning mFileList with only file names in it, which is really fast.
And, all other file info is retrieved in EnumerateDirectory() callback.

Originally, mFileList would return full file info, which would take a long time to construct. Cause, in our native library, first you get file name (readdir()), and then, you request file statistics (statfile()) for the given file name.
If those two are combined in DirectoryEnumerationContext() only, it would take at least twice more time to enumerate the directory (100000 requests for file name + 100000 requests for file stat).
Moving statfile() to EnumerateDirectory() has solved the problem. Still it is taking much time to enumerate all 100000 files. But, this is not CBFS's problem.
The good thing is, Explorer is not "freezing" anymore. It is possible to open, read, write files, while the number of files in given directory continues growing.

Thank you.
Also by EldoS: CallbackRegistry
A component to monitor and control Windows registry access and create virtual registry keys.

Reply

Statistics

Topic viewed 3315 times

Number of guests: 1, registered members: 0, in total hidden: 0




|

Back to top

As of July 15, 2016 EldoS Corporation will operate as a division of /n software inc. For more information, please read the announcement.

Got it!