<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:11:14 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14610] Make mpiFileUtils better support statahead</title>
                <link>https://jira.whamcloud.com/browse/LU-14610</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;mpiFileUtils utilizes open source MPI-based library &lt;b&gt;libcircle&lt;/b&gt;, an API for distributing embarrassingly parallel workloads using self-stabilization, to implement distributed scalable tree walking. libcircle is an API to provide an efficient distributed queue on a cluster. Libcircle is currently used in production to quickly traverse and perform operations on a file tree which contains several hundred-million file nodes. Each MPI rank (process) maintains its queue of work items and is able to exchange work items with random process without a central coordinator. To keep the process balanced, libcircle makes work requests to random nodes when work is needed. If the node requested has work in its queue it will randomly split that queue with the requestor.&lt;/p&gt;

&lt;p&gt;Conceptually, the workload distribution in mpiFileUtils can be described with two queues. One queue is a global queue that spans all nodes in the distributed system. The other queue is a queue which is local to each process (MPI rank).&lt;/p&gt;

&lt;p&gt;The libcircle API defines a function to produce a work item initially and a function to process a work item. The libcircle work item is an array of characters describing the job to be performed. mpiFileUtils warps libCircle, and functionality that is common to multiple tools is moved to the common library, libmfu. The key data structure in libmfu is a distributed file queue. This structure represents a list of files, each with stat-like metadata, that is distributed among a set of MPI ranks. Each MPI rank &quot;owns&quot; a portion of the queue, and there are routines to step through the entries owned by that process. This portion is referred to as the &quot;local&quot; queue. Functions exist to get and set properties of the items in the local list, for example to get the path name, type, and size of a file. Functions dealing with the local list can be called by the MPI process independently of other MPI processes. Other functions operate on the global list in a collective fashion, such as deleting all items in a file list. All processes in the MPI job must invoke these functions simultaneously.&lt;/p&gt;

&lt;p&gt;In the following, it will take &lt;b&gt;dwalk&lt;/b&gt; as an example to demonstrate how mpiFileUtils works for distributed tree walking.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Walk_stat_process(CIRCLE Queue){ 
    path = Queue.dequeue(); 
    stat(path);
    record stat() result &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;final&lt;/span&gt; summary via MPI reduce.
    &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; path is a directory
        dir = opendir(path); 
        &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; dent = readdir(dir) != NULL &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt;
            Queue.enqueue(path + &lt;span class=&quot;code-quote&quot;&gt;&apos;/&apos;&lt;/span&gt; + dent.name);
        end &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; closedir(dir);
   fi
}&#160;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;When walking a tree by using &lt;b&gt;dwalk&lt;/b&gt;, the function to produce the initial work item takes the input parameters (a directory) passed in by the user and enqueues a string describing the directory. The function to process a work item will dequeue to get the full path name for a file, do stat call on it. If the current work item is a directory, walks a single level within its directory, and each dentry under this directory will be added into the global libcircle work queue to be processed independently.&lt;/p&gt;

&lt;p&gt;LibCircle can automatically and dynamically balance the treewalk workload across many nodes in a large distributed system. Obviously, the minimal work set for the current parallel tree walking with stat in dwalk is a single file. This may result in that the files within a same directory may be randomly distributed among different MPI ranks (different processes or nodes) and broken the sequential stat() in readdir() order.&lt;/p&gt;

&lt;p&gt;We improve the mpiFileUtils to make the minimal splitable work set is a directory. Within a directory, it uses FLAT statahead algorithm to accelerate the speed of tree walking with stat(). The pseudo code for the algorithm is described as follows:&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Walk_Flat_stat_process(CIRCLE Queue) {
    path = Queue.dequeue();
    dir = opendir(path);
    &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; dent = readdir(dir) != NULL; &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt; 
        fullpath = path + &lt;span class=&quot;code-quote&quot;&gt;&quot;/&quot;&lt;/span&gt; + dent.name; 
        localQueue.enqueue(fullpath);
        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; dent is a directory; &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt;
            Queue.enqueue(fullpath);
        fi
    end &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt;
    &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; |localQueue| &amp;gt; 0 &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt;
       path = localQueue.dequeue() stat(path);
       record stat() result &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;final&lt;/span&gt; summary via MPI reduce.
    end &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt;
    closedir(dir);
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The function to process a work item first dequeues to get the directory full path, and then it iterate over a single level within this directory. All children files will be added to the local producer queue of this MPI rank; while each of the sbudirectories are added to the libcircle work queue to be processed distributedly.&lt;/p&gt;

&lt;p&gt;Although the workload of sequential traversal against a directory serially, which follows readdir() plus stat() access pattern, can be optimized by using our statahead algorithm, but it will become time-consuming also when the directory is extreme larger. A trace-off strategy is propose to balance parallelization and statahead speedup. A tunning parameter named &apos;&lt;b&gt;stmax&lt;/b&gt;&apos; is defined. when the number of sub files under the directory is lower than &lt;b&gt;pmax&lt;/b&gt;, use the algorithm above to do tree walking; while when it is larger than &lt;b&gt;stmax&lt;/b&gt;, the &lt;b&gt;stmax&lt;/b&gt; entry in local queue will do stat() call sequentially by using FLAT statahead algorithm; while the latter entry will add into the global queue to do stat() call distributely. &apos;&lt;b&gt;stmax&lt;/b&gt;&apos; can be set by the input parameter for dwalk. The default value is 5000. The pseudo code for the algorithm is described as follows:&lt;/p&gt;

&lt;p&gt;```&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
struct Elem {
    bool localQueued;
    &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; type;
    &lt;span class=&quot;code-object&quot;&gt;char&lt;/span&gt; path[];  
}
ElemEnqueue(Queue, type, localQueued, fullpath){
    elem = &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; Elem();
    elem.localQueued = localQueued;
    elem.type = type;
    elem.path = fullpath;
    Queue.enqueue(elem)&#65307;
}
Walk_Xfast_stat_process(CIRCLE Queue)
 elem = Qeueu.deqeueu();
 &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (elem.localQueued == flase){
    stat(elem.path);
    record stat() result &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;final&lt;/span&gt; summary via MPI reduce. 
 }
&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (elem.type == DT_DIR) {
 dir = opendir(elem.path);
 count = 0;
 &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; dent = readdir(dir) != NULL; &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt;
     fullpath = path + &lt;span class=&quot;code-quote&quot;&gt;&quot;/&quot;&lt;/span&gt; + dent.name;
     &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (count &amp;lt; stmax){ 
        count++; 
        localQueue.enqueue(fullpath);
       &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (dent.type == DT_DIR)
          ElemEnqueue(Queue, dent.type, &lt;span class=&quot;code-keyword&quot;&gt;true&lt;/span&gt;, fullpath); }
    } &lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt; {
       ElemEnqueue(Queue, dent.type, &lt;span class=&quot;code-keyword&quot;&gt;false&lt;/span&gt;, fullpath); }
    }
  end &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt;
  }
  &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; |localQueue| &amp;gt; 0 &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt;
      path = localQueue.dequeue()
     stat(path);
     record stat() result &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;final&lt;/span&gt; summary via MPI reduce. 
  end &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt;
 closedir(dir);
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt; ```&lt;/p&gt;</description>
                <environment></environment>
        <key id="63767">LU-14610</key>
            <summary>Make mpiFileUtils better support statahead</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="qian_wc">Qian Yingjin</assignee>
                                    <reporter username="qian_wc">Qian Yingjin</reporter>
                        <labels>
                    </labels>
                <created>Tue, 13 Apr 2021 07:50:23 +0000</created>
                <updated>Fri, 16 Apr 2021 08:44:00 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="298634" author="adilger" created="Tue, 13 Apr 2021 08:22:27 +0000"  >&lt;p&gt;It should be noted that it is possible to segment a large directory in Lustre for parallel processing by dividing the hash space [0-2^63) evenly among the ranks. For example, if there are 8 ranks waking a single directory, the maximum hash value 2^63/8 = 2^60 for each rank. That means rank 0 would read until directory cookie 2^60-1, rank 1 would seek to 2^60 and read until 2^61-1, etc.  See for example the pfind code in &lt;a href=&quot;https://github.com/VI4IO/pfind.git&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/VI4IO/pfind.git&lt;/a&gt; that is leveraging this ability. &lt;/p&gt;

&lt;p&gt;Since the hash cookie is not returned as part of &lt;tt&gt;readdir()&lt;/tt&gt;, and callling &lt;tt&gt;telldir()&lt;/tt&gt; for every file isn&apos;t very efficient, one approach would be for rank n+1 to seek to the start of its range, &lt;tt&gt;readdir()&lt;/tt&gt; once to get the first entry, then pass this entry name back to rank n as the &quot;end&quot; marker of its readdir() processing.  &lt;/p&gt;</comment>
                            <comment id="298635" author="qian_wc" created="Tue, 13 Apr 2021 08:34:56 +0000"  >&lt;p&gt;Agreed,&lt;/p&gt;

&lt;p&gt;That&apos;s another stat-ahead strategy I considered to implement for mpiFileUtils or pFind:&#160;Add a new Lustre ladvise() for an opened dir file handle with hash space [start, end) as a hint to indicate the kernel to launch a statahead thread to do readdir() + stat() in the position range [start, end);&lt;/p&gt;

&lt;p&gt;The usage in mpiFileUtils may be as follows:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
dir=opendir(path);
llapi_lfadivse(dirfd, STATAHEAD_HAHS_RNAGE, start, end);
seekdir(dirfd, start);
&lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; dent = reader(dir) &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt;;
   stat(dent.name);
   &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (telldir(dirfd) &amp;gt;= end)
       &lt;span class=&quot;code-keyword&quot;&gt;break&lt;/span&gt;;
end &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt;
closedir(dir);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="298651" author="adilger" created="Tue, 13 Apr 2021 14:13:05 +0000"  >&lt;p&gt;I think we wouldn&apos;t need a new ladvise for this, since hash order &lt;b&gt;is&lt;/b&gt; readdir() order. The main issue is that statahead only starts at the beginning of the directory, it doesn&apos;t start statahead for hash-order operations that start in the middle of the directory. &lt;/p&gt;</comment>
                            <comment id="298656" author="qian_wc" created="Tue, 13 Apr 2021 14:30:07 +0000"  >&lt;p&gt;Then, how to detect that applications such as mpiFileUtils and pFind is doing readdir() + stat() in the hash order that starts in the middle of the directory and when to terminate it?&lt;/p&gt;</comment>
                            <comment id="298677" author="adilger" created="Tue, 13 Apr 2021 15:31:26 +0000"  >&lt;p&gt;In the same way that statahead currently detects if the process is &lt;b&gt;not&lt;/b&gt; doing hash-order stats, it could detect that the process did a &lt;tt&gt;readdir()&lt;/tt&gt; call and then &lt;tt&gt;stat()&lt;/tt&gt; entries in that order. It could detect this on a per-readdir basis instead of only at the start of the directory?&lt;/p&gt;</comment>
                            <comment id="298980" author="qian_wc" created="Fri, 16 Apr 2021 08:44:00 +0000"  >&lt;p&gt;Yingjin Qian (qian@ddn.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/43170&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/43170&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14380&quot; title=&quot;Make statahead better support Breadth First Search (BFS) or Depth First Search (DFS)&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14380&quot;&gt;LU-14380&lt;/a&gt; statahead: divide hash space evenly among the ranks&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: f113be1941c5b18b65e94d5c364b9d69e49151c3&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i01s1r:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>