Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14610

Make mpiFileUtils better support statahead

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 9223372036854775807

    Description

      mpiFileUtils utilizes open source MPI-based library libcircle, an API for distributing embarrassingly parallel workloads using self-stabilization, to implement distributed scalable tree walking. libcircle is an API to provide an efficient distributed queue on a cluster. Libcircle is currently used in production to quickly traverse and perform operations on a file tree which contains several hundred-million file nodes. Each MPI rank (process) maintains its queue of work items and is able to exchange work items with random process without a central coordinator. To keep the process balanced, libcircle makes work requests to random nodes when work is needed. If the node requested has work in its queue it will randomly split that queue with the requestor.

      Conceptually, the workload distribution in mpiFileUtils can be described with two queues. One queue is a global queue that spans all nodes in the distributed system. The other queue is a queue which is local to each process (MPI rank).

      The libcircle API defines a function to produce a work item initially and a function to process a work item. The libcircle work item is an array of characters describing the job to be performed. mpiFileUtils warps libCircle, and functionality that is common to multiple tools is moved to the common library, libmfu. The key data structure in libmfu is a distributed file queue. This structure represents a list of files, each with stat-like metadata, that is distributed among a set of MPI ranks. Each MPI rank "owns" a portion of the queue, and there are routines to step through the entries owned by that process. This portion is referred to as the "local" queue. Functions exist to get and set properties of the items in the local list, for example to get the path name, type, and size of a file. Functions dealing with the local list can be called by the MPI process independently of other MPI processes. Other functions operate on the global list in a collective fashion, such as deleting all items in a file list. All processes in the MPI job must invoke these functions simultaneously.

      In the following, it will take dwalk as an example to demonstrate how mpiFileUtils works for distributed tree walking.

       

      Walk_stat_process(CIRCLE Queue){ 
          path = Queue.dequeue(); 
          stat(path);
          record stat() result for final summary via MPI reduce.
          if path is a directory
              dir = opendir(path); 
              while dent = readdir(dir) != NULL do
                  Queue.enqueue(path + '/' + dent.name);
              end while closedir(dir);
         fi
      } 
      

       

      When walking a tree by using dwalk, the function to produce the initial work item takes the input parameters (a directory) passed in by the user and enqueues a string describing the directory. The function to process a work item will dequeue to get the full path name for a file, do stat call on it. If the current work item is a directory, walks a single level within its directory, and each dentry under this directory will be added into the global libcircle work queue to be processed independently.

      LibCircle can automatically and dynamically balance the treewalk workload across many nodes in a large distributed system. Obviously, the minimal work set for the current parallel tree walking with stat in dwalk is a single file. This may result in that the files within a same directory may be randomly distributed among different MPI ranks (different processes or nodes) and broken the sequential stat() in readdir() order.

      We improve the mpiFileUtils to make the minimal splitable work set is a directory. Within a directory, it uses FLAT statahead algorithm to accelerate the speed of tree walking with stat(). The pseudo code for the algorithm is described as follows:

       

      Walk_Flat_stat_process(CIRCLE Queue) {
          path = Queue.dequeue();
          dir = opendir(path);
          while dent = readdir(dir) != NULL; do 
              fullpath = path + "/" + dent.name; 
              localQueue.enqueue(fullpath);
              if dent is a directory; do
                  Queue.enqueue(fullpath);
              fi
          end while
          while |localQueue| > 0 do
             path = localQueue.dequeue() stat(path);
             record stat() result for final summary via MPI reduce.
          end while
          closedir(dir);
      }
      

      The function to process a work item first dequeues to get the directory full path, and then it iterate over a single level within this directory. All children files will be added to the local producer queue of this MPI rank; while each of the sbudirectories are added to the libcircle work queue to be processed distributedly.

      Although the workload of sequential traversal against a directory serially, which follows readdir() plus stat() access pattern, can be optimized by using our statahead algorithm, but it will become time-consuming also when the directory is extreme larger. A trace-off strategy is propose to balance parallelization and statahead speedup. A tunning parameter named 'stmax' is defined. when the number of sub files under the directory is lower than pmax, use the algorithm above to do tree walking; while when it is larger than stmax, the stmax entry in local queue will do stat() call sequentially by using FLAT statahead algorithm; while the latter entry will add into the global queue to do stat() call distributely. 'stmax' can be set by the input parameter for dwalk. The default value is 5000. The pseudo code for the algorithm is described as follows:

      ```

      struct Elem {
          bool localQueued;
          int type;
          char path[];  
      }
      ElemEnqueue(Queue, type, localQueued, fullpath){
          elem = new Elem();
          elem.localQueued = localQueued;
          elem.type = type;
          elem.path = fullpath;
          Queue.enqueue(elem);
      }
      Walk_Xfast_stat_process(CIRCLE Queue)
       elem = Qeueu.deqeueu();
       if (elem.localQueued == flase){
          stat(elem.path);
          record stat() result for final summary via MPI reduce. 
       }
      if (elem.type == DT_DIR) {
       dir = opendir(elem.path);
       count = 0;
       while dent = readdir(dir) != NULL; do
           fullpath = path + "/" + dent.name;
           if (count < stmax){ 
              count++; 
              localQueue.enqueue(fullpath);
             if (dent.type == DT_DIR)
                ElemEnqueue(Queue, dent.type, true, fullpath); }
          } else {
             ElemEnqueue(Queue, dent.type, false, fullpath); }
          }
        end while
        }
        while |localQueue| > 0 do
            path = localQueue.dequeue()
           stat(path);
           record stat() result for final summary via MPI reduce. 
        end while
       closedir(dir);
      }
      

      ```

      Attachments

        Issue Links

          Activity

            People

              qian_wc Qian Yingjin
              qian_wc Qian Yingjin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: