Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16429

Create-ahead: create massive files by batched RPC

    XMLWordPrintable

Details

    • New Feature
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 9223372036854775807

    Description

      To improve the efficiency and overall throughput, a create-ahead mechanism is proposed to optimize the creation operation for Lustre.

      In the traditional POSIX API, a user generally invokes a open() system call with O_CREAT flag to create a new file.

      We use the open cache mechanism to cache results of create-ahead files with strong consistence. And the application can use the valid cached open handle for create-ahead files directly without interaction to the server.

      To avoid the dependencies of metadata hierarchy, our create-ahead mechanism only does create-ahead on files under a same directory. To begin creat-ahead, the client must first know file names in advance. Some applications with the batch access pattern obey certain naming rules. mdtest creation benchmark is a such typical example in which the file naming format is mdtest.$rank.$i. The kernel can automically detect such I/O pattern if the file naming in the creation sequence is predictable, then enable the create-ahead against this directory.

      Some other kinds of applications, such as mpiFileUtils/dcp, usually provide a file name list as input to do batch creations with irregular file names. In this case, a simple API is provided to programmers through which an application can inform the kernel with file name list that will be created. The file name list must be given as the same order with the creation sequence in the application.

       

      create-ahead mechanism works as follows:

      Once detected automically or informed via API to enable create-ahead against a directory, the client starts a dedicated kernel thread to do create-ahead work. The thread packs file names and build a set of creations into one compound RPC, then send it to the metadata server asynchronously. For each sub creation request, the server first creates the file with the file name and then return an open lock with extra inode bits accordingly. If the file is created with DoM layout and detect or inform that subsequent operations of the application will write the file, the server returns a <PW, OPEN|DOM> lock to the client where the DOM ibits lock can be resued for the later write, avoiding extra lock traffic furture. Otherwsie, it grants a <CR, OPEN> lock for batch read access or a <CW, OPEN> for batch write access.

       

      Since the order of create-ahead files is same as the application, the sliding window is used to control the create-ahead progress. The control algorithm is same as stat ahead. Each directory with create-ahead enabled maintains a hash table for create-ahead files. When create ahead a file, a corresponding file entry is inserted into the hash table using the file name as a key.

      When an application calls open() with O_CREAT flag on a new file and its parent directory is enabled for create ahead, the client first searches the hash table. If there is no file entry with the same name in the hash table, the client must to contract with the server to perform the creation operation synchronously. Otherwise, the application can open the file locally without interaction to the server If create-ahead for this file entry has been finished; If not finished, the application must wait for the result of the asynchronous compound RPC. A successful hit on create ahead will move forward the sliding winodw, and drop the corresponing file entry from the hash table.

       

      By using open cache and compounding, our create-ahead mechanism can save substantial latency by eliminating many costly round-trips, improving the creation performance significantly.

       

      Combined with batching massive small buffered writes:  LU-16355, it should improve the performance of IO500 mdtest-hard-write.

       

      TODO:

      Open relay data for batched open/create...

      Attachments

        Activity

          People

            qian_wc Qian Yingjin
            qian_wc Qian Yingjin
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: