Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6984

Failure to delete over a million files in a DNE2 directory.

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.8.0
    • pre-2.8 clients with DNE2 directories which contain 1 million or more files.
    • 3
    • 9223372036854775807

    Description

      In my testing of DNE2 I'm seeing problems when creating 1 million+ files per directory. Clearing out the debug logs I see the problem is only on the client side. When running a application I see:

      command line used: /lustre/sultan/stf008/scratch/jsimmons/mdtest -I 100000 -i 5 -d /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test/shared_1000k_10
      Path: /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test
      FS: 21.8 TiB Used FS: 0.2% Inodes: 58.7 Mi Used Inodes: 4.6%

      10 tasks, 1000000 files/directories
      aprun: Apid 3172: Caught signal Window changed, sending to application
      08/03/2015 10:34:45: Process 0(nid00028): FAILED in create_remove_directory_tree, Unable to remove directory: No such file or directory
      Rank 0 [Mon Aug 3 10:34:45 2015] [c0-0c0s1n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
      _pmiu_daemon(SIGCHLD): [NID 00028] [c0-0c0s1n2] [Mon Aug 3 10:34:45 2015] PE RANK 0 exit signal Aborted
      aprun: Apid 3172: Caught signal Interrupt, sending to application
      _pmiu_daemon(SIGCHLD): [NID 00012] [c0-0c0s6n0] [Mon Aug 3 10:50:50 2015] PE RANK 7 exit signal Interrupt
      _pmiu_daemon(SIGCHLD): [NID 00018] [c0-0c0s6n2] [Mon Aug 3 10:50:50 2015] PE RANK 9 exit signal Interrupt
      _pmiu_daemon(SIGCHLD): [NID 00013] [c0-0c0s6n1] [Mon Aug 3 10:50:50 2015] PE RANK 8 exit signal Interrupt

      After the test failed any attempt to remove the files create by these test fail. When I attempt to remove the files I see the following errors in dmesg.

      LustreError: 5430:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2
      LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2
      LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) Skipped 7 previous similar messages
      LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2

      Attachments

        1. lctldump.20150813
          0.2 kB
        2. LU-6381.log
          0.2 kB
        3. LU-6984-backtrace.log
          83 kB
        4. lu-6984-Sept-18-2015.tgz
          0.2 kB

        Issue Links

          Activity

            [LU-6984] Failure to delete over a million files in a DNE2 directory.
            di.wang Di Wang added a comment -

            Both would be best. If not, then only client would be ok. Thanks

            di.wang Di Wang added a comment - Both would be best. If not, then only client would be ok. Thanks

            On the MDS or the client?

            simmonsja James A Simmons added a comment - On the MDS or the client?
            di.wang Di Wang added a comment -

            Hmm, during slaves revalidation, it seems the striped directory has been locked with both LOOKUP and UPDATE locks. I do not understand why the master stripe nlink turns to 1 at that time.

            James: Could you please collect the debug log when the failure happens? (-1) would be best, but if there is race, just collect the default one please. Thanks!

            di.wang Di Wang added a comment - Hmm, during slaves revalidation, it seems the striped directory has been locked with both LOOKUP and UPDATE locks. I do not understand why the master stripe nlink turns to 1 at that time. James: Could you please collect the debug log when the failure happens? (-1) would be best, but if there is race, just collect the default one please. Thanks!

            Doesn't matter how many client nodes. I use 400 below but use whatever you want. What matters the number of files per directory. Remember this is with remote_dir=-1 and remote_dir_gid=-1. Try using 8 MDS servers but any number greater than 1 will do:

            lfs setdirstripe -c 8 /lustre/whatever/jsimmons/dne2_8_mds_md_test
            lfs setdirstripe -c 8 -D /lustre/whatever/jsimmons/dne2_8_mds_md_test (to make all directories under it the same)
            mkdir /lustre/whatever/jsimmons/dne2_8_mds_md_test/shared_1000k_400
            mpi_run -n 400 mdtest -I 2500 -i 5 -d /lustre/whatever/jsimmons/dne2_8_mds_md_test/shared_1000k_400

            When mdtest goes to delete the files mdtest will fail. At least it does for me.

            simmonsja James A Simmons added a comment - Doesn't matter how many client nodes. I use 400 below but use whatever you want. What matters the number of files per directory. Remember this is with remote_dir=-1 and remote_dir_gid=-1. Try using 8 MDS servers but any number greater than 1 will do: lfs setdirstripe -c 8 /lustre/whatever/jsimmons/dne2_8_mds_md_test lfs setdirstripe -c 8 -D /lustre/whatever/jsimmons/dne2_8_mds_md_test (to make all directories under it the same) mkdir /lustre/whatever/jsimmons/dne2_8_mds_md_test/shared_1000k_400 mpi_run -n 400 mdtest -I 2500 -i 5 -d /lustre/whatever/jsimmons/dne2_8_mds_md_test/shared_1000k_400 When mdtest goes to delete the files mdtest will fail. At least it does for me.
            di.wang Di Wang added a comment -

            Could you please tell me how to reproduce the problem? still use mdtest with single thread on 1 node? thanks.

            di.wang Di Wang added a comment - Could you please tell me how to reproduce the problem? still use mdtest with single thread on 1 node? thanks.

            Better yet I did a dump_stack at where the code fails. You have complete back traces. I attached the file to this ticket.

            simmonsja James A Simmons added a comment - Better yet I did a dump_stack at where the code fails. You have complete back traces. I attached the file to this ticket.
            di.wang Di Wang added a comment -

            James, thanks for updating, very useful information. Do you happen to know who calls ll_prep_inode() ? Thanks!

            di.wang Di Wang added a comment - James, thanks for updating, very useful information. Do you happen to know who calls ll_prep_inode() ? Thanks!

            The ll_prep_inode errors came back so I back track to see what was causing the error. From my cheesy debugging I found this:

            [ 251.338486] Lustre: looks like a bad stripe
            [ 251.342779] Lustre: lmv_revalidate_slaves failed
            [ 251.347499] Lustre: md_merge_attr failed to validate the lsm
            [ 251.353263] Lustre: ll_update_lsm_md failed when called from ll_update_inode
            [ 251.360412] Lustre: ll_update_inode return -2
            [ 251.364877] Lustre: ll_read_inode2 failed
            [ 251.369061] Lustre: Skipped 1 previous similar message
            [ 251.374321] LustreError: 18546:0:(llite_lib.c:2305:ll_prep_inode()) new_inode -fatal: rc -2

            So the problem is from lmv_revalidate_slaves. It's reporting back a ENOENT due to body->mbo_nlink < 2 condition. The comment points to a race between close(unlink) and getattr.

            simmonsja James A Simmons added a comment - The ll_prep_inode errors came back so I back track to see what was causing the error. From my cheesy debugging I found this: [ 251.338486] Lustre: looks like a bad stripe [ 251.342779] Lustre: lmv_revalidate_slaves failed [ 251.347499] Lustre: md_merge_attr failed to validate the lsm [ 251.353263] Lustre: ll_update_lsm_md failed when called from ll_update_inode [ 251.360412] Lustre: ll_update_inode return -2 [ 251.364877] Lustre: ll_read_inode2 failed [ 251.369061] Lustre: Skipped 1 previous similar message [ 251.374321] LustreError: 18546:0:(llite_lib.c:2305:ll_prep_inode()) new_inode -fatal: rc -2 So the problem is from lmv_revalidate_slaves. It's reporting back a ENOENT due to body->mbo_nlink < 2 condition. The comment points to a race between close(unlink) and getattr.

            Good news I can duplicate this problem on a non-Cray node. I'm debugging it. Will let you know what I find.

            simmonsja James A Simmons added a comment - Good news I can duplicate this problem on a non-Cray node. I'm debugging it. Will let you know what I find.
            di.wang Di Wang added a comment -

            Hmm, It seems there might be two places in mdtest.c, which can trigger the problem.

            #else
                  if (rmdir(dir) == -1) {
                    FAIL("Unable to remove directory");
                  }
            #endif
            

            and

                    if (rmdir(temp_path) == -1) {
                      FAIL("Unable to remove directory");
                    }
            

            So could you please add some debug information here to print out the dir or temp_path when these rmdir fail happens.

            So if rmdir() fails, on the kernel side, that is probably caused by two reasons.

            1. path resolution failed for "dir" or "temp_path", I am not sure there are good way to printk the error message, without changing the kernel. But let's see what are these directories are when the failure happens.

            2. rmdir fails in ll_rmdir(), for this failure, you can simply add a patch like this

            --- a/lustre/llite/namei.c
            +++ b/lustre/llite/namei.c
            @@ -1198,7 +1198,10 @@ static int ll_rmdir(struct inode *dir, struct dentry *dchild)
                     if (rc == 0) {
                             ll_update_times(request, dir);
                             ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_RMDIR, 1);
            -        }
            +        } else {
            +               CERROR("unlink fails :name=%.*s, dir="DFID"(%p) rc: %d\n",
            +                       name->len, name->name, PFID(ll_inode2fid(dir)), dir, rc);
            +       }
             
                     ptlrpc_req_finished(request);
                     RETURN(rc);
            

            TBH, I suspect the failure is caused by 1. Sigh, I can not reproduce it locally.

            di.wang Di Wang added a comment - Hmm, It seems there might be two places in mdtest.c, which can trigger the problem. #else if (rmdir(dir) == -1) { FAIL("Unable to remove directory"); } #endif and if (rmdir(temp_path) == -1) { FAIL("Unable to remove directory"); } So could you please add some debug information here to print out the dir or temp_path when these rmdir fail happens. So if rmdir() fails, on the kernel side, that is probably caused by two reasons. 1. path resolution failed for "dir" or "temp_path", I am not sure there are good way to printk the error message, without changing the kernel. But let's see what are these directories are when the failure happens. 2. rmdir fails in ll_rmdir(), for this failure, you can simply add a patch like this --- a/lustre/llite/namei.c +++ b/lustre/llite/namei.c @@ -1198,7 +1198,10 @@ static int ll_rmdir(struct inode *dir, struct dentry *dchild) if (rc == 0) { ll_update_times(request, dir); ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_RMDIR, 1); - } + } else { + CERROR("unlink fails :name=%.*s, dir="DFID"(%p) rc: %d\n", + name->len, name->name, PFID(ll_inode2fid(dir)), dir, rc); + } ptlrpc_req_finished(request); RETURN(rc); TBH, I suspect the failure is caused by 1. Sigh, I can not reproduce it locally.

            With the latest master and a few patches this problem still exist. The "ll_prep_inode" errors are gone now. I tried to debug it with lctl debug but turning on debug hides the race condition. Do you know what areas I should add old fashion printks to see what is causing the failure?

            simmonsja James A Simmons added a comment - With the latest master and a few patches this problem still exist. The "ll_prep_inode" errors are gone now. I tried to debug it with lctl debug but turning on debug hides the race condition. Do you know what areas I should add old fashion printks to see what is causing the failure?

            People

              di.wang Di Wang
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: