Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6984

Failure to delete over a million files in a DNE2 directory.

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.8.0
    • pre-2.8 clients with DNE2 directories which contain 1 million or more files.
    • 3
    • 9223372036854775807

    Description

      In my testing of DNE2 I'm seeing problems when creating 1 million+ files per directory. Clearing out the debug logs I see the problem is only on the client side. When running a application I see:

      command line used: /lustre/sultan/stf008/scratch/jsimmons/mdtest -I 100000 -i 5 -d /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test/shared_1000k_10
      Path: /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test
      FS: 21.8 TiB Used FS: 0.2% Inodes: 58.7 Mi Used Inodes: 4.6%

      10 tasks, 1000000 files/directories
      aprun: Apid 3172: Caught signal Window changed, sending to application
      08/03/2015 10:34:45: Process 0(nid00028): FAILED in create_remove_directory_tree, Unable to remove directory: No such file or directory
      Rank 0 [Mon Aug 3 10:34:45 2015] [c0-0c0s1n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
      _pmiu_daemon(SIGCHLD): [NID 00028] [c0-0c0s1n2] [Mon Aug 3 10:34:45 2015] PE RANK 0 exit signal Aborted
      aprun: Apid 3172: Caught signal Interrupt, sending to application
      _pmiu_daemon(SIGCHLD): [NID 00012] [c0-0c0s6n0] [Mon Aug 3 10:50:50 2015] PE RANK 7 exit signal Interrupt
      _pmiu_daemon(SIGCHLD): [NID 00018] [c0-0c0s6n2] [Mon Aug 3 10:50:50 2015] PE RANK 9 exit signal Interrupt
      _pmiu_daemon(SIGCHLD): [NID 00013] [c0-0c0s6n1] [Mon Aug 3 10:50:50 2015] PE RANK 8 exit signal Interrupt

      After the test failed any attempt to remove the files create by these test fail. When I attempt to remove the files I see the following errors in dmesg.

      LustreError: 5430:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2
      LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2
      LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) Skipped 7 previous similar messages
      LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2

      Attachments

        1. lctldump.20150813
          0.2 kB
        2. LU-6381.log
          0.2 kB
        3. LU-6984-backtrace.log
          83 kB
        4. lu-6984-Sept-18-2015.tgz
          0.2 kB

        Issue Links

          Activity

            [LU-6984] Failure to delete over a million files in a DNE2 directory.
            di.wang Di Wang added a comment -

            James: thanks. And usually how soon did you met the failure? after a few minutes? a few hours after starting the test?

            di.wang Di Wang added a comment - James: thanks. And usually how soon did you met the failure? after a few minutes? a few hours after starting the test?
            simmonsja James A Simmons added a comment - I did you one better. Grab my source rpm at http://www.infradead.org/~jsimmons/lustre-2.7.59-1_g703195a.src.rpm
            di.wang Di Wang added a comment -

            Ok, I tried to reproduce it on Opensfs cluster with 8 MDTs (4 MDS) and 4 OSTs(2 OSS). 9 clients. Just start the test, it has been an hour, still can not see this problem. I will check tomorrow morning to see how it goes?

            James: could you please tell me all of your patches(based on master)? Thanks.

            di.wang Di Wang added a comment - Ok, I tried to reproduce it on Opensfs cluster with 8 MDTs (4 MDS) and 4 OSTs(2 OSS). 9 clients. Just start the test, it has been an hour, still can not see this problem. I will check tomorrow morning to see how it goes? James: could you please tell me all of your patches(based on master)? Thanks.
            di.wang Di Wang added a comment -

            Both would be best. If not, then only client would be ok. Thanks

            di.wang Di Wang added a comment - Both would be best. If not, then only client would be ok. Thanks

            On the MDS or the client?

            simmonsja James A Simmons added a comment - On the MDS or the client?
            di.wang Di Wang added a comment -

            Hmm, during slaves revalidation, it seems the striped directory has been locked with both LOOKUP and UPDATE locks. I do not understand why the master stripe nlink turns to 1 at that time.

            James: Could you please collect the debug log when the failure happens? (-1) would be best, but if there is race, just collect the default one please. Thanks!

            di.wang Di Wang added a comment - Hmm, during slaves revalidation, it seems the striped directory has been locked with both LOOKUP and UPDATE locks. I do not understand why the master stripe nlink turns to 1 at that time. James: Could you please collect the debug log when the failure happens? (-1) would be best, but if there is race, just collect the default one please. Thanks!

            Doesn't matter how many client nodes. I use 400 below but use whatever you want. What matters the number of files per directory. Remember this is with remote_dir=-1 and remote_dir_gid=-1. Try using 8 MDS servers but any number greater than 1 will do:

            lfs setdirstripe -c 8 /lustre/whatever/jsimmons/dne2_8_mds_md_test
            lfs setdirstripe -c 8 -D /lustre/whatever/jsimmons/dne2_8_mds_md_test (to make all directories under it the same)
            mkdir /lustre/whatever/jsimmons/dne2_8_mds_md_test/shared_1000k_400
            mpi_run -n 400 mdtest -I 2500 -i 5 -d /lustre/whatever/jsimmons/dne2_8_mds_md_test/shared_1000k_400

            When mdtest goes to delete the files mdtest will fail. At least it does for me.

            simmonsja James A Simmons added a comment - Doesn't matter how many client nodes. I use 400 below but use whatever you want. What matters the number of files per directory. Remember this is with remote_dir=-1 and remote_dir_gid=-1. Try using 8 MDS servers but any number greater than 1 will do: lfs setdirstripe -c 8 /lustre/whatever/jsimmons/dne2_8_mds_md_test lfs setdirstripe -c 8 -D /lustre/whatever/jsimmons/dne2_8_mds_md_test (to make all directories under it the same) mkdir /lustre/whatever/jsimmons/dne2_8_mds_md_test/shared_1000k_400 mpi_run -n 400 mdtest -I 2500 -i 5 -d /lustre/whatever/jsimmons/dne2_8_mds_md_test/shared_1000k_400 When mdtest goes to delete the files mdtest will fail. At least it does for me.
            di.wang Di Wang added a comment -

            Could you please tell me how to reproduce the problem? still use mdtest with single thread on 1 node? thanks.

            di.wang Di Wang added a comment - Could you please tell me how to reproduce the problem? still use mdtest with single thread on 1 node? thanks.

            Better yet I did a dump_stack at where the code fails. You have complete back traces. I attached the file to this ticket.

            simmonsja James A Simmons added a comment - Better yet I did a dump_stack at where the code fails. You have complete back traces. I attached the file to this ticket.
            di.wang Di Wang added a comment -

            James, thanks for updating, very useful information. Do you happen to know who calls ll_prep_inode() ? Thanks!

            di.wang Di Wang added a comment - James, thanks for updating, very useful information. Do you happen to know who calls ll_prep_inode() ? Thanks!

            The ll_prep_inode errors came back so I back track to see what was causing the error. From my cheesy debugging I found this:

            [ 251.338486] Lustre: looks like a bad stripe
            [ 251.342779] Lustre: lmv_revalidate_slaves failed
            [ 251.347499] Lustre: md_merge_attr failed to validate the lsm
            [ 251.353263] Lustre: ll_update_lsm_md failed when called from ll_update_inode
            [ 251.360412] Lustre: ll_update_inode return -2
            [ 251.364877] Lustre: ll_read_inode2 failed
            [ 251.369061] Lustre: Skipped 1 previous similar message
            [ 251.374321] LustreError: 18546:0:(llite_lib.c:2305:ll_prep_inode()) new_inode -fatal: rc -2

            So the problem is from lmv_revalidate_slaves. It's reporting back a ENOENT due to body->mbo_nlink < 2 condition. The comment points to a race between close(unlink) and getattr.

            simmonsja James A Simmons added a comment - The ll_prep_inode errors came back so I back track to see what was causing the error. From my cheesy debugging I found this: [ 251.338486] Lustre: looks like a bad stripe [ 251.342779] Lustre: lmv_revalidate_slaves failed [ 251.347499] Lustre: md_merge_attr failed to validate the lsm [ 251.353263] Lustre: ll_update_lsm_md failed when called from ll_update_inode [ 251.360412] Lustre: ll_update_inode return -2 [ 251.364877] Lustre: ll_read_inode2 failed [ 251.369061] Lustre: Skipped 1 previous similar message [ 251.374321] LustreError: 18546:0:(llite_lib.c:2305:ll_prep_inode()) new_inode -fatal: rc -2 So the problem is from lmv_revalidate_slaves. It's reporting back a ENOENT due to body->mbo_nlink < 2 condition. The comment points to a race between close(unlink) and getattr.

            People

              di.wang Di Wang
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: