Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6984

Failure to delete over a million files in a DNE2 directory.

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.8.0
    • pre-2.8 clients with DNE2 directories which contain 1 million or more files.
    • 3
    • 9223372036854775807

    Description

      In my testing of DNE2 I'm seeing problems when creating 1 million+ files per directory. Clearing out the debug logs I see the problem is only on the client side. When running a application I see:

      command line used: /lustre/sultan/stf008/scratch/jsimmons/mdtest -I 100000 -i 5 -d /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test/shared_1000k_10
      Path: /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test
      FS: 21.8 TiB Used FS: 0.2% Inodes: 58.7 Mi Used Inodes: 4.6%

      10 tasks, 1000000 files/directories
      aprun: Apid 3172: Caught signal Window changed, sending to application
      08/03/2015 10:34:45: Process 0(nid00028): FAILED in create_remove_directory_tree, Unable to remove directory: No such file or directory
      Rank 0 [Mon Aug 3 10:34:45 2015] [c0-0c0s1n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
      _pmiu_daemon(SIGCHLD): [NID 00028] [c0-0c0s1n2] [Mon Aug 3 10:34:45 2015] PE RANK 0 exit signal Aborted
      aprun: Apid 3172: Caught signal Interrupt, sending to application
      _pmiu_daemon(SIGCHLD): [NID 00012] [c0-0c0s6n0] [Mon Aug 3 10:50:50 2015] PE RANK 7 exit signal Interrupt
      _pmiu_daemon(SIGCHLD): [NID 00018] [c0-0c0s6n2] [Mon Aug 3 10:50:50 2015] PE RANK 9 exit signal Interrupt
      _pmiu_daemon(SIGCHLD): [NID 00013] [c0-0c0s6n1] [Mon Aug 3 10:50:50 2015] PE RANK 8 exit signal Interrupt

      After the test failed any attempt to remove the files create by these test fail. When I attempt to remove the files I see the following errors in dmesg.

      LustreError: 5430:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2
      LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2
      LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) Skipped 7 previous similar messages
      LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2

      Attachments

        1. lctldump.20150813
          0.2 kB
        2. LU-6381.log
          0.2 kB
        3. LU-6984-backtrace.log
          83 kB
        4. lu-6984-Sept-18-2015.tgz
          0.2 kB

        Issue Links

          Activity

            [LU-6984] Failure to delete over a million files in a DNE2 directory.

            wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16490
            Subject: LU-6984 lmv: remove nlink check in lmv_revalidate_slaves
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d676ea2ab1f55dfa8e04ed5fa074444315808329

            gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16490 Subject: LU-6984 lmv: remove nlink check in lmv_revalidate_slaves Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d676ea2ab1f55dfa8e04ed5fa074444315808329

            Ah, It seems this check in lmv_revalidate_slaves is not correct

                                if (unlikely(body->mbo_nlink < 2)) {
                                            /* If this is bad stripe, most likely due
                                             * to the race between close(unlink) and
                                             * getattr, let's return -EONENT, so llite
                                             * will revalidate the dentry see
                                             * ll_inode_revalidate_fini() */
                                            CDEBUG(D_INODE, "%s: nlink %d < 2 bad stripe %d"
                                                   DFID ":" DFID"\n",
                                                   obd->obd_name, body->mbo_nlink, i,
                                                   PFID(&lsm->lsm_md_oinfo[i].lmo_fid),
                                                   PFID(&lsm->lsm_md_oinfo[0].lmo_fid));
            
                                            if (it.d.lustre.it_lock_mode && lockh) {
                                                    ldlm_lock_decref_and_cancel(lockh,
                                                             it.d.lustre.it_lock_mode);
                                                    it.d.lustre.it_lock_mode = 0;
                                            }
            
                                            GOTO(cleanup, rc = -ENOENT);
                                    }
            

            Because

                    /*
                     * The DIR_NLINK feature allows directories to exceed LDISKFS_LINK_MAX
                     * (65000) subdirectories by storing "1" in i_nlink if the link count
                     * would otherwise overflow. Directory tranversal tools understand
                     * that (st_nlink == 1) indicates that the filesystem dose not track
                     * hard links count on the directory, and will not abort subdirectory
                     * scanning early once (st_nlink - 2) subdirs have been found.
                     *
                     * This also has to properly handle the case of inodes with nlink == 0
                     * in case they are being linked into the PENDING directory
                     */
            

            I will remove this.

            di.wang Di Wang (Inactive) added a comment - Ah, It seems this check in lmv_revalidate_slaves is not correct if (unlikely(body->mbo_nlink < 2)) { /* If this is bad stripe, most likely due * to the race between close(unlink) and * getattr, let's return -EONENT, so llite * will revalidate the dentry see * ll_inode_revalidate_fini() */ CDEBUG(D_INODE, "%s: nlink %d < 2 bad stripe %d" DFID ":" DFID"\n", obd->obd_name, body->mbo_nlink, i, PFID(&lsm->lsm_md_oinfo[i].lmo_fid), PFID(&lsm->lsm_md_oinfo[0].lmo_fid)); if (it.d.lustre.it_lock_mode && lockh) { ldlm_lock_decref_and_cancel(lockh, it.d.lustre.it_lock_mode); it.d.lustre.it_lock_mode = 0; } GOTO(cleanup, rc = -ENOENT); } Because /* * The DIR_NLINK feature allows directories to exceed LDISKFS_LINK_MAX * (65000) subdirectories by storing "1" in i_nlink if the link count * would otherwise overflow. Directory tranversal tools understand * that (st_nlink == 1) indicates that the filesystem dose not track * hard links count on the directory, and will not abort subdirectory * scanning early once (st_nlink - 2) subdirs have been found. * * This also has to properly handle the case of inodes with nlink == 0 * in case they are being linked into the PENDING directory */ I will remove this.

            The first run of mdtest takes a while before failure. Once it fails you can duplicate the failure with rm -rf the left over files from mdtest.

            I attached the logs for my latest test from the client node and the all the MDS servers I have.

            simmonsja James A Simmons added a comment - The first run of mdtest takes a while before failure. Once it fails you can duplicate the failure with rm -rf the left over files from mdtest. I attached the logs for my latest test from the client node and the all the MDS servers I have.

            James: thanks. And usually how soon did you met the failure? after a few minutes? a few hours after starting the test?

            di.wang Di Wang (Inactive) added a comment - James: thanks. And usually how soon did you met the failure? after a few minutes? a few hours after starting the test?
            simmonsja James A Simmons added a comment - I did you one better. Grab my source rpm at http://www.infradead.org/~jsimmons/lustre-2.7.59-1_g703195a.src.rpm

            Ok, I tried to reproduce it on Opensfs cluster with 8 MDTs (4 MDS) and 4 OSTs(2 OSS). 9 clients. Just start the test, it has been an hour, still can not see this problem. I will check tomorrow morning to see how it goes?

            James: could you please tell me all of your patches(based on master)? Thanks.

            di.wang Di Wang (Inactive) added a comment - Ok, I tried to reproduce it on Opensfs cluster with 8 MDTs (4 MDS) and 4 OSTs(2 OSS). 9 clients. Just start the test, it has been an hour, still can not see this problem. I will check tomorrow morning to see how it goes? James: could you please tell me all of your patches(based on master)? Thanks.

            Both would be best. If not, then only client would be ok. Thanks

            di.wang Di Wang (Inactive) added a comment - Both would be best. If not, then only client would be ok. Thanks

            On the MDS or the client?

            simmonsja James A Simmons added a comment - On the MDS or the client?

            Hmm, during slaves revalidation, it seems the striped directory has been locked with both LOOKUP and UPDATE locks. I do not understand why the master stripe nlink turns to 1 at that time.

            James: Could you please collect the debug log when the failure happens? (-1) would be best, but if there is race, just collect the default one please. Thanks!

            di.wang Di Wang (Inactive) added a comment - Hmm, during slaves revalidation, it seems the striped directory has been locked with both LOOKUP and UPDATE locks. I do not understand why the master stripe nlink turns to 1 at that time. James: Could you please collect the debug log when the failure happens? (-1) would be best, but if there is race, just collect the default one please. Thanks!

            Doesn't matter how many client nodes. I use 400 below but use whatever you want. What matters the number of files per directory. Remember this is with remote_dir=-1 and remote_dir_gid=-1. Try using 8 MDS servers but any number greater than 1 will do:

            lfs setdirstripe -c 8 /lustre/whatever/jsimmons/dne2_8_mds_md_test
            lfs setdirstripe -c 8 -D /lustre/whatever/jsimmons/dne2_8_mds_md_test (to make all directories under it the same)
            mkdir /lustre/whatever/jsimmons/dne2_8_mds_md_test/shared_1000k_400
            mpi_run -n 400 mdtest -I 2500 -i 5 -d /lustre/whatever/jsimmons/dne2_8_mds_md_test/shared_1000k_400

            When mdtest goes to delete the files mdtest will fail. At least it does for me.

            simmonsja James A Simmons added a comment - Doesn't matter how many client nodes. I use 400 below but use whatever you want. What matters the number of files per directory. Remember this is with remote_dir=-1 and remote_dir_gid=-1. Try using 8 MDS servers but any number greater than 1 will do: lfs setdirstripe -c 8 /lustre/whatever/jsimmons/dne2_8_mds_md_test lfs setdirstripe -c 8 -D /lustre/whatever/jsimmons/dne2_8_mds_md_test (to make all directories under it the same) mkdir /lustre/whatever/jsimmons/dne2_8_mds_md_test/shared_1000k_400 mpi_run -n 400 mdtest -I 2500 -i 5 -d /lustre/whatever/jsimmons/dne2_8_mds_md_test/shared_1000k_400 When mdtest goes to delete the files mdtest will fail. At least it does for me.

            Could you please tell me how to reproduce the problem? still use mdtest with single thread on 1 node? thanks.

            di.wang Di Wang (Inactive) added a comment - Could you please tell me how to reproduce the problem? still use mdtest with single thread on 1 node? thanks.

            People

              di.wang Di Wang (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: