[LU-6984] Failure to delete over a million files in a DNE2 directory. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.8.0
Labels:
- dne2
Environment:
pre-2.8 clients with DNE2 directories which contain 1 million or more files.

Epic/Theme:
- dne
Severity:
3
Rank (Obsolete):
9223372036854775807

Description

In my testing of DNE2 I'm seeing problems when creating 1 million+ files per directory. Clearing out the debug logs I see the problem is only on the client side. When running a application I see:

command line used: /lustre/sultan/stf008/scratch/jsimmons/mdtest -I 100000 -i 5 -d /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test/shared_1000k_10
Path: /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test
FS: 21.8 TiB Used FS: 0.2% Inodes: 58.7 Mi Used Inodes: 4.6%

10 tasks, 1000000 files/directories
aprun: Apid 3172: Caught signal Window changed, sending to application
08/03/2015 10:34:45: Process 0(nid00028): FAILED in create_remove_directory_tree, Unable to remove directory: No such file or directory
Rank 0 [Mon Aug 3 10:34:45 2015] [c0-0c0s1n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
_pmiu_daemon(SIGCHLD): [NID 00028] [c0-0c0s1n2] [Mon Aug 3 10:34:45 2015] PE RANK 0 exit signal Aborted
aprun: Apid 3172: Caught signal Interrupt, sending to application
_pmiu_daemon(SIGCHLD): [NID 00012] [c0-0c0s6n0] [Mon Aug 3 10:50:50 2015] PE RANK 7 exit signal Interrupt
_pmiu_daemon(SIGCHLD): [NID 00018] [c0-0c0s6n2] [Mon Aug 3 10:50:50 2015] PE RANK 9 exit signal Interrupt
_pmiu_daemon(SIGCHLD): [NID 00013] [c0-0c0s6n1] [Mon Aug 3 10:50:50 2015] PE RANK 8 exit signal Interrupt

After the test failed any attempt to remove the files create by these test fail. When I attempt to remove the files I see the following errors in dmesg.

LustreError: 5430:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2
LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2
LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) Skipped 7 previous similar messages
LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lctldump.20150813
0.2 kB
13/Aug/15 6:17 PM
LU-6381.log
0.2 kB
11/Aug/15 2:45 PM
LU-6984-backtrace.log
83 kB
16/Sep/15 10:54 PM
lu-6984-Sept-18-2015.tgz
0.2 kB
18/Sep/15 6:20 PM

Issue Links

is related to

LU-6831 The ticket for tracking all DNE2 bugs

Reopened

mentioned in: Page Loading...; Page Loading...

Activity

[LU-6984] Failure to delete over a million files in a DNE2 directory.

Gerrit Updater added a comment - 19/Sep/15 7:47 AM

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16490
Subject: ~~LU-6984~~ lmv: remove nlink check in lmv_revalidate_slaves
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d676ea2ab1f55dfa8e04ed5fa074444315808329

Gerrit Updater added a comment - 19/Sep/15 7:47 AM wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16490 Subject: LU-6984 lmv: remove nlink check in lmv_revalidate_slaves Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d676ea2ab1f55dfa8e04ed5fa074444315808329

Di Wang (Inactive) added a comment - 19/Sep/15 7:32 AM

Ah, It seems this check in lmv_revalidate_slaves is not correct

                    if (unlikely(body->mbo_nlink < 2)) {
                                /* If this is bad stripe, most likely due
                                 * to the race between close(unlink) and
                                 * getattr, let's return -EONENT, so llite
                                 * will revalidate the dentry see
                                 * ll_inode_revalidate_fini() */
                                CDEBUG(D_INODE, "%s: nlink %d < 2 bad stripe %d"
                                       DFID ":" DFID"\n",
                                       obd->obd_name, body->mbo_nlink, i,
                                       PFID(&lsm->lsm_md_oinfo[i].lmo_fid),
                                       PFID(&lsm->lsm_md_oinfo[0].lmo_fid));

                                if (it.d.lustre.it_lock_mode && lockh) {
                                        ldlm_lock_decref_and_cancel(lockh,
                                                 it.d.lustre.it_lock_mode);
                                        it.d.lustre.it_lock_mode = 0;
                                }

                                GOTO(cleanup, rc = -ENOENT);
                        }

Because

        /*
         * The DIR_NLINK feature allows directories to exceed LDISKFS_LINK_MAX
         * (65000) subdirectories by storing "1" in i_nlink if the link count
         * would otherwise overflow. Directory tranversal tools understand
         * that (st_nlink == 1) indicates that the filesystem dose not track
         * hard links count on the directory, and will not abort subdirectory
         * scanning early once (st_nlink - 2) subdirs have been found.
         *
         * This also has to properly handle the case of inodes with nlink == 0
         * in case they are being linked into the PENDING directory
         */

I will remove this.

Di Wang (Inactive) added a comment - 19/Sep/15 7:32 AM Ah, It seems this check in lmv_revalidate_slaves is not correct if (unlikely(body->mbo_nlink < 2)) { /* If this is bad stripe, most likely due * to the race between close(unlink) and * getattr, let's return -EONENT, so llite * will revalidate the dentry see * ll_inode_revalidate_fini() */ CDEBUG(D_INODE, "%s: nlink %d < 2 bad stripe %d" DFID ":" DFID"\n", obd->obd_name, body->mbo_nlink, i, PFID(&lsm->lsm_md_oinfo[i].lmo_fid), PFID(&lsm->lsm_md_oinfo[0].lmo_fid)); if (it.d.lustre.it_lock_mode && lockh) { ldlm_lock_decref_and_cancel(lockh, it.d.lustre.it_lock_mode); it.d.lustre.it_lock_mode = 0; } GOTO(cleanup, rc = -ENOENT); } Because /* * The DIR_NLINK feature allows directories to exceed LDISKFS_LINK_MAX * (65000) subdirectories by storing "1" in i_nlink if the link count * would otherwise overflow. Directory tranversal tools understand * that (st_nlink == 1) indicates that the filesystem dose not track * hard links count on the directory, and will not abort subdirectory * scanning early once (st_nlink - 2) subdirs have been found. * * This also has to properly handle the case of inodes with nlink == 0 * in case they are being linked into the PENDING directory */ I will remove this.

James A Simmons added a comment - 18/Sep/15 6:20 PM

The first run of mdtest takes a while before failure. Once it fails you can duplicate the failure with rm -rf the left over files from mdtest.

I attached the logs for my latest test from the client node and the all the MDS servers I have.

James A Simmons added a comment - 18/Sep/15 6:20 PM The first run of mdtest takes a while before failure. Once it fails you can duplicate the failure with rm -rf the left over files from mdtest. I attached the logs for my latest test from the client node and the all the MDS servers I have.

Di Wang (Inactive) added a comment - 18/Sep/15 5:00 PM

James: thanks. And usually how soon did you met the failure? after a few minutes? a few hours after starting the test?

Di Wang (Inactive) added a comment - 18/Sep/15 5:00 PM James: thanks. And usually how soon did you met the failure? after a few minutes? a few hours after starting the test?

James A Simmons added a comment - 18/Sep/15 4:02 PM

I did you one better. Grab my source rpm at http://www.infradead.org/~jsimmons/lustre-2.7.59-1_g703195a.src.rpm

James A Simmons added a comment - 18/Sep/15 4:02 PM I did you one better. Grab my source rpm at http://www.infradead.org/~jsimmons/lustre-2.7.59-1_g703195a.src.rpm

Di Wang (Inactive) added a comment - 18/Sep/15 6:53 AM

Ok, I tried to reproduce it on Opensfs cluster with 8 MDTs (4 MDS) and 4 OSTs(2 OSS). 9 clients. Just start the test, it has been an hour, still can not see this problem. I will check tomorrow morning to see how it goes?

James: could you please tell me all of your patches(based on master)? Thanks.

Di Wang (Inactive) added a comment - 18/Sep/15 6:53 AM Ok, I tried to reproduce it on Opensfs cluster with 8 MDTs (4 MDS) and 4 OSTs(2 OSS). 9 clients. Just start the test, it has been an hour, still can not see this problem. I will check tomorrow morning to see how it goes? James: could you please tell me all of your patches(based on master)? Thanks.

Di Wang (Inactive) added a comment - 17/Sep/15 8:34 PM

Both would be best. If not, then only client would be ok. Thanks

Di Wang (Inactive) added a comment - 17/Sep/15 8:34 PM Both would be best. If not, then only client would be ok. Thanks

James A Simmons added a comment - 17/Sep/15 8:15 PM

On the MDS or the client?

James A Simmons added a comment - 17/Sep/15 8:15 PM On the MDS or the client?

Di Wang (Inactive) added a comment - 17/Sep/15 8:05 PM

Hmm, during slaves revalidation, it seems the striped directory has been locked with both LOOKUP and UPDATE locks. I do not understand why the master stripe nlink turns to 1 at that time.

James: Could you please collect the debug log when the failure happens? (-1) would be best, but if there is race, just collect the default one please. Thanks!

Di Wang (Inactive) added a comment - 17/Sep/15 8:05 PM Hmm, during slaves revalidation, it seems the striped directory has been locked with both LOOKUP and UPDATE locks. I do not understand why the master stripe nlink turns to 1 at that time. James: Could you please collect the debug log when the failure happens? (-1) would be best, but if there is race, just collect the default one please. Thanks!

James A Simmons added a comment - 16/Sep/15 11:29 PM

Doesn't matter how many client nodes. I use 400 below but use whatever you want. What matters the number of files per directory. Remember this is with remote_dir=-1 and remote_dir_gid=-1. Try using 8 MDS servers but any number greater than 1 will do:

lfs setdirstripe -c 8 /lustre/whatever/jsimmons/dne2_8_mds_md_test
lfs setdirstripe -c 8 -D /lustre/whatever/jsimmons/dne2_8_mds_md_test (to make all directories under it the same)
mkdir /lustre/whatever/jsimmons/dne2_8_mds_md_test/shared_1000k_400
mpi_run -n 400 mdtest -I 2500 -i 5 -d /lustre/whatever/jsimmons/dne2_8_mds_md_test/shared_1000k_400

When mdtest goes to delete the files mdtest will fail. At least it does for me.

James A Simmons added a comment - 16/Sep/15 11:29 PM Doesn't matter how many client nodes. I use 400 below but use whatever you want. What matters the number of files per directory. Remember this is with remote_dir=-1 and remote_dir_gid=-1. Try using 8 MDS servers but any number greater than 1 will do: lfs setdirstripe -c 8 /lustre/whatever/jsimmons/dne2_8_mds_md_test lfs setdirstripe -c 8 -D /lustre/whatever/jsimmons/dne2_8_mds_md_test (to make all directories under it the same) mkdir /lustre/whatever/jsimmons/dne2_8_mds_md_test/shared_1000k_400 mpi_run -n 400 mdtest -I 2500 -i 5 -d /lustre/whatever/jsimmons/dne2_8_mds_md_test/shared_1000k_400 When mdtest goes to delete the files mdtest will fail. At least it does for me.

Di Wang (Inactive) added a comment - 16/Sep/15 11:12 PM

Could you please tell me how to reproduce the problem? still use mdtest with single thread on 1 node? thanks.

Di Wang (Inactive) added a comment - 16/Sep/15 11:12 PM Could you please tell me how to reproduce the problem? still use mdtest with single thread on 1 node? thanks.

People

Assignee:: Di Wang (Inactive)

Reporter:: James A Simmons

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 11/Aug/15 2:40 PM

Updated:: 20/Feb/17 7:41 AM

Resolved:: 03/Oct/15 4:21 AM