[LU-6984] Failure to delete over a million files in a DNE2 directory. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.8.0
Labels:
- dne2
Environment:
pre-2.8 clients with DNE2 directories which contain 1 million or more files.

Epic/Theme:
- dne
Severity:
3
Rank (Obsolete):
9223372036854775807

Description

In my testing of DNE2 I'm seeing problems when creating 1 million+ files per directory. Clearing out the debug logs I see the problem is only on the client side. When running a application I see:

command line used: /lustre/sultan/stf008/scratch/jsimmons/mdtest -I 100000 -i 5 -d /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test/shared_1000k_10
Path: /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test
FS: 21.8 TiB Used FS: 0.2% Inodes: 58.7 Mi Used Inodes: 4.6%

10 tasks, 1000000 files/directories
aprun: Apid 3172: Caught signal Window changed, sending to application
08/03/2015 10:34:45: Process 0(nid00028): FAILED in create_remove_directory_tree, Unable to remove directory: No such file or directory
Rank 0 [Mon Aug 3 10:34:45 2015] [c0-0c0s1n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
_pmiu_daemon(SIGCHLD): [NID 00028] [c0-0c0s1n2] [Mon Aug 3 10:34:45 2015] PE RANK 0 exit signal Aborted
aprun: Apid 3172: Caught signal Interrupt, sending to application
_pmiu_daemon(SIGCHLD): [NID 00012] [c0-0c0s6n0] [Mon Aug 3 10:50:50 2015] PE RANK 7 exit signal Interrupt
_pmiu_daemon(SIGCHLD): [NID 00018] [c0-0c0s6n2] [Mon Aug 3 10:50:50 2015] PE RANK 9 exit signal Interrupt
_pmiu_daemon(SIGCHLD): [NID 00013] [c0-0c0s6n1] [Mon Aug 3 10:50:50 2015] PE RANK 8 exit signal Interrupt

After the test failed any attempt to remove the files create by these test fail. When I attempt to remove the files I see the following errors in dmesg.

LustreError: 5430:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2
LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2
LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) Skipped 7 previous similar messages
LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lctldump.20150813
0.2 kB
13/Aug/15 6:17 PM
LU-6381.log
0.2 kB
11/Aug/15 2:45 PM
LU-6984-backtrace.log
83 kB
16/Sep/15 10:54 PM
lu-6984-Sept-18-2015.tgz
0.2 kB
18/Sep/15 6:20 PM

Issue Links

is related to

LU-6831 The ticket for tracking all DNE2 bugs

Reopened

mentioned in: Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.

Activity

[LU-6984] Failure to delete over a million files in a DNE2 directory.

Di Wang (Inactive) added a comment - 16/Sep/15 10:31 PM

James, thanks for updating, very useful information. Do you happen to know who calls ll_prep_inode() ? Thanks!

Di Wang (Inactive) added a comment - 16/Sep/15 10:31 PM James, thanks for updating, very useful information. Do you happen to know who calls ll_prep_inode() ? Thanks!

James A Simmons added a comment - 16/Sep/15 9:41 PM

The ll_prep_inode errors came back so I back track to see what was causing the error. From my cheesy debugging I found this:

[ 251.338486] Lustre: looks like a bad stripe
[ 251.342779] Lustre: lmv_revalidate_slaves failed
[ 251.347499] Lustre: md_merge_attr failed to validate the lsm
[ 251.353263] Lustre: ll_update_lsm_md failed when called from ll_update_inode
[ 251.360412] Lustre: ll_update_inode return -2
[ 251.364877] Lustre: ll_read_inode2 failed
[ 251.369061] Lustre: Skipped 1 previous similar message
[ 251.374321] LustreError: 18546:0:(llite_lib.c:2305:ll_prep_inode()) new_inode -fatal: rc -2

So the problem is from lmv_revalidate_slaves. It's reporting back a ENOENT due to body->mbo_nlink < 2 condition. The comment points to a race between close(unlink) and getattr.

James A Simmons added a comment - 16/Sep/15 9:41 PM The ll_prep_inode errors came back so I back track to see what was causing the error. From my cheesy debugging I found this: [ 251.338486] Lustre: looks like a bad stripe [ 251.342779] Lustre: lmv_revalidate_slaves failed [ 251.347499] Lustre: md_merge_attr failed to validate the lsm [ 251.353263] Lustre: ll_update_lsm_md failed when called from ll_update_inode [ 251.360412] Lustre: ll_update_inode return -2 [ 251.364877] Lustre: ll_read_inode2 failed [ 251.369061] Lustre: Skipped 1 previous similar message [ 251.374321] LustreError: 18546:0:(llite_lib.c:2305:ll_prep_inode()) new_inode -fatal: rc -2 So the problem is from lmv_revalidate_slaves. It's reporting back a ENOENT due to body->mbo_nlink < 2 condition. The comment points to a race between close(unlink) and getattr.

James A Simmons added a comment - 16/Sep/15 5:45 PM

Good news I can duplicate this problem on a non-Cray node. I'm debugging it. Will let you know what I find.

James A Simmons added a comment - 16/Sep/15 5:45 PM Good news I can duplicate this problem on a non-Cray node. I'm debugging it. Will let you know what I find.

Di Wang (Inactive) added a comment - 10/Sep/15 8:56 PM

Hmm, It seems there might be two places in mdtest.c, which can trigger the problem.

#else
      if (rmdir(dir) == -1) {
        FAIL("Unable to remove directory");
      }
#endif

and

        if (rmdir(temp_path) == -1) {
          FAIL("Unable to remove directory");
        }

So could you please add some debug information here to print out the dir or temp_path when these rmdir fail happens.

So if rmdir() fails, on the kernel side, that is probably caused by two reasons.

1. path resolution failed for "dir" or "temp_path", I am not sure there are good way to printk the error message, without changing the kernel. But let's see what are these directories are when the failure happens.

2. rmdir fails in ll_rmdir(), for this failure, you can simply add a patch like this

--- a/lustre/llite/namei.c
+++ b/lustre/llite/namei.c
@@ -1198,7 +1198,10 @@ static int ll_rmdir(struct inode *dir, struct dentry *dchild)
         if (rc == 0) {
                 ll_update_times(request, dir);
                 ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_RMDIR, 1);
-        }
+        } else {
+               CERROR("unlink fails :name=%.*s, dir="DFID"(%p) rc: %d\n",
+                       name->len, name->name, PFID(ll_inode2fid(dir)), dir, rc);
+       }
 
         ptlrpc_req_finished(request);
         RETURN(rc);

TBH, I suspect the failure is caused by 1. Sigh, I can not reproduce it locally.

Di Wang (Inactive) added a comment - 10/Sep/15 8:56 PM Hmm, It seems there might be two places in mdtest.c, which can trigger the problem. #else if (rmdir(dir) == -1) { FAIL("Unable to remove directory"); } #endif and if (rmdir(temp_path) == -1) { FAIL("Unable to remove directory"); } So could you please add some debug information here to print out the dir or temp_path when these rmdir fail happens. So if rmdir() fails, on the kernel side, that is probably caused by two reasons. 1. path resolution failed for "dir" or "temp_path", I am not sure there are good way to printk the error message, without changing the kernel. But let's see what are these directories are when the failure happens. 2. rmdir fails in ll_rmdir(), for this failure, you can simply add a patch like this --- a/lustre/llite/namei.c +++ b/lustre/llite/namei.c @@ -1198,7 +1198,10 @@ static int ll_rmdir(struct inode *dir, struct dentry *dchild) if (rc == 0) { ll_update_times(request, dir); ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_RMDIR, 1); - } + } else { + CERROR("unlink fails :name=%.*s, dir="DFID"(%p) rc: %d\n", + name->len, name->name, PFID(ll_inode2fid(dir)), dir, rc); + } ptlrpc_req_finished(request); RETURN(rc); TBH, I suspect the failure is caused by 1. Sigh, I can not reproduce it locally.

James A Simmons added a comment - 10/Sep/15 7:32 PM

With the latest master and a few patches this problem still exist. The "ll_prep_inode" errors are gone now. I tried to debug it with lctl debug but turning on debug hides the race condition. Do you know what areas I should add old fashion printks to see what is causing the failure?

James A Simmons added a comment - 10/Sep/15 7:32 PM With the latest master and a few patches this problem still exist. The "ll_prep_inode" errors are gone now. I tried to debug it with lctl debug but turning on debug hides the race condition. Do you know what areas I should add old fashion printks to see what is causing the failure?

Di Wang (Inactive) added a comment - 29/Aug/15 12:18 AM

Just update, I tried this several times with build 34198, and I can not reproduce the problem. My cluster has 4 MDSs x 4 MDTs (16 MDTs), 2 OSS x 4 OSTs (8 OSTs).

James: could you please retry this on current master if this failure is still there?

Di Wang (Inactive) added a comment - 29/Aug/15 12:18 AM Just update, I tried this several times with build 34198, and I can not reproduce the problem. My cluster has 4 MDSs x 4 MDTs (16 MDTs), 2 OSS x 4 OSTs (8 OSTs). James: could you please retry this on current master if this failure is still there?

James A Simmons added a comment - 27/Aug/15 1:42 AM

It really does take that long

James A Simmons added a comment - 27/Aug/15 1:42 AM It really does take that long

Di Wang (Inactive) added a comment - 27/Aug/15 1:33 AM

I tried this on a cluster with 4 MDSs x 4 MDTs (16 MDTs), 2 OSS x 4 OSTs (8 OSTs).

[root@c01 lustre]# lfs setdirstripe -c 4 -i 5 test1
[root@c01 lustre]# lfs setdirstripe -c4 -i 5 -D test1
[root@c01 lustre]# lfs setstripe -c 8 test1
[root@c01 lustre]# ls
test1
[root@c01 lustre]# mpirun -n 1 mdtest -I 1000000 -i 5 -d /mnt/lustre/test1/shared_1000k
-- started at 08/26/2015 14:56:55 --

mdtest-1.8.3 was launched with 1 total task(s) on 1 nodes
Command line used: mdtest -I 1000000 -i 5 -d /mnt/lustre/test1/shared_1000k
Path: /mnt/lustre/test1
FS: 3.9 TiB   Used FS: 0.1%   Inodes: 58.5 Mi   Used Inodes: 0.0%

1 tasks, 1000000 files/directories

It has been more than 3 hours, and still running. James: how long did it take in your run? Did I do sth wrong here?

Di Wang (Inactive) added a comment - 27/Aug/15 1:33 AM I tried this on a cluster with 4 MDSs x 4 MDTs (16 MDTs), 2 OSS x 4 OSTs (8 OSTs). [root@c01 lustre]# lfs setdirstripe -c 4 -i 5 test1 [root@c01 lustre]# lfs setdirstripe -c4 -i 5 -D test1 [root@c01 lustre]# lfs setstripe -c 8 test1 [root@c01 lustre]# ls test1 [root@c01 lustre]# mpirun -n 1 mdtest -I 1000000 -i 5 -d /mnt/lustre/test1/shared_1000k -- started at 08/26/2015 14:56:55 -- mdtest-1.8.3 was launched with 1 total task(s) on 1 nodes Command line used: mdtest -I 1000000 -i 5 -d /mnt/lustre/test1/shared_1000k Path: /mnt/lustre/test1 FS: 3.9 TiB Used FS: 0.1% Inodes: 58.5 Mi Used Inodes: 0.0% 1 tasks, 1000000 files/directories It has been more than 3 hours, and still running. James: how long did it take in your run? Did I do sth wrong here?

James A Simmons added a comment - 15/Aug/15 2:15 AM

I'm not seeing those error

James A Simmons added a comment - 15/Aug/15 2:15 AM I'm not seeing those error

Di Wang (Inactive) added a comment - 14/Aug/15 5:54 PM

I am not sure I can find an env to do such test, but I can try. Thanks.

Di Wang (Inactive) added a comment - 14/Aug/15 5:54 PM I am not sure I can find an env to do such test, but I can try. Thanks.

Di Wang (Inactive) added a comment - 13/Aug/15 9:40 PM

James: unfortunately, there are still not much useful information in the lctldump log. Hmm, I checked the code again to see what can cause

LustreError: 5430:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2

int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req,
                  struct super_block *sb, struct lookup_intent *it)
{
............
 *inode = ll_iget(sb, cl_fid_build_ino(&md.body->mbo_fid1,
                                             sbi->ll_flags & LL_SBI_32BIT_API),
                                 &md);           
                if (IS_ERR(*inode)) {
#ifdef CONFIG_FS_POSIX_ACL              
                        if (md.posix_acl) {
                                posix_acl_release(md.posix_acl);
                                md.posix_acl = NULL;
                        }
#endif  
                        rc = IS_ERR(*inode) ? PTR_ERR(*inode) : -ENOMEM;
                        *inode = NULL;
                        CERROR("new_inode -fatal: rc %d\n", rc);
                        GOTO(out, rc);
                }
}

Then in ll_iget(), there are only 3 places which can cause this ENOENT error.

1. ll_iget() --~~>ll_read_inode2()~~>ll_update_inode()--~~>ll_update_lsm_md()~~-~~>ll_init_lsm_md()~~-->ll_iget_anon_dir() , and you should see sth like "failed get simple inode ...." on client console. Did you see that on client console after failure happens? Hmm, this is caused by the failure in iget_locked(), seems unlikely.
2. ll_iget()---->cl_file_inode_init(), and then you should see sth like "Failure to initialize cl object ...." on client console, did you see that?
3. ll_iget()---~~>ll_update_inode()~~>..... then same as 1.

So could you please post the client console log (dmesg output) when these failure happens? Thanks.

Di Wang (Inactive) added a comment - 13/Aug/15 9:40 PM James: unfortunately, there are still not much useful information in the lctldump log. Hmm, I checked the code again to see what can cause LustreError: 5430:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2 int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req, struct super_block *sb, struct lookup_intent *it) { ............ *inode = ll_iget(sb, cl_fid_build_ino(&md.body->mbo_fid1, sbi->ll_flags & LL_SBI_32BIT_API), &md); if (IS_ERR(*inode)) { #ifdef CONFIG_FS_POSIX_ACL if (md.posix_acl) { posix_acl_release(md.posix_acl); md.posix_acl = NULL; } #endif rc = IS_ERR(*inode) ? PTR_ERR(*inode) : -ENOMEM; *inode = NULL; CERROR("new_inode -fatal: rc %d\n", rc); GOTO(out, rc); } } Then in ll_iget(), there are only 3 places which can cause this ENOENT error. 1. ll_iget() -- >ll_read_inode2() >ll_update_inode()-- >ll_update_lsm_md() - >ll_init_lsm_md() -->ll_iget_anon_dir() , and you should see sth like "failed get simple inode ...." on client console. Did you see that on client console after failure happens? Hmm, this is caused by the failure in iget_locked(), seems unlikely. 2. ll_iget()---->cl_file_inode_init(), and then you should see sth like "Failure to initialize cl object ...." on client console, did you see that? 3. ll_iget()--- >ll_update_inode() >..... then same as 1. So could you please post the client console log (dmesg output) when these failure happens? Thanks.

People

Assignee:: Di Wang (Inactive)

Reporter:: James A Simmons

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 11/Aug/15 2:40 PM

Updated:: 20/Feb/17 7:41 AM

Resolved:: 03/Oct/15 4:21 AM