[LU-6984] Failure to delete over a million files in a DNE2 directory. Created: 11/Aug/15 Updated: 20/Feb/17 Resolved: 03/Oct/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James A Simmons | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | dne2 | ||
| Environment: |
pre-2.8 clients with DNE2 directories which contain 1 million or more files. |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Epic/Theme: | dne | ||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
In my testing of DNE2 I'm seeing problems when creating 1 million+ files per directory. Clearing out the debug logs I see the problem is only on the client side. When running a application I see: command line used: /lustre/sultan/stf008/scratch/jsimmons/mdtest -I 100000 -i 5 -d /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test/shared_1000k_10 10 tasks, 1000000 files/directories After the test failed any attempt to remove the files create by these test fail. When I attempt to remove the files I see the following errors in dmesg. LustreError: 5430:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2 |
| Comments |
| Comment by James A Simmons [ 11/Aug/15 ] |
|
I attached to this ticket the client logs from when it failed to delete some of the files in the DNE2 directory. Let me know if the logs are good enough. Please be aware that the deletion of files sometimes works so it is some race condition. The logs cover both failed and successful attempts to delete files. |
| Comment by Andreas Dilger [ 11/Aug/15 ] |
|
James, how many shards in the striped directory? |
| Comment by James A Simmons [ 11/Aug/15 ] |
|
Each file is striped across 56 OSTs and the problem exist for any MDS striping greater than one. |
| Comment by Di Wang [ 11/Aug/15 ] |
|
James: I did not see error cases in the debug log? Will you attach another log? Thanks [root@testnode lustre-release_new]# grep -r "ll_prep_inode" /work/LU-6381.log 00000080:00000001:4.0:1438629890.366795:0:12535:0:(llite_lib.c:2252:ll_prep_inode()) Process entered 00000080:00000001:4.0:1438629890.366895:0:12535:0:(llite_lib.c:2319:ll_prep_inode()) Process leaving via out (rc=0 : 0 : 0x0) 00000080:00000001:4.0:1438629890.369895:0:12535:0:(llite_lib.c:2252:ll_prep_inode()) Process entered 00000080:00000001:4.0:1438629890.369977:0:12535:0:(llite_lib.c:2319:ll_prep_inode()) Process leaving via out (rc=0 : 0 : 0x0) 00000080:00000001:4.0:1438629892.079356:0:12597:0:(llite_lib.c:2252:ll_prep_inode()) Process entered 00000080:00000001:4.0:1438629892.079459:0:12597:0:(llite_lib.c:2319:ll_prep_inode()) Process leaving via out (rc=0 : 0 : 0x0) 00000080:00000001:4.0:1438629892.082521:0:12597:0:(llite_lib.c:2252:ll_prep_inode()) Process entered 00000080:00000001:4.0:1438629892.082605:0:12597:0:(llite_lib.c:2319:ll_prep_inode()) Process leaving via out (rc=0 : 0 : 0x0) |
| Comment by James A Simmons [ 11/Aug/15 ] |
|
Somethings it doesn't print that message when it fails. Do you see anything useful in the log? |
| Comment by Di Wang [ 11/Aug/15 ] |
|
There are only successful cases in the log, so not very helpful |
| Comment by Di Wang [ 11/Aug/15 ] |
|
Could you please tell me how to reproduce the problem? |
| Comment by Di Wang [ 11/Aug/15 ] |
After the test failed any attempt to remove the files create by these test fail. When I attempt to remove the files I see the following errors in dmesg. James: Could you please collect the client debug log (-1) when you tries to remove these files after failure? Thanks. |
| Comment by James A Simmons [ 13/Aug/15 ] |
|
WangDi here is another client dump. When enabling full debug the file deletion race becomes less common. I do see this in the log: 00000080:00002000:1.0:1439485513.362411:0:32074:0:(dcache.c:181:ll_ddelete()) keeping dentry mdtest_tree.0 (ffff8803ed204e0 As for reproducing this you just need one client node. On the client node create a directory with the following. lfs setdirstripe -c 4 -i 5 dne2_4_mds_md_test So you end up with a directory striped across 56 OSTs and 4 MDTs starting at MDT 5. This is with one MDT per MDS server. The 56 OSTs are split evenly across 4 OSS servers. In total I have 16 MDS servers each with one MDT. mpirun -n 1 mdtest -I 1000000 -i 5 -d /lustre/sultan/stf008/scratch/jsimmons/ In this case it is running just one thread on one node. I split the file count for -l by the number of total client threads so the total is always about 1 million files. |
| Comment by James A Simmons [ 13/Aug/15 ] |
|
Recent test with striping across 16 MDS servers shows this problem with mdtest in every case. So this might be the best configuration for you to reproduce this error. It fails with even 10k per directory. |
| Comment by Di Wang [ 13/Aug/15 ] |
|
James: unfortunately, there are still not much useful information in the lctldump log. Hmm, I checked the code again to see what can cause LustreError: 5430:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2 int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req,
struct super_block *sb, struct lookup_intent *it)
{
............
*inode = ll_iget(sb, cl_fid_build_ino(&md.body->mbo_fid1,
sbi->ll_flags & LL_SBI_32BIT_API),
&md);
if (IS_ERR(*inode)) {
#ifdef CONFIG_FS_POSIX_ACL
if (md.posix_acl) {
posix_acl_release(md.posix_acl);
md.posix_acl = NULL;
}
#endif
rc = IS_ERR(*inode) ? PTR_ERR(*inode) : -ENOMEM;
*inode = NULL;
CERROR("new_inode -fatal: rc %d\n", rc);
GOTO(out, rc);
}
}
Then in ll_iget(), there are only 3 places which can cause this ENOENT error. 1. ll_iget() -- So could you please post the client console log (dmesg output) when these failure happens? Thanks. |
| Comment by Di Wang [ 14/Aug/15 ] |
|
I am not sure I can find an env to do such test, but I can try. Thanks. |
| Comment by James A Simmons [ 15/Aug/15 ] |
|
I'm not seeing those error |
| Comment by Di Wang [ 27/Aug/15 ] |
|
I tried this on a cluster with 4 MDSs x 4 MDTs (16 MDTs), 2 OSS x 4 OSTs (8 OSTs). [root@c01 lustre]# lfs setdirstripe -c 4 -i 5 test1 [root@c01 lustre]# lfs setdirstripe -c4 -i 5 -D test1 [root@c01 lustre]# lfs setstripe -c 8 test1 [root@c01 lustre]# ls test1 [root@c01 lustre]# mpirun -n 1 mdtest -I 1000000 -i 5 -d /mnt/lustre/test1/shared_1000k -- started at 08/26/2015 14:56:55 -- mdtest-1.8.3 was launched with 1 total task(s) on 1 nodes Command line used: mdtest -I 1000000 -i 5 -d /mnt/lustre/test1/shared_1000k Path: /mnt/lustre/test1 FS: 3.9 TiB Used FS: 0.1% Inodes: 58.5 Mi Used Inodes: 0.0% 1 tasks, 1000000 files/directories It has been more than 3 hours, and still running. James: how long did it take in your run? Did I do sth wrong here? |
| Comment by James A Simmons [ 27/Aug/15 ] |
|
It really does take that long |
| Comment by Di Wang [ 29/Aug/15 ] |
|
Just update, I tried this several times with build 34198, and I can not reproduce the problem. My cluster has 4 MDSs x 4 MDTs (16 MDTs), 2 OSS x 4 OSTs (8 OSTs). James: could you please retry this on current master if this failure is still there? |
| Comment by James A Simmons [ 10/Sep/15 ] |
|
With the latest master and a few patches this problem still exist. The "ll_prep_inode" errors are gone now. I tried to debug it with lctl debug but turning on debug hides the race condition. Do you know what areas I should add old fashion printks to see what is causing the failure? |
| Comment by Di Wang [ 10/Sep/15 ] |
|
Hmm, It seems there might be two places in mdtest.c, which can trigger the problem. #else
if (rmdir(dir) == -1) {
FAIL("Unable to remove directory");
}
#endif
and if (rmdir(temp_path) == -1) {
FAIL("Unable to remove directory");
}
So could you please add some debug information here to print out the dir or temp_path when these rmdir fail happens. So if rmdir() fails, on the kernel side, that is probably caused by two reasons. 1. path resolution failed for "dir" or "temp_path", I am not sure there are good way to printk the error message, without changing the kernel. But let's see what are these directories are when the failure happens. 2. rmdir fails in ll_rmdir(), for this failure, you can simply add a patch like this --- a/lustre/llite/namei.c
+++ b/lustre/llite/namei.c
@@ -1198,7 +1198,10 @@ static int ll_rmdir(struct inode *dir, struct dentry *dchild)
if (rc == 0) {
ll_update_times(request, dir);
ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_RMDIR, 1);
- }
+ } else {
+ CERROR("unlink fails :name=%.*s, dir="DFID"(%p) rc: %d\n",
+ name->len, name->name, PFID(ll_inode2fid(dir)), dir, rc);
+ }
ptlrpc_req_finished(request);
RETURN(rc);
TBH, I suspect the failure is caused by 1. Sigh, I can not reproduce it locally. |
| Comment by James A Simmons [ 16/Sep/15 ] |
|
Good news I can duplicate this problem on a non-Cray node. I'm debugging it. Will let you know what I find. |
| Comment by James A Simmons [ 16/Sep/15 ] |
|
The ll_prep_inode errors came back so I back track to see what was causing the error. From my cheesy debugging I found this: [ 251.338486] Lustre: looks like a bad stripe So the problem is from lmv_revalidate_slaves. It's reporting back a ENOENT due to body->mbo_nlink < 2 condition. The comment points to a race between close(unlink) and getattr. |
| Comment by Di Wang [ 16/Sep/15 ] |
|
James, thanks for updating, very useful information. Do you happen to know who calls ll_prep_inode() ? Thanks! |
| Comment by James A Simmons [ 16/Sep/15 ] |
|
Better yet I did a dump_stack at where the code fails. You have complete back traces. I attached the file to this ticket. |
| Comment by Di Wang [ 16/Sep/15 ] |
|
Could you please tell me how to reproduce the problem? still use mdtest with single thread on 1 node? thanks. |
| Comment by James A Simmons [ 16/Sep/15 ] |
|
Doesn't matter how many client nodes. I use 400 below but use whatever you want. What matters the number of files per directory. Remember this is with remote_dir=-1 and remote_dir_gid=-1. Try using 8 MDS servers but any number greater than 1 will do: lfs setdirstripe -c 8 /lustre/whatever/jsimmons/dne2_8_mds_md_test When mdtest goes to delete the files mdtest will fail. At least it does for me. |
| Comment by Di Wang [ 17/Sep/15 ] |
|
Hmm, during slaves revalidation, it seems the striped directory has been locked with both LOOKUP and UPDATE locks. I do not understand why the master stripe nlink turns to 1 at that time. James: Could you please collect the debug log when the failure happens? (-1) would be best, but if there is race, just collect the default one please. Thanks! |
| Comment by James A Simmons [ 17/Sep/15 ] |
|
On the MDS or the client? |
| Comment by Di Wang [ 17/Sep/15 ] |
|
Both would be best. If not, then only client would be ok. Thanks |
| Comment by Di Wang [ 18/Sep/15 ] |
|
Ok, I tried to reproduce it on Opensfs cluster with 8 MDTs (4 MDS) and 4 OSTs(2 OSS). 9 clients. Just start the test, it has been an hour, still can not see this problem. I will check tomorrow morning to see how it goes? James: could you please tell me all of your patches(based on master)? Thanks. |
| Comment by James A Simmons [ 18/Sep/15 ] |
|
I did you one better. Grab my source rpm at http://www.infradead.org/~jsimmons/lustre-2.7.59-1_g703195a.src.rpm |
| Comment by Di Wang [ 18/Sep/15 ] |
|
James: thanks. And usually how soon did you met the failure? after a few minutes? a few hours after starting the test? |
| Comment by James A Simmons [ 18/Sep/15 ] |
|
The first run of mdtest takes a while before failure. Once it fails you can duplicate the failure with rm -rf the left over files from mdtest. I attached the logs for my latest test from the client node and the all the MDS servers I have. |
| Comment by Di Wang [ 19/Sep/15 ] |
|
Ah, It seems this check in lmv_revalidate_slaves is not correct if (unlikely(body->mbo_nlink < 2)) {
/* If this is bad stripe, most likely due
* to the race between close(unlink) and
* getattr, let's return -EONENT, so llite
* will revalidate the dentry see
* ll_inode_revalidate_fini() */
CDEBUG(D_INODE, "%s: nlink %d < 2 bad stripe %d"
DFID ":" DFID"\n",
obd->obd_name, body->mbo_nlink, i,
PFID(&lsm->lsm_md_oinfo[i].lmo_fid),
PFID(&lsm->lsm_md_oinfo[0].lmo_fid));
if (it.d.lustre.it_lock_mode && lockh) {
ldlm_lock_decref_and_cancel(lockh,
it.d.lustre.it_lock_mode);
it.d.lustre.it_lock_mode = 0;
}
GOTO(cleanup, rc = -ENOENT);
}
Because /*
* The DIR_NLINK feature allows directories to exceed LDISKFS_LINK_MAX
* (65000) subdirectories by storing "1" in i_nlink if the link count
* would otherwise overflow. Directory tranversal tools understand
* that (st_nlink == 1) indicates that the filesystem dose not track
* hard links count on the directory, and will not abort subdirectory
* scanning early once (st_nlink - 2) subdirs have been found.
*
* This also has to properly handle the case of inodes with nlink == 0
* in case they are being linked into the PENDING directory
*/
I will remove this. |
| Comment by Gerrit Updater [ 19/Sep/15 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16490 |
| Comment by James A Simmons [ 20/Sep/15 ] |
|
So far your fix seems to have resolved this issue. Tomorrow I will run the mdtest to see if everything works. |
| Comment by Gerrit Updater [ 02/Oct/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16490/ |
| Comment by James A Simmons [ 03/Oct/15 ] |
|
I have tested this patch extensively and it has resolved this issue. Now that it has landed we can close this ticket. |
| Comment by Peter Jones [ 03/Oct/15 ] |
|
Right you are James! |