[LU-13400] sanity test_300d: createmany 10 under striped dir fails with Permission denied Created: 30/Mar/20 Updated: 28/Sep/20 Resolved: 27/Sep/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Lai Siyao |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Olaf Faaland <faaland1@llnl.gov> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/150c0a5d-19ca-4d0e-bbd4-5d98ae175f9a test_300d failed with the following error: create 10 files failed == sanity test 300d: check default stripe under striped directory ==================================== 12:17:4 3 (1585595863) 2 open(/mnt/diane/d300d.sanity/striped_dir/f1) error: Permission denied 3 total: 1 open/close in 0.18 seconds: 5.62 ops/second 4 sanity test_300d: @@@@@@ FAIL: create 10 files failed 5 Trace dump: 6 = /home/olaf/lustre/lustre/tests/test-framework.sh:6138:error() 7 = /home/olaf/lustre/lustre/tests/sanity.sh:19623:test_300d() 8 = /home/olaf/lustre/lustre/tests/test-framework.sh:6441:run_one() 9 = /home/olaf/lustre/lustre/tests/test-framework.sh:6490:run_one_logged() 10 = /home/olaf/lustre/lustre/tests/test-framework.sh:6315:run_test() 11 = /home/olaf/lustre/lustre/tests/sanity.sh:19639:main() I can locally reproduce reliably with: I've seen the issue on average about 15% of runs with zfs-0.8 and less often with zfs-0.7.11 I've reproduced the issue with lustre 2.12.4, 2.13.0, master, and master-next. There are no warnings or errors in dmesg, emitted while the test is running (on any node). VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Olaf Faaland [ 30/Mar/20 ] |
|
I've replaced the createmany with a loop that does mkdirs, and find that the mkdirs fail for only one of the two MDTs: mkdir: cannot create directory '/mnt/diane/d300d.sanity/striped_dir/f.1': Permission denied mkdir /mnt/diane/d300d.sanity/striped_dir/f.1 failed mkdir: cannot create directory '/mnt/diane/d300d.sanity/striped_dir/f.3': Permission denied mkdir /mnt/diane/d300d.sanity/striped_dir/f.3 failed mkdir: cannot create directory '/mnt/diane/d300d.sanity/striped_dir/f.5': Permission denied mkdir /mnt/diane/d300d.sanity/striped_dir/f.5 failed mkdir: cannot create directory '/mnt/diane/d300d.sanity/striped_dir/f.7': Permission denied mkdir /mnt/diane/d300d.sanity/striped_dir/f.7 failed mkdir: cannot create directory '/mnt/diane/d300d.sanity/striped_dir/f.9': Permission denied mkdir /mnt/diane/d300d.sanity/striped_dir/f.9 failed mkdir: cannot create directory '/mnt/diane/d300d.sanity/striped_dir/f.10': Permission denied mkdir /mnt/diane/d300d.sanity/striped_dir/f.10 failed I took a look and the subdirs that were created successfully were on MDT0, so the ones that failed were on MDT1. |
| Comment by Olaf Faaland [ 31/Mar/20 ] |
|
I'm attempting to compare debug logs for a subdirectory created successfully (f.2) and one which failed (f.1). In both cases, I see lmv_create() correctly identifies the MDT, a metadata modify RPC slot is assigned, an XID is assigned, and the RPC is sent successfully. I believe all that is on the client, but if I'm wrong please let me know. lmv_create()) CREATE name '<dirname>' [<FID>] on [<SHARDFID>] -> mds #<MDT_INDEX> obd_get_mod_rpc_slot()) <MDT_EXPORT>: modify RPC slot <SLOT_INDEX> is alloca ptlrpc_reassign_next_xid()) @@@ reassign xid req@<reqid> x<xid> ptlrpc_send_new_req()) Sending RPC req@<reqid> and pname:cluuid:pid:xid:nid: ptlrpc_send_new_req()) Process leaving (rc=0 : 0 : 0) For the successful mkdir, I then see the request is enqueued (which as I understand it occurs in the MDT): ldlm_cli_enqueue_local()) ### client-side local enqueue handler, new lock created ns: mdt-diane-MDT0000_UUID lock: ffff972604887180/0x56189f42e7cba373 lrc: 3/0,1 mode: CW/CW res: But for the failed mkdir I do not see that. And for the failed mkdir I see ptlrpc_check_status() indicates -EACCES as one might expect (client.c:1343:ptlrpc_check_status()) @@@ check status: rc = -13 req@ffff9728e9f40480 x1662629166884800/t0(0) o36->diane-MDT0001-mdc-ffff9725fdff3800@10.0.2.15@tcp:12/10 lens 568/448 e 0 to 0 d l 1585607163 ref 2 fl Rpc:RQU/0/0 rc 0/-13 job:'mkdir.0' |
| Comment by Olaf Faaland [ 31/Mar/20 ] |
|
I added the topllnl label. This is not something we are seeing in production (we are currently just testing DNE2) but I would like to track it down or at least help narrow the scope. This test seems to fail fairly frequently. |
| Comment by Olaf Faaland [ 31/Mar/20 ] |
|
Also, in the debug logs for the successful mkdir I see mdt_handler.c, mdt_reint.c, and mdd_permission.c (e.g. mdd_create() and mdt_object_new()).
I made some minor tweaks to the test (e.g. added lctl mark after each mkdir) and in the new logs I do see mdt_object_new(), mdd_create(), and __mdd_permission_internal() calls, and that last returns -13 as one would expect. I'm not sure whether I overlooked those in the original log, or the failure mode was different. |
| Comment by Peter Jones [ 31/Mar/20 ] |
|
Lai Could you please advise here? Thanks Peter |
| Comment by Olaf Faaland [ 01/Apr/20 ] |
|
The -13 is coming from here: /*
* Nobody gets write access to an immutable file.
*/
if (mask & MAY_WRITE && la->la_flags & LUSTRE_IMMUTABLE_FL)
RETURN(-EACCES);
in __mdd_permission_internal() |
| Comment by Lai Siyao [ 01/Apr/20 ] |
|
Thanks Olaf, this is quite helpful, I'm reviewing the code now. |
| Comment by Lai Siyao [ 01/Apr/20 ] |
|
|
| Comment by Olaf Faaland [ 01/Apr/20 ] |
No, not really. I have started looking into whether one of the other tests was not cleaning up properly, but I think bad cleanup does not explain it. In my testing, the subdirs on one MDT succeed, and the subdirs on the other one fail. So apparently one shard has the bit set, and the other does not. |
| Comment by Olaf Faaland [ 01/Apr/20 ] |
|
I did notice one other thing. I altered sanity.sh to replace "run_test 300d ..." with this: for xx in $(seq 1 30); do run_test 300d "check default stripe under striped directory" done I found that every time I ran auster --only 300, either 300d failed every time (all xx in 1,2.3,...,30), or not at all. I believe that means that the problem is occurring in an earlier subtest. I'll see what more I can learn. |
| Comment by Arshad Hussain [ 21/Apr/20 ] |
|
Seen on Master Run : https://testing.whamcloud.com/test_sets/50b3635a-8810-4f18-b3fe-da0bf0eac4fe |
| Comment by Andreas Dilger [ 15/May/20 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/75fa672b-9224-4efc-8223-fa6663f6b220 Side note for Olaf - if submitting patches for debugging/fixing intermittent issues, it is possible to use: Test-Parameters: testlist=sanity env=ONLY=300d,ONLY_REPEAT=N to have the specific test run N times in a loop. The test is considered a failure if any of the iterations fails. You can specify a range of subtests like "ONLY=290-300" but "ONLY_REPEAT=N" will run each subtest N times in a row before moving to the next subtest. It looks like auster also has a command-line option "-i N" to rerun a subtest N times without having to edit . |
| Comment by Lai Siyao [ 27/Sep/20 ] |
|
This is a duplicate of |