[LU-5371] Failure on test suite parallel-scale test_simul Created: 18/Jul/14  Updated: 17/Oct/14  Resolved: 21/Aug/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: None
Environment:

client and server: lustre-b2_6-rc2 RHEL6 ldiskfs


Issue Links:
Related
is related to LU-4921 DNE clients should fall back to "try ... Resolved
Severity: 3
Rank (Obsolete): 14975

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/c8156066-0dc7-11e4-af8b-5254006e85c2.

The sub-test test_simul failed with the following error:

simul failed! 1

test log:

22:35:15: FAILED in simul_rmdir: too many operations succeeded (2).


 Comments   
Comment by Jodi Levi (Inactive) [ 21/Jul/14 ]

Di,
Could you please look into and comment on this one?
Thank you!

Comment by Sarah Liu [ 21/Jul/14 ]

Also hit this issue in interop test between 2.5.2 server and b2_6-rc2 testing

https://testing.hpdd.intel.com/test_sets/3c3c2f74-0d82-11e4-b3f5-5254006e85c2

Comment by Di Wang [ 22/Jul/14 ]

Hmm, I checked the log, it seems failures start to occur around July 3th. (build 2548)

It seems simul test tries to rmdir a single directory from multiple clients(multiple threads), and only 1 thread should succeed. But two threads succeed in the test, that is why it causes failure.

04:50:57: FAILED in simul_rmdir: too many operations succeeded (2).

Hmm, I checked the debug log on both clients and MDT side.

Client1 (succeeds for rmdir)

00000080:00200000:1.0:1405575315.343857:0:29632:0:(namei.c:1137:ll_rmdir_generic()) VFS Op:name=simul_rmdir.0, dir=[0x2000013a8:0x143db:0x0](ffff880060e42138)
00010000:00010000:1.0:1405575315.343863:0:29632:0:(ldlm_request.c:1113:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-MDT0000-mdc-ffff88005dc3e000 lock: ffff88001f3b3500/0x4c7dc53d5dce65cc lrc: 2/0,0 mode: PR/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x8400000000 nid: local remote: 0xce57c426058b2c04 expref: -99 pid: 29632 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1405575315.343874:0:29632:0:(ldlm_request.c:1172:ldlm_cancel_pack()) ### packing ns: lustre-MDT0000-mdc-ffff88005dc3e000 lock: ffff88001f3b3500/0x4c7dc53d5dce65cc lrc: 1/0,0 mode: --/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x4809400000000 nid: local remote: 0xce57c426058b2c04 expref: -99 pid: 29632 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1405575315.343877:0:29632:0:(ldlm_request.c:1176:ldlm_cancel_pack()) 1 locks packed
00010000:00010000:1.0:1405575315.343877:0:29632:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-MDT0000-mdc-ffff88005dc3e000 lock: ffff88001f3b3500/0x4c7dc53d5dce65cc lrc: 0/0,0 mode: --/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x4809400000000 nid: local remote: 0xce57c426058b2c04 expref: -99 pid: 29632 timeout: 0 lvb_type: 0
00000100:00100000:1.0:1405575315.343885:0:29632:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc simul:49ab2d2b-0df9-c120-4bb6-1f0147383ab7:29632:1473843428301100:10.2.4.203@tcp:36
00000100:00100000:1.0:1405575315.343909:0:29632:0:(client.c:2146:ptlrpc_set_wait()) set ffff88001f1b0440 going to sleep for 43 seconds
00000100:00100000:1.0:1405575315.345000:0:29632:0:(client.c:1863:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc simul:49ab2d2b-0df9-c120-4bb6-1f0147383ab7:29632:1473843428301100:10.2.4.203@tcp:36
00000080:00200000:1.0:1405575315.345007:0:29632:0:(llite_lib.c:1425:ll_clear_inode()) VFS Op:inode=[0x2000013a7:0x1c43f:0x0](ffff8800616f1178)

Client2 (also succeeds with rmdir}

00000080:00200000:0.0:1405575315.345956:0:18310:0:(namei.c:1137:ll_rmdir_generic()) VFS Op:name=simul_rmdir.0, dir=[0x2000013a8:0x143db:0x0](ffff88005d0bebb8)
00010000:00010000:0.0:1405575315.345961:0:18310:0:(ldlm_request.c:1113:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-MDT0000-mdc-ffff88007dbfd000 lock: ffff88002e7eed40/0x366918de5153be5b lrc: 2/0,0 mode: PR/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x8400000000 nid: local remote: 0xce57c426058b2c20 expref: -99 pid: 18310 timeout: 0 lvb_type: 0
00010000:00010000:0.0:1405575315.345970:0:18310:0:(ldlm_request.c:1172:ldlm_cancel_pack()) ### packing ns: lustre-MDT0000-mdc-ffff88007dbfd000 lock: ffff88002e7eed40/0x366918de5153be5b lrc: 1/0,0 mode: --/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x4809400000000 nid: local remote: 0xce57c426058b2c20 expref: -99 pid: 18310 timeout: 0 lvb_type: 0
00010000:00010000:0.0:1405575315.345974:0:18310:0:(ldlm_request.c:1176:ldlm_cancel_pack()) 1 locks packed
00010000:00010000:0.0:1405575315.345975:0:18310:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-MDT0000-mdc-ffff88007dbfd000 lock: ffff88002e7eed40/0x366918de5153be5b lrc: 0/0,0 mode: --/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x4809400000000 nid: local remote: 0xce57c426058b2c20 expref: -99 pid: 18310 timeout: 0 lvb_type: 0
00000100:00100000:0.0:1405575315.345981:0:18310:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc simul:992018ad-6c7b-3f3b-15cf-42922a66aa9a:18310:1473838828202428:10.2.4.203@tcp:36
00000100:00100000:0.0:1405575315.345989:0:18310:0:(client.c:2146:ptlrpc_set_wait()) set ffff88002c1cf200 going to sleep for 43 seconds
00000100:00100000:0.0:1405575315.347314:0:18310:0:(client.c:1863:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc simul:992018ad-6c7b-3f3b-15cf-42922a66aa9a:18310:1473838828202428:10.2.4.203@tcp:36
00000080:00200000:0.0:1405575315.347322:0:18310:0:(llite_lib.c:1425:ll_clear_inode()) VFS Op:inode=[0x2000013a7:0x1c43f:0x0](ffff88003ab0c1b8)

It is interesting when MDT handle the request from client2, it is already return ENOENT, but it seems client2 ignore that

00000100:00100000:1.0:1405575315.353820:0:25427:0:(service.c:2144:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_003:992018ad-6c7b-3f3b-15cf-42922a66aa9a+675:18310:x1473838828202428:12345-10.2.4.205@tcp:36 Request procesed in 988us (1036us total) trans 0 rc -2/-2

Sigh there are no enough debug log tell me what happen there, I will keep dig.

Comment by Di Wang [ 22/Jul/14 ]

http://review.whamcloud.com/11170

Sorry, this is brought in by
commit 3ea78dd02a57211ae9b55111323d14cfbc71bc43
Author: Wang Di <di.wang@intel.com>
Date: Wed Jun 25 22:35:52 2014 -0700

LU-4921 lmv: try all stripes for unknown hash functions

For unknown hash type, LMV should try all stripes to locate
the name entry. But it will only for lookup and unlink, i.e.
we can only list and unlink entries under striped dir with
unknown hash type.

Signed-off-by: wang di <di.wang@intel.com>
Change-Id: Ifeed7131c24e48277a6cc8fd4c09b7534e31079f
Reviewed-on: http://review.whamcloud.com/10041
Tested-by: Jenkins
Reviewed-by: John L. Hammond <john.hammond@intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Tested-by: Maloo <hpdd-maloo@intel.com>

Comment by Jodi Levi (Inactive) [ 21/Aug/14 ]

Patch landed to Master.

Comment by Sarah Liu [ 17/Oct/14 ]

Hit this error in interop test between 2.6.0 client and master server, patch needs to be back ported to b2_6 fix the error

https://testing.hpdd.intel.com/test_sets/336e7e88-5558-11e4-a570-5254006e85c2

Generated at Sat Feb 10 01:50:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.