[LU-5371] Failure on test suite parallel-scale test_simul Created: 18/Jul/14 Updated: 17/Oct/14 Resolved: 21/Aug/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
client and server: lustre-b2_6-rc2 RHEL6 ldiskfs |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 14975 | ||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/c8156066-0dc7-11e4-af8b-5254006e85c2. The sub-test test_simul failed with the following error:
test log: 22:35:15: FAILED in simul_rmdir: too many operations succeeded (2). |
| Comments |
| Comment by Jodi Levi (Inactive) [ 21/Jul/14 ] |
|
Di, |
| Comment by Sarah Liu [ 21/Jul/14 ] |
|
Also hit this issue in interop test between 2.5.2 server and b2_6-rc2 testing https://testing.hpdd.intel.com/test_sets/3c3c2f74-0d82-11e4-b3f5-5254006e85c2 |
| Comment by Di Wang [ 22/Jul/14 ] |
|
Hmm, I checked the log, it seems failures start to occur around July 3th. (build 2548) It seems simul test tries to rmdir a single directory from multiple clients(multiple threads), and only 1 thread should succeed. But two threads succeed in the test, that is why it causes failure. 04:50:57: FAILED in simul_rmdir: too many operations succeeded (2). Hmm, I checked the debug log on both clients and MDT side. Client1 (succeeds for rmdir) 00000080:00200000:1.0:1405575315.343857:0:29632:0:(namei.c:1137:ll_rmdir_generic()) VFS Op:name=simul_rmdir.0, dir=[0x2000013a8:0x143db:0x0](ffff880060e42138) 00010000:00010000:1.0:1405575315.343863:0:29632:0:(ldlm_request.c:1113:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-MDT0000-mdc-ffff88005dc3e000 lock: ffff88001f3b3500/0x4c7dc53d5dce65cc lrc: 2/0,0 mode: PR/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x8400000000 nid: local remote: 0xce57c426058b2c04 expref: -99 pid: 29632 timeout: 0 lvb_type: 0 00010000:00010000:1.0:1405575315.343874:0:29632:0:(ldlm_request.c:1172:ldlm_cancel_pack()) ### packing ns: lustre-MDT0000-mdc-ffff88005dc3e000 lock: ffff88001f3b3500/0x4c7dc53d5dce65cc lrc: 1/0,0 mode: --/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x4809400000000 nid: local remote: 0xce57c426058b2c04 expref: -99 pid: 29632 timeout: 0 lvb_type: 0 00010000:00010000:1.0:1405575315.343877:0:29632:0:(ldlm_request.c:1176:ldlm_cancel_pack()) 1 locks packed 00010000:00010000:1.0:1405575315.343877:0:29632:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-MDT0000-mdc-ffff88005dc3e000 lock: ffff88001f3b3500/0x4c7dc53d5dce65cc lrc: 0/0,0 mode: --/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x4809400000000 nid: local remote: 0xce57c426058b2c04 expref: -99 pid: 29632 timeout: 0 lvb_type: 0 00000100:00100000:1.0:1405575315.343885:0:29632:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc simul:49ab2d2b-0df9-c120-4bb6-1f0147383ab7:29632:1473843428301100:10.2.4.203@tcp:36 00000100:00100000:1.0:1405575315.343909:0:29632:0:(client.c:2146:ptlrpc_set_wait()) set ffff88001f1b0440 going to sleep for 43 seconds 00000100:00100000:1.0:1405575315.345000:0:29632:0:(client.c:1863:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc simul:49ab2d2b-0df9-c120-4bb6-1f0147383ab7:29632:1473843428301100:10.2.4.203@tcp:36 00000080:00200000:1.0:1405575315.345007:0:29632:0:(llite_lib.c:1425:ll_clear_inode()) VFS Op:inode=[0x2000013a7:0x1c43f:0x0](ffff8800616f1178) Client2 (also succeeds with rmdir} 00000080:00200000:0.0:1405575315.345956:0:18310:0:(namei.c:1137:ll_rmdir_generic()) VFS Op:name=simul_rmdir.0, dir=[0x2000013a8:0x143db:0x0](ffff88005d0bebb8) 00010000:00010000:0.0:1405575315.345961:0:18310:0:(ldlm_request.c:1113:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-MDT0000-mdc-ffff88007dbfd000 lock: ffff88002e7eed40/0x366918de5153be5b lrc: 2/0,0 mode: PR/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x8400000000 nid: local remote: 0xce57c426058b2c20 expref: -99 pid: 18310 timeout: 0 lvb_type: 0 00010000:00010000:0.0:1405575315.345970:0:18310:0:(ldlm_request.c:1172:ldlm_cancel_pack()) ### packing ns: lustre-MDT0000-mdc-ffff88007dbfd000 lock: ffff88002e7eed40/0x366918de5153be5b lrc: 1/0,0 mode: --/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x4809400000000 nid: local remote: 0xce57c426058b2c20 expref: -99 pid: 18310 timeout: 0 lvb_type: 0 00010000:00010000:0.0:1405575315.345974:0:18310:0:(ldlm_request.c:1176:ldlm_cancel_pack()) 1 locks packed 00010000:00010000:0.0:1405575315.345975:0:18310:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-MDT0000-mdc-ffff88007dbfd000 lock: ffff88002e7eed40/0x366918de5153be5b lrc: 0/0,0 mode: --/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x4809400000000 nid: local remote: 0xce57c426058b2c20 expref: -99 pid: 18310 timeout: 0 lvb_type: 0 00000100:00100000:0.0:1405575315.345981:0:18310:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc simul:992018ad-6c7b-3f3b-15cf-42922a66aa9a:18310:1473838828202428:10.2.4.203@tcp:36 00000100:00100000:0.0:1405575315.345989:0:18310:0:(client.c:2146:ptlrpc_set_wait()) set ffff88002c1cf200 going to sleep for 43 seconds 00000100:00100000:0.0:1405575315.347314:0:18310:0:(client.c:1863:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc simul:992018ad-6c7b-3f3b-15cf-42922a66aa9a:18310:1473838828202428:10.2.4.203@tcp:36 00000080:00200000:0.0:1405575315.347322:0:18310:0:(llite_lib.c:1425:ll_clear_inode()) VFS Op:inode=[0x2000013a7:0x1c43f:0x0](ffff88003ab0c1b8) It is interesting when MDT handle the request from client2, it is already return ENOENT, but it seems client2 ignore that 00000100:00100000:1.0:1405575315.353820:0:25427:0:(service.c:2144:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_003:992018ad-6c7b-3f3b-15cf-42922a66aa9a+675:18310:x1473838828202428:12345-10.2.4.205@tcp:36 Request procesed in 988us (1036us total) trans 0 rc -2/-2 Sigh there are no enough debug log tell me what happen there, I will keep dig. |
| Comment by Di Wang [ 22/Jul/14 ] |
|
http://review.whamcloud.com/11170 Sorry, this is brought in by For unknown hash type, LMV should try all stripes to locate Signed-off-by: wang di <di.wang@intel.com> |
| Comment by Jodi Levi (Inactive) [ 21/Aug/14 ] |
|
Patch landed to Master. |
| Comment by Sarah Liu [ 17/Oct/14 ] |
|
Hit this error in interop test between 2.6.0 client and master server, patch needs to be back ported to b2_6 fix the error https://testing.hpdd.intel.com/test_sets/336e7e88-5558-11e4-a570-5254006e85c2 |