[LU-6359] recovery-small test_10d: FAIL: wrong content found Created: 12/Mar/15 Updated: 09/Jan/17 Resolved: 14/Sep/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | zfs | ||
| Environment: |
Lustre branch: master |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 17819 | ||||||||||||
| Description |
|
recovery-small test 10d failed as follows: CMD: shadow-42vm1,shadow-42vm2.shadow.whamcloud.com df /mnt/lustre; uname -n >> /mnt/lustre/recon Connected clients: shadow-42vm2.shadow.whamcloud.com shadow-42vm2.shadow.whamcloud.com shadow-42vm1.shadow.whamcloud.com /mnt/lustre/f10d.recovery-small /tmp/f10d.recovery-small differ: byte 1, line 1 recovery-small test_10d: @@@@@@ FAIL: wrong content found Console log from OSS: 22:56:53:Lustre: DEBUG MARKER: == recovery-small test 10d: test failed blocking ast == 22:56:44 (1425077804) 22:56:53:LustreError: 7235:0:(ldlm_lockd.c:654:ldlm_handle_ast_error()) ### client (nid 10.1.5.247@tcp) returned error from blocking AST (req status -71 rc -71), evict it ns: filter-lustre-OST0001_UUID lock: ffff88006eac1180/0x2685d9635bea7737 lrc: 4/0,0 mode: PR/PR res: [0x3:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4194303) flags: 0x60000000030020 nid: 10.1.5.247@tcp remote: 0x8c70d8c3c8479ef9 expref: 5 pid: 7235 timeout: 4295762243 lvb_type: 0 22:56:53:LustreError: 7235:0:(ldlm_lockd.c:654:ldlm_handle_ast_error()) Skipped 2 previous similar messages 22:58:04:LustreError: 138-a: lustre-OST0001: A client on nid 10.1.5.247@tcp was evicted due to a lock blocking callback time out: rc -71 22:58:04:LustreError: Skipped 2 previous similar messages 22:58:04:Lustre: DEBUG MARKER: /usr/sbin/lctl mark recovery-small test_10d: @@@@@@ FAIL: wrong content found Maloo report: https://testing.hpdd.intel.com/test_sets/a712f6f2-c001-11e4-8be6-5254006e85c2 |
| Comments |
| Comment by Jian Yu [ 12/Mar/15 ] |
|
This is blocking patch review testing on master branch: |
| Comment by Bruno Faccini (Inactive) [ 18/Apr/15 ] |
|
+2 master reviews failed at : |
| Comment by Jian Yu [ 25/Apr/15 ] |
|
More instance on master branch: |
| Comment by James Nunez (Inactive) [ 09/Jun/15 ] |
|
Another instance on master: |
| Comment by Di Wang [ 11/Aug/15 ] |
|
Another failure https://testing.hpdd.intel.com/test_sets/7ddcc5de-3e6d-11e5-870e-5254006e85c2 |
| Comment by James Nunez (Inactive) [ 27/Aug/15 ] |
|
Another on master (pre-2.8): |
| Comment by Oleg Drokin [ 31/Aug/15 ] |
|
So the client does answer with EPROTO from somewhere in ptlrpc without going into ldlm or osc: 00000100:00100000:0.0:1440610522.125668:0:27431:0:(service.c:2076:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_000:LOV_OSC_UUID+4:28342:x1510587166102920:12345-10.1.6.94@tcp:104 00000100:00000200:0.0:1440610522.125671:0:27431:0:(service.c:2081:ptlrpc_server_handle_request()) got req 1510587166102920 00010000:00000001:0.0:1440610522.125673:0:27431:0:(ldlm_lockd.c:2146:ldlm_callback_handler()) Process entered 00010000:02000000:0.0:1440610522.125675:0:27431:0:(libcfs_fail.h:96:cfs_fail_check_set()) *** cfs_fail_loc=305, val=0*** 00000100:00000001:0.0:1440610522.125678:0:27431:0:(pack_generic.c:331:lustre_pack_reply_v2()) Process entered 02000000:00000001:0.0:1440610522.125679:0:27431:0:(sec.c:2127:sptlrpc_svc_alloc_rs()) Process entered 02000000:00000010:0.0:1440610522.125681:0:27431:0:(sec_null.c:329:null_alloc_rs()) kmalloced 'rs': 488 at ffff88007cabc000. 02000000:00000001:0.0:1440610522.125683:0:27431:0:(sec.c:2163:sptlrpc_svc_alloc_rs()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1440610522.125684:0:27431:0:(pack_generic.c:365:lustre_pack_reply_v2()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:0.0:1440610522.125687:0:27431:0:(ldlm_lib.c:2735:target_pack_pool_reply()) Process entered 00010000:00000001:0.0:1440610522.125689:0:27431:0:(ldlm_lib.c:2754:target_pack_pool_reply()) Process leaving (rc=0 : 0 : 0) 00000100:00001000:0.0:1440610522.125691:0:27431:0:(import.c:1627:at_measured()) add 1 to ffff88007b872150 time=20 v=1 (1 1 0 0) 00000100:00000001:0.0:1440610522.125694:0:27431:0:(connection.c:130:ptlrpc_connection_addref()) Process entered 00000100:00000040:0.0:1440610522.125696:0:27431:0:(connection.c:135:ptlrpc_connection_addref()) conn=ffff880058022ec0 refcount 14 to 10.1.6.94@tcp 00000100:00000001:0.0:1440610522.125698:0:27431:0:(connection.c:137:ptlrpc_connection_addref()) Process leaving (rc=18446612133790756544 : -131939918795072 : ffff880058022ec0) 02000000:00000001:0.0:1440610522.125701:0:27431:0:(sec.c:2176:sptlrpc_svc_wrap_reply()) Process entered 02000000:00000001:0.0:1440610522.125702:0:27431:0:(sec.c:2187:sptlrpc_svc_wrap_reply()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1440610522.125704:0:27431:0:(niobuf.c:57:ptl_send_buf()) Process entered 00000100:00000040:0.0:1440610522.125706:0:27431:0:(niobuf.c:61:ptl_send_buf()) conn=ffff880058022ec0 id 12345-10.1.6.94@tcp 00000100:00000200:0.0:1440610522.125710:0:27431:0:(niobuf.c:83:ptl_send_buf()) Sending 192 bytes to portal 16, xid 1510587166102920, offset 192 00000100:00000001:0.0:1440610522.125721:0:27431:0:(niobuf.c:98:ptl_send_buf()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1440610522.125723:0:27431:0:(connection.c:94:ptlrpc_connection_put()) Process entered 00000100:00000040:0.0:1440610522.125725:0:27431:0:(connection.c:122:ptlrpc_connection_put()) PUT conn=ffff880058022ec0 refcount 13 to 10.1.6.94@tcp 00000100:00000001:0.0:1440610522.125727:0:27431:0:(connection.c:124:ptlrpc_connection_put()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:0.0:1440610522.125729:0:27431:0:(ldlm_lockd.c:2174:ldlm_callback_handler()) Process leaving (rc=0 : 0 : 0) 00000100:00000040:0.0:1440610522.125732:0:27431:0:(lustre_net.h:2448:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete" req@ffff88007bfd2980 x1510587166102920/t0(0) o104->LOV_OSC_UUID@10.1.6.94@tcp:313/0 lens 296/192 e 0 to 0 dl 1440610528 ref 1 fl Interpret:/0/0 rc -71/-71 00000100:00100000:0.0:1440610522.125740:0:27431:0:(service.c:2126:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_000:LOV_OSC_UUID+4:28342:x1510587166102920:12345-10.1.6.94@tcp:104 Request procesed in 75us (162us total) trans 0 rc -71/-71 |
| Comment by Oleg Drokin [ 31/Aug/15 ] |
|
So in fact the error comes from the test itself; stat $DIR2/$tfile >& /dev/null $LCTL set_param fail_err=71 drop_bl_callback_once "echo -n \\\", world\\\" >> $DIR2/$tfile" |
| Comment by Joseph Gmitter (Inactive) [ 31/Aug/15 ] |
|
Hi Niu, |
| Comment by Niu Yawei (Inactive) [ 01/Sep/15 ] |
|
I checked one of the failure: https://testing.hpdd.intel.com/test_sets/ef954b8a-4c26-11e5-b537-5254006e85c2 , from the client log, seems fail_loc has been triggered even before the script trying to append ", world" to the file from mount_2: 00010000:00000001:0.0:1440610522.125729:0:27431:0:(ldlm_lockd.c:2174:ldlm_callback_handler()) Process leaving (rc=0 : 0 : 0) 00000100:00000040:0.0:1440610522.125732:0:27431:0:(lustre_net.h:2448:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete" req@ffff88007bfd2980 x1510587166102920/t0(0) o104->LOV_OSC_UUID@10.1.6.94@tcp:313/0 lens 296/192 e 0 to 0 dl 1440610528 ref 1 fl Interpret:/0/0 rc -71/-71 00000100:00100000:0.0:1440610522.125740:0:27431:0:(service.c:2126:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_000:LOV_OSC_UUID+4:28342:x1510587166102920:12345-10.1.6.94@tcp:104 Request procesed in 75us (162us total) trans 0 rc -71/-71 The fail_loc was triggered. 00000020:00000001:0.0:1440610522.133279:0:409:0:(cl_lock.c:124:cl_lock_init()) Process leaving (rc=0 : 0 : 0) 00020000:00000001:0.0:1440610522.133282:0:409:0:(lov_lock.c:115:lov_sublock_init()) Process leaving (rc=0 : 0 : 0) 00020000:00000001:0.0:1440610522.133284:0:409:0:(lov_lock.c:207:lov_lock_sub_init()) Process leaving (rc=18446612133879364800 : -131939830186816 : ffff88005d4a3cc0) 00000020:00000001:0.0:1440610522.133287:0:409:0:(cl_lock.c:76:cl_lock_slice_add()) Process entered 00000020:00000001:0.0:1440610522.133289:0:409:0:(cl_lock.c:81:cl_lock_slice_add()) Process leaving 00020000:00000001:0.0:1440610522.133290:0:409:0:(lov_lock.c:333:lov_lock_init_raid0()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1440610522.133293:0:409:0:(cl_lock.c:124:cl_lock_init()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1440610522.133295:0:409:0:(cl_io.c:1433:cl_sync_io_init()) Process entered 00000020:00000001:0.0:1440610522.133297:0:409:0:(cl_io.c:1441:cl_sync_io_init()) Process leaving 00000020:00000001:0.0:1440610522.133298:0:409:0:(cl_lock.c:177:cl_lock_enqueue()) Process entered 00020000:00000001:0.0:1440610522.133300:0:409:0:(lov_lock.c:244:lov_lock_enqueue()) Process entered 00020000:00000001:0.0:1440610522.133302:0:409:0:(lov_io.c:236:lov_sub_get()) Process entered 00020000:00000001:0.0:1440610522.133303:0:409:0:(lov_io.c:247:lov_sub_get()) Process leaving (rc=18446612134369657600 : -131939339894016 : ffff88007a838300) 00000020:00000001:0.0:1440610522.133306:0:409:0:(cl_lock.c:177:cl_lock_enqueue()) Process entered 00000008:00000001:0.0:1440610522.133308:0:409:0:(osc_lock.c:933:osc_lock_enqueue()) Process entered 00000008:00000001:0.0:1440610522.133310:0:409:0:(osc_lock.c:947:osc_lock_enqueue()) Process leaving via enqueue_base (rc=0 : 0 : 0x0) 00000008:00000001:0.0:1440610522.133314:0:409:0:(osc_request.c:1988:osc_enqueue_base()) Process entered Test script start to acquire lock to the file from mount_2. I think we'd better to drop unused locks and use drop_bl_callback() instead of drop_bl_callback_once() to avoid unnecessary interference. |
| Comment by Gerrit Updater [ 01/Sep/15 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/16155 |
| Comment by Gerrit Updater [ 14/Sep/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16155/ |
| Comment by Joseph Gmitter (Inactive) [ 14/Sep/15 ] |
|
Landed for 2.8. |