[LU-6359] recovery-small test_10d: FAIL: wrong content found Created: 12/Mar/15  Updated: 09/Jan/17  Resolved: 14/Sep/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: zfs
Environment:

Lustre branch: master
Lustre build: https://build.hpdd.intel.com/job/lustre-master/2934/
FSTYPE=zfs


Issue Links:
Duplicate
is duplicated by LU-6233 recovery-small test_10d failed with '... Closed
Related
Severity: 3
Rank (Obsolete): 17819

 Description   

recovery-small test 10d failed as follows:

CMD: shadow-42vm1,shadow-42vm2.shadow.whamcloud.com df /mnt/lustre; uname -n >> /mnt/lustre/recon
Connected clients:
shadow-42vm2.shadow.whamcloud.com
shadow-42vm2.shadow.whamcloud.com
shadow-42vm1.shadow.whamcloud.com
/mnt/lustre/f10d.recovery-small /tmp/f10d.recovery-small differ: byte 1, line 1
 recovery-small test_10d: @@@@@@ FAIL: wrong content found

Console log from OSS:

22:56:53:Lustre: DEBUG MARKER: == recovery-small test 10d: test failed blocking ast == 22:56:44 (1425077804)
22:56:53:LustreError: 7235:0:(ldlm_lockd.c:654:ldlm_handle_ast_error()) ### client (nid 10.1.5.247@tcp) returned error from blocking AST (req status -71 rc -71), evict it ns: filter-lustre-OST0001_UUID lock: ffff88006eac1180/0x2685d9635bea7737 lrc: 4/0,0 mode: PR/PR res: [0x3:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4194303) flags: 0x60000000030020 nid: 10.1.5.247@tcp remote: 0x8c70d8c3c8479ef9 expref: 5 pid: 7235 timeout: 4295762243 lvb_type: 0
22:56:53:LustreError: 7235:0:(ldlm_lockd.c:654:ldlm_handle_ast_error()) Skipped 2 previous similar messages
22:58:04:LustreError: 138-a: lustre-OST0001: A client on nid 10.1.5.247@tcp was evicted due to a lock blocking callback time out: rc -71
22:58:04:LustreError: Skipped 2 previous similar messages
22:58:04:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  recovery-small test_10d: @@@@@@ FAIL: wrong content found

Maloo report: https://testing.hpdd.intel.com/test_sets/a712f6f2-c001-11e4-8be6-5254006e85c2



 Comments   
Comment by Jian Yu [ 12/Mar/15 ]

This is blocking patch review testing on master branch:
https://testing.hpdd.intel.com/test_sets/7d4390e4-c83e-11e4-b4c3-5254006e85c2
https://testing.hpdd.intel.com/test_sets/8fca85ee-c855-11e4-be50-5254006e85c2
https://testing.hpdd.intel.com/test_sets/05535fd6-c7e1-11e4-b4c3-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 18/Apr/15 ]

+2 master reviews failed at :
https://testing.hpdd.intel.com/test_sets/3ad872c0-e458-11e4-a267-5254006e85c2
https://testing.hpdd.intel.com/test_sets/7d841c34-e54b-11e4-8bad-5254006e85c2

Comment by Jian Yu [ 25/Apr/15 ]

More instance on master branch:
https://testing.hpdd.intel.com/test_sets/1c01aef4-ea87-11e4-95aa-5254006e85c2

Comment by James Nunez (Inactive) [ 09/Jun/15 ]

Another instance on master:
https://testing.hpdd.intel.com/test_sets/28f2641a-0e71-11e5-828a-5254006e85c2

Comment by Di Wang [ 11/Aug/15 ]

Another failure

https://testing.hpdd.intel.com/test_sets/7ddcc5de-3e6d-11e5-870e-5254006e85c2

Comment by James Nunez (Inactive) [ 27/Aug/15 ]

Another on master (pre-2.8):
2015-08-26 17:30:19 - https://testing.hpdd.intel.com/test_sets/ef954b8a-4c26-11e5-b537-5254006e85c2

Comment by Oleg Drokin [ 31/Aug/15 ]

So the client does answer with EPROTO from somewhere in ptlrpc without going into ldlm or osc:

00000100:00100000:0.0:1440610522.125668:0:27431:0:(service.c:2076:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_000:LOV_OSC_UUID+4:28342:x1510587166102920:12345-10.1.6.94@tcp:104
00000100:00000200:0.0:1440610522.125671:0:27431:0:(service.c:2081:ptlrpc_server_handle_request()) got req 1510587166102920
00010000:00000001:0.0:1440610522.125673:0:27431:0:(ldlm_lockd.c:2146:ldlm_callback_handler()) Process entered
00010000:02000000:0.0:1440610522.125675:0:27431:0:(libcfs_fail.h:96:cfs_fail_check_set()) *** cfs_fail_loc=305, val=0***
00000100:00000001:0.0:1440610522.125678:0:27431:0:(pack_generic.c:331:lustre_pack_reply_v2()) Process entered
02000000:00000001:0.0:1440610522.125679:0:27431:0:(sec.c:2127:sptlrpc_svc_alloc_rs()) Process entered
02000000:00000010:0.0:1440610522.125681:0:27431:0:(sec_null.c:329:null_alloc_rs()) kmalloced 'rs': 488 at ffff88007cabc000.
02000000:00000001:0.0:1440610522.125683:0:27431:0:(sec.c:2163:sptlrpc_svc_alloc_rs()) Process leaving (rc=0 : 0 : 0)
00000100:00000001:0.0:1440610522.125684:0:27431:0:(pack_generic.c:365:lustre_pack_reply_v2()) Process leaving (rc=0 : 0 : 0)
00010000:00000001:0.0:1440610522.125687:0:27431:0:(ldlm_lib.c:2735:target_pack_pool_reply()) Process entered
00010000:00000001:0.0:1440610522.125689:0:27431:0:(ldlm_lib.c:2754:target_pack_pool_reply()) Process leaving (rc=0 : 0 : 0)
00000100:00001000:0.0:1440610522.125691:0:27431:0:(import.c:1627:at_measured()) add 1 to ffff88007b872150 time=20 v=1 (1 1 0 0)
00000100:00000001:0.0:1440610522.125694:0:27431:0:(connection.c:130:ptlrpc_connection_addref()) Process entered
00000100:00000040:0.0:1440610522.125696:0:27431:0:(connection.c:135:ptlrpc_connection_addref()) conn=ffff880058022ec0 refcount 14 to 10.1.6.94@tcp
00000100:00000001:0.0:1440610522.125698:0:27431:0:(connection.c:137:ptlrpc_connection_addref()) Process leaving (rc=18446612133790756544 : -131939918795072 : ffff880058022ec0)
02000000:00000001:0.0:1440610522.125701:0:27431:0:(sec.c:2176:sptlrpc_svc_wrap_reply()) Process entered
02000000:00000001:0.0:1440610522.125702:0:27431:0:(sec.c:2187:sptlrpc_svc_wrap_reply()) Process leaving (rc=0 : 0 : 0)
00000100:00000001:0.0:1440610522.125704:0:27431:0:(niobuf.c:57:ptl_send_buf()) Process entered
00000100:00000040:0.0:1440610522.125706:0:27431:0:(niobuf.c:61:ptl_send_buf()) conn=ffff880058022ec0 id 12345-10.1.6.94@tcp
00000100:00000200:0.0:1440610522.125710:0:27431:0:(niobuf.c:83:ptl_send_buf()) Sending 192 bytes to portal 16, xid 1510587166102920, offset 192
00000100:00000001:0.0:1440610522.125721:0:27431:0:(niobuf.c:98:ptl_send_buf()) Process leaving (rc=0 : 0 : 0)
00000100:00000001:0.0:1440610522.125723:0:27431:0:(connection.c:94:ptlrpc_connection_put()) Process entered
00000100:00000040:0.0:1440610522.125725:0:27431:0:(connection.c:122:ptlrpc_connection_put()) PUT conn=ffff880058022ec0 refcount 13 to 10.1.6.94@tcp
00000100:00000001:0.0:1440610522.125727:0:27431:0:(connection.c:124:ptlrpc_connection_put()) Process leaving (rc=0 : 0 : 0)
00010000:00000001:0.0:1440610522.125729:0:27431:0:(ldlm_lockd.c:2174:ldlm_callback_handler()) Process leaving (rc=0 : 0 : 0)
00000100:00000040:0.0:1440610522.125732:0:27431:0:(lustre_net.h:2448:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete"  req@ffff88007bfd2980 x1510587166102920/t0(0) o104->LOV_OSC_UUID@10.1.6.94@tcp:313/0 lens 296/192 e 0 to 0 dl 1440610528 ref 1 fl Interpret:/0/0 rc -71/-71
00000100:00100000:0.0:1440610522.125740:0:27431:0:(service.c:2126:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_000:LOV_OSC_UUID+4:28342:x1510587166102920:12345-10.1.6.94@tcp:104 Request procesed in 75us (162us total) trans 0 rc -71/-71
Comment by Oleg Drokin [ 31/Aug/15 ]

So in fact the error comes from the test itself;

        stat $DIR2/$tfile >& /dev/null
        $LCTL set_param fail_err=71
        drop_bl_callback_once "echo -n \\\", world\\\" >> $DIR2/$tfile"
Comment by Joseph Gmitter (Inactive) [ 31/Aug/15 ]

Hi Niu,
Can you have a look at this issue?
Thanks.
Joe

Comment by Niu Yawei (Inactive) [ 01/Sep/15 ]

I checked one of the failure: https://testing.hpdd.intel.com/test_sets/ef954b8a-4c26-11e5-b537-5254006e85c2 , from the client log, seems fail_loc has been triggered even before the script trying to append ", world" to the file from mount_2:

00010000:00000001:0.0:1440610522.125729:0:27431:0:(ldlm_lockd.c:2174:ldlm_callback_handler()) Process leaving (rc=0 : 0 : 0)
00000100:00000040:0.0:1440610522.125732:0:27431:0:(lustre_net.h:2448:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete"  req@ffff88007bfd2980 x1510587166102920/t0(0) o104->LOV_OSC_UUID@10.1.6.94@tcp:313/0 lens 296/192 e 0 to 0 dl 1440610528 ref 1 fl Interpret:/0/0 rc -71/-71
00000100:00100000:0.0:1440610522.125740:0:27431:0:(service.c:2126:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_000:LOV_OSC_UUID+4:28342:x1510587166102920:12345-10.1.6.94@tcp:104 Request procesed in 75us (162us total) trans 0 rc -71/-71

The fail_loc was triggered.

00000020:00000001:0.0:1440610522.133279:0:409:0:(cl_lock.c:124:cl_lock_init()) Process leaving (rc=0 : 0 : 0)
00020000:00000001:0.0:1440610522.133282:0:409:0:(lov_lock.c:115:lov_sublock_init()) Process leaving (rc=0 : 0 : 0)
00020000:00000001:0.0:1440610522.133284:0:409:0:(lov_lock.c:207:lov_lock_sub_init()) Process leaving (rc=18446612133879364800 : -131939830186816 : ffff88005d4a3cc0)
00000020:00000001:0.0:1440610522.133287:0:409:0:(cl_lock.c:76:cl_lock_slice_add()) Process entered
00000020:00000001:0.0:1440610522.133289:0:409:0:(cl_lock.c:81:cl_lock_slice_add()) Process leaving
00020000:00000001:0.0:1440610522.133290:0:409:0:(lov_lock.c:333:lov_lock_init_raid0()) Process leaving (rc=0 : 0 : 0)
00000020:00000001:0.0:1440610522.133293:0:409:0:(cl_lock.c:124:cl_lock_init()) Process leaving (rc=0 : 0 : 0)
00000020:00000001:0.0:1440610522.133295:0:409:0:(cl_io.c:1433:cl_sync_io_init()) Process entered
00000020:00000001:0.0:1440610522.133297:0:409:0:(cl_io.c:1441:cl_sync_io_init()) Process leaving
00000020:00000001:0.0:1440610522.133298:0:409:0:(cl_lock.c:177:cl_lock_enqueue()) Process entered
00020000:00000001:0.0:1440610522.133300:0:409:0:(lov_lock.c:244:lov_lock_enqueue()) Process entered
00020000:00000001:0.0:1440610522.133302:0:409:0:(lov_io.c:236:lov_sub_get()) Process entered
00020000:00000001:0.0:1440610522.133303:0:409:0:(lov_io.c:247:lov_sub_get()) Process leaving (rc=18446612134369657600 : -131939339894016 : ffff88007a838300)
00000020:00000001:0.0:1440610522.133306:0:409:0:(cl_lock.c:177:cl_lock_enqueue()) Process entered
00000008:00000001:0.0:1440610522.133308:0:409:0:(osc_lock.c:933:osc_lock_enqueue()) Process entered
00000008:00000001:0.0:1440610522.133310:0:409:0:(osc_lock.c:947:osc_lock_enqueue()) Process leaving via enqueue_base (rc=0 : 0 : 0x0)
00000008:00000001:0.0:1440610522.133314:0:409:0:(osc_request.c:1988:osc_enqueue_base()) Process entered

Test script start to acquire lock to the file from mount_2.

I think we'd better to drop unused locks and use drop_bl_callback() instead of drop_bl_callback_once() to avoid unnecessary interference.

Comment by Gerrit Updater [ 01/Sep/15 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/16155
Subject: LU-6359 test: fix test_10d of recovery-small
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c2b59f8f2e9fcac805cd324bc9cf7e543c8f5cbb

Comment by Gerrit Updater [ 14/Sep/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16155/
Subject: LU-6359 test: fix test_10d of recovery-small
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: eac7b0ca8d2c915741bc6f9ca51562e83119a1ce

Comment by Joseph Gmitter (Inactive) [ 14/Sep/15 ]

Landed for 2.8.

Generated at Sat Feb 10 01:59:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.