Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6359

recovery-small test_10d: FAIL: wrong content found

Details

    • 3
    • 17819

    Description

      recovery-small test 10d failed as follows:

      CMD: shadow-42vm1,shadow-42vm2.shadow.whamcloud.com df /mnt/lustre; uname -n >> /mnt/lustre/recon
      Connected clients:
      shadow-42vm2.shadow.whamcloud.com
      shadow-42vm2.shadow.whamcloud.com
      shadow-42vm1.shadow.whamcloud.com
      /mnt/lustre/f10d.recovery-small /tmp/f10d.recovery-small differ: byte 1, line 1
       recovery-small test_10d: @@@@@@ FAIL: wrong content found
      

      Console log from OSS:

      22:56:53:Lustre: DEBUG MARKER: == recovery-small test 10d: test failed blocking ast == 22:56:44 (1425077804)
      22:56:53:LustreError: 7235:0:(ldlm_lockd.c:654:ldlm_handle_ast_error()) ### client (nid 10.1.5.247@tcp) returned error from blocking AST (req status -71 rc -71), evict it ns: filter-lustre-OST0001_UUID lock: ffff88006eac1180/0x2685d9635bea7737 lrc: 4/0,0 mode: PR/PR res: [0x3:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4194303) flags: 0x60000000030020 nid: 10.1.5.247@tcp remote: 0x8c70d8c3c8479ef9 expref: 5 pid: 7235 timeout: 4295762243 lvb_type: 0
      22:56:53:LustreError: 7235:0:(ldlm_lockd.c:654:ldlm_handle_ast_error()) Skipped 2 previous similar messages
      22:58:04:LustreError: 138-a: lustre-OST0001: A client on nid 10.1.5.247@tcp was evicted due to a lock blocking callback time out: rc -71
      22:58:04:LustreError: Skipped 2 previous similar messages
      22:58:04:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  recovery-small test_10d: @@@@@@ FAIL: wrong content found
      

      Maloo report: https://testing.hpdd.intel.com/test_sets/a712f6f2-c001-11e4-8be6-5254006e85c2

      Attachments

        Issue Links

          Activity

            [LU-6359] recovery-small test_10d: FAIL: wrong content found

            Landed for 2.8.

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16155/
            Subject: LU-6359 test: fix test_10d of recovery-small
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: eac7b0ca8d2c915741bc6f9ca51562e83119a1ce

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16155/ Subject: LU-6359 test: fix test_10d of recovery-small Project: fs/lustre-release Branch: master Current Patch Set: Commit: eac7b0ca8d2c915741bc6f9ca51562e83119a1ce

            Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/16155
            Subject: LU-6359 test: fix test_10d of recovery-small
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c2b59f8f2e9fcac805cd324bc9cf7e543c8f5cbb

            gerrit Gerrit Updater added a comment - Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/16155 Subject: LU-6359 test: fix test_10d of recovery-small Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c2b59f8f2e9fcac805cd324bc9cf7e543c8f5cbb

            I checked one of the failure: https://testing.hpdd.intel.com/test_sets/ef954b8a-4c26-11e5-b537-5254006e85c2 , from the client log, seems fail_loc has been triggered even before the script trying to append ", world" to the file from mount_2:

            00010000:00000001:0.0:1440610522.125729:0:27431:0:(ldlm_lockd.c:2174:ldlm_callback_handler()) Process leaving (rc=0 : 0 : 0)
            00000100:00000040:0.0:1440610522.125732:0:27431:0:(lustre_net.h:2448:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete"  req@ffff88007bfd2980 x1510587166102920/t0(0) o104->LOV_OSC_UUID@10.1.6.94@tcp:313/0 lens 296/192 e 0 to 0 dl 1440610528 ref 1 fl Interpret:/0/0 rc -71/-71
            00000100:00100000:0.0:1440610522.125740:0:27431:0:(service.c:2126:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_000:LOV_OSC_UUID+4:28342:x1510587166102920:12345-10.1.6.94@tcp:104 Request procesed in 75us (162us total) trans 0 rc -71/-71
            

            The fail_loc was triggered.

            00000020:00000001:0.0:1440610522.133279:0:409:0:(cl_lock.c:124:cl_lock_init()) Process leaving (rc=0 : 0 : 0)
            00020000:00000001:0.0:1440610522.133282:0:409:0:(lov_lock.c:115:lov_sublock_init()) Process leaving (rc=0 : 0 : 0)
            00020000:00000001:0.0:1440610522.133284:0:409:0:(lov_lock.c:207:lov_lock_sub_init()) Process leaving (rc=18446612133879364800 : -131939830186816 : ffff88005d4a3cc0)
            00000020:00000001:0.0:1440610522.133287:0:409:0:(cl_lock.c:76:cl_lock_slice_add()) Process entered
            00000020:00000001:0.0:1440610522.133289:0:409:0:(cl_lock.c:81:cl_lock_slice_add()) Process leaving
            00020000:00000001:0.0:1440610522.133290:0:409:0:(lov_lock.c:333:lov_lock_init_raid0()) Process leaving (rc=0 : 0 : 0)
            00000020:00000001:0.0:1440610522.133293:0:409:0:(cl_lock.c:124:cl_lock_init()) Process leaving (rc=0 : 0 : 0)
            00000020:00000001:0.0:1440610522.133295:0:409:0:(cl_io.c:1433:cl_sync_io_init()) Process entered
            00000020:00000001:0.0:1440610522.133297:0:409:0:(cl_io.c:1441:cl_sync_io_init()) Process leaving
            00000020:00000001:0.0:1440610522.133298:0:409:0:(cl_lock.c:177:cl_lock_enqueue()) Process entered
            00020000:00000001:0.0:1440610522.133300:0:409:0:(lov_lock.c:244:lov_lock_enqueue()) Process entered
            00020000:00000001:0.0:1440610522.133302:0:409:0:(lov_io.c:236:lov_sub_get()) Process entered
            00020000:00000001:0.0:1440610522.133303:0:409:0:(lov_io.c:247:lov_sub_get()) Process leaving (rc=18446612134369657600 : -131939339894016 : ffff88007a838300)
            00000020:00000001:0.0:1440610522.133306:0:409:0:(cl_lock.c:177:cl_lock_enqueue()) Process entered
            00000008:00000001:0.0:1440610522.133308:0:409:0:(osc_lock.c:933:osc_lock_enqueue()) Process entered
            00000008:00000001:0.0:1440610522.133310:0:409:0:(osc_lock.c:947:osc_lock_enqueue()) Process leaving via enqueue_base (rc=0 : 0 : 0x0)
            00000008:00000001:0.0:1440610522.133314:0:409:0:(osc_request.c:1988:osc_enqueue_base()) Process entered
            

            Test script start to acquire lock to the file from mount_2.

            I think we'd better to drop unused locks and use drop_bl_callback() instead of drop_bl_callback_once() to avoid unnecessary interference.

            niu Niu Yawei (Inactive) added a comment - I checked one of the failure: https://testing.hpdd.intel.com/test_sets/ef954b8a-4c26-11e5-b537-5254006e85c2 , from the client log, seems fail_loc has been triggered even before the script trying to append ", world" to the file from mount_2: 00010000:00000001:0.0:1440610522.125729:0:27431:0:(ldlm_lockd.c:2174:ldlm_callback_handler()) Process leaving (rc=0 : 0 : 0) 00000100:00000040:0.0:1440610522.125732:0:27431:0:(lustre_net.h:2448:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete" req@ffff88007bfd2980 x1510587166102920/t0(0) o104->LOV_OSC_UUID@10.1.6.94@tcp:313/0 lens 296/192 e 0 to 0 dl 1440610528 ref 1 fl Interpret:/0/0 rc -71/-71 00000100:00100000:0.0:1440610522.125740:0:27431:0:(service.c:2126:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_000:LOV_OSC_UUID+4:28342:x1510587166102920:12345-10.1.6.94@tcp:104 Request procesed in 75us (162us total) trans 0 rc -71/-71 The fail_loc was triggered. 00000020:00000001:0.0:1440610522.133279:0:409:0:(cl_lock.c:124:cl_lock_init()) Process leaving (rc=0 : 0 : 0) 00020000:00000001:0.0:1440610522.133282:0:409:0:(lov_lock.c:115:lov_sublock_init()) Process leaving (rc=0 : 0 : 0) 00020000:00000001:0.0:1440610522.133284:0:409:0:(lov_lock.c:207:lov_lock_sub_init()) Process leaving (rc=18446612133879364800 : -131939830186816 : ffff88005d4a3cc0) 00000020:00000001:0.0:1440610522.133287:0:409:0:(cl_lock.c:76:cl_lock_slice_add()) Process entered 00000020:00000001:0.0:1440610522.133289:0:409:0:(cl_lock.c:81:cl_lock_slice_add()) Process leaving 00020000:00000001:0.0:1440610522.133290:0:409:0:(lov_lock.c:333:lov_lock_init_raid0()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1440610522.133293:0:409:0:(cl_lock.c:124:cl_lock_init()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1440610522.133295:0:409:0:(cl_io.c:1433:cl_sync_io_init()) Process entered 00000020:00000001:0.0:1440610522.133297:0:409:0:(cl_io.c:1441:cl_sync_io_init()) Process leaving 00000020:00000001:0.0:1440610522.133298:0:409:0:(cl_lock.c:177:cl_lock_enqueue()) Process entered 00020000:00000001:0.0:1440610522.133300:0:409:0:(lov_lock.c:244:lov_lock_enqueue()) Process entered 00020000:00000001:0.0:1440610522.133302:0:409:0:(lov_io.c:236:lov_sub_get()) Process entered 00020000:00000001:0.0:1440610522.133303:0:409:0:(lov_io.c:247:lov_sub_get()) Process leaving (rc=18446612134369657600 : -131939339894016 : ffff88007a838300) 00000020:00000001:0.0:1440610522.133306:0:409:0:(cl_lock.c:177:cl_lock_enqueue()) Process entered 00000008:00000001:0.0:1440610522.133308:0:409:0:(osc_lock.c:933:osc_lock_enqueue()) Process entered 00000008:00000001:0.0:1440610522.133310:0:409:0:(osc_lock.c:947:osc_lock_enqueue()) Process leaving via enqueue_base (rc=0 : 0 : 0x0) 00000008:00000001:0.0:1440610522.133314:0:409:0:(osc_request.c:1988:osc_enqueue_base()) Process entered Test script start to acquire lock to the file from mount_2. I think we'd better to drop unused locks and use drop_bl_callback() instead of drop_bl_callback_once() to avoid unnecessary interference.

            Hi Niu,
            Can you have a look at this issue?
            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi Niu, Can you have a look at this issue? Thanks. Joe
            green Oleg Drokin added a comment -

            So in fact the error comes from the test itself;

                    stat $DIR2/$tfile >& /dev/null
                    $LCTL set_param fail_err=71
                    drop_bl_callback_once "echo -n \\\", world\\\" >> $DIR2/$tfile"
            
            green Oleg Drokin added a comment - So in fact the error comes from the test itself; stat $DIR2/$tfile >& /dev/ null $LCTL set_param fail_err=71 drop_bl_callback_once "echo -n \\\" , world\\\ " >> $DIR2/$tfile"
            green Oleg Drokin added a comment -

            So the client does answer with EPROTO from somewhere in ptlrpc without going into ldlm or osc:

            00000100:00100000:0.0:1440610522.125668:0:27431:0:(service.c:2076:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_000:LOV_OSC_UUID+4:28342:x1510587166102920:12345-10.1.6.94@tcp:104
            00000100:00000200:0.0:1440610522.125671:0:27431:0:(service.c:2081:ptlrpc_server_handle_request()) got req 1510587166102920
            00010000:00000001:0.0:1440610522.125673:0:27431:0:(ldlm_lockd.c:2146:ldlm_callback_handler()) Process entered
            00010000:02000000:0.0:1440610522.125675:0:27431:0:(libcfs_fail.h:96:cfs_fail_check_set()) *** cfs_fail_loc=305, val=0***
            00000100:00000001:0.0:1440610522.125678:0:27431:0:(pack_generic.c:331:lustre_pack_reply_v2()) Process entered
            02000000:00000001:0.0:1440610522.125679:0:27431:0:(sec.c:2127:sptlrpc_svc_alloc_rs()) Process entered
            02000000:00000010:0.0:1440610522.125681:0:27431:0:(sec_null.c:329:null_alloc_rs()) kmalloced 'rs': 488 at ffff88007cabc000.
            02000000:00000001:0.0:1440610522.125683:0:27431:0:(sec.c:2163:sptlrpc_svc_alloc_rs()) Process leaving (rc=0 : 0 : 0)
            00000100:00000001:0.0:1440610522.125684:0:27431:0:(pack_generic.c:365:lustre_pack_reply_v2()) Process leaving (rc=0 : 0 : 0)
            00010000:00000001:0.0:1440610522.125687:0:27431:0:(ldlm_lib.c:2735:target_pack_pool_reply()) Process entered
            00010000:00000001:0.0:1440610522.125689:0:27431:0:(ldlm_lib.c:2754:target_pack_pool_reply()) Process leaving (rc=0 : 0 : 0)
            00000100:00001000:0.0:1440610522.125691:0:27431:0:(import.c:1627:at_measured()) add 1 to ffff88007b872150 time=20 v=1 (1 1 0 0)
            00000100:00000001:0.0:1440610522.125694:0:27431:0:(connection.c:130:ptlrpc_connection_addref()) Process entered
            00000100:00000040:0.0:1440610522.125696:0:27431:0:(connection.c:135:ptlrpc_connection_addref()) conn=ffff880058022ec0 refcount 14 to 10.1.6.94@tcp
            00000100:00000001:0.0:1440610522.125698:0:27431:0:(connection.c:137:ptlrpc_connection_addref()) Process leaving (rc=18446612133790756544 : -131939918795072 : ffff880058022ec0)
            02000000:00000001:0.0:1440610522.125701:0:27431:0:(sec.c:2176:sptlrpc_svc_wrap_reply()) Process entered
            02000000:00000001:0.0:1440610522.125702:0:27431:0:(sec.c:2187:sptlrpc_svc_wrap_reply()) Process leaving (rc=0 : 0 : 0)
            00000100:00000001:0.0:1440610522.125704:0:27431:0:(niobuf.c:57:ptl_send_buf()) Process entered
            00000100:00000040:0.0:1440610522.125706:0:27431:0:(niobuf.c:61:ptl_send_buf()) conn=ffff880058022ec0 id 12345-10.1.6.94@tcp
            00000100:00000200:0.0:1440610522.125710:0:27431:0:(niobuf.c:83:ptl_send_buf()) Sending 192 bytes to portal 16, xid 1510587166102920, offset 192
            00000100:00000001:0.0:1440610522.125721:0:27431:0:(niobuf.c:98:ptl_send_buf()) Process leaving (rc=0 : 0 : 0)
            00000100:00000001:0.0:1440610522.125723:0:27431:0:(connection.c:94:ptlrpc_connection_put()) Process entered
            00000100:00000040:0.0:1440610522.125725:0:27431:0:(connection.c:122:ptlrpc_connection_put()) PUT conn=ffff880058022ec0 refcount 13 to 10.1.6.94@tcp
            00000100:00000001:0.0:1440610522.125727:0:27431:0:(connection.c:124:ptlrpc_connection_put()) Process leaving (rc=0 : 0 : 0)
            00010000:00000001:0.0:1440610522.125729:0:27431:0:(ldlm_lockd.c:2174:ldlm_callback_handler()) Process leaving (rc=0 : 0 : 0)
            00000100:00000040:0.0:1440610522.125732:0:27431:0:(lustre_net.h:2448:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete"  req@ffff88007bfd2980 x1510587166102920/t0(0) o104->LOV_OSC_UUID@10.1.6.94@tcp:313/0 lens 296/192 e 0 to 0 dl 1440610528 ref 1 fl Interpret:/0/0 rc -71/-71
            00000100:00100000:0.0:1440610522.125740:0:27431:0:(service.c:2126:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_000:LOV_OSC_UUID+4:28342:x1510587166102920:12345-10.1.6.94@tcp:104 Request procesed in 75us (162us total) trans 0 rc -71/-71
            
            green Oleg Drokin added a comment - So the client does answer with EPROTO from somewhere in ptlrpc without going into ldlm or osc: 00000100:00100000:0.0:1440610522.125668:0:27431:0:(service.c:2076:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_000:LOV_OSC_UUID+4:28342:x1510587166102920:12345-10.1.6.94@tcp:104 00000100:00000200:0.0:1440610522.125671:0:27431:0:(service.c:2081:ptlrpc_server_handle_request()) got req 1510587166102920 00010000:00000001:0.0:1440610522.125673:0:27431:0:(ldlm_lockd.c:2146:ldlm_callback_handler()) Process entered 00010000:02000000:0.0:1440610522.125675:0:27431:0:(libcfs_fail.h:96:cfs_fail_check_set()) *** cfs_fail_loc=305, val=0*** 00000100:00000001:0.0:1440610522.125678:0:27431:0:(pack_generic.c:331:lustre_pack_reply_v2()) Process entered 02000000:00000001:0.0:1440610522.125679:0:27431:0:(sec.c:2127:sptlrpc_svc_alloc_rs()) Process entered 02000000:00000010:0.0:1440610522.125681:0:27431:0:(sec_null.c:329:null_alloc_rs()) kmalloced 'rs': 488 at ffff88007cabc000. 02000000:00000001:0.0:1440610522.125683:0:27431:0:(sec.c:2163:sptlrpc_svc_alloc_rs()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1440610522.125684:0:27431:0:(pack_generic.c:365:lustre_pack_reply_v2()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:0.0:1440610522.125687:0:27431:0:(ldlm_lib.c:2735:target_pack_pool_reply()) Process entered 00010000:00000001:0.0:1440610522.125689:0:27431:0:(ldlm_lib.c:2754:target_pack_pool_reply()) Process leaving (rc=0 : 0 : 0) 00000100:00001000:0.0:1440610522.125691:0:27431:0:(import.c:1627:at_measured()) add 1 to ffff88007b872150 time=20 v=1 (1 1 0 0) 00000100:00000001:0.0:1440610522.125694:0:27431:0:(connection.c:130:ptlrpc_connection_addref()) Process entered 00000100:00000040:0.0:1440610522.125696:0:27431:0:(connection.c:135:ptlrpc_connection_addref()) conn=ffff880058022ec0 refcount 14 to 10.1.6.94@tcp 00000100:00000001:0.0:1440610522.125698:0:27431:0:(connection.c:137:ptlrpc_connection_addref()) Process leaving (rc=18446612133790756544 : -131939918795072 : ffff880058022ec0) 02000000:00000001:0.0:1440610522.125701:0:27431:0:(sec.c:2176:sptlrpc_svc_wrap_reply()) Process entered 02000000:00000001:0.0:1440610522.125702:0:27431:0:(sec.c:2187:sptlrpc_svc_wrap_reply()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1440610522.125704:0:27431:0:(niobuf.c:57:ptl_send_buf()) Process entered 00000100:00000040:0.0:1440610522.125706:0:27431:0:(niobuf.c:61:ptl_send_buf()) conn=ffff880058022ec0 id 12345-10.1.6.94@tcp 00000100:00000200:0.0:1440610522.125710:0:27431:0:(niobuf.c:83:ptl_send_buf()) Sending 192 bytes to portal 16, xid 1510587166102920, offset 192 00000100:00000001:0.0:1440610522.125721:0:27431:0:(niobuf.c:98:ptl_send_buf()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1440610522.125723:0:27431:0:(connection.c:94:ptlrpc_connection_put()) Process entered 00000100:00000040:0.0:1440610522.125725:0:27431:0:(connection.c:122:ptlrpc_connection_put()) PUT conn=ffff880058022ec0 refcount 13 to 10.1.6.94@tcp 00000100:00000001:0.0:1440610522.125727:0:27431:0:(connection.c:124:ptlrpc_connection_put()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:0.0:1440610522.125729:0:27431:0:(ldlm_lockd.c:2174:ldlm_callback_handler()) Process leaving (rc=0 : 0 : 0) 00000100:00000040:0.0:1440610522.125732:0:27431:0:(lustre_net.h:2448:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete" req@ffff88007bfd2980 x1510587166102920/t0(0) o104->LOV_OSC_UUID@10.1.6.94@tcp:313/0 lens 296/192 e 0 to 0 dl 1440610528 ref 1 fl Interpret:/0/0 rc -71/-71 00000100:00100000:0.0:1440610522.125740:0:27431:0:(service.c:2126:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_000:LOV_OSC_UUID+4:28342:x1510587166102920:12345-10.1.6.94@tcp:104 Request procesed in 75us (162us total) trans 0 rc -71/-71
            jamesanunez James Nunez (Inactive) added a comment - Another on master (pre-2.8): 2015-08-26 17:30:19 - https://testing.hpdd.intel.com/test_sets/ef954b8a-4c26-11e5-b537-5254006e85c2
            di.wang Di Wang added a comment - Another failure https://testing.hpdd.intel.com/test_sets/7ddcc5de-3e6d-11e5-870e-5254006e85c2
            jamesanunez James Nunez (Inactive) added a comment - Another instance on master: https://testing.hpdd.intel.com/test_sets/28f2641a-0e71-11e5-828a-5254006e85c2

            People

              niu Niu Yawei (Inactive)
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: