Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5371

Failure on test suite parallel-scale test_simul

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.7.0
    • Lustre 2.6.0
    • None
    • client and server: lustre-b2_6-rc2 RHEL6 ldiskfs
    • 3
    • 14975

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/c8156066-0dc7-11e4-af8b-5254006e85c2.

      The sub-test test_simul failed with the following error:

      simul failed! 1

      test log:

      22:35:15: FAILED in simul_rmdir: too many operations succeeded (2).
      

      Attachments

        Issue Links

          Activity

            [LU-5371] Failure on test suite parallel-scale test_simul
            sarah Sarah Liu added a comment - - edited

            Hit this error in interop test between 2.6.0 client and master server, patch needs to be back ported to b2_6 fix the error

            https://testing.hpdd.intel.com/test_sets/336e7e88-5558-11e4-a570-5254006e85c2

            sarah Sarah Liu added a comment - - edited Hit this error in interop test between 2.6.0 client and master server, patch needs to be back ported to b2_6 fix the error https://testing.hpdd.intel.com/test_sets/336e7e88-5558-11e4-a570-5254006e85c2

            Patch landed to Master.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master.
            di.wang Di Wang added a comment -

            http://review.whamcloud.com/11170

            Sorry, this is brought in by
            commit 3ea78dd02a57211ae9b55111323d14cfbc71bc43
            Author: Wang Di <di.wang@intel.com>
            Date: Wed Jun 25 22:35:52 2014 -0700

            LU-4921 lmv: try all stripes for unknown hash functions

            For unknown hash type, LMV should try all stripes to locate
            the name entry. But it will only for lookup and unlink, i.e.
            we can only list and unlink entries under striped dir with
            unknown hash type.

            Signed-off-by: wang di <di.wang@intel.com>
            Change-Id: Ifeed7131c24e48277a6cc8fd4c09b7534e31079f
            Reviewed-on: http://review.whamcloud.com/10041
            Tested-by: Jenkins
            Reviewed-by: John L. Hammond <john.hammond@intel.com>
            Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
            Tested-by: Maloo <hpdd-maloo@intel.com>

            di.wang Di Wang added a comment - http://review.whamcloud.com/11170 Sorry, this is brought in by commit 3ea78dd02a57211ae9b55111323d14cfbc71bc43 Author: Wang Di <di.wang@intel.com> Date: Wed Jun 25 22:35:52 2014 -0700 LU-4921 lmv: try all stripes for unknown hash functions For unknown hash type, LMV should try all stripes to locate the name entry. But it will only for lookup and unlink, i.e. we can only list and unlink entries under striped dir with unknown hash type. Signed-off-by: wang di <di.wang@intel.com> Change-Id: Ifeed7131c24e48277a6cc8fd4c09b7534e31079f Reviewed-on: http://review.whamcloud.com/10041 Tested-by: Jenkins Reviewed-by: John L. Hammond <john.hammond@intel.com> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Tested-by: Maloo <hpdd-maloo@intel.com>
            di.wang Di Wang added a comment -

            Hmm, I checked the log, it seems failures start to occur around July 3th. (build 2548)

            It seems simul test tries to rmdir a single directory from multiple clients(multiple threads), and only 1 thread should succeed. But two threads succeed in the test, that is why it causes failure.

            04:50:57: FAILED in simul_rmdir: too many operations succeeded (2).
            

            Hmm, I checked the debug log on both clients and MDT side.

            Client1 (succeeds for rmdir)

            00000080:00200000:1.0:1405575315.343857:0:29632:0:(namei.c:1137:ll_rmdir_generic()) VFS Op:name=simul_rmdir.0, dir=[0x2000013a8:0x143db:0x0](ffff880060e42138)
            00010000:00010000:1.0:1405575315.343863:0:29632:0:(ldlm_request.c:1113:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-MDT0000-mdc-ffff88005dc3e000 lock: ffff88001f3b3500/0x4c7dc53d5dce65cc lrc: 2/0,0 mode: PR/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x8400000000 nid: local remote: 0xce57c426058b2c04 expref: -99 pid: 29632 timeout: 0 lvb_type: 0
            00010000:00010000:1.0:1405575315.343874:0:29632:0:(ldlm_request.c:1172:ldlm_cancel_pack()) ### packing ns: lustre-MDT0000-mdc-ffff88005dc3e000 lock: ffff88001f3b3500/0x4c7dc53d5dce65cc lrc: 1/0,0 mode: --/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x4809400000000 nid: local remote: 0xce57c426058b2c04 expref: -99 pid: 29632 timeout: 0 lvb_type: 0
            00010000:00010000:1.0:1405575315.343877:0:29632:0:(ldlm_request.c:1176:ldlm_cancel_pack()) 1 locks packed
            00010000:00010000:1.0:1405575315.343877:0:29632:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-MDT0000-mdc-ffff88005dc3e000 lock: ffff88001f3b3500/0x4c7dc53d5dce65cc lrc: 0/0,0 mode: --/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x4809400000000 nid: local remote: 0xce57c426058b2c04 expref: -99 pid: 29632 timeout: 0 lvb_type: 0
            00000100:00100000:1.0:1405575315.343885:0:29632:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc simul:49ab2d2b-0df9-c120-4bb6-1f0147383ab7:29632:1473843428301100:10.2.4.203@tcp:36
            00000100:00100000:1.0:1405575315.343909:0:29632:0:(client.c:2146:ptlrpc_set_wait()) set ffff88001f1b0440 going to sleep for 43 seconds
            00000100:00100000:1.0:1405575315.345000:0:29632:0:(client.c:1863:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc simul:49ab2d2b-0df9-c120-4bb6-1f0147383ab7:29632:1473843428301100:10.2.4.203@tcp:36
            00000080:00200000:1.0:1405575315.345007:0:29632:0:(llite_lib.c:1425:ll_clear_inode()) VFS Op:inode=[0x2000013a7:0x1c43f:0x0](ffff8800616f1178)
            

            Client2 (also succeeds with rmdir}

            00000080:00200000:0.0:1405575315.345956:0:18310:0:(namei.c:1137:ll_rmdir_generic()) VFS Op:name=simul_rmdir.0, dir=[0x2000013a8:0x143db:0x0](ffff88005d0bebb8)
            00010000:00010000:0.0:1405575315.345961:0:18310:0:(ldlm_request.c:1113:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-MDT0000-mdc-ffff88007dbfd000 lock: ffff88002e7eed40/0x366918de5153be5b lrc: 2/0,0 mode: PR/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x8400000000 nid: local remote: 0xce57c426058b2c20 expref: -99 pid: 18310 timeout: 0 lvb_type: 0
            00010000:00010000:0.0:1405575315.345970:0:18310:0:(ldlm_request.c:1172:ldlm_cancel_pack()) ### packing ns: lustre-MDT0000-mdc-ffff88007dbfd000 lock: ffff88002e7eed40/0x366918de5153be5b lrc: 1/0,0 mode: --/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x4809400000000 nid: local remote: 0xce57c426058b2c20 expref: -99 pid: 18310 timeout: 0 lvb_type: 0
            00010000:00010000:0.0:1405575315.345974:0:18310:0:(ldlm_request.c:1176:ldlm_cancel_pack()) 1 locks packed
            00010000:00010000:0.0:1405575315.345975:0:18310:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-MDT0000-mdc-ffff88007dbfd000 lock: ffff88002e7eed40/0x366918de5153be5b lrc: 0/0,0 mode: --/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x4809400000000 nid: local remote: 0xce57c426058b2c20 expref: -99 pid: 18310 timeout: 0 lvb_type: 0
            00000100:00100000:0.0:1405575315.345981:0:18310:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc simul:992018ad-6c7b-3f3b-15cf-42922a66aa9a:18310:1473838828202428:10.2.4.203@tcp:36
            00000100:00100000:0.0:1405575315.345989:0:18310:0:(client.c:2146:ptlrpc_set_wait()) set ffff88002c1cf200 going to sleep for 43 seconds
            00000100:00100000:0.0:1405575315.347314:0:18310:0:(client.c:1863:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc simul:992018ad-6c7b-3f3b-15cf-42922a66aa9a:18310:1473838828202428:10.2.4.203@tcp:36
            00000080:00200000:0.0:1405575315.347322:0:18310:0:(llite_lib.c:1425:ll_clear_inode()) VFS Op:inode=[0x2000013a7:0x1c43f:0x0](ffff88003ab0c1b8)
            

            It is interesting when MDT handle the request from client2, it is already return ENOENT, but it seems client2 ignore that

            00000100:00100000:1.0:1405575315.353820:0:25427:0:(service.c:2144:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_003:992018ad-6c7b-3f3b-15cf-42922a66aa9a+675:18310:x1473838828202428:12345-10.2.4.205@tcp:36 Request procesed in 988us (1036us total) trans 0 rc -2/-2
            

            Sigh there are no enough debug log tell me what happen there, I will keep dig.

            di.wang Di Wang added a comment - Hmm, I checked the log, it seems failures start to occur around July 3th. (build 2548) It seems simul test tries to rmdir a single directory from multiple clients(multiple threads), and only 1 thread should succeed. But two threads succeed in the test, that is why it causes failure. 04:50:57: FAILED in simul_rmdir: too many operations succeeded (2). Hmm, I checked the debug log on both clients and MDT side. Client1 (succeeds for rmdir) 00000080:00200000:1.0:1405575315.343857:0:29632:0:(namei.c:1137:ll_rmdir_generic()) VFS Op:name=simul_rmdir.0, dir=[0x2000013a8:0x143db:0x0](ffff880060e42138) 00010000:00010000:1.0:1405575315.343863:0:29632:0:(ldlm_request.c:1113:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-MDT0000-mdc-ffff88005dc3e000 lock: ffff88001f3b3500/0x4c7dc53d5dce65cc lrc: 2/0,0 mode: PR/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x8400000000 nid: local remote: 0xce57c426058b2c04 expref: -99 pid: 29632 timeout: 0 lvb_type: 0 00010000:00010000:1.0:1405575315.343874:0:29632:0:(ldlm_request.c:1172:ldlm_cancel_pack()) ### packing ns: lustre-MDT0000-mdc-ffff88005dc3e000 lock: ffff88001f3b3500/0x4c7dc53d5dce65cc lrc: 1/0,0 mode: --/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x4809400000000 nid: local remote: 0xce57c426058b2c04 expref: -99 pid: 29632 timeout: 0 lvb_type: 0 00010000:00010000:1.0:1405575315.343877:0:29632:0:(ldlm_request.c:1176:ldlm_cancel_pack()) 1 locks packed 00010000:00010000:1.0:1405575315.343877:0:29632:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-MDT0000-mdc-ffff88005dc3e000 lock: ffff88001f3b3500/0x4c7dc53d5dce65cc lrc: 0/0,0 mode: --/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x4809400000000 nid: local remote: 0xce57c426058b2c04 expref: -99 pid: 29632 timeout: 0 lvb_type: 0 00000100:00100000:1.0:1405575315.343885:0:29632:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc simul:49ab2d2b-0df9-c120-4bb6-1f0147383ab7:29632:1473843428301100:10.2.4.203@tcp:36 00000100:00100000:1.0:1405575315.343909:0:29632:0:(client.c:2146:ptlrpc_set_wait()) set ffff88001f1b0440 going to sleep for 43 seconds 00000100:00100000:1.0:1405575315.345000:0:29632:0:(client.c:1863:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc simul:49ab2d2b-0df9-c120-4bb6-1f0147383ab7:29632:1473843428301100:10.2.4.203@tcp:36 00000080:00200000:1.0:1405575315.345007:0:29632:0:(llite_lib.c:1425:ll_clear_inode()) VFS Op:inode=[0x2000013a7:0x1c43f:0x0](ffff8800616f1178) Client2 (also succeeds with rmdir} 00000080:00200000:0.0:1405575315.345956:0:18310:0:(namei.c:1137:ll_rmdir_generic()) VFS Op:name=simul_rmdir.0, dir=[0x2000013a8:0x143db:0x0](ffff88005d0bebb8) 00010000:00010000:0.0:1405575315.345961:0:18310:0:(ldlm_request.c:1113:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-MDT0000-mdc-ffff88007dbfd000 lock: ffff88002e7eed40/0x366918de5153be5b lrc: 2/0,0 mode: PR/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x8400000000 nid: local remote: 0xce57c426058b2c20 expref: -99 pid: 18310 timeout: 0 lvb_type: 0 00010000:00010000:0.0:1405575315.345970:0:18310:0:(ldlm_request.c:1172:ldlm_cancel_pack()) ### packing ns: lustre-MDT0000-mdc-ffff88007dbfd000 lock: ffff88002e7eed40/0x366918de5153be5b lrc: 1/0,0 mode: --/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x4809400000000 nid: local remote: 0xce57c426058b2c20 expref: -99 pid: 18310 timeout: 0 lvb_type: 0 00010000:00010000:0.0:1405575315.345974:0:18310:0:(ldlm_request.c:1176:ldlm_cancel_pack()) 1 locks packed 00010000:00010000:0.0:1405575315.345975:0:18310:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-MDT0000-mdc-ffff88007dbfd000 lock: ffff88002e7eed40/0x366918de5153be5b lrc: 0/0,0 mode: --/PR res: [0x2000013a7:0x1c43f:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x4809400000000 nid: local remote: 0xce57c426058b2c20 expref: -99 pid: 18310 timeout: 0 lvb_type: 0 00000100:00100000:0.0:1405575315.345981:0:18310:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc simul:992018ad-6c7b-3f3b-15cf-42922a66aa9a:18310:1473838828202428:10.2.4.203@tcp:36 00000100:00100000:0.0:1405575315.345989:0:18310:0:(client.c:2146:ptlrpc_set_wait()) set ffff88002c1cf200 going to sleep for 43 seconds 00000100:00100000:0.0:1405575315.347314:0:18310:0:(client.c:1863:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc simul:992018ad-6c7b-3f3b-15cf-42922a66aa9a:18310:1473838828202428:10.2.4.203@tcp:36 00000080:00200000:0.0:1405575315.347322:0:18310:0:(llite_lib.c:1425:ll_clear_inode()) VFS Op:inode=[0x2000013a7:0x1c43f:0x0](ffff88003ab0c1b8) It is interesting when MDT handle the request from client2, it is already return ENOENT, but it seems client2 ignore that 00000100:00100000:1.0:1405575315.353820:0:25427:0:(service.c:2144:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_003:992018ad-6c7b-3f3b-15cf-42922a66aa9a+675:18310:x1473838828202428:12345-10.2.4.205@tcp:36 Request procesed in 988us (1036us total) trans 0 rc -2/-2 Sigh there are no enough debug log tell me what happen there, I will keep dig.
            sarah Sarah Liu added a comment -

            Also hit this issue in interop test between 2.5.2 server and b2_6-rc2 testing

            https://testing.hpdd.intel.com/test_sets/3c3c2f74-0d82-11e4-b3f5-5254006e85c2

            sarah Sarah Liu added a comment - Also hit this issue in interop test between 2.5.2 server and b2_6-rc2 testing https://testing.hpdd.intel.com/test_sets/3c3c2f74-0d82-11e4-b3f5-5254006e85c2

            Di,
            Could you please look into and comment on this one?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - Di, Could you please look into and comment on this one? Thank you!

            People

              di.wang Di Wang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: