Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4206

Sanity test_120e fails with 1 blocking RPC occured.

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.6.0
    • Lustre 2.5.0
    • None
    • 3
    • 11434

    Description

      Sanity test_120e seems to be failing intermittently on some tests.
      Seeing the following on the MDT

      Lustre: lustre-MDT0000: Not available for connect from 10.10.16.108@tcp (stopping)
      Lustre: lustre-MDT0000: Not available for connect from 10.10.16.108@tcp (stopping)
      Lustre: Skipped 5 previous similar messages
      Lustre: lustre-MDT0000: Not available for connect from 10.10.16.109@tcp (stopping)
      LustreError: 2966:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8800686f7c00 x1450597855266416/t0(0) o13->lustre-OST0002-osc-MDT0000@10.10.16.108@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      LustreError: 2966:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 3 previous similar messages
      Lustre: lustre-MDT0000: Not available for connect from 10.10.16.110@tcp (stopping)
      Lustre: 10030:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1383398053/real 1383398053] req@ffff880068291c00 x1450597855266444/t0(0) o251->MGC10.10.16.107@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1383398059 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: server umount lustre-MDT0000 complete

      Might be the that mdt is being umounted while Clients are communicating with it

      On Client
      LustreError: 11-0: lustre-MDT0000-mdc-ffff880037d5a400: Communicating with 10.10.16.107@tcp, operation obd_ping failed with -107.

      Attachments

        Issue Links

          Activity

            [LU-4206] Sanity test_120e fails with 1 blocking RPC occured.
            yujian Jian Yu added a comment -
            yujian Jian Yu added a comment - The failure also occurred on Lustre b2_5 branch: https://testing.hpdd.intel.com/test_sets/720ecb84-5e77-11e4-bd01-5254006e85c2

            Patch landed to Master.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master.

            Andreas: I just talked to jinshan, there are no proper interface for us to implement this(cancel OST locks directly from llite). And according to jinshan's idea, there might be proper interface for us to do this after CLIO cleanup project, according to jinshan's comment, so we probably temporary fix this in script for now?

            di.wang Di Wang (Inactive) added a comment - Andreas: I just talked to jinshan, there are no proper interface for us to implement this(cancel OST locks directly from llite). And according to jinshan's idea, there might be proper interface for us to do this after CLIO cleanup project, according to jinshan's comment, so we probably temporary fix this in script for now?
            di.wang Di Wang (Inactive) added a comment - http://review.whamcloud.com/10250

            The correct solution is to have the client cancel the OST locks for the file's objects if it thinks this is the last unlink of the file and it is not open on the client. That should avoid the blocking RPC from the OST, and also allow the page cleanup and OST RPCs to overlap with the MDT processing of the unlink.

            adilger Andreas Dilger added a comment - The correct solution is to have the client cancel the OST locks for the file's objects if it thinks this is the last unlink of the file and it is not open on the client. That should avoid the blocking RPC from the OST, and also allow the page cleanup and OST RPCs to overlap with the MDT processing of the unlink.
            jhammond John Hammond added a comment -

            Yes, from osc_destroy() but this is not until after md_rename() has returned.

            jhammond John Hammond added a comment - Yes, from osc_destroy() but this is not until after md_rename() has returned.

            As I understand, we actually do ELC on OSC as well. But after 2.4, it is the MDT who will send destroy request to the OST, then OST revoke the lock on the client cache and destroy the object. So client does not have chance to do ELC here, so we might see some blocking RPC here. IIRC, test_120 only suppose to check ELC for metadata object, we probably need fix the test script here.

            di.wang Di Wang (Inactive) added a comment - As I understand, we actually do ELC on OSC as well. But after 2.4, it is the MDT who will send destroy request to the OST, then OST revoke the lock on the client cache and destroy the object. So client does not have chance to do ELC here, so we might see some blocking RPC here. IIRC, test_120 only suppose to check ELC for metadata object, we probably need fix the test script here.
            jhammond John Hammond added a comment -

            I see this occasionally from autotest and also when running locally. IIUC, when is see it, the blocking callback is from the OST (handling OST_DESTROY) for the rename onto victim. AFAICT we don't do ELC for objects of rename victims. Is that correct?

            00000100:00100000:0.0:1399475307.568943:0:21604:0:service.c:2090:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_004:lustre-MDT0000-mdtlov_UUID+4:11360:x1467451757057912:12345-0@lo:6
            ...
            00010000:00010000:0.0:1399475307.569026:0:21604:0:(ldlm_lock.c:715:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: filter-lustre-OST0000_UUID lock: ffff88016dacb158/0x2e4c6046c3721790 lrc: 2/0,0 mode: PR/PR res: [0x212:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x40000000000000 nid: 0@lo remote: 0x2e4c6046c3721789 expref: 4 pid: 13418 timeout: 0 lvb_type: 0
            ...
            00010000:00010000:0.0:1399475307.569074:0:21604:0:(ldlm_lockd.c:848:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-lustre-OST0000_UUID lock: ffff88016dacb158/0x2e4c6046c3721790 lrc: 3/0,0 mode: PR/PR res: [0x212:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x50000000010020 nid: 0@lo remote: 0x2e4c6046c3721789 expref: 4 pid: 13418 timeout: 0 lvb_type: 0
            ...
            00010000:00010000:0.0:1399475307.569081:0:21604:0:(ldlm_lockd.c:459:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 150, AT: on) ns: filter-lustre-OST0000_UUID lock: ffff88016dacb158/0x2e4c6046c3721790 lrc: 4/0,0 mode: PR/PR res: [0x212:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x70000000010020 nid: 0@lo remote: 0x2e4c6046c3721789 expref: 4 pid: 13418 timeout: 4451652756 lvb_type: 0
            ...
            00000100:00000040:0.0:1399475307.569088:0:21604:0:(lustre_net.h:3296:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc"  req@ffff88012c7032f0 x1467451757057916/t0(0) o104->lustre-OST0000@0@lo:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1
            ...
            00000100:00100000:0.0:1399475307.569097:0:21604:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ll_ost00_004:lustre-OST0000_UUID:21604:1467451757057916:0@lo:104
            
            jhammond John Hammond added a comment - I see this occasionally from autotest and also when running locally. IIUC, when is see it, the blocking callback is from the OST (handling OST_DESTROY) for the rename onto victim. AFAICT we don't do ELC for objects of rename victims. Is that correct? 00000100:00100000:0.0:1399475307.568943:0:21604:0:service.c:2090:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_004:lustre-MDT0000-mdtlov_UUID+4:11360:x1467451757057912:12345-0@lo:6 ... 00010000:00010000:0.0:1399475307.569026:0:21604:0:(ldlm_lock.c:715:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: filter-lustre-OST0000_UUID lock: ffff88016dacb158/0x2e4c6046c3721790 lrc: 2/0,0 mode: PR/PR res: [0x212:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x40000000000000 nid: 0@lo remote: 0x2e4c6046c3721789 expref: 4 pid: 13418 timeout: 0 lvb_type: 0 ... 00010000:00010000:0.0:1399475307.569074:0:21604:0:(ldlm_lockd.c:848:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-lustre-OST0000_UUID lock: ffff88016dacb158/0x2e4c6046c3721790 lrc: 3/0,0 mode: PR/PR res: [0x212:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x50000000010020 nid: 0@lo remote: 0x2e4c6046c3721789 expref: 4 pid: 13418 timeout: 0 lvb_type: 0 ... 00010000:00010000:0.0:1399475307.569081:0:21604:0:(ldlm_lockd.c:459:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 150, AT: on) ns: filter-lustre-OST0000_UUID lock: ffff88016dacb158/0x2e4c6046c3721790 lrc: 4/0,0 mode: PR/PR res: [0x212:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x70000000010020 nid: 0@lo remote: 0x2e4c6046c3721789 expref: 4 pid: 13418 timeout: 4451652756 lvb_type: 0 ... 00000100:00000040:0.0:1399475307.569088:0:21604:0:(lustre_net.h:3296:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc" req@ffff88012c7032f0 x1467451757057916/t0(0) o104->lustre-OST0000@0@lo:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1 ... 00000100:00100000:0.0:1399475307.569097:0:21604:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ll_ost00_004:lustre-OST0000_UUID:21604:1467451757057916:0@lo:104

            Only found 2 failures in the past few weeks.

            adilger Andreas Dilger added a comment - Only found 2 failures in the past few weeks.

            People

              di.wang Di Wang (Inactive)
              ashehata Amir Shehata (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: