Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.5.0
    • 3
    • HSM
    • 12689

    Description

      Issuing too many HSM requests (120 or more, it seems) leads to lnet errors. The corresponding requests, as well as subsequent ones, aren't delivered to the copytool.

      LNetError: 7307:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-0@lo, match 1460297122524484 length 6776 too big: 7600 left, 6144 allowed

      Can be easily reproduced with lfs hsm_archive * against a hundred-ish files.

      Attachments

        Activity

          [LU-4639] HSM requests not delivered

          Yes, I'm able to recreate this problem trying archive ~111 files at once. There are several of the follow error on the agent:

          LNetError: 12269:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.2.109@o2ib, match 1460420866039584 length 8288 too big: 7896 left, 6144 allowed
          

          On the MDS, I see the following in dmesg:

          Lustre: scratch-MDD0000: changelog on
          LustreError: 21923:0:(mdt_coordinator.c:950:mdt_hsm_cdt_start()) scratch-MDT0000: Coordinator already started
          LustreError: 21923:0:(obd_config.c:1350:class_process_proc_param()) mdt.: error writing proc entry 'hsm_control': rc = -114
          Lustre: 1963:0:(client.c:1907:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1392918571/real 1392918571]  req@ffff8806bd2bd400 x1460420865979988/t0(0) o8->scratch-OST0000-osc-MDT0000@192.168.2.111@o2ib:28/4 lens 400/544 e 0 to 1 dl 1392918596 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          Lustre: 1963:0:(client.c:1907:ptlrpc_expire_one_request()) Skipped 41 previous similar messages
          LNet: 1959:0:(o2iblnd_cb.c:381:kiblnd_handle_rx()) PUT_NACK from 192.168.2.110@o2ib
          Lustre: 21913:0:(client.c:1907:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1392931034/real 1392931034]  req@ffff8806bd181800 x1460420866039584/t0(0) o107->scratch-MDT0000@192.168.2.110@o2ib:15/16 lens 8288/224 e 0 to 1 dl 1392931041 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
          Lustre: 21913:0:(client.c:1907:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
          LNet: 1958:0:(o2iblnd_cb.c:381:kiblnd_handle_rx()) PUT_NACK from 192.168.2.110@o2ib
          
          jamesanunez James Nunez (Inactive) added a comment - Yes, I'm able to recreate this problem trying archive ~111 files at once. There are several of the follow error on the agent: LNetError: 12269:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.2.109@o2ib, match 1460420866039584 length 8288 too big: 7896 left, 6144 allowed On the MDS, I see the following in dmesg: Lustre: scratch-MDD0000: changelog on LustreError: 21923:0:(mdt_coordinator.c:950:mdt_hsm_cdt_start()) scratch-MDT0000: Coordinator already started LustreError: 21923:0:(obd_config.c:1350:class_process_proc_param()) mdt.: error writing proc entry 'hsm_control': rc = -114 Lustre: 1963:0:(client.c:1907:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1392918571/real 1392918571] req@ffff8806bd2bd400 x1460420865979988/t0(0) o8->scratch-OST0000-osc-MDT0000@192.168.2.111@o2ib:28/4 lens 400/544 e 0 to 1 dl 1392918596 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 1963:0:(client.c:1907:ptlrpc_expire_one_request()) Skipped 41 previous similar messages LNet: 1959:0:(o2iblnd_cb.c:381:kiblnd_handle_rx()) PUT_NACK from 192.168.2.110@o2ib Lustre: 21913:0:(client.c:1907:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1392931034/real 1392931034] req@ffff8806bd181800 x1460420866039584/t0(0) o107->scratch-MDT0000@192.168.2.110@o2ib:15/16 lens 8288/224 e 0 to 1 dl 1392931041 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Lustre: 21913:0:(client.c:1907:ptlrpc_expire_one_request()) Skipped 5 previous similar messages LNet: 1958:0:(o2iblnd_cb.c:381:kiblnd_handle_rx()) PUT_NACK from 192.168.2.110@o2ib

          It looks like we've already limited the maximum size of HSM requests to MDS_MAXREQSIZE which is 5K bytes. Why does it add up to 6776 bytes at the end?

          jay Jinshan Xiong (Inactive) added a comment - It looks like we've already limited the maximum size of HSM requests to MDS_MAXREQSIZE which is 5K bytes. Why does it add up to 6776 bytes at the end?
          green Oleg Drokin added a comment -

          The packet drops hang everything because the client will keep retrying indefinitely, as server will not provide any indication it just dropped the request.
          So client is stuck trying to send this rpc with who knows what locks held and blocking other rpcs from being sent

          green Oleg Drokin added a comment - The packet drops hang everything because the client will keep retrying indefinitely, as server will not provide any indication it just dropped the request. So client is stuck trying to send this rpc with who knows what locks held and blocking other rpcs from being sent

          Actually, it was not know during HSM design , this was found out later. LNET message size and KUC message size where limitation for big request. I think KUC was patched to detect this. Nothing was done for LNET.
          Anyway, I do not really see why this packet drop will hang everything. We need to work this out.

          adegremont Aurelien Degremont (Inactive) added a comment - Actually, it was not know during HSM design , this was found out later. LNET message size and KUC message size where limitation for big request. I think KUC was patched to detect this. Nothing was done for LNET. Anyway, I do not really see why this packet drop will hang everything. We need to work this out.

          I wasn't involved in the HSM design. But if the limitation was known, what is the expected behavior when this scenario is hit? From the HSM feature perspective, is the message dropped silently at the ptl layer, and that's what's causing the problem?

          ashehata Amir Shehata (Inactive) added a comment - I wasn't involved in the HSM design. But if the limitation was known, what is the expected behavior when this scenario is hit? From the HSM feature perspective, is the message dropped silently at the ptl layer, and that's what's causing the problem?

          This limitation was known. But the resulting stale state was not.

          It is a little bit strange that upper layer knows lower layers constraints.

          adegremont Aurelien Degremont (Inactive) added a comment - This limitation was known. But the resulting stale state was not. It is a little bit strange that upper layer knows lower layers constraints.

          as far as I understand, this error message is in lnet_try_match_md(). This is when a message is received and LNet tries to mach a memory descriptor, which was setup to receive these messages. The MD is of a specific size. Now there are some flags that can be set to define what should be done if the message received is larger than the memory descriptor. It can truncate or drop. This particular MD seems to be setup to drop messages.

          I believe either the HSM needs to be aware of this limiation and make sure to send messages of a smaller size, or to increas the size of the MD

          ashehata Amir Shehata (Inactive) added a comment - as far as I understand, this error message is in lnet_try_match_md(). This is when a message is received and LNet tries to mach a memory descriptor, which was setup to receive these messages. The MD is of a specific size. Now there are some flags that can be set to define what should be done if the message received is larger than the memory descriptor. It can truncate or drop. This particular MD seems to be setup to drop messages. I believe either the HSM needs to be aware of this limiation and make sure to send messages of a smaller size, or to increas the size of the MD
          pjones Peter Jones added a comment -

          James

          Please could you set up this reproducer?

          Thanks

          Peter

          pjones Peter Jones added a comment - James Please could you set up this reproducer? Thanks Peter

          People

            jamesanunez James Nunez (Inactive)
            hdoreau Henri Doreau (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: