Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.5.0
    • 3
    • HSM
    • 12689

    Description

      Issuing too many HSM requests (120 or more, it seems) leads to lnet errors. The corresponding requests, as well as subsequent ones, aren't delivered to the copytool.

      LNetError: 7307:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-0@lo, match 1460297122524484 length 6776 too big: 7600 left, 6144 allowed

      Can be easily reproduced with lfs hsm_archive * against a hundred-ish files.

      Attachments

        Activity

          [LU-4639] HSM requests not delivered
          jamesanunez James Nunez (Inactive) added a comment - - edited

          I tested John's fix on master and, as expected, it reduced the number of files I was allowed to request for archive. If I try to archive, say, 200 files, I get an "Argument list too long" (E2BIG) error as expected. Now the bulk archive request with the reduced number of files succeeds and does not hang the client.

          Patch for master at http://review.whamcloud.com/#/c/9393/
          Patch for b2_5 at http://review.whamcloud.com/#/c/9422/

          jamesanunez James Nunez (Inactive) added a comment - - edited I tested John's fix on master and, as expected, it reduced the number of files I was allowed to request for archive. If I try to archive, say, 200 files, I get an "Argument list too long" (E2BIG) error as expected. Now the bulk archive request with the reduced number of files succeeds and does not hang the client. Patch for master at http://review.whamcloud.com/#/c/9393/ Patch for b2_5 at http://review.whamcloud.com/#/c/9422/

          I tried the same thing on the latest build of b2_5, build #33, and got different error messages. The results seem the same; I can't archive any files from this client and the bulk archive command never completes, but I can write from this client to the file system.

          # ls u1* | wc
              111     111     543
          # lfs hsm_archive u1*
          # lfs hsm_state u1*
          u1: (0x00000009) exists archived, archive_id:1
          u10: (0x00000009) exists archived, archive_id:1
          u100: (0x00000009) exists archived, archive_id:1
          ...
          

          There are no errors on the client, agent, nor MDS. So, with 111 files, archiving all the files worked.

          Then tried 120 files:

          # ls a* | wc
              125     125     517
          # lfs hsm_archive a*
          # lfs hsm_state a*
          a1: (0x00000000)
          a10: (0x00000000)
          a100: (0x00000000)
          ...
          

          So, it looks like the archive of these 120 files did not work.

          The difference between b2_5 and the master results, posted previously in this ticket, are the errors in the logs. After the failed bulk archive attempt, there are no errors on the client nor on the agent. The errors on the MDS are different from master in that they complain about not finding an agent for archive 3:

          Lustre: 13480:0:(mdd_device.c:1351:mdd_changelog_user_purge()) Could not determine changelog records to purge; rc=0
          Lustre: 13480:0:(mdd_device.c:1351:mdd_changelog_user_purge()) Skipped 14 previous similar messages
          LustreError: 21605:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) scratch-MDT0000: Cannot find agent for archive 3: rc = -11
          LustreError: 21605:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) Skipped 2 previous similar messages
          LustreError: 21605:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) scratch-MDT0000: Cannot find agent for archive 3: rc = -11
          

          I have a single archive with index 1. On the agent:

          # ps -ef | grep hsm | grep -v grep
          root     11891     1  0 13:29 ?        00:00:00 lhsmtool_posix --daemon --hsm_root /lustre/archive --archive 1 /lustre/scratch
          
          jamesanunez James Nunez (Inactive) added a comment - I tried the same thing on the latest build of b2_5, build #33, and got different error messages. The results seem the same; I can't archive any files from this client and the bulk archive command never completes, but I can write from this client to the file system. # ls u1* | wc 111 111 543 # lfs hsm_archive u1* # lfs hsm_state u1* u1: (0x00000009) exists archived, archive_id:1 u10: (0x00000009) exists archived, archive_id:1 u100: (0x00000009) exists archived, archive_id:1 ... There are no errors on the client, agent, nor MDS. So, with 111 files, archiving all the files worked. Then tried 120 files: # ls a* | wc 125 125 517 # lfs hsm_archive a* # lfs hsm_state a* a1: (0x00000000) a10: (0x00000000) a100: (0x00000000) ... So, it looks like the archive of these 120 files did not work. The difference between b2_5 and the master results, posted previously in this ticket, are the errors in the logs. After the failed bulk archive attempt, there are no errors on the client nor on the agent. The errors on the MDS are different from master in that they complain about not finding an agent for archive 3: Lustre: 13480:0:(mdd_device.c:1351:mdd_changelog_user_purge()) Could not determine changelog records to purge; rc=0 Lustre: 13480:0:(mdd_device.c:1351:mdd_changelog_user_purge()) Skipped 14 previous similar messages LustreError: 21605:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) scratch-MDT0000: Cannot find agent for archive 3: rc = -11 LustreError: 21605:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) Skipped 2 previous similar messages LustreError: 21605:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) scratch-MDT0000: Cannot find agent for archive 3: rc = -11 I have a single archive with index 1. On the agent: # ps -ef | grep hsm | grep -v grep root 11891 1 0 13:29 ? 00:00:00 lhsmtool_posix --daemon --hsm_root /lustre/archive --archive 1 /lustre/scratch

          Good point John.

          jay Jinshan Xiong (Inactive) added a comment - Good point John.
          jhammond John Hammond added a comment -

          Jinshan, we limit on the client side in ll_dir_ioctl(). There we require that hur_len(hur) < MDS_MAXREQSIZE (5K). Since hsm_user_item is 32 bytes, this means hur_count < 160. Then the MDT tries to send the same number of 72 byte hsm_action_items to the copytool. So with 120 items we have a KUC buffer that's more than 72 * 120 = 8640 bytes. It looks like a cheap fix would be to use MDS_MAXREQSIZE / 4 in ll_dir_ioctl().

          jhammond John Hammond added a comment - Jinshan, we limit on the client side in ll_dir_ioctl(). There we require that hur_len(hur) < MDS_MAXREQSIZE (5K). Since hsm_user_item is 32 bytes, this means hur_count < 160. Then the MDT tries to send the same number of 72 byte hsm_action_items to the copytool. So with 120 items we have a KUC buffer that's more than 72 * 120 = 8640 bytes. It looks like a cheap fix would be to use MDS_MAXREQSIZE / 4 in ll_dir_ioctl().

          I tried to see if any of the files I tried to archive were actually archived or if I could archive a new file.

          For all files that I tried to archive in the bulk archive attempt above, I see:

          # lfs hsm_state u1*
          u1: (0x00000001) exists, archive_id:1
          u10: (0x00000001) exists, archive_id:1
          u100: (0x00000001) exists, archive_id:1
          ...
          

          I can't archive any new files and there is no error message in the logs or console for this new request to archive. Maybe the request is queued up, but can't be executed because of the previous bulk request:

          # lfs hsm_state u2
          u2: (0x00000000)
          # lfs hsm_archive --archive 1 u2
          # lfs hsm_state u2
          u2: (0x00000000)
          

          I rebooted the client node and tried the above again, but nothing changed. I looks like, as Oleg said, the client is stuck.

          I still see the following error message printed periodically in the agent logs:

          LNetError: 12270:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.2.109@o2ib, match 1460420866039584 length 8288 too big: 7896 left, 6144 allowed
          LNetError: 12270:0:(lib-ptl.c:190:lnet_try_match_md()) Skipped 85 previous similar messages
          
          jamesanunez James Nunez (Inactive) added a comment - I tried to see if any of the files I tried to archive were actually archived or if I could archive a new file. For all files that I tried to archive in the bulk archive attempt above, I see: # lfs hsm_state u1* u1: (0x00000001) exists, archive_id:1 u10: (0x00000001) exists, archive_id:1 u100: (0x00000001) exists, archive_id:1 ... I can't archive any new files and there is no error message in the logs or console for this new request to archive. Maybe the request is queued up, but can't be executed because of the previous bulk request: # lfs hsm_state u2 u2: (0x00000000) # lfs hsm_archive --archive 1 u2 # lfs hsm_state u2 u2: (0x00000000) I rebooted the client node and tried the above again, but nothing changed. I looks like, as Oleg said, the client is stuck. I still see the following error message printed periodically in the agent logs: LNetError: 12270:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.2.109@o2ib, match 1460420866039584 length 8288 too big: 7896 left, 6144 allowed LNetError: 12270:0:(lib-ptl.c:190:lnet_try_match_md()) Skipped 85 previous similar messages

          Yes, I'm able to recreate this problem trying archive ~111 files at once. There are several of the follow error on the agent:

          LNetError: 12269:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.2.109@o2ib, match 1460420866039584 length 8288 too big: 7896 left, 6144 allowed
          

          On the MDS, I see the following in dmesg:

          Lustre: scratch-MDD0000: changelog on
          LustreError: 21923:0:(mdt_coordinator.c:950:mdt_hsm_cdt_start()) scratch-MDT0000: Coordinator already started
          LustreError: 21923:0:(obd_config.c:1350:class_process_proc_param()) mdt.: error writing proc entry 'hsm_control': rc = -114
          Lustre: 1963:0:(client.c:1907:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1392918571/real 1392918571]  req@ffff8806bd2bd400 x1460420865979988/t0(0) o8->scratch-OST0000-osc-MDT0000@192.168.2.111@o2ib:28/4 lens 400/544 e 0 to 1 dl 1392918596 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          Lustre: 1963:0:(client.c:1907:ptlrpc_expire_one_request()) Skipped 41 previous similar messages
          LNet: 1959:0:(o2iblnd_cb.c:381:kiblnd_handle_rx()) PUT_NACK from 192.168.2.110@o2ib
          Lustre: 21913:0:(client.c:1907:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1392931034/real 1392931034]  req@ffff8806bd181800 x1460420866039584/t0(0) o107->scratch-MDT0000@192.168.2.110@o2ib:15/16 lens 8288/224 e 0 to 1 dl 1392931041 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
          Lustre: 21913:0:(client.c:1907:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
          LNet: 1958:0:(o2iblnd_cb.c:381:kiblnd_handle_rx()) PUT_NACK from 192.168.2.110@o2ib
          
          jamesanunez James Nunez (Inactive) added a comment - Yes, I'm able to recreate this problem trying archive ~111 files at once. There are several of the follow error on the agent: LNetError: 12269:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.2.109@o2ib, match 1460420866039584 length 8288 too big: 7896 left, 6144 allowed On the MDS, I see the following in dmesg: Lustre: scratch-MDD0000: changelog on LustreError: 21923:0:(mdt_coordinator.c:950:mdt_hsm_cdt_start()) scratch-MDT0000: Coordinator already started LustreError: 21923:0:(obd_config.c:1350:class_process_proc_param()) mdt.: error writing proc entry 'hsm_control': rc = -114 Lustre: 1963:0:(client.c:1907:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1392918571/real 1392918571] req@ffff8806bd2bd400 x1460420865979988/t0(0) o8->scratch-OST0000-osc-MDT0000@192.168.2.111@o2ib:28/4 lens 400/544 e 0 to 1 dl 1392918596 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 1963:0:(client.c:1907:ptlrpc_expire_one_request()) Skipped 41 previous similar messages LNet: 1959:0:(o2iblnd_cb.c:381:kiblnd_handle_rx()) PUT_NACK from 192.168.2.110@o2ib Lustre: 21913:0:(client.c:1907:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1392931034/real 1392931034] req@ffff8806bd181800 x1460420866039584/t0(0) o107->scratch-MDT0000@192.168.2.110@o2ib:15/16 lens 8288/224 e 0 to 1 dl 1392931041 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Lustre: 21913:0:(client.c:1907:ptlrpc_expire_one_request()) Skipped 5 previous similar messages LNet: 1958:0:(o2iblnd_cb.c:381:kiblnd_handle_rx()) PUT_NACK from 192.168.2.110@o2ib

          It looks like we've already limited the maximum size of HSM requests to MDS_MAXREQSIZE which is 5K bytes. Why does it add up to 6776 bytes at the end?

          jay Jinshan Xiong (Inactive) added a comment - It looks like we've already limited the maximum size of HSM requests to MDS_MAXREQSIZE which is 5K bytes. Why does it add up to 6776 bytes at the end?
          green Oleg Drokin added a comment -

          The packet drops hang everything because the client will keep retrying indefinitely, as server will not provide any indication it just dropped the request.
          So client is stuck trying to send this rpc with who knows what locks held and blocking other rpcs from being sent

          green Oleg Drokin added a comment - The packet drops hang everything because the client will keep retrying indefinitely, as server will not provide any indication it just dropped the request. So client is stuck trying to send this rpc with who knows what locks held and blocking other rpcs from being sent

          Actually, it was not know during HSM design , this was found out later. LNET message size and KUC message size where limitation for big request. I think KUC was patched to detect this. Nothing was done for LNET.
          Anyway, I do not really see why this packet drop will hang everything. We need to work this out.

          adegremont Aurelien Degremont (Inactive) added a comment - Actually, it was not know during HSM design , this was found out later. LNET message size and KUC message size where limitation for big request. I think KUC was patched to detect this. Nothing was done for LNET. Anyway, I do not really see why this packet drop will hang everything. We need to work this out.

          I wasn't involved in the HSM design. But if the limitation was known, what is the expected behavior when this scenario is hit? From the HSM feature perspective, is the message dropped silently at the ptl layer, and that's what's causing the problem?

          ashehata Amir Shehata (Inactive) added a comment - I wasn't involved in the HSM design. But if the limitation was known, what is the expected behavior when this scenario is hit? From the HSM feature perspective, is the message dropped silently at the ptl layer, and that's what's causing the problem?

          This limitation was known. But the resulting stale state was not.

          It is a little bit strange that upper layer knows lower layers constraints.

          adegremont Aurelien Degremont (Inactive) added a comment - This limitation was known. But the resulting stale state was not. It is a little bit strange that upper layer knows lower layers constraints.

          People

            jamesanunez James Nunez (Inactive)
            hdoreau Henri Doreau (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: