Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.5.0
    • 3
    • HSM
    • 12689

    Description

      Issuing too many HSM requests (120 or more, it seems) leads to lnet errors. The corresponding requests, as well as subsequent ones, aren't delivered to the copytool.

      LNetError: 7307:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-0@lo, match 1460297122524484 length 6776 too big: 7600 left, 6144 allowed

      Can be easily reproduced with lfs hsm_archive * against a hundred-ish files.

      Attachments

        Activity

          [LU-4639] HSM requests not delivered
          rread Robert Read added a comment -

          I'm not sure I'd call this patch a fix - it's really just a work around. A fix would have allowed arbitrary sized requests, or would have at least fixed the lfs command to submit a large request in multiple batches instead of failing.

          In particular, since there is no way for a user (or tool like lfs) to determine what the max request size is, it's not possible to know exactly when a request is too large until it is too late. This makes it difficult to build usable tools on top of this without resort to arbitrarily small batch sizes or hard coded magic values.

          rread Robert Read added a comment - I'm not sure I'd call this patch a fix - it's really just a work around. A fix would have allowed arbitrary sized requests, or would have at least fixed the lfs command to submit a large request in multiple batches instead of failing. In particular, since there is no way for a user (or tool like lfs) to determine what the max request size is, it's not possible to know exactly when a request is too large until it is too late. This makes it difficult to build usable tools on top of this without resort to arbitrarily small batch sizes or hard coded magic values.

          "/ 4" seems a little bit magic to me. It is possible to add a comment to both patches to explain why we are doing this? If my maths are correct, divide it by 3 should be enough to fit in MDS_MAXREQSIZE. This should give us 33% increase of file in batched hsm requests. Ideally, we should add a test for that also .

          adegremont Aurelien Degremont (Inactive) added a comment - "/ 4" seems a little bit magic to me. It is possible to add a comment to both patches to explain why we are doing this? If my maths are correct, divide it by 3 should be enough to fit in MDS_MAXREQSIZE. This should give us 33% increase of file in batched hsm requests. Ideally, we should add a test for that also .
          jamesanunez James Nunez (Inactive) added a comment - - edited

          I tested John's fix on master and, as expected, it reduced the number of files I was allowed to request for archive. If I try to archive, say, 200 files, I get an "Argument list too long" (E2BIG) error as expected. Now the bulk archive request with the reduced number of files succeeds and does not hang the client.

          Patch for master at http://review.whamcloud.com/#/c/9393/
          Patch for b2_5 at http://review.whamcloud.com/#/c/9422/

          jamesanunez James Nunez (Inactive) added a comment - - edited I tested John's fix on master and, as expected, it reduced the number of files I was allowed to request for archive. If I try to archive, say, 200 files, I get an "Argument list too long" (E2BIG) error as expected. Now the bulk archive request with the reduced number of files succeeds and does not hang the client. Patch for master at http://review.whamcloud.com/#/c/9393/ Patch for b2_5 at http://review.whamcloud.com/#/c/9422/

          I tried the same thing on the latest build of b2_5, build #33, and got different error messages. The results seem the same; I can't archive any files from this client and the bulk archive command never completes, but I can write from this client to the file system.

          # ls u1* | wc
              111     111     543
          # lfs hsm_archive u1*
          # lfs hsm_state u1*
          u1: (0x00000009) exists archived, archive_id:1
          u10: (0x00000009) exists archived, archive_id:1
          u100: (0x00000009) exists archived, archive_id:1
          ...
          

          There are no errors on the client, agent, nor MDS. So, with 111 files, archiving all the files worked.

          Then tried 120 files:

          # ls a* | wc
              125     125     517
          # lfs hsm_archive a*
          # lfs hsm_state a*
          a1: (0x00000000)
          a10: (0x00000000)
          a100: (0x00000000)
          ...
          

          So, it looks like the archive of these 120 files did not work.

          The difference between b2_5 and the master results, posted previously in this ticket, are the errors in the logs. After the failed bulk archive attempt, there are no errors on the client nor on the agent. The errors on the MDS are different from master in that they complain about not finding an agent for archive 3:

          Lustre: 13480:0:(mdd_device.c:1351:mdd_changelog_user_purge()) Could not determine changelog records to purge; rc=0
          Lustre: 13480:0:(mdd_device.c:1351:mdd_changelog_user_purge()) Skipped 14 previous similar messages
          LustreError: 21605:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) scratch-MDT0000: Cannot find agent for archive 3: rc = -11
          LustreError: 21605:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) Skipped 2 previous similar messages
          LustreError: 21605:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) scratch-MDT0000: Cannot find agent for archive 3: rc = -11
          

          I have a single archive with index 1. On the agent:

          # ps -ef | grep hsm | grep -v grep
          root     11891     1  0 13:29 ?        00:00:00 lhsmtool_posix --daemon --hsm_root /lustre/archive --archive 1 /lustre/scratch
          
          jamesanunez James Nunez (Inactive) added a comment - I tried the same thing on the latest build of b2_5, build #33, and got different error messages. The results seem the same; I can't archive any files from this client and the bulk archive command never completes, but I can write from this client to the file system. # ls u1* | wc 111 111 543 # lfs hsm_archive u1* # lfs hsm_state u1* u1: (0x00000009) exists archived, archive_id:1 u10: (0x00000009) exists archived, archive_id:1 u100: (0x00000009) exists archived, archive_id:1 ... There are no errors on the client, agent, nor MDS. So, with 111 files, archiving all the files worked. Then tried 120 files: # ls a* | wc 125 125 517 # lfs hsm_archive a* # lfs hsm_state a* a1: (0x00000000) a10: (0x00000000) a100: (0x00000000) ... So, it looks like the archive of these 120 files did not work. The difference between b2_5 and the master results, posted previously in this ticket, are the errors in the logs. After the failed bulk archive attempt, there are no errors on the client nor on the agent. The errors on the MDS are different from master in that they complain about not finding an agent for archive 3: Lustre: 13480:0:(mdd_device.c:1351:mdd_changelog_user_purge()) Could not determine changelog records to purge; rc=0 Lustre: 13480:0:(mdd_device.c:1351:mdd_changelog_user_purge()) Skipped 14 previous similar messages LustreError: 21605:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) scratch-MDT0000: Cannot find agent for archive 3: rc = -11 LustreError: 21605:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) Skipped 2 previous similar messages LustreError: 21605:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) scratch-MDT0000: Cannot find agent for archive 3: rc = -11 I have a single archive with index 1. On the agent: # ps -ef | grep hsm | grep -v grep root 11891 1 0 13:29 ? 00:00:00 lhsmtool_posix --daemon --hsm_root /lustre/archive --archive 1 /lustre/scratch

          Good point John.

          jay Jinshan Xiong (Inactive) added a comment - Good point John.
          jhammond John Hammond added a comment -

          Jinshan, we limit on the client side in ll_dir_ioctl(). There we require that hur_len(hur) < MDS_MAXREQSIZE (5K). Since hsm_user_item is 32 bytes, this means hur_count < 160. Then the MDT tries to send the same number of 72 byte hsm_action_items to the copytool. So with 120 items we have a KUC buffer that's more than 72 * 120 = 8640 bytes. It looks like a cheap fix would be to use MDS_MAXREQSIZE / 4 in ll_dir_ioctl().

          jhammond John Hammond added a comment - Jinshan, we limit on the client side in ll_dir_ioctl(). There we require that hur_len(hur) < MDS_MAXREQSIZE (5K). Since hsm_user_item is 32 bytes, this means hur_count < 160. Then the MDT tries to send the same number of 72 byte hsm_action_items to the copytool. So with 120 items we have a KUC buffer that's more than 72 * 120 = 8640 bytes. It looks like a cheap fix would be to use MDS_MAXREQSIZE / 4 in ll_dir_ioctl().

          I tried to see if any of the files I tried to archive were actually archived or if I could archive a new file.

          For all files that I tried to archive in the bulk archive attempt above, I see:

          # lfs hsm_state u1*
          u1: (0x00000001) exists, archive_id:1
          u10: (0x00000001) exists, archive_id:1
          u100: (0x00000001) exists, archive_id:1
          ...
          

          I can't archive any new files and there is no error message in the logs or console for this new request to archive. Maybe the request is queued up, but can't be executed because of the previous bulk request:

          # lfs hsm_state u2
          u2: (0x00000000)
          # lfs hsm_archive --archive 1 u2
          # lfs hsm_state u2
          u2: (0x00000000)
          

          I rebooted the client node and tried the above again, but nothing changed. I looks like, as Oleg said, the client is stuck.

          I still see the following error message printed periodically in the agent logs:

          LNetError: 12270:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.2.109@o2ib, match 1460420866039584 length 8288 too big: 7896 left, 6144 allowed
          LNetError: 12270:0:(lib-ptl.c:190:lnet_try_match_md()) Skipped 85 previous similar messages
          
          jamesanunez James Nunez (Inactive) added a comment - I tried to see if any of the files I tried to archive were actually archived or if I could archive a new file. For all files that I tried to archive in the bulk archive attempt above, I see: # lfs hsm_state u1* u1: (0x00000001) exists, archive_id:1 u10: (0x00000001) exists, archive_id:1 u100: (0x00000001) exists, archive_id:1 ... I can't archive any new files and there is no error message in the logs or console for this new request to archive. Maybe the request is queued up, but can't be executed because of the previous bulk request: # lfs hsm_state u2 u2: (0x00000000) # lfs hsm_archive --archive 1 u2 # lfs hsm_state u2 u2: (0x00000000) I rebooted the client node and tried the above again, but nothing changed. I looks like, as Oleg said, the client is stuck. I still see the following error message printed periodically in the agent logs: LNetError: 12270:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.2.109@o2ib, match 1460420866039584 length 8288 too big: 7896 left, 6144 allowed LNetError: 12270:0:(lib-ptl.c:190:lnet_try_match_md()) Skipped 85 previous similar messages

          Yes, I'm able to recreate this problem trying archive ~111 files at once. There are several of the follow error on the agent:

          LNetError: 12269:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.2.109@o2ib, match 1460420866039584 length 8288 too big: 7896 left, 6144 allowed
          

          On the MDS, I see the following in dmesg:

          Lustre: scratch-MDD0000: changelog on
          LustreError: 21923:0:(mdt_coordinator.c:950:mdt_hsm_cdt_start()) scratch-MDT0000: Coordinator already started
          LustreError: 21923:0:(obd_config.c:1350:class_process_proc_param()) mdt.: error writing proc entry 'hsm_control': rc = -114
          Lustre: 1963:0:(client.c:1907:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1392918571/real 1392918571]  req@ffff8806bd2bd400 x1460420865979988/t0(0) o8->scratch-OST0000-osc-MDT0000@192.168.2.111@o2ib:28/4 lens 400/544 e 0 to 1 dl 1392918596 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          Lustre: 1963:0:(client.c:1907:ptlrpc_expire_one_request()) Skipped 41 previous similar messages
          LNet: 1959:0:(o2iblnd_cb.c:381:kiblnd_handle_rx()) PUT_NACK from 192.168.2.110@o2ib
          Lustre: 21913:0:(client.c:1907:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1392931034/real 1392931034]  req@ffff8806bd181800 x1460420866039584/t0(0) o107->scratch-MDT0000@192.168.2.110@o2ib:15/16 lens 8288/224 e 0 to 1 dl 1392931041 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
          Lustre: 21913:0:(client.c:1907:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
          LNet: 1958:0:(o2iblnd_cb.c:381:kiblnd_handle_rx()) PUT_NACK from 192.168.2.110@o2ib
          
          jamesanunez James Nunez (Inactive) added a comment - Yes, I'm able to recreate this problem trying archive ~111 files at once. There are several of the follow error on the agent: LNetError: 12269:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.2.109@o2ib, match 1460420866039584 length 8288 too big: 7896 left, 6144 allowed On the MDS, I see the following in dmesg: Lustre: scratch-MDD0000: changelog on LustreError: 21923:0:(mdt_coordinator.c:950:mdt_hsm_cdt_start()) scratch-MDT0000: Coordinator already started LustreError: 21923:0:(obd_config.c:1350:class_process_proc_param()) mdt.: error writing proc entry 'hsm_control': rc = -114 Lustre: 1963:0:(client.c:1907:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1392918571/real 1392918571] req@ffff8806bd2bd400 x1460420865979988/t0(0) o8->scratch-OST0000-osc-MDT0000@192.168.2.111@o2ib:28/4 lens 400/544 e 0 to 1 dl 1392918596 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 1963:0:(client.c:1907:ptlrpc_expire_one_request()) Skipped 41 previous similar messages LNet: 1959:0:(o2iblnd_cb.c:381:kiblnd_handle_rx()) PUT_NACK from 192.168.2.110@o2ib Lustre: 21913:0:(client.c:1907:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1392931034/real 1392931034] req@ffff8806bd181800 x1460420866039584/t0(0) o107->scratch-MDT0000@192.168.2.110@o2ib:15/16 lens 8288/224 e 0 to 1 dl 1392931041 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Lustre: 21913:0:(client.c:1907:ptlrpc_expire_one_request()) Skipped 5 previous similar messages LNet: 1958:0:(o2iblnd_cb.c:381:kiblnd_handle_rx()) PUT_NACK from 192.168.2.110@o2ib

          It looks like we've already limited the maximum size of HSM requests to MDS_MAXREQSIZE which is 5K bytes. Why does it add up to 6776 bytes at the end?

          jay Jinshan Xiong (Inactive) added a comment - It looks like we've already limited the maximum size of HSM requests to MDS_MAXREQSIZE which is 5K bytes. Why does it add up to 6776 bytes at the end?
          green Oleg Drokin added a comment -

          The packet drops hang everything because the client will keep retrying indefinitely, as server will not provide any indication it just dropped the request.
          So client is stuck trying to send this rpc with who knows what locks held and blocking other rpcs from being sent

          green Oleg Drokin added a comment - The packet drops hang everything because the client will keep retrying indefinitely, as server will not provide any indication it just dropped the request. So client is stuck trying to send this rpc with who knows what locks held and blocking other rpcs from being sent

          People

            jamesanunez James Nunez (Inactive)
            hdoreau Henri Doreau (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: