Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17731

Lustre doesn't send CANCEL requests to the correct agents

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Repro steps:
      Modify a couple large files.
      Mark them for archive.
      Observe that the archive is in-progress.
      Issue cancels for all archives.
      Observe that they are still in-progress.
      Repeat issuance of cancels.
      Observe they are still in-progress.

      As we can see below, the actions are sent to the two agents not currently running the archives, and eventually the archives complete:
      root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.actions
      mdt.lustrefs-MDT0000.hsm.actions=
      lrh=[type=10680000 len=136 idx=1/9] fid=[0x200000401:0xd:0x0] dfid=[0x200000401:0xd:0x0] compound/cookie=0x0/0x63ac8b76 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
      lrh=[type=10680000 len=136 idx=1/10] fid=[0x200000401:0xe:0x0] dfid=[0x200000401:0xe:0x0] compound/cookie=0x0/0x63ac8b77 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
      root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
      mdt.lustrefs-MDT0000.hsm.agents=
      uuid=9d508026-91ef-40fc-9388-df7c24601d11 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:7]
      uuid=ca376747-bb10-4ff6-b512-609a1a7bcf67 archive_id=ANY requests=[current:2 ok:0 errors:0 epoch:8]
      uuid=5106d672-aaac-4ed1-992d-f976aad2fa87 archive_id=ANY requests=[current:0 ok:0 errors:0 epoch:0]
      root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl set_param mdt.lustrefs-MDT0000.hsm_control=cancel_archives
      mdt.lustrefs-MDT0000.hsm_control=cancel_archives
      root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
      mdt.lustrefs-MDT0000.hsm.agents=
      uuid=9d508026-91ef-40fc-9388-df7c24601d11 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:10]
      uuid=ca376747-bb10-4ff6-b512-609a1a7bcf67 archive_id=ANY requests=[current:2 ok:0 errors:0 epoch:8]
      uuid=5106d672-aaac-4ed1-992d-f976aad2fa87 archive_id=ANY requests=[current:0 ok:0 errors:0 epoch:9]
      root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl set_param mdt.lustrefs-MDT0000.hsm_control=cancel_archives
      mdt.lustrefs-MDT0000.hsm_control=cancel_archives
      root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
      mdt.lustrefs-MDT0000.hsm.agents=
      uuid=9d508026-91ef-40fc-9388-df7c24601d11 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:12]
      uuid=ca376747-bb10-4ff6-b512-609a1a7bcf67 archive_id=ANY requests=[current:2 ok:0 errors:0 epoch:8]
      uuid=5106d672-aaac-4ed1-992d-f976aad2fa87 archive_id=ANY requests=[current:0 ok:0 errors:0 epoch:11]
      root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl set_param mdt.lustrefs-MDT0000.hsm_control=cancel_archives
      mdt.lustrefs-MDT0000.hsm_control=cancel_archives
      root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
      mdt.lustrefs-MDT0000.hsm.agents=
      uuid=9d508026-91ef-40fc-9388-df7c24601d11 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:14]
      uuid=ca376747-bb10-4ff6-b512-609a1a7bcf67 archive_id=ANY requests=[current:2 ok:0 errors:0 epoch:8]
      uuid=5106d672-aaac-4ed1-992d-f976aad2fa87 archive_id=ANY requests=[current:0 ok:0 errors:0 epoch:13]
      root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl set_param mdt.lustrefs-MDT0000.hsm_control=purge
      mdt.lustrefs-MDT0000.hsm_control=purge
      root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
      mdt.lustrefs-MDT0000.hsm.agents=
      uuid=9d508026-91ef-40fc-9388-df7c24601d11 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:16]
      uuid=ca376747-bb10-4ff6-b512-609a1a7bcf67 archive_id=ANY requests=[current:2 ok:0 errors:0 epoch:8]
      uuid=5106d672-aaac-4ed1-992d-f976aad2fa87 archive_id=ANY requests=[current:0 ok:0 errors:0 epoch:15]
      root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
      mdt.lustrefs-MDT0000.hsm.agents=
      uuid=9d508026-91ef-40fc-9388-df7c24601d11 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:16]
      uuid=ca376747-bb10-4ff6-b512-609a1a7bcf67 archive_id=ANY requests=[current:2 ok:0 errors:0 epoch:8]
      uuid=5106d672-aaac-4ed1-992d-f976aad2fa87 archive_id=ANY requests=[current:0 ok:0 errors:0 epoch:15]
      root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
      mdt.lustrefs-MDT0000.hsm.agents=
      uuid=9d508026-91ef-40fc-9388-df7c24601d11 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:16]
      uuid=ca376747-bb10-4ff6-b512-609a1a7bcf67 archive_id=ANY requests=[current:0 ok:2 errors:0 epoch:8]
      uuid=5106d672-aaac-4ed1-992d-f976aad2fa87 archive_id=ANY requests=[current:0 ok:0 errors:0 epoch:15]

      This is caused by the mdt_hsm_agent_send function not considering that cancels must be sent to specific targets associated with the request to be cancelled.  It happily just runs mdt_hsm_find_best_agent and sends the request to those.

      After fix (this has been repeated multiple times to confirm it's not just luck):

      # Start archive on priagt
      # show 2 large file being processed by one agent (2 by one is expected – the entire 3-entry HAL is sent to a single agent)
      root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
      mdt.lustrefs-MDT0000.hsm.agents=
      uuid=d69d45e6-d6ca-413a-a3a4-9424eba97642 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:1]
      uuid=68fcb2e2-a6b3-48ae-aeac-78439cca7996 archive_id=ANY requests=[current:2 ok:0 errors:0 epoch:2]
      uuid=4d8c4cbb-1865-452b-a5d0-34df1243bca6 archive_id=ANY requests=[current:0 ok:0 errors:0 epoch:0]

      # Cancel archives
      # Show that jobs correctly transition to cancelled:
      root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
      mdt.lustrefs-MDT0000.hsm.agents=
      uuid=d69d45e6-d6ca-413a-a3a4-9424eba97642 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:1]
      uuid=68fcb2e2-a6b3-48ae-aeac-78439cca7996 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:2]
      uuid=4d8c4cbb-1865-452b-a5d0-34df1243bca6 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:3]

      Patch to be sent shortly.

      Attachments

        Activity

          People

            elliswilson Ellis Wilson
            elliswilson Ellis Wilson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: