Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
Repro steps:
Modify a couple large files.
Mark them for archive.
Observe that the archive is in-progress.
Issue cancels for all archives.
Observe that they are still in-progress.
Repeat issuance of cancels.
Observe they are still in-progress.
As we can see below, the actions are sent to the two agents not currently running the archives, and eventually the archives complete:
root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.actions
mdt.lustrefs-MDT0000.hsm.actions=
lrh=[type=10680000 len=136 idx=1/9] fid=[0x200000401:0xd:0x0] dfid=[0x200000401:0xd:0x0] compound/cookie=0x0/0x63ac8b76 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=1/10] fid=[0x200000401:0xe:0x0] dfid=[0x200000401:0xe:0x0] compound/cookie=0x0/0x63ac8b77 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
mdt.lustrefs-MDT0000.hsm.agents=
uuid=9d508026-91ef-40fc-9388-df7c24601d11 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:7]
uuid=ca376747-bb10-4ff6-b512-609a1a7bcf67 archive_id=ANY requests=[current:2 ok:0 errors:0 epoch:8]
uuid=5106d672-aaac-4ed1-992d-f976aad2fa87 archive_id=ANY requests=[current:0 ok:0 errors:0 epoch:0]
root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl set_param mdt.lustrefs-MDT0000.hsm_control=cancel_archives
mdt.lustrefs-MDT0000.hsm_control=cancel_archives
root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
mdt.lustrefs-MDT0000.hsm.agents=
uuid=9d508026-91ef-40fc-9388-df7c24601d11 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:10]
uuid=ca376747-bb10-4ff6-b512-609a1a7bcf67 archive_id=ANY requests=[current:2 ok:0 errors:0 epoch:8]
uuid=5106d672-aaac-4ed1-992d-f976aad2fa87 archive_id=ANY requests=[current:0 ok:0 errors:0 epoch:9]
root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl set_param mdt.lustrefs-MDT0000.hsm_control=cancel_archives
mdt.lustrefs-MDT0000.hsm_control=cancel_archives
root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
mdt.lustrefs-MDT0000.hsm.agents=
uuid=9d508026-91ef-40fc-9388-df7c24601d11 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:12]
uuid=ca376747-bb10-4ff6-b512-609a1a7bcf67 archive_id=ANY requests=[current:2 ok:0 errors:0 epoch:8]
uuid=5106d672-aaac-4ed1-992d-f976aad2fa87 archive_id=ANY requests=[current:0 ok:0 errors:0 epoch:11]
root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl set_param mdt.lustrefs-MDT0000.hsm_control=cancel_archives
mdt.lustrefs-MDT0000.hsm_control=cancel_archives
root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
mdt.lustrefs-MDT0000.hsm.agents=
uuid=9d508026-91ef-40fc-9388-df7c24601d11 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:14]
uuid=ca376747-bb10-4ff6-b512-609a1a7bcf67 archive_id=ANY requests=[current:2 ok:0 errors:0 epoch:8]
uuid=5106d672-aaac-4ed1-992d-f976aad2fa87 archive_id=ANY requests=[current:0 ok:0 errors:0 epoch:13]
root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl set_param mdt.lustrefs-MDT0000.hsm_control=purge
mdt.lustrefs-MDT0000.hsm_control=purge
root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
mdt.lustrefs-MDT0000.hsm.agents=
uuid=9d508026-91ef-40fc-9388-df7c24601d11 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:16]
uuid=ca376747-bb10-4ff6-b512-609a1a7bcf67 archive_id=ANY requests=[current:2 ok:0 errors:0 epoch:8]
uuid=5106d672-aaac-4ed1-992d-f976aad2fa87 archive_id=ANY requests=[current:0 ok:0 errors:0 epoch:15]
root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
mdt.lustrefs-MDT0000.hsm.agents=
uuid=9d508026-91ef-40fc-9388-df7c24601d11 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:16]
uuid=ca376747-bb10-4ff6-b512-609a1a7bcf67 archive_id=ANY requests=[current:2 ok:0 errors:0 epoch:8]
uuid=5106d672-aaac-4ed1-992d-f976aad2fa87 archive_id=ANY requests=[current:0 ok:0 errors:0 epoch:15]
root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
mdt.lustrefs-MDT0000.hsm.agents=
uuid=9d508026-91ef-40fc-9388-df7c24601d11 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:16]
uuid=ca376747-bb10-4ff6-b512-609a1a7bcf67 archive_id=ANY requests=[current:0 ok:2 errors:0 epoch:8]
uuid=5106d672-aaac-4ed1-992d-f976aad2fa87 archive_id=ANY requests=[current:0 ok:0 errors:0 epoch:15]
This is caused by the mdt_hsm_agent_send function not considering that cancels must be sent to specific targets associated with the request to be cancelled. It happily just runs mdt_hsm_find_best_agent and sends the request to those.
After fix (this has been repeated multiple times to confirm it's not just luck):
# Start archive on priagt
# show 2 large file being processed by one agent (2 by one is expected – the entire 3-entry HAL is sent to a single agent)
root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
mdt.lustrefs-MDT0000.hsm.agents=
uuid=d69d45e6-d6ca-413a-a3a4-9424eba97642 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:1]
uuid=68fcb2e2-a6b3-48ae-aeac-78439cca7996 archive_id=ANY requests=[current:2 ok:0 errors:0 epoch:2]
uuid=4d8c4cbb-1865-452b-a5d0-34df1243bca6 archive_id=ANY requests=[current:0 ok:0 errors:0 epoch:0]
# Cancel archives
# Show that jobs correctly transition to cancelled:
root@d24d1c86-21d7-4297-b8a3-f5ba9d2430ed-mdsmgs-a0-vm:~# lctl get_param mdt.lustrefs-MDT0000.hsm.agents
mdt.lustrefs-MDT0000.hsm.agents=
uuid=d69d45e6-d6ca-413a-a3a4-9424eba97642 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:1]
uuid=68fcb2e2-a6b3-48ae-aeac-78439cca7996 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:2]
uuid=4d8c4cbb-1865-452b-a5d0-34df1243bca6 archive_id=ANY requests=[current:0 ok:0 errors:2 epoch:3]
Patch to be sent shortly.