Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.16.0
-
None
-
EL9.3 Lustre 2.15.62
-
3
-
9223372036854775807
Description
We noticed that in some cases, when an HSM agent crashes, some MDTs retain a "ghost" agent, as shown here with 3 out of 4 MDTs still referencing the agent after it crashed:
# clush -w@mds -b 'grep "" /sys/kernel/debug/lustre/mdt/elm-MDT*/hsm/agents' --------------- elm-rcf-md1-s1 --------------- /sys/kernel/debug/lustre/mdt/elm-MDT0000/hsm/agents:uuid=e27ffd42-fa46-4657-a582-e2fe5e4a4b9c archive_id=1 requests=[current:0 ok:0 errors:0] /sys/kernel/debug/lustre/mdt/elm-MDT0002/hsm/agents:uuid=e27ffd42-fa46-4657-a582-e2fe5e4a4b9c archive_id=1 requests=[current:0 ok:0 errors:0] --------------- elm-rcf-md1-s2 --------------- /sys/kernel/debug/lustre/mdt/elm-MDT0001/hsm/agents:uuid=e27ffd42-fa46-4657-a582-e2fe5e4a4b9c archive_id=1 requests=[current:0 ok:0 errors:0] # clush -w@mds -b 'grep e2fe5e4a4b9c /proc/fs/lustre/mdt/elm-MDT*/exports/*/uuid' clush: elm-rcf-md1-s[1-2] (2): exited with exit code 1
Even after remounting/rebooting the Lustre client running the copytool (here, the coordinatool, the old agent remains, causing trouble with new archive requests. Wanted to investigate more but opening this now to reference it at LAD next week. A restart of the MDTs fixes it.