Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.16.0
    • None
    • EL9.3 Lustre 2.15.62
    • 3
    • 9223372036854775807

    Description

      We noticed that in some cases, when an HSM agent crashes, some MDTs retain a "ghost" agent, as shown here with 3 out of 4 MDTs still referencing the agent after it crashed:

      # clush -w@mds -b 'grep "" /sys/kernel/debug/lustre/mdt/elm-MDT*/hsm/agents'
      ---------------
      elm-rcf-md1-s1
      ---------------
      /sys/kernel/debug/lustre/mdt/elm-MDT0000/hsm/agents:uuid=e27ffd42-fa46-4657-a582-e2fe5e4a4b9c archive_id=1 requests=[current:0 ok:0 errors:0]
      /sys/kernel/debug/lustre/mdt/elm-MDT0002/hsm/agents:uuid=e27ffd42-fa46-4657-a582-e2fe5e4a4b9c archive_id=1 requests=[current:0 ok:0 errors:0]
      ---------------
      elm-rcf-md1-s2
      ---------------
      /sys/kernel/debug/lustre/mdt/elm-MDT0001/hsm/agents:uuid=e27ffd42-fa46-4657-a582-e2fe5e4a4b9c archive_id=1 requests=[current:0 ok:0 errors:0]
      
      # clush -w@mds -b 'grep e2fe5e4a4b9c /proc/fs/lustre/mdt/elm-MDT*/exports/*/uuid'
      clush: elm-rcf-md1-s[1-2] (2): exited with exit code 1
      

      Even after remounting/rebooting the Lustre client running the copytool (here, the coordinatool, the old agent remains, causing trouble with new archive requests. Wanted to investigate more but opening this now to reference it at LAD next week. A restart of the MDTs fixes it.
       

      Attachments

        Activity

          People

            giardi Sylwyn Giardi
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: