Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17185

After deactivating OSTs, some clients see them as active

Details

    • Bug
    • Resolution: Unresolved
    • Blocker
    • None
    • Lustre 2.12.9
    • None
    • 3
    • 9223372036854775807

    Description

      After running lctl conf_param lustrefc-OST0018.osc.active=0 on the MGS for multiple OSTs, some clients see the OSTs as inactive and work just fine. Some clients see the OSTs as active and hang. The software stack is the same on working and non-working clients.

      Here is some sample output from a working client:
      4 IN osc lustrefc-OST0007-osc-ffff9d035c824000 c75633d5-3afe-6370-7e49-dcad475a6bc2 4
      5 IN osc lustrefc-OST000e-osc-ffff9d035c824000 c75633d5-3afe-6370-7e49-dcad475a6bc2 4
      6 IN osc lustrefc-OST000f-osc-ffff9d035c824000 c75633d5-3afe-6370-7e49-dcad475a6bc2 4

      1. Note disabled OSTs not listed
      2. lfs df -h partial ouput
        lustrefc-MDT0000_UUID 229.3G 61.3G 131.4G 32% /data[MDT:0]
        lustrefc-OST0012_UUID 64.9T 20.3T 44.6T 32% /data[OST:18]

      Output from a non-working client
      4 UP osc lustrefc-OST0007-osc-ffff958c1995b800 4910e4fd-accd-1685-c6d2-3418a29afbd1 3
      5 UP osc lustrefc-OST000e-osc-ffff958c1995b800 4910e4fd-accd-1685-c6d2-3418a29afbd1 3
      6 UP osc lustrefc-OST000f-osc-ffff958c1995b800 4910e4fd-accd-1685-c6d2-3418a29afbd1 3

      1. lfs df -h partial output
        lustrefc-MDT0000_UUID 229.3G 61.3G 131.4G 32% /data[MDT:0]
        OST0007 : Invalid argument
        OST000e : Invalid argument
        OST000f : Invalid argument

      This has rendered our cluster unusable.

      Attachments

        Activity

          [LU-17185] After deactivating OSTs, some clients see them as active

          I thought it fixed the problem. On the problem clients, I'm seeing this:

          [root@puppy83 ~]# lfs df -h
          UUID bytes Used Available Use% Mounted on
          lustrefc-MDT0000_UUID 229.3G 61.3G 131.4G 32% /data[MDT:0]
          OST0007 : Invalid argument
          OST000e : Invalid argument
          OST000f : Invalid argument
          OST0010 : Invalid argument
          OST0011 : Invalid argument
          lustrefc-OST0012_UUID 64.9T 20.1T 44.8T 31% /data[OST:18]
          lustrefc-OST0013_UUID 64.9T 19.4T 45.6T 30% /data[OST:19]
          lustrefc-OST0014_UUID 64.9T 49.4T 15.5T 77% /data[OST:20]
          lustrefc-OST0015_UUID 64.9T 48.8T 16.1T 76% /data[OST:21]
          OST0016 : Invalid argument
          OST0017 : Invalid argument
          OST0018 : Invalid argument
          [root@puppy83 ~]# lfs osts
          OBDS:
          7: lustrefc-OST0007_UUID INACTIVE
          14: lustrefc-OST000e_UUID INACTIVE
          15: lustrefc-OST000f_UUID INACTIVE
          16: lustrefc-OST0010_UUID INACTIVE
          17: lustrefc-OST0011_UUID INACTIVE
          18: lustrefc-OST0012_UUID ACTIVE

          [root@puppy83 ~]# lctl dl
          0 UP mgc MGC172.17.1.112@o2ib 8214fcc9-bf1d-148d-8c9e-8040314a4b34 4
          1 UP lov lustrefc-clilov-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3
          2 UP lmv lustrefc-clilmv-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 4
          3 UP mdc lustrefc-MDT0000-mdc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 4
          4 UP osc lustrefc-OST0007-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3
          5 UP osc lustrefc-OST000e-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3
          6 UP osc lustrefc-OST000f-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3
          7 UP osc lustrefc-OST0010-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3
          8 UP osc lustrefc-OST0011-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3
          9 UP osc lustrefc-OST0012-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 4

          rs1 Roger Sersted added a comment - I thought it fixed the problem. On the problem clients, I'm seeing this: [root@puppy83 ~] # lfs df -h UUID bytes Used Available Use% Mounted on lustrefc-MDT0000_UUID 229.3G 61.3G 131.4G 32% /data [MDT:0] OST0007 : Invalid argument OST000e : Invalid argument OST000f : Invalid argument OST0010 : Invalid argument OST0011 : Invalid argument lustrefc-OST0012_UUID 64.9T 20.1T 44.8T 31% /data [OST:18] lustrefc-OST0013_UUID 64.9T 19.4T 45.6T 30% /data [OST:19] lustrefc-OST0014_UUID 64.9T 49.4T 15.5T 77% /data [OST:20] lustrefc-OST0015_UUID 64.9T 48.8T 16.1T 76% /data [OST:21] OST0016 : Invalid argument OST0017 : Invalid argument OST0018 : Invalid argument [root@puppy83 ~] # lfs osts OBDS: 7: lustrefc-OST0007_UUID INACTIVE 14: lustrefc-OST000e_UUID INACTIVE 15: lustrefc-OST000f_UUID INACTIVE 16: lustrefc-OST0010_UUID INACTIVE 17: lustrefc-OST0011_UUID INACTIVE 18: lustrefc-OST0012_UUID ACTIVE [root@puppy83 ~] # lctl dl 0 UP mgc MGC172.17.1.112@o2ib 8214fcc9-bf1d-148d-8c9e-8040314a4b34 4 1 UP lov lustrefc-clilov-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3 2 UP lmv lustrefc-clilmv-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 4 3 UP mdc lustrefc-MDT0000-mdc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 4 4 UP osc lustrefc-OST0007-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3 5 UP osc lustrefc-OST000e-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3 6 UP osc lustrefc-OST000f-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3 7 UP osc lustrefc-OST0010-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3 8 UP osc lustrefc-OST0011-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3 9 UP osc lustrefc-OST0012-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 4

          Thank you. The lctl ...active=0 command on the clients fixed the hang problem. I'll run the llog_print command later and attach the results.

          rs1 Roger Sersted added a comment - Thank you. The lctl ...active=0 command on the clients fixed the hang problem. I'll run the llog_print command later and attach the results.

          It would be worthwhile to check that the conf_param command is present in the client config log:

          mgs# lctl --device MGS llog_print lustrefc-client | grep OST0007
          

          There would be initial commands to add the OST and then the last one should be the one to mark it inactive. In newer releases there is an "lctl del_ost" command that will remove the OST setup commands from the configuration log completely.

          adilger Andreas Dilger added a comment - It would be worthwhile to check that the conf_param command is present in the client config log: mgs# lctl --device MGS llog_print lustrefc-client | grep OST0007 There would be initial commands to add the OST and then the last one should be the one to mark it inactive. In newer releases there is an " lctl del_ost " command that will remove the OST setup commands from the configuration log completely.

          I can't say for sure why this setting is not being applied to some of the clients. As a workaround, you could manually deactivate these OSTs on the affected clients like:

          client# lctl set_param osc.*OST{0007,000e,000f}*.active=0
          

          possibly using pdsh or other tool to execute it on multiple clients at once. It shouldn't be harmful if this is run on clients that already have the OSTs deactivated.

          adilger Andreas Dilger added a comment - I can't say for sure why this setting is not being applied to some of the clients. As a workaround, you could manually deactivate these OSTs on the affected clients like: client# lctl set_param osc.*OST{0007,000e,000f}*.active=0 possibly using pdsh or other tool to execute it on multiple clients at once. It shouldn't be harmful if this is run on clients that already have the OSTs deactivated.

          I should add, I tried rebooting one of the problem nodes and that did not resolve this issue.

          rs1 Roger Sersted added a comment - I should add, I tried rebooting one of the problem nodes and that did not resolve this issue.
          rs1 Roger Sersted created issue -

          People

            wc-triage WC Triage
            rs1 Roger Sersted
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: