[LU-17185] After deactivating OSTs, some clients see them as active Created: 11/Oct/23  Updated: 13/Oct/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.9
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Roger Sersted Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Client:
CentOS Linux release 7.9.2009 (Core)
3.10.0-1160.49.1.el7.x86_64
lustre-client-2.12.9-1.el7.x86_64
lustre-client-dkms-2.12.9-1.el7.noarch

Servers:
Servers:
CentOS Linux release 7.9.2009 (Core)
3.10.0-1160.49.1.el7_lustre.x86_64
Lustre 2.12.9


Epic/Theme: client, mgs
Severity: 3
Epic: client, mgs
Rank (Obsolete): 9223372036854775807

 Description   

After running lctl conf_param lustrefc-OST0018.osc.active=0 on the MGS for multiple OSTs, some clients see the OSTs as inactive and work just fine. Some clients see the OSTs as active and hang. The software stack is the same on working and non-working clients.

Here is some sample output from a working client:
4 IN osc lustrefc-OST0007-osc-ffff9d035c824000 c75633d5-3afe-6370-7e49-dcad475a6bc2 4
5 IN osc lustrefc-OST000e-osc-ffff9d035c824000 c75633d5-3afe-6370-7e49-dcad475a6bc2 4
6 IN osc lustrefc-OST000f-osc-ffff9d035c824000 c75633d5-3afe-6370-7e49-dcad475a6bc2 4

  1. Note disabled OSTs not listed
  2. lfs df -h partial ouput
    lustrefc-MDT0000_UUID 229.3G 61.3G 131.4G 32% /data[MDT:0]
    lustrefc-OST0012_UUID 64.9T 20.3T 44.6T 32% /data[OST:18]

Output from a non-working client
4 UP osc lustrefc-OST0007-osc-ffff958c1995b800 4910e4fd-accd-1685-c6d2-3418a29afbd1 3
5 UP osc lustrefc-OST000e-osc-ffff958c1995b800 4910e4fd-accd-1685-c6d2-3418a29afbd1 3
6 UP osc lustrefc-OST000f-osc-ffff958c1995b800 4910e4fd-accd-1685-c6d2-3418a29afbd1 3

  1. lfs df -h partial output
    lustrefc-MDT0000_UUID 229.3G 61.3G 131.4G 32% /data[MDT:0]
    OST0007 : Invalid argument
    OST000e : Invalid argument
    OST000f : Invalid argument

This has rendered our cluster unusable.



 Comments   
Comment by Roger Sersted [ 11/Oct/23 ]

I should add, I tried rebooting one of the problem nodes and that did not resolve this issue.

Comment by Andreas Dilger [ 11/Oct/23 ]

I can't say for sure why this setting is not being applied to some of the clients. As a workaround, you could manually deactivate these OSTs on the affected clients like:

client# lctl set_param osc.*OST{0007,000e,000f}*.active=0

possibly using pdsh or other tool to execute it on multiple clients at once. It shouldn't be harmful if this is run on clients that already have the OSTs deactivated.

Comment by Andreas Dilger [ 11/Oct/23 ]

It would be worthwhile to check that the conf_param command is present in the client config log:

mgs# lctl --device MGS llog_print lustrefc-client | grep OST0007

There would be initial commands to add the OST and then the last one should be the one to mark it inactive. In newer releases there is an "lctl del_ost" command that will remove the OST setup commands from the configuration log completely.

Comment by Roger Sersted [ 12/Oct/23 ]

Thank you. The lctl ...active=0 command on the clients fixed the hang problem. I'll run the llog_print command later and attach the results.

Comment by Roger Sersted [ 13/Oct/23 ]

I thought it fixed the problem. On the problem clients, I'm seeing this:

[root@puppy83 ~]# lfs df -h
UUID bytes Used Available Use% Mounted on
lustrefc-MDT0000_UUID 229.3G 61.3G 131.4G 32% /data[MDT:0]
OST0007 : Invalid argument
OST000e : Invalid argument
OST000f : Invalid argument
OST0010 : Invalid argument
OST0011 : Invalid argument
lustrefc-OST0012_UUID 64.9T 20.1T 44.8T 31% /data[OST:18]
lustrefc-OST0013_UUID 64.9T 19.4T 45.6T 30% /data[OST:19]
lustrefc-OST0014_UUID 64.9T 49.4T 15.5T 77% /data[OST:20]
lustrefc-OST0015_UUID 64.9T 48.8T 16.1T 76% /data[OST:21]
OST0016 : Invalid argument
OST0017 : Invalid argument
OST0018 : Invalid argument
[root@puppy83 ~]# lfs osts
OBDS:
7: lustrefc-OST0007_UUID INACTIVE
14: lustrefc-OST000e_UUID INACTIVE
15: lustrefc-OST000f_UUID INACTIVE
16: lustrefc-OST0010_UUID INACTIVE
17: lustrefc-OST0011_UUID INACTIVE
18: lustrefc-OST0012_UUID ACTIVE

[root@puppy83 ~]# lctl dl
0 UP mgc MGC172.17.1.112@o2ib 8214fcc9-bf1d-148d-8c9e-8040314a4b34 4
1 UP lov lustrefc-clilov-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3
2 UP lmv lustrefc-clilmv-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 4
3 UP mdc lustrefc-MDT0000-mdc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 4
4 UP osc lustrefc-OST0007-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3
5 UP osc lustrefc-OST000e-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3
6 UP osc lustrefc-OST000f-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3
7 UP osc lustrefc-OST0010-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3
8 UP osc lustrefc-OST0011-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 3
9 UP osc lustrefc-OST0012-osc-ffff9cb9fd858000 9e88710e-fc2d-cda9-b054-cef11bc3aa18 4

Generated at Sat Feb 10 03:33:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.