[LU-14403] lctl dl UP and lfs df problem with conf_param osc.active=0 after client remount Created: 08/Feb/21  Updated: 10/May/23  Resolved: 10/May/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Stephane Thiell
Resolution: Duplicate Votes: 0
Labels: None
Environment:

CentOS 7.6


Attachments: Text File oak-h01v10-client-dk.log    
Issue Links:
Related
is related to LU-7668 permanently remove deactivated OSTs f... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This is related to LU-7668. The Lustre Manual, in section 14.9.3. Removing an OST from the File System, recommends to use lctl conf_param ost_name.osc.active=0 to permanently disable OSTs.

We are trying to permanently disable 12 old empty OSTs on our Oak filesystem. We used commands like these:

lctl conf_param oak-OST0000.osc.active=0

Lustre logs seem to indicate it works OK:

20000000:02000400:16.0:1612760209.795689:0:334624:0:(mgs_llog.c:3964:mgs_write_log_param()) Permanently deactivating oak-OST0000

On already mounted clients, lctl dl shows the OBD status inactive:

[root@oak-rbh01 ~]# lctl dl | grep oak-OST0000
  9 IN osc oak-OST0000-osc-ffff9125e10c3800 f532ae1d-6c67-fa34-deaa-5a130b24844f 4

Also, lfs df works as expected for already mounted clients:

[root@oak-rbh01 ~]# lfs df -v  /oak | grep OST0000
OST0000             : inactive device

However, we have observed the following when using Lustre 2.12.6 after client remount:

  • the OBD state as reported by lctl dl comes back to UP instead of IN
[root@oak-h01v10 ~]# lctl dl | grep oak-OST0000
  9 UP osc oak-OST0000-osc-ffff9c5b6b90f800 523b8803-837d-acf8-a8e6-aae2d47585ac 3
  • the OSC state, however, is properly set to 0
[root@oak-h01v10 ~]# cat /sys/fs/lustre/osc/oak-OST0000-osc-ffff9c5b6b90f800/active 
0
  • a lfs check osts reports the following error:
    [root@oak-h01v10 ~]# lfs check osts
    lfs check: error: check 'oak-OST0000-osc-ffff9c5b6b90f800': Cannot allocate memory (12)
    ...
    
  • lfs df shows the following error for the permanently deactivated OST:
    OST0000             : Invalid argument
    

I'm attaching client logs of a remounting client. We can see that the OST is disabled:

00020000:01000000:0.0:1612805236.515621:0:2155:0:(lov_obd.c:166:lov_connect_obd()) not connecting OSC oak-OST0000_UUID; administratively disabled

It looks like at some point, the status of the OBD is not updated properly at mount time and this seems to be causing the confusion. Ideally, we would like to see the same behavior after client remount (IN in lctl dl and lfs df -v showing inactive device). Any ideas on how best to fix/improve this? Thanks!



 Comments   
Comment by Andreas Dilger [ 08/Feb/21 ]

Stephane,
it is possible to permanently remove/deactivate the configuration records for those OSTs from the config record itself, so that they are no longer even present in "lctl dl", rather than being present but inactive. Please see instructions in LU-7668, which I've just updated to have examples. If that process works for you, I can add this information into the manual, though it would be better in the long term to actually implement the logic for "lctl del_ost" as described in that ticket.

Comment by Stephane Thiell [ 08/Feb/21 ]

Thanks Andreas. I will try and report back!

Comment by Stephane Thiell [ 09/Feb/21 ]

Andreas,

Using lctl llog_cancel seems to work on my test system. We haven't tried on Oak yet though.

I've also pushed a patch with a proposal for lctl del_ost as described in LU-7668, happy to improve it and add some tests if you think this could make sense.

Comment by Andreas Dilger [ 10/May/23 ]

The "lctl del_ost" command was included into Lustre 2.15 via LU-7668.

Generated at Sat Feb 10 03:09:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.