[LU-12529] How to remove bad config values from MGS logs Created: 09/Jul/19  Updated: 07/Aug/19  Resolved: 07/Aug/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Mahmoud Hanafi Assignee: Patrick Farrell (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

lustre 2.10.


Issue Links:
Related
is related to LU-4939 Need to be able to sanely query and c... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

We have some bad config vlaues.

Jul  9 16:13:23 testpbs systemd-udevd[98107]: Process '/usr/sbin/lctl set_param 'lov.nbp7-MDT0000-mdtlov.stripecount=1'' failed with exit code 2.
Jul  9 16:13:23 testpbs systemd-udevd[98110]: Process '/usr/sbin/lctl set_param 'osc.nbp7-OST0026-osc-MDT0000.active=0'' failed with exit code 2.
Jul  9 16:13:23 testpbs systemd-udevd[98110]: Process '/usr/sbin/lctl set_param 'osc.nbp7-OST002a-osc-MDT0000.active=0'' failed with exit code 2.
Jul  9 16:13:23 testpbs systemd-udevd[98107]: Process '/usr/sbin/lctl set_param 'osc.nbp7-OST0046-osc-MDT0000.active=0'' failed with exit code 2.
Jul  9 16:13:23 testpbs systemd-udevd[98107]: Process '/usr/sbin/lctl set_param 'osc.nbp7-OST003e-osc-MDT0000.active=0'' failed with exit code 2.
Jul  9 16:13:23 testpbs systemd-udevd[98107]: Process '/usr/sbin/lctl set_param 'osc.nbp7-OST003a-osc-MDT0000.active=0'' failed with exit code 2.
Jul  9 16:13:23 testpbs systemd-udevd[98107]: Process '/usr/sbin/lctl set_param 'osc.nbp7-OST002e-osc-MDT0000.active=0'' failed with exit code 2.
Jul  9 16:13:23 testpbs systemd-udevd[98107]: Process '/usr/sbin/lctl set_param 'osc.nbp7-OST0032-osc-MDT0000.active=0'' failed with exit code 2.
Jul  9 16:13:23 testpbs systemd-udevd[98107]: Process '/usr/sbin/lctl set_param 'osc.nbp7-OST004e-osc-MDT0000.active=0'' failed with exit code 2.
Jul  9 16:13:23 testpbs systemd-udevd[98107]: Process '/usr/sbin/lctl set_param 'osc.nbp7-OST0042-osc-MDT0000.active=0'' failed with exit code 2.
Jul  9 16:13:23 testpbs systemd-udevd[98107]: Process '/usr/sbin/lctl set_param 'osc.nbp7-OST0052-osc-MDT0000.active=0'' failed with exit code 2.
Jul  9 16:13:23 testpbs systemd-udevd[98107]: Process '/usr/sbin/lctl set_param 'osc.nbp7-OST004a-osc-MDT0000.active=0'' failed with exit code 2.
Jul  9 16:13:23 testpbs systemd-udevd[98107]: Process '/usr/sbin/lctl set_param 'osc.nbp7-OST0036-osc-MDT0000.active=0'' failed with exit code 2.
 

How do we remove these I tried like this.
lctl set_param -P -d "osc.nbp7-OST0036-osc-MDT0000.active"
but didn't remove the entry from the config file.



 Comments   
Comment by Patrick Farrell (Inactive) [ 10/Jul/19 ]

Mahmoud,

Did you run that on the MGS?  If not, you must.

Additionally, I think you probably need something like:

lctl set_param -P -d osc.*.active 

As you probably didn't specify each OST individually when setting this up?

Comment by Mahmoud Hanafi [ 10/Jul/19 ]

That doesn't work. Looks like there the log entry for one

nbp7-mds1 ~ # llog_reader /tmp/params | grep osc.nbp7-OST0026-osc-MDT0000.active
#68 (224)marker   2 (flags=0x01, v2.7.3.0) general         'osc.nbp7-OST0026-osc-MDT0000.active' Tue Jul 10 08:28:03 2018-
#69 (128)set_param 0:general  1:osc.nbp7-OST0026-osc-MDT0000.active=0  2:lctl  
#70 (224)END   marker   2 (flags=0x02, v2.7.3.0) general         'osc.nbp7-OST0026-osc-MDT0000.active' Tue Jul 10 08:28:03 2018-
#274 (224)SKIP START marker   5 (flags=0x05, v2.10.6.0) nbp7-OST-OST0026 'osc.nbp7-OST0026-osc-MDT0000.active' Tue Jul  9 15:48:34 2019-Tue Jul  9 16:02:23 2019
#275 (144)SKIP set_param 0:nbp7-OST-OST0026  1:osc.nbp7-OST0026-osc-MDT0000.active=0=  2:lctl  
#276 (224)SKIP END   marker   5 (flags=0x06, v2.10.6.0) nbp7-OST-OST0026 'osc.nbp7-OST0026-osc-MDT0000.active' Tue Jul  9 15:48:34 2019-Tue Jul  9 16:02:23 2019

Running this has no effect.

lctl set_param -P -d osc.nbp7-OST0026-osc-MDT0000.active
Comment by Patrick Farrell (Inactive) [ 11/Jul/19 ]

Can you collect debug logs for this?

DEBUGMB=`lctl get_param -n debug_mb`
lctl set_param *debug=-1 debug_mb=10000
lctl clear
lctl mark "before"
lctl set_param -P -d osc.nbp7-OST0026-osc-MDT0000.active
#Write out the log
lctl dk > /tmp/log#Set debug back to defaults
lctl set_param debug="super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck"
lctl set_param debug_mb=$DEBUGMB

And then also, separately, can you please collect an strace of the command?

strace -v lctl set_param -P -d osc.nbp7-OST0026-osc-MDT0000.active 
Comment by Andreas Dilger [ 11/Jul/19 ]

You could also run on the MGS "lctl --device MGS llog_print nbp7-client" or "... params" to dump the client config llogs while the system is mounted instead of the "params" log. You can use "lctl --device MGS llog_cancel params 69" to cancel that record number.

Comment by Patrick Farrell (Inactive) [ 11/Jul/19 ]

Mahmoud,

Even if Andreas' suggestion works (you should definitely try it), could you please gather those logs I asked for?  They may be helpful in identifying a problem or problems.

Comment by Mahmoud Hanafi [ 11/Jul/19 ]

On nbp7 I already ran Andreas command and it work. But we have 2 other filesystem that also have bad records. On those I get

nbp2-mds ~ # lctl --device MGS llog_print params
OBD_IOC_LLOG_PRINT failed: Inappropriate ioctl for device

They both are running server versions.
lctl --device MGS llog_print nbp2-client works.
btw, with both param or client config if the number of records is large, they get truncated and we get this error

[3115652.139607] LustreError: 35937:0:(llog_ioctl.c:254:llog_print_cb()) not enough space for print log records
Comment by Patrick Farrell (Inactive) [ 11/Jul/19 ]

What do you mean by "server versions"?  These commands need to be run on the MGS.

Comment by Mahmoud Hanafi [ 11/Jul/19 ]

They are all running the same luster version.

 nbp7-mds1 ~ # modinfo lustre
filename:       /lib/modules/3.10.0-693.21.1.el7.20180508.x86_64.lustre2106/extra/lustre/fs/lustre.ko
license:        GPL
version:        2.10.6
description:    Lustre Client File System
author:         OpenSFS, Inc. <http://www.lustre.org/>
retpoline:      Y
rhelversion:    7.4
srcversion:     E459483EA54C83D0585ECA3
depends:        obdclass,ptlrpc,libcfs,lnet,lmv,mdc,lov
vermagic:       3.10.0-693.21.1.el7.20180508.x86_64.lustre2106 SMP mod_unload modversions 


nbp2-mds ~ # modinfo lustre
filename:       /lib/modules/3.10.0-693.21.1.el7.20180508.x86_64.lustre2106/extra/lustre/fs/lustre.ko
license:        GPL
version:        2.10.6
description:    Lustre Client File System
author:         OpenSFS, Inc. <http://www.lustre.org/>
retpoline:      Y
rhelversion:    7.4
srcversion:     E459483EA54C83D0585ECA3
depends:        obdclass,ptlrpc,libcfs,lnet,lmv,mdc,lov
vermagic:       3.10.0-693.21.1.el7.20180508.x86_64.lustre2106 SMP mod_unload modversions 
Comment by Andreas Dilger [ 12/Jul/19 ]

mhanafi the bad config records may appear in either the "params" log or in the "fsname-client" config log. The name of the logfile passed to "lctl llog_cancel" needs to match.

If "llog_print" complains about the log file being too large, you can grab the lctl binary from 2.10.7 or later, or apply the patch https://review.whamcloud.com/3381 "LU-11566 utils: fix lctl llog_print for large configs". This is purely a userspace problem and can be solved with an updated lctl binary, no need to update the kernel
Modules or restart.

Comment by Mahmoud Hanafi [ 13/Jul/19 ]

What about the Inappropriate ioctl error.

 nbp2-mds ~ # lctl --device MGS llog_print params 
OBD_IOC_LLOG_PRINT failed: Inappropriate ioctl for device

 


 brk(0x67d000)                           = 0x67d000
brk(NULL)                               = 0x67d000
open("/dev/obd", O_RDWR)                = 3
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x66, 0x7f, 0x08), 0x7fffffffc9f0) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x66, 0xc0, 0x08), 0x7fffffffc9f0) = -1 ENOTTY (Inappropriate ioctl for device)
write(2, "OBD_IOC_LLOG_PRINT failed: Inapp"..., 58OBD_IOC_LLOG_PRINT failed: Inappropriate ioctl for device
) = 58
rt_sigaction(SIGINT, {0x405ef0, ~[RTMIN RT_1], SA_RESTORER|SA_RESTART, 0x7fffed6d65d0}, NULL, 8) = 0
exit_group(1)                           = ?
+++ exited with 1 +++
Comment by Andreas Dilger [ 18/Jul/19 ]

What about the Inappropriate ioctl error.

nbp2-mds ~ # lctl --device MGS llog_print params 
OBD_IOC_LLOG_PRINT failed: Inappropriate ioctl for device

I tested this out locally on my 2.10 system and got the same error. I thought that llog_print was working for the params file, and there is even a regression test for this (conf-sanity.sh test_123ab(), so it isn't clear why this isn't working.

Comment by Andreas Dilger [ 18/Jul/19 ]

It looks like the MGS needs patch https://review.whamcloud.com/34250:

LU-4939 obdclass: llog_print params file
    
    Allow llog_print to handle the params file in yaml

which was landed to b2_10 as v2_10_6-45-gfb77f09ac8, so it would be included into 2.10.7 and later. My home/test server is running only 2.10.5.

Comment by Andreas Dilger [ 07/Aug/19 ]

The various issues in this ticket have been resolved, or are fixed in a newer release.

Generated at Sat Feb 10 02:53:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.