[LU-10756] Send Uevents for interesting Lustre changes Created: 02/Mar/18  Updated: 13/Jul/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Joe Grund Assignee: James A Simmons
Resolution: Unresolved Votes: 1
Labels: None

Issue Links:
Blocker
is blocked by LU-12362 kernel warning 'do not call blocking ... Resolved
Related
is related to LU-9431 class_process_proc_param can't handle... Resolved
is related to LU-9667 LNet Kernel/Userspace Interface Open
is related to LU-9120 LNet Network Health Feature Resolved
is related to LU-8066 Move lustre procfs handling to sysfs ... Open
is related to LU-7004 fix "lctl set_param -P" to allow depr... Resolved
is related to LU-12564 ptlrpcd daemon sleeps while holding i... Resolved
is related to LU-12362 kernel warning 'do not call blocking ... Resolved
is related to LUDOC-420 Documentation on using uevents for Lu... Open
is related to LU-8609 connect client health_check file to c... Open
is related to LU-10599 Possible for ID_FS_TYPE to be lustre ... Open
Rank (Obsolete): 9223372036854775807

 Description   

For applications that manage / monitor Lustre, it would be useful if Lustre sent Uevents for interesting changes. An incomplete and non-exhaustive list:

  • target mount / umount
  • Lustre health
  • nid / LNet changes
  • evictions
  • lbugs
  • network timeouts
  • recovery status changes
     
     
     


 Comments   
Comment by James A Simmons [ 02/Mar/18 ]

Hi!

I'm working on this right now. Please see LU-8066. Below are the sub-tickets

LU-8066 : In order for this to work each Lustre subsystem needs a sysfs kobject.
This is true for every subsystem except sptlrpc. The sptlrpc sysfs port
will happen in 2.12. While we have a sysfs tree we don't have the actual
tunables migrated over yet. That will happen in the 2.12 time frame. When
that does happen you will be able to configure lustre using udev rules as well.
Here is a example for what can be done with 2.12:

     SUBSYSTEM=="lustre", ACTION=="add",  DEVPATH="*lov*",
     ATTR{stripecount}="4"

Once LNet moves to sysfs in 2.12 you can if done right confgure LNet the
same way.

LU-7004 : This is to make lctl set_param -P to work. This combined with LU-9431
send out udev events for when tunables i.e procfs/sysfs/debugfs changes
are requested in mass via the MSG server.

LU-10756 - send udev for client import state changes. I have a patch at
https://review.whamcloud.com/#/c/31407. This cover client evictions
plus other client state events.

LU-9431 - send udev events when changing tunables via sysfs/debugfs. This is
set to land for lustre 2.11 in the next batch. This only works with
lctl set_param -P ... on the MGS. You can see the udev rule:

     SUBSYSTEM=="lustre", ACTION=="change", ENV{PARAM}=="?*", 
     RUN+="/usr/sbin/lctl set_param $env{PARAM}=$env{SETTING}"

This is the default but in reality you can run anything for RUN

LU-9667 - This covers the move of LNet to sysfs. I have discussed
already with the LNet developers about using uevents.

This is what is on the table so far. I expected more things to be requested once this functionality
started to show up. Feel free to try out my patches

Comment by James A Simmons [ 21/May/18 ]

I got a working client state patch going ; https://review.whamcloud.com/#/c/31407. The question is what do we want transmitted in the uevent. So far we have for example:

change@/fs/lustre/mdc

ACTION=change

DEVPATH=/fs/lustre/mdc

SUBSYSTEM=lustre

IMPORT=lustre-MDT0001_UUID

STATE=REPLAY_WAIT

SEQNUM=4622

Anything else to add. Perhaps the obd device such as lustre-MDT0000-mdc-ffff88105dbc1000 being transmitted as well.

Comment by Nathan Rutman [ 26/Sep/18 ]

^^

timestamp?

Comment by James A Simmons [ 29/Sep/18 ]

That is a good idea to add a timestamp for the import state. Currently the only uevents sent are for lctl set_param -P which don't include a time stamp. Should we? Also is a timestamp in seconds good enough?

Comment by Nathan Rutman [ 02/Oct/18 ]

timestamp for all records; can just be the time it shows up at the server.

Can all lctl's generate a uevent? Might be a good way to audit changes.

Comment by Gerrit Updater [ 11/Jul/19 ]

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/35463
Subject: LU-10756 ptlrpc: change IMPORT_SET_* macros into real functions
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f26400387d1150005486ad70762dc767d71a303e

Comment by Gerrit Updater [ 11/Jul/19 ]

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/35464
Subject: LU-10756 osp: properly order sysfs registeration
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ceebf58486dea07f9fe3e4de03f301a69c96cd84

Comment by James A Simmons [ 11/Jul/19 ]

Started this work back up. Sorry I didn't reply earlier Nathan. I did add a second precision timestamp to the lctl conf_param events. For the upcoming import change state events second precision timestamps are also available. If nanosecond timestamps are needed let me know. To honest uevents are not designed to be sent by the thousands per second so I doubt nanoseconds are needed. By lctl what commands are you thinking of? Also at this point the sysfs LNet work under LU-9667 will provide the framework to send network events. Especially now that LNet health has landed.

Comment by Gerrit Updater [ 17/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35463/
Subject: LU-10756 ptlrpc: change IMPORT_SET_* macros into real functions
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cf78502e48d6dbbc0d6c113e573ba9c68c5c311e

Comment by Gerrit Updater [ 15/Aug/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35464/
Subject: LU-10756 osp: properly order sysfs registration
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2f16681d68650c0c834c9af3e05c8ed98f481d1d

Comment by Gerrit Updater [ 15/Aug/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35795
Subject: LU-10756 ptlrpc: change IMPORT_SET_* macros into real functions
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 62fb72ce1955d7086c3555ddc41d9cf75441f67b

Comment by Gerrit Updater [ 04/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35795/
Subject: LU-10756 ptlrpc: change IMPORT_SET_* macros into real functions
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 1cdc366dca7cf2a97e02de68f865f539dd58da85

Comment by Gerrit Updater [ 03/Feb/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37405
Subject: LU-10756 ptlrpc: fix IMP_CLOSED state is being never set
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: baca205ac7c8f4cf8a859af84d2b5862059d2e32

Comment by Mikhail Pershin [ 03/Feb/20 ]

Explanation about new patch, original code in IMPORT_SET_STATE_NOLOCK() was checking imp->imp_state != LUSTRE_IMP_CLOSED before applying new state, therefore preventing closed import from changing its closed state. Meanwhile the new code checks 'state' parameter which is not current import state but new state to be set. So new code does opposite thing - instead of keeping 'closed' state forever it prevents import state to become LUSTRE_IMP_CLOSED, so import stays in FULL state until destroyed, I suppose. The patch above restores original logic.

I've found that by noticing the following errors shortly after client remount:

[ 1139.774868] LustreError: 25570:0:(ldlm_lockd.c:716:ldlm_handle_ast_error()) ### client (nid 10.9.3.117@tcp) returned error from blocking AST (req@ffff960abad43180 x1657226243021504 status -107 rc -107), evict it ns: mdt-lustre-MDT0000_UUID lock: ffff960aba9a2240/0x60032f478e4387c8 lrc: 4/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.9.3.117@tcp remote: 0xb16e71cea23ecf65 expref: 18639 pid: 13354 timeout: 2039 lvb_type: 0
[ 1139.783656] LustreError: 138-a: lustre-MDT0000: A client on nid 10.9.3.117@tcp was evicted due to a lock blocking callback time out: rc -107
[ 1139.791100] LustreError: 13344:0:(ldlm_lockd.c:259:expired_lock_main()) ### lock callback timer expired after 0s: evicting client at 10.9.3.117@tcp  ns: mdt-lustre-MDT0000_UUID lock: ffff960aba9a2240/0x60032f478e4387c8 lrc: 3/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.9.3.117@tcp remote: 0xb16e71cea23ecf65 expref: 18601 pid: 13354 timeout: 0 lvb_type: 0

After remount there is an old stale export on server which has a lot of locks to be canceled in background, some of them can be a blocking lock for a new locks from just mounted client. Normally such locks shouldn't cause AST be sent to a client, but while stale export is disconnected, its reverse import was not set to LUSTRE_IMP_CLOSED as needed and remains in FULL state, so AST was sent causing all these errors. I don't know about other possible side effects, but there can be.

Comment by James A Simmons [ 03/Feb/20 ]

That was my bad. I was attempting to collect debug info even when the import entered a close state.

Comment by Mikhail Pershin [ 04/Feb/20 ]

I see, but that will be in debug - imp_state is not yet IMP_CLOSED, so setting the 'CLOSED' state will be in debug. As I see, only skipped cases when someone is trying to change closed import state. That case can be added in debug by separate message under the same check imp->imp_state == LUSTRE_IMP_CLOSED, e.g.

if (imp->imp_state == LUSTRE_IMP_CLOSED) {
        CDEBUG(D_HA, "%p %s: attempt to change closed import state to %s\n",
	       imp, obd2cli_tgt(imp->imp_obd),
	       ptlrpc_import_state_name(state));
Comment by Gerrit Updater [ 20/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37405/
Subject: LU-10756 ptlrpc: fix IMP_CLOSED state is being never set
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 43dddbd0785d4da14714390d802bf6ec65567350

Comment by Gerrit Updater [ 15/May/20 ]

Sebastien Piechurski (sebastien.piechurski@atos.net) uploaded a new patch: https://review.whamcloud.com/38621
Subject: LU-10756 ptlrpc: fix IMP_CLOSED state is being never set
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 34581dc976dbfaef708287898da4b1fb2fb4b44b

Comment by Gerrit Updater [ 29/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38621/
Subject: LU-10756 ptlrpc: fix IMP_CLOSED state is being never set
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: d22ae9251fe04d717aa0e323312879ba7e2ae3ae

Generated at Sat Feb 10 02:37:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.