[LU-10756] Send Uevents for interesting Lustre changes Created: 02/Mar/18 Updated: 13/Jul/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Joe Grund | Assignee: | James A Simmons |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
For applications that manage / monitor Lustre, it would be useful if Lustre sent Uevents for interesting changes. An incomplete and non-exhaustive list:
|
| Comments |
| Comment by James A Simmons [ 02/Mar/18 ] |
|
Hi! I'm working on this right now. Please see LU-8066. Below are the sub-tickets LU-8066 : In order for this to work each Lustre subsystem needs a sysfs kobject. SUBSYSTEM=="lustre", ACTION=="add", DEVPATH="*lov*",
ATTR{stripecount}="4"
Once LNet moves to sysfs in 2.12 you can if done right confgure LNet the
LU-10756 - send udev for client import state changes. I have a patch at
SUBSYSTEM=="lustre", ACTION=="change", ENV{PARAM}=="?*",
RUN+="/usr/sbin/lctl set_param $env{PARAM}=$env{SETTING}"
This is the default but in reality you can run anything for RUN LU-9667 - This covers the move of LNet to sysfs. I have discussed This is what is on the table so far. I expected more things to be requested once this functionality |
| Comment by James A Simmons [ 21/May/18 ] |
|
I got a working client state patch going ; https://review.whamcloud.com/#/c/31407. The question is what do we want transmitted in the uevent. So far we have for example: change@/fs/lustre/mdc ACTION=change DEVPATH=/fs/lustre/mdc SUBSYSTEM=lustre IMPORT=lustre-MDT0001_UUID STATE=REPLAY_WAIT SEQNUM=4622 Anything else to add. Perhaps the obd device such as lustre-MDT0000-mdc-ffff88105dbc1000 being transmitted as well. |
| Comment by Nathan Rutman [ 26/Sep/18 ] |
|
^^ timestamp? |
| Comment by James A Simmons [ 29/Sep/18 ] |
|
That is a good idea to add a timestamp for the import state. Currently the only uevents sent are for lctl set_param -P which don't include a time stamp. Should we? Also is a timestamp in seconds good enough? |
| Comment by Nathan Rutman [ 02/Oct/18 ] |
|
timestamp for all records; can just be the time it shows up at the server. Can all lctl's generate a uevent? Might be a good way to audit changes. |
| Comment by Gerrit Updater [ 11/Jul/19 ] |
|
James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/35463 |
| Comment by Gerrit Updater [ 11/Jul/19 ] |
|
James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/35464 |
| Comment by James A Simmons [ 11/Jul/19 ] |
|
Started this work back up. Sorry I didn't reply earlier Nathan. I did add a second precision timestamp to the lctl conf_param events. For the upcoming import change state events second precision timestamps are also available. If nanosecond timestamps are needed let me know. To honest uevents are not designed to be sent by the thousands per second so I doubt nanoseconds are needed. By lctl what commands are you thinking of? Also at this point the sysfs LNet work under LU-9667 will provide the framework to send network events. Especially now that LNet health has landed. |
| Comment by Gerrit Updater [ 17/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35463/ |
| Comment by Gerrit Updater [ 15/Aug/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35464/ |
| Comment by Gerrit Updater [ 15/Aug/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35795 |
| Comment by Gerrit Updater [ 04/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35795/ |
| Comment by Gerrit Updater [ 03/Feb/20 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37405 |
| Comment by Mikhail Pershin [ 03/Feb/20 ] |
|
Explanation about new patch, original code in IMPORT_SET_STATE_NOLOCK() was checking imp->imp_state != LUSTRE_IMP_CLOSED before applying new state, therefore preventing closed import from changing its closed state. Meanwhile the new code checks 'state' parameter which is not current import state but new state to be set. So new code does opposite thing - instead of keeping 'closed' state forever it prevents import state to become LUSTRE_IMP_CLOSED, so import stays in FULL state until destroyed, I suppose. The patch above restores original logic. I've found that by noticing the following errors shortly after client remount: [ 1139.774868] LustreError: 25570:0:(ldlm_lockd.c:716:ldlm_handle_ast_error()) ### client (nid 10.9.3.117@tcp) returned error from blocking AST (req@ffff960abad43180 x1657226243021504 status -107 rc -107), evict it ns: mdt-lustre-MDT0000_UUID lock: ffff960aba9a2240/0x60032f478e4387c8 lrc: 4/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.9.3.117@tcp remote: 0xb16e71cea23ecf65 expref: 18639 pid: 13354 timeout: 2039 lvb_type: 0 [ 1139.783656] LustreError: 138-a: lustre-MDT0000: A client on nid 10.9.3.117@tcp was evicted due to a lock blocking callback time out: rc -107 [ 1139.791100] LustreError: 13344:0:(ldlm_lockd.c:259:expired_lock_main()) ### lock callback timer expired after 0s: evicting client at 10.9.3.117@tcp ns: mdt-lustre-MDT0000_UUID lock: ffff960aba9a2240/0x60032f478e4387c8 lrc: 3/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.9.3.117@tcp remote: 0xb16e71cea23ecf65 expref: 18601 pid: 13354 timeout: 0 lvb_type: 0 After remount there is an old stale export on server which has a lot of locks to be canceled in background, some of them can be a blocking lock for a new locks from just mounted client. Normally such locks shouldn't cause AST be sent to a client, but while stale export is disconnected, its reverse import was not set to LUSTRE_IMP_CLOSED as needed and remains in FULL state, so AST was sent causing all these errors. I don't know about other possible side effects, but there can be. |
| Comment by James A Simmons [ 03/Feb/20 ] |
|
That was my bad. I was attempting to collect debug info even when the import entered a close state. |
| Comment by Mikhail Pershin [ 04/Feb/20 ] |
|
I see, but that will be in debug - imp_state is not yet IMP_CLOSED, so setting the 'CLOSED' state will be in debug. As I see, only skipped cases when someone is trying to change closed import state. That case can be added in debug by separate message under the same check imp->imp_state == LUSTRE_IMP_CLOSED, e.g. if (imp->imp_state == LUSTRE_IMP_CLOSED) { CDEBUG(D_HA, "%p %s: attempt to change closed import state to %s\n", imp, obd2cli_tgt(imp->imp_obd), ptlrpc_import_state_name(state)); |
| Comment by Gerrit Updater [ 20/Feb/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37405/ |
| Comment by Gerrit Updater [ 15/May/20 ] |
|
Sebastien Piechurski (sebastien.piechurski@atos.net) uploaded a new patch: https://review.whamcloud.com/38621 |
| Comment by Gerrit Updater [ 29/Oct/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38621/ |