Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10756

Send Uevents for interesting Lustre changes

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 9223372036854775807

    Description

      For applications that manage / monitor Lustre, it would be useful if Lustre sent Uevents for interesting changes. An incomplete and non-exhaustive list:

      • target mount / umount
      • Lustre health
      • nid / LNet changes
      • evictions
      • lbugs
      • network timeouts
      • recovery status changes
         
         
         

      Attachments

        Issue Links

          Activity

            [LU-10756] Send Uevents for interesting Lustre changes

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38621/
            Subject: LU-10756 ptlrpc: fix IMP_CLOSED state is being never set
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: d22ae9251fe04d717aa0e323312879ba7e2ae3ae

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38621/ Subject: LU-10756 ptlrpc: fix IMP_CLOSED state is being never set Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: d22ae9251fe04d717aa0e323312879ba7e2ae3ae

            Sebastien Piechurski (sebastien.piechurski@atos.net) uploaded a new patch: https://review.whamcloud.com/38621
            Subject: LU-10756 ptlrpc: fix IMP_CLOSED state is being never set
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 34581dc976dbfaef708287898da4b1fb2fb4b44b

            gerrit Gerrit Updater added a comment - Sebastien Piechurski (sebastien.piechurski@atos.net) uploaded a new patch: https://review.whamcloud.com/38621 Subject: LU-10756 ptlrpc: fix IMP_CLOSED state is being never set Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 34581dc976dbfaef708287898da4b1fb2fb4b44b

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37405/
            Subject: LU-10756 ptlrpc: fix IMP_CLOSED state is being never set
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 43dddbd0785d4da14714390d802bf6ec65567350

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37405/ Subject: LU-10756 ptlrpc: fix IMP_CLOSED state is being never set Project: fs/lustre-release Branch: master Current Patch Set: Commit: 43dddbd0785d4da14714390d802bf6ec65567350
            tappro Mikhail Pershin added a comment - - edited

            I see, but that will be in debug - imp_state is not yet IMP_CLOSED, so setting the 'CLOSED' state will be in debug. As I see, only skipped cases when someone is trying to change closed import state. That case can be added in debug by separate message under the same check imp->imp_state == LUSTRE_IMP_CLOSED, e.g.

            if (imp->imp_state == LUSTRE_IMP_CLOSED) {
                    CDEBUG(D_HA, "%p %s: attempt to change closed import state to %s\n",
            	       imp, obd2cli_tgt(imp->imp_obd),
            	       ptlrpc_import_state_name(state));
            
            tappro Mikhail Pershin added a comment - - edited I see, but that will be in debug - imp_state is not yet IMP_CLOSED , so setting the 'CLOSED' state will be in debug. As I see, only skipped cases when someone is trying to change closed import state. That case can be added in debug by separate message under the same check imp->imp_state == LUSTRE_IMP_CLOSED , e.g. if (imp->imp_state == LUSTRE_IMP_CLOSED) { CDEBUG(D_HA, "%p %s: attempt to change closed import state to %s\n" , imp, obd2cli_tgt(imp->imp_obd), ptlrpc_import_state_name(state));

            That was my bad. I was attempting to collect debug info even when the import entered a close state.

            simmonsja James A Simmons added a comment - That was my bad. I was attempting to collect debug info even when the import entered a close state.

            Explanation about new patch, original code in IMPORT_SET_STATE_NOLOCK() was checking imp->imp_state != LUSTRE_IMP_CLOSED before applying new state, therefore preventing closed import from changing its closed state. Meanwhile the new code checks 'state' parameter which is not current import state but new state to be set. So new code does opposite thing - instead of keeping 'closed' state forever it prevents import state to become LUSTRE_IMP_CLOSED, so import stays in FULL state until destroyed, I suppose. The patch above restores original logic.

            I've found that by noticing the following errors shortly after client remount:

            [ 1139.774868] LustreError: 25570:0:(ldlm_lockd.c:716:ldlm_handle_ast_error()) ### client (nid 10.9.3.117@tcp) returned error from blocking AST (req@ffff960abad43180 x1657226243021504 status -107 rc -107), evict it ns: mdt-lustre-MDT0000_UUID lock: ffff960aba9a2240/0x60032f478e4387c8 lrc: 4/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.9.3.117@tcp remote: 0xb16e71cea23ecf65 expref: 18639 pid: 13354 timeout: 2039 lvb_type: 0
            [ 1139.783656] LustreError: 138-a: lustre-MDT0000: A client on nid 10.9.3.117@tcp was evicted due to a lock blocking callback time out: rc -107
            [ 1139.791100] LustreError: 13344:0:(ldlm_lockd.c:259:expired_lock_main()) ### lock callback timer expired after 0s: evicting client at 10.9.3.117@tcp  ns: mdt-lustre-MDT0000_UUID lock: ffff960aba9a2240/0x60032f478e4387c8 lrc: 3/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.9.3.117@tcp remote: 0xb16e71cea23ecf65 expref: 18601 pid: 13354 timeout: 0 lvb_type: 0
            

            After remount there is an old stale export on server which has a lot of locks to be canceled in background, some of them can be a blocking lock for a new locks from just mounted client. Normally such locks shouldn't cause AST be sent to a client, but while stale export is disconnected, its reverse import was not set to LUSTRE_IMP_CLOSED as needed and remains in FULL state, so AST was sent causing all these errors. I don't know about other possible side effects, but there can be.

            tappro Mikhail Pershin added a comment - Explanation about new patch, original code in IMPORT_SET_STATE_NOLOCK() was checking imp->imp_state != LUSTRE_IMP_CLOSED before applying new state, therefore preventing closed import from changing its closed state. Meanwhile the new code checks 'state' parameter which is not current import state but new state to be set. So new code does opposite thing - instead of keeping 'closed' state forever it prevents import state to become LUSTRE_IMP_CLOSED, so import stays in FULL state until destroyed, I suppose. The patch above restores original logic. I've found that by noticing the following errors shortly after client remount: [ 1139.774868] LustreError: 25570:0:(ldlm_lockd.c:716:ldlm_handle_ast_error()) ### client (nid 10.9.3.117@tcp) returned error from blocking AST (req@ffff960abad43180 x1657226243021504 status -107 rc -107), evict it ns: mdt-lustre-MDT0000_UUID lock: ffff960aba9a2240/0x60032f478e4387c8 lrc: 4/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.9.3.117@tcp remote: 0xb16e71cea23ecf65 expref: 18639 pid: 13354 timeout: 2039 lvb_type: 0 [ 1139.783656] LustreError: 138-a: lustre-MDT0000: A client on nid 10.9.3.117@tcp was evicted due to a lock blocking callback time out: rc -107 [ 1139.791100] LustreError: 13344:0:(ldlm_lockd.c:259:expired_lock_main()) ### lock callback timer expired after 0s: evicting client at 10.9.3.117@tcp ns: mdt-lustre-MDT0000_UUID lock: ffff960aba9a2240/0x60032f478e4387c8 lrc: 3/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.9.3.117@tcp remote: 0xb16e71cea23ecf65 expref: 18601 pid: 13354 timeout: 0 lvb_type: 0 After remount there is an old stale export on server which has a lot of locks to be canceled in background, some of them can be a blocking lock for a new locks from just mounted client. Normally such locks shouldn't cause AST be sent to a client, but while stale export is disconnected, its reverse import was not set to LUSTRE_IMP_CLOSED as needed and remains in FULL state, so AST was sent causing all these errors. I don't know about other possible side effects, but there can be.

            People

              simmonsja James A Simmons
              joe.grund Joe Grund
              Votes:
              1 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated: