Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5787

ptlrpcd_rcv loop in osc_ldlm_weigh_ast

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.5.3
    • None
    • kernel 2.6.32-431.23.3 + bull fix
      lustre 2.5.3 + bull fix
    • 3
    • 16240

    Description

      At Cea T100 system we have an issue similar than the LU-5781, after a failover server node some client nodes have the thread ptlrpcd_rcv who use 100% of one cpu, with perf we can see :

            0.55%      ptlrpcd_rcv  [obdclass]               [k] cl_page_at_trusted
                      |
                      --- cl_page_at_trusted
                         |
                         |--97.59%-- cl_page_gang_lookup
                         |          osc_ldlm_weigh_ast
                         |          osc_cancel_for_recovery
                         |          ldlm_cancel_no_wait_policy
                         |          ldlm_prepare_lru_list
                         |          ldlm_cancel_lru_local
                         |          ldlm_replay_locks
                         |          ptlrpc_import_recovery_state_machine
                         |          ptlrpc_connect_interpret
                         |          ptlrpc_check_set
                         |          ptlrpcd_check
                         |          ptlrpcd
                         |          kthread
                         |          child_rip
                         |
                          --2.41%-- osc_ldlm_weigh_ast
                                    osc_cancel_for_recovery
                                    ldlm_cancel_no_wait_policy
                                    ldlm_prepare_lru_list
                                    ldlm_cancel_lru_local
                                    ldlm_replay_locks
                                    ptlrpc_import_recovery_state_machine
                                    ptlrpc_connect_interpret
                                    ptlrpc_check_set
                                    ptlrpcd_check
                                    ptlrpcd
                                    kthread
                                    child_rip  
      

      we have some osc with the state :

          /proc/fs/lustre/osc/ptmp2-OST0021-osc-ffff8801ff8b3800/state:current_state: CONNECTING
          /proc/fs/lustre/osc/ptmp2-OST0022-osc-ffff8801ff8b3800/state:current_state: REPLAY_LOCKS
          /proc/fs/lustre/osc/ptmp2-OST0023-osc-ffff8801ff8b3800/state:current_state: REPLAY_LOCKS
          /proc/fs/lustre/osc/ptmp2-OST0024-osc-ffff8801ff8b3800/state:current_state: CONNECTING
          /proc/fs/lustre/osc/ptmp2-OST0025-osc-ffff8801ff8b3800/state:current_state: REPLAY_LOCKS
          /proc/fs/lustre/osc/ptmp2-OST0026-osc-ffff8801ff8b3800/state:current_state: CONNECTING
          /proc/fs/lustre/osc/ptmp2-OST0027-osc-ffff8801ff8b3800/state:current_state: CONNECTING
          /proc/fs/lustre/osc/ptmp2-OST0028-osc-ffff8801ff8b3800/state:current_state: CONNECTING
          /proc/fs/lustre/osc/ptmp2-OST0029-osc-ffff8801ff8b3800/state:current_state: REPLAY_LOCKS
          /proc/fs/lustre/osc/ptmp2-OST002a-osc-ffff8801ff8b3800/state:current_state: CONNECTING
          /proc/fs/lustre/osc/ptmp2-OST002b-osc-ffff8801ff8b3800/state:current_state: REPLAY_WAIT
      

      and also,in some case, it is possible to release the state to FULL after running

      1. lctl set_param ldlm.namespaces.*.lru_size=clear
        or
      2. echo 1 > /proc/sys/vm/drop_caches

      and after a NMI ptlrpcd_rcv stack was :

           crash> bt 15400
          PID: 15400  TASK: ffff880c7c5a0100  CPU: 14  COMMAND: "ptlrpcd_rcv"
           #0 [ffff88088e4c7e90] crash_nmi_callback at ffffffff81030096
           #1 [ffff88088e4c7ea0] notifier_call_chain at ffffffff8152f9d5
           #2 [ffff88088e4c7ee0] atomic_notifier_call_chain at ffffffff8152fa3a
           #3 [ffff88088e4c7ef0] notify_die at ffffffff810a056e
           #4 [ffff88088e4c7f20] do_nmi at ffffffff8152d69b
           #5 [ffff88088e4c7f50] nmi at ffffffff8152cf60
              [exception RIP: cl_page_gang_lookup+292]
              RIP: ffffffffa04f18b4  RSP: ffff880c7cabb990  RFLAGS: 00000206
              RAX: 000000000000000a  RBX: 000000000000000b  RCX: 0000000000000000
              RDX: ffff880660a63da8  RSI: ffffffffa0af8740  RDI: ffff88065b15ae00
              RBP: ffff880c7cabba30   R8: 000000000000000e   R9: ffff880c7cabb950
              R10: 0000000000002362  R11: ffff88087a09e5d0  R12: ffff88065b15a800
              R13: ffff880660a63df8  R14: 000000000000000b  R15: 000000000000000e
              ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
          --- <NMI exception stack> ---
           #6 [ffff880c7cabb990] cl_page_gang_lookup at ffffffffa04f18b4 [obdclass]
           #7 [ffff880c7cabba38] osc_ldlm_weigh_ast at ffffffffa095e9b7 [osc]
           #8 [ffff880c7cabbab8] osc_cancel_for_recovery at ffffffffa094305d [osc]
           #9 [ffff880c7cabbac8] ldlm_cancel_no_wait_policy at ffffffffa0637711 [ptlrpc]
          #10 [ffff880c7cabbae8] ldlm_prepare_lru_list at ffffffffa063b61b [ptlrpc]
          #11 [ffff880c7cabbb68] ldlm_cancel_lru_local at ffffffffa063ba34 [ptlrpc]
          #12 [ffff880c7cabbb88] ldlm_replay_locks at ffffffffa063bbbc [ptlrpc]
          #13 [ffff880c7cabbc08] ptlrpc_import_recovery_state_machine at ffffffffa06844f7 [ptlrpc]
          #14 [ffff880c7cabbc68] ptlrpc_connect_interpret at ffffffffa0685659 [ptlrpc]
          #15 [ffff880c7cabbd08] ptlrpc_check_set at ffffffffa065bbc1 [ptlrpc]
          #16 [ffff880c7cabbda8] ptlrpcd_check at ffffffffa0687f9b [ptlrpc]
          #17 [ffff880c7cabbe08] ptlrpcd at ffffffffa06884bb [ptlrpc]
          #18 [ffff880c7cabbee8] kthread at ffffffff81099f56
          #19 [ffff880c7cabbf48] kernel_thread at ffffffff8100c20a
      

      when the root will be understanding on LU-5781, we need a patch version for lustre 2.5.3

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              apercher Antoine Percher
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated: