Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2944

Client evictions - watchdog timeouts on MDT - iorfpp

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Cannot Reproduce
    • Affects Version/s: Lustre 2.4.0
    • Fix Version/s: None
    • Labels:
    • Environment:
      Hyperion/LLNL RHEL6
    • Severity:
      3
    • Rank (Obsolete):
      7066

      Description

      Running parallel-scale IOR fpp test. at end of test. 60 clients report ENOTCONN, then are evicted from MDT due to lock callback timeout:

      Mar  8 15:47:22 hyperion-rst6 kernel: LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 477s: evicting client at 192.168.117.65@o2ib1  ns: mdt-ffff8802f7d21000 lock: ffff8801193bfe00/0xb4fc8ee670e8bb9a lrc: 3/0,0 mode: CR/CR res: 8589935754/45677 bits 0x9 rrc: 2 type: IBT flags: 0x200000000020 nid: 192.168.117.65@o2ib1 remote: 0xc3b8b1de83cb4606 expref: 81 pid: 11951 timeout: 4379843546 lvb_type: 0
      

      After long delay, system is idle and MDT is now watchdogging:

      Mar  8 17:03:57 hyperion-rst6 kernel: LNet: Service thread pid 15804 was inactive for 304.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Mar  8 17:03:57 hyperion-rst6 kernel: Pid: 15804, comm: mdt03_035
      Mar  8 17:03:57 hyperion-rst6 kernel:
      Mar  8 17:03:57 hyperion-rst6 kernel: Call Trace:  
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffff814ead12>] schedule_timeout+0x192/0x2e0
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffff8107cb50>] ? process_timeout+0x0/0x10
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa06df6d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa096b22d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0966950>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa096a968>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa096ad40>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f3ac60>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f3d92b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f3ac60>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa096ad40>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f3e1a4>] mdt_object_lock+0x14/0x20 [mdt]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f5e5a9>] mdt_reint_unlink+0x5b9/0xdf0 [mdt]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f59781>] mdt_reint_rec+0x41/0xe0 [mdt]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f52de3>] mdt_reint_internal+0x4e3/0x7d0 [mdt]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f53114>] mdt_reint+0x44/0xe0 [mdt]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f44008>] mdt_handle_common+0x628/0x1620 [mdt]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa0f7c6e5>] mds_regular_handle+0x15/0x20 [mdt]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa09a404c>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa06df5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa099b799>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffff81052223>] ? __wake_up+0x53/0x70
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa09a5596>] ptlrpc_main+0xb76/0x1870 [ptlrpc]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa09a4a20>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa09a4a20>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffffa09a4a20>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
      Mar  8 17:03:57 hyperion-rst6 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      

      After a period, evicted clients are unable to reconnect, MDT reports 'busy with 1 RPC'

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wc-triage WC Triage
                Reporter:
                cliffw Cliff White (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: