|
After scheduled maintenance, Yale was attempting to failback their OSTs from the failover server to the primary server, but the umounts hung on the failover server for over an hour until the machine was reboot. Here is an example of the messages we have seen:
Feb 28 09:31:12 oss9 kernel: Lustre: Service thread pid 2708 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping
the stack trace for debugging purposes:
Feb 28 09:31:12 oss9 kernel: Pid: 2708, comm: ll_ost_11
Feb 28 09:31:12 oss9 kernel:
Feb 28 09:31:12 oss9 kernel: Call Trace:
Feb 28 09:31:12 oss9 kernel: [<ffffffff80063002>] thread_return+0x62/0xfe
Feb 28 09:31:12 oss9 kernel: [<ffffffff8002dee8>] __wake_up+0x38/0x4f
Feb 28 09:31:12 oss9 kernel: [<ffffffff88a12828>] jbd2_log_wait_commit+0xa3/0xf5 [jbd2]
Feb 28 09:31:12 oss9 kernel: [<ffffffff800a34a7>] autoremove_wake_function+0x0/0x2e
Feb 28 09:31:12 oss9 kernel: [<ffffffff88a0d5ae>] jbd2_journal_stop+0x1e6/0x215 [jbd2]
Feb 28 09:31:12 oss9 kernel: [<ffffffff88af0d05>] filter_sync+0xc5/0x5c0 [obdfilter]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887c30c1>] ldlm_pool_add+0x131/0x190 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887b39af>] ldlm_export_lock_put+0x6f/0xe0 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887c40a5>] interval_next+0xf5/0x1d0 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff88a9edac>] ost_blocking_ast+0x79c/0x9b0 [ost]
Feb 28 09:31:12 oss9 kernel: [<ffffffff88728cf0>] class_handle2object+0xe0/0x170 [obdclass]
Feb 28 09:31:12 oss9 kernel: [<ffffffff8879a270>] ldlm_resource_putref_internal+0x230/0x460 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff80064b09>] _spin_lock_bh+0x9/0x14
Feb 28 09:31:12 oss9 kernel: [<ffffffff887932fd>] ldlm_cancel_callback+0x6d/0xd0 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff88797580>] ldlm_lock_cancel+0xc0/0x170 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887b58e5>] ldlm_request_cancel+0x265/0x330 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887d94a1>] lustre_swab_buf+0x81/0x170 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887b6d50>] ldlm_server_glimpse_ast+0x0/0x3b0 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887bc290>] ldlm_server_completion_ast+0x0/0x5e0 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff88a9e610>] ost_blocking_ast+0x0/0x9b0 [ost]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887b9106>] ldlm_handle_enqueue+0x1d6/0x1210 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887d7ff5>] lustre_msg_get_version+0x35/0xf0 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887d7f05>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887d80b8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff88aa64e3>] ost_handle+0x4ff3/0x55c0 [ost]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887e76d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887e7e35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff8008d299>] __wake_up_common+0x3e/0x68
Feb 28 09:31:12 oss9 kernel: [<ffffffff887e8dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb 28 09:31:12 oss9 kernel: [<ffffffff887e7e60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Feb 28 09:31:12 oss9 kernel:
Feb 28 09:31:12 oss9 kernel: LustreError: dumping log to /tmp/lustre-log.1362061872.2708
Feb 28 09:32:25 oss9 kernel: Lustre: Service thread pid 2708 completed after 272.45s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Feb 28 09:35:11 oss9 kernel: Lustre: 3146:0:(quota_interface.c:475:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
We are planning a downtime to gather more information. Are there any debugging flags we should use? ldlm, quota, rpctrace? I was also thinking of seeing if 1.8.9 might help, though I don't see any commits that really deal with this issue.
|