[LU-2931] OST umount hangs for over 1 hour Created: 08/Mar/13  Updated: 11/Jul/14  Resolved: 11/Jul/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.8
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oz Rentas Assignee: Cliff White (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File oss9-2013-02-21     File oss9-2013-02-28    
Severity: 3
Rank (Obsolete): 7048

 Description   

After scheduled maintenance, Yale was attempting to failback their OSTs from the failover server to the primary server, but the umounts hung on the failover server for over an hour until the machine was reboot. Here is an example of the messages we have seen:

Feb 28 09:31:12 oss9 kernel: Lustre: Service thread pid 2708 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping
the stack trace for debugging purposes:
Feb 28 09:31:12 oss9 kernel: Pid: 2708, comm: ll_ost_11
Feb 28 09:31:12 oss9 kernel:
Feb 28 09:31:12 oss9 kernel: Call Trace:
Feb 28 09:31:12 oss9 kernel: [<ffffffff80063002>] thread_return+0x62/0xfe
Feb 28 09:31:12 oss9 kernel: [<ffffffff8002dee8>] __wake_up+0x38/0x4f
Feb 28 09:31:12 oss9 kernel: [<ffffffff88a12828>] jbd2_log_wait_commit+0xa3/0xf5 [jbd2]
Feb 28 09:31:12 oss9 kernel: [<ffffffff800a34a7>] autoremove_wake_function+0x0/0x2e
Feb 28 09:31:12 oss9 kernel: [<ffffffff88a0d5ae>] jbd2_journal_stop+0x1e6/0x215 [jbd2]
Feb 28 09:31:12 oss9 kernel: [<ffffffff88af0d05>] filter_sync+0xc5/0x5c0 [obdfilter]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887c30c1>] ldlm_pool_add+0x131/0x190 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887b39af>] ldlm_export_lock_put+0x6f/0xe0 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887c40a5>] interval_next+0xf5/0x1d0 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff88a9edac>] ost_blocking_ast+0x79c/0x9b0 [ost]
Feb 28 09:31:12 oss9 kernel: [<ffffffff88728cf0>] class_handle2object+0xe0/0x170 [obdclass]
Feb 28 09:31:12 oss9 kernel: [<ffffffff8879a270>] ldlm_resource_putref_internal+0x230/0x460 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff80064b09>] _spin_lock_bh+0x9/0x14
Feb 28 09:31:12 oss9 kernel: [<ffffffff887932fd>] ldlm_cancel_callback+0x6d/0xd0 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff88797580>] ldlm_lock_cancel+0xc0/0x170 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887b58e5>] ldlm_request_cancel+0x265/0x330 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887d94a1>] lustre_swab_buf+0x81/0x170 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887b6d50>] ldlm_server_glimpse_ast+0x0/0x3b0 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887bc290>] ldlm_server_completion_ast+0x0/0x5e0 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff88a9e610>] ost_blocking_ast+0x0/0x9b0 [ost]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887b9106>] ldlm_handle_enqueue+0x1d6/0x1210 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887d7ff5>] lustre_msg_get_version+0x35/0xf0 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887d7f05>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887d80b8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff88aa64e3>] ost_handle+0x4ff3/0x55c0 [ost]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887e76d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff887e7e35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff8008d299>] __wake_up_common+0x3e/0x68
Feb 28 09:31:12 oss9 kernel: [<ffffffff887e8dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb 28 09:31:12 oss9 kernel: [<ffffffff887e7e60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Feb 28 09:31:12 oss9 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Feb 28 09:31:12 oss9 kernel:
Feb 28 09:31:12 oss9 kernel: LustreError: dumping log to /tmp/lustre-log.1362061872.2708
Feb 28 09:32:25 oss9 kernel: Lustre: Service thread pid 2708 completed after 272.45s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Feb 28 09:35:11 oss9 kernel: Lustre: 3146:0:(quota_interface.c:475:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

We are planning a downtime to gather more information. Are there any debugging flags we should use? ldlm, quota, rpctrace? I was also thinking of seeing if 1.8.9 might help, though I don't see any commits that really deal with this issue.



 Comments   
Comment by Cliff White (Inactive) [ 08/Mar/13 ]

Are you certain this is the first timeout? Are there any errors prior to the timeout? 1.8.9 might be a good idea.

Comment by Kit Westneat (Inactive) [ 26/Mar/13 ]

Hi Cliff, sorry for not getting back to you sooner, I missed your response. It was the first that day, there were some the previous day. It's actually kind of a strange log. The previous failover on the 21st makes more sense. I'll attach both kern.log files though so you can check it out.

Comment by Cliff White (Inactive) [ 09/May/14 ]

Have you upgraded to 1.8.9?

Comment by Cliff White (Inactive) [ 11/Jul/14 ]

Do you have any updates on this situation?

Comment by Oz Rentas [ 11/Jul/14 ]

This is way old. Opened by Kit.
The system has since been upgraded and is now running Lustre 2.4.
THis is no longer a problem. Please close this.

Comment by Cliff White (Inactive) [ 11/Jul/14 ]

Great, will do

Generated at Sat Feb 10 01:29:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.