Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2931

OST umount hangs for over 1 hour

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 1.8.8
    • None
    • 3
    • 7048

    Description

      After scheduled maintenance, Yale was attempting to failback their OSTs from the failover server to the primary server, but the umounts hung on the failover server for over an hour until the machine was reboot. Here is an example of the messages we have seen:

      Feb 28 09:31:12 oss9 kernel: Lustre: Service thread pid 2708 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping
      the stack trace for debugging purposes:
      Feb 28 09:31:12 oss9 kernel: Pid: 2708, comm: ll_ost_11
      Feb 28 09:31:12 oss9 kernel:
      Feb 28 09:31:12 oss9 kernel: Call Trace:
      Feb 28 09:31:12 oss9 kernel: [<ffffffff80063002>] thread_return+0x62/0xfe
      Feb 28 09:31:12 oss9 kernel: [<ffffffff8002dee8>] __wake_up+0x38/0x4f
      Feb 28 09:31:12 oss9 kernel: [<ffffffff88a12828>] jbd2_log_wait_commit+0xa3/0xf5 [jbd2]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff800a34a7>] autoremove_wake_function+0x0/0x2e
      Feb 28 09:31:12 oss9 kernel: [<ffffffff88a0d5ae>] jbd2_journal_stop+0x1e6/0x215 [jbd2]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff88af0d05>] filter_sync+0xc5/0x5c0 [obdfilter]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff887c30c1>] ldlm_pool_add+0x131/0x190 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff887b39af>] ldlm_export_lock_put+0x6f/0xe0 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff887c40a5>] interval_next+0xf5/0x1d0 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff88a9edac>] ost_blocking_ast+0x79c/0x9b0 [ost]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff88728cf0>] class_handle2object+0xe0/0x170 [obdclass]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff8879a270>] ldlm_resource_putref_internal+0x230/0x460 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff80064b09>] _spin_lock_bh+0x9/0x14
      Feb 28 09:31:12 oss9 kernel: [<ffffffff887932fd>] ldlm_cancel_callback+0x6d/0xd0 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff88797580>] ldlm_lock_cancel+0xc0/0x170 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff887b58e5>] ldlm_request_cancel+0x265/0x330 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff887d94a1>] lustre_swab_buf+0x81/0x170 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff887b6d50>] ldlm_server_glimpse_ast+0x0/0x3b0 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff887bc290>] ldlm_server_completion_ast+0x0/0x5e0 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff88a9e610>] ost_blocking_ast+0x0/0x9b0 [ost]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff887b9106>] ldlm_handle_enqueue+0x1d6/0x1210 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff887d7ff5>] lustre_msg_get_version+0x35/0xf0 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff887d7f05>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff887d80b8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff88aa64e3>] ost_handle+0x4ff3/0x55c0 [ost]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff887e76d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff887e7e35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff8008d299>] __wake_up_common+0x3e/0x68
      Feb 28 09:31:12 oss9 kernel: [<ffffffff887e8dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
      Feb 28 09:31:12 oss9 kernel: [<ffffffff887e7e60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
      Feb 28 09:31:12 oss9 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
      Feb 28 09:31:12 oss9 kernel:
      Feb 28 09:31:12 oss9 kernel: LustreError: dumping log to /tmp/lustre-log.1362061872.2708
      Feb 28 09:32:25 oss9 kernel: Lustre: Service thread pid 2708 completed after 272.45s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      Feb 28 09:35:11 oss9 kernel: Lustre: 3146:0:(quota_interface.c:475:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

      We are planning a downtime to gather more information. Are there any debugging flags we should use? ldlm, quota, rpctrace? I was also thinking of seeing if 1.8.9 might help, though I don't see any commits that really deal with this issue.

      Attachments

        Activity

          People

            cliffw Cliff White (Inactive)
            orentas Oz Rentas
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: