Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4242

mdt_open.c:1685:mdt_reint_open()) LBUG

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Blocker Blocker
    • None
    • Lustre 2.4.1
    • None
    • file system upgraded from at least 1.8 to 2.3 and now 2.4.1, clients all running 1.8.9 currently, mostly Red Hat clients, Red Hat 6 servers
    • 3
    • 11548

      After upgrading our test file system to 2.4.1 earlier (and at the same time moving the OSSes to a different network), the MDT crashes very frequently with and LBUG and reboots directly. I have managed to get the following stack trace from /var/crash.

      <0>LustreError: 8518:0:(mdt_open.c:1685:mdt_reint_open()) LBUG
      <6>Lustre: play01-MDT0000: Recovery over after 0:31, of 267 clients 267 recovered and 0 were evicted.
      <4>Pid: 8518, comm: mdt01_005
      <4>
      <4>Call Trace:
      <4> [<ffffffffa04ea895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa04eae97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0e5a6b9>] mdt_reint_open+0x1989/0x20c0 [mdt]
      <4> [<ffffffffa050782e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
      <4> [<ffffffffa07d5dcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
      <4> [<ffffffffa0669f50>] ? lu_ucred+0x20/0x30 [obdclass]
      <4> [<ffffffffa0e44911>] mdt_reint_rec+0x41/0xe0 [mdt]
      <4> [<ffffffffa0e29ae3>] mdt_reint_internal+0x4c3/0x780 [mdt]
      <4> [<ffffffffa0e2a06d>] mdt_intent_reint+0x1ed/0x520 [mdt]
      <4> [<ffffffffa0e27f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
      <4> [<ffffffffa078d831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
      <4> [<ffffffffa07b41ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
      <4> [<ffffffffa0e283a6>] mdt_enqueue+0x46/0xe0 [mdt]
      <4> [<ffffffffa0e2ea97>] mdt_handle_common+0x647/0x16d0 [mdt]
      <4> [<ffffffffa07d6bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
      <4> [<ffffffffa0e683f5>] mds_regular_handle+0x15/0x20 [mdt]
      <4> [<ffffffffa07e63c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      <4> [<ffffffffa04eb5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      <4> [<ffffffffa04fcd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      <4> [<ffffffffa07dd729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      <4> [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
      <4> [<ffffffffa07e775e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
      <4> [<ffffffffa07e6c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffff8100c0ca>] child_rip+0xa/0x20
      <4> [<ffffffffa07e6c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffffa07e6c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      

      I also have a vmcore file for this crash, though none of the files in /tmp that I remember from 1.8 times, not sure if this is a 2.4 thing or related to the reboots, which happen even though kernel.panic=0.

      It doesn't make any difference if I mount with our without --abort-recovery, the LBUG happens within a minute of the file system coming back, every time.

      This test file system has been upgrade from 1.8 to 2.3 previously and was running 2.3 for a while. It is also possible that this has been upgraded from 1.6 initially, though I'd have to check this.

      It might be of note that even though we moved the OSSes to a different network, we did not manage to shutdown all clients before the migration, so quite a few clients are likely trying to communicate with the OSSes using the old IPs and will fail.

            bogl Bob Glossman (Inactive)
            ferner Frederik Ferner (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: