Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4242

mdt_open.c:1685:mdt_reint_open()) LBUG

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Blocker
    • None
    • Lustre 2.4.1
    • None
    • file system upgraded from at least 1.8 to 2.3 and now 2.4.1, clients all running 1.8.9 currently, mostly Red Hat clients, Red Hat 6 servers
    • 3
    • 11548

    Description

      After upgrading our test file system to 2.4.1 earlier (and at the same time moving the OSSes to a different network), the MDT crashes very frequently with and LBUG and reboots directly. I have managed to get the following stack trace from /var/crash.

      <0>LustreError: 8518:0:(mdt_open.c:1685:mdt_reint_open()) LBUG
      <6>Lustre: play01-MDT0000: Recovery over after 0:31, of 267 clients 267 recovered and 0 were evicted.
      <4>Pid: 8518, comm: mdt01_005
      <4>
      <4>Call Trace:
      <4> [<ffffffffa04ea895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa04eae97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0e5a6b9>] mdt_reint_open+0x1989/0x20c0 [mdt]
      <4> [<ffffffffa050782e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
      <4> [<ffffffffa07d5dcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
      <4> [<ffffffffa0669f50>] ? lu_ucred+0x20/0x30 [obdclass]
      <4> [<ffffffffa0e44911>] mdt_reint_rec+0x41/0xe0 [mdt]
      <4> [<ffffffffa0e29ae3>] mdt_reint_internal+0x4c3/0x780 [mdt]
      <4> [<ffffffffa0e2a06d>] mdt_intent_reint+0x1ed/0x520 [mdt]
      <4> [<ffffffffa0e27f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
      <4> [<ffffffffa078d831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
      <4> [<ffffffffa07b41ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
      <4> [<ffffffffa0e283a6>] mdt_enqueue+0x46/0xe0 [mdt]
      <4> [<ffffffffa0e2ea97>] mdt_handle_common+0x647/0x16d0 [mdt]
      <4> [<ffffffffa07d6bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
      <4> [<ffffffffa0e683f5>] mds_regular_handle+0x15/0x20 [mdt]
      <4> [<ffffffffa07e63c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      <4> [<ffffffffa04eb5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      <4> [<ffffffffa04fcd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      <4> [<ffffffffa07dd729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      <4> [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
      <4> [<ffffffffa07e775e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
      <4> [<ffffffffa07e6c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffff8100c0ca>] child_rip+0xa/0x20
      <4> [<ffffffffa07e6c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffffa07e6c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      

      I also have a vmcore file for this crash, though none of the files in /tmp that I remember from 1.8 times, not sure if this is a 2.4 thing or related to the reboots, which happen even though kernel.panic=0.

      It doesn't make any difference if I mount with our without --abort-recovery, the LBUG happens within a minute of the file system coming back, every time.

      This test file system has been upgrade from 1.8 to 2.3 previously and was running 2.3 for a while. It is also possible that this has been upgraded from 1.6 initially, though I'd have to check this.

      It might be of note that even though we moved the OSSes to a different network, we did not manage to shutdown all clients before the migration, so quite a few clients are likely trying to communicate with the OSSes using the old IPs and will fail.

      Attachments

        Issue Links

          Activity

            People

              bogl Bob Glossman (Inactive)
              ferner Frederik Ferner (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: