Details
-
Bug
-
Resolution: Cannot Reproduce
-
Blocker
-
None
-
Lustre 2.4.1
-
None
-
file system upgraded from at least 1.8 to 2.3 and now 2.4.1, clients all running 1.8.9 currently, mostly Red Hat clients, Red Hat 6 servers
-
3
-
11548
Description
After upgrading our test file system to 2.4.1 earlier (and at the same time moving the OSSes to a different network), the MDT crashes very frequently with and LBUG and reboots directly. I have managed to get the following stack trace from /var/crash.
<0>LustreError: 8518:0:(mdt_open.c:1685:mdt_reint_open()) LBUG <6>Lustre: play01-MDT0000: Recovery over after 0:31, of 267 clients 267 recovered and 0 were evicted. <4>Pid: 8518, comm: mdt01_005 <4> <4>Call Trace: <4> [<ffffffffa04ea895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa04eae97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa0e5a6b9>] mdt_reint_open+0x1989/0x20c0 [mdt] <4> [<ffffffffa050782e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] <4> [<ffffffffa07d5dcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] <4> [<ffffffffa0669f50>] ? lu_ucred+0x20/0x30 [obdclass] <4> [<ffffffffa0e44911>] mdt_reint_rec+0x41/0xe0 [mdt] <4> [<ffffffffa0e29ae3>] mdt_reint_internal+0x4c3/0x780 [mdt] <4> [<ffffffffa0e2a06d>] mdt_intent_reint+0x1ed/0x520 [mdt] <4> [<ffffffffa0e27f1e>] mdt_intent_policy+0x39e/0x720 [mdt] <4> [<ffffffffa078d831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] <4> [<ffffffffa07b41ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] <4> [<ffffffffa0e283a6>] mdt_enqueue+0x46/0xe0 [mdt] <4> [<ffffffffa0e2ea97>] mdt_handle_common+0x647/0x16d0 [mdt] <4> [<ffffffffa07d6bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] <4> [<ffffffffa0e683f5>] mds_regular_handle+0x15/0x20 [mdt] <4> [<ffffffffa07e63c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] <4> [<ffffffffa04eb5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] <4> [<ffffffffa04fcd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] <4> [<ffffffffa07dd729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] <4> [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 <4> [<ffffffffa07e775e>] ptlrpc_main+0xace/0x1700 [ptlrpc] <4> [<ffffffffa07e6c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] <4> [<ffffffff8100c0ca>] child_rip+0xa/0x20 <4> [<ffffffffa07e6c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] <4> [<ffffffffa07e6c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG
I also have a vmcore file for this crash, though none of the files in /tmp that I remember from 1.8 times, not sure if this is a 2.4 thing or related to the reboots, which happen even though kernel.panic=0.
It doesn't make any difference if I mount with our without --abort-recovery, the LBUG happens within a minute of the file system coming back, every time.
This test file system has been upgrade from 1.8 to 2.3 previously and was running 2.3 for a while. It is also possible that this has been upgraded from 1.6 initially, though I'd have to check this.
It might be of note that even though we moved the OSSes to a different network, we did not manage to shutdown all clients before the migration, so quite a few clients are likely trying to communicate with the OSSes using the old IPs and will fail.
Attachments
Issue Links
- is related to
-
LU-4282 some OSTs reported as inactive in lfs df, UP with lctl dl, data not accessible
-
- Resolved
-