[LU-4242] mdt_open.c:1685:mdt_reint_open()) LBUG - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.4.1
Labels:
None
Environment:
file system upgraded from at least 1.8 to 2.3 and now 2.4.1, clients all running 1.8.9 currently, mostly Red Hat clients, Red Hat 6 servers

Severity:
3
Rank (Obsolete):
11548

Description

After upgrading our test file system to 2.4.1 earlier (and at the same time moving the OSSes to a different network), the MDT crashes very frequently with and LBUG and reboots directly. I have managed to get the following stack trace from /var/crash.

<0>LustreError: 8518:0:(mdt_open.c:1685:mdt_reint_open()) LBUG
<6>Lustre: play01-MDT0000: Recovery over after 0:31, of 267 clients 267 recovered and 0 were evicted.
<4>Pid: 8518, comm: mdt01_005
<4>
<4>Call Trace:
<4> [<ffffffffa04ea895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa04eae97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0e5a6b9>] mdt_reint_open+0x1989/0x20c0 [mdt]
<4> [<ffffffffa050782e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
<4> [<ffffffffa07d5dcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
<4> [<ffffffffa0669f50>] ? lu_ucred+0x20/0x30 [obdclass]
<4> [<ffffffffa0e44911>] mdt_reint_rec+0x41/0xe0 [mdt]
<4> [<ffffffffa0e29ae3>] mdt_reint_internal+0x4c3/0x780 [mdt]
<4> [<ffffffffa0e2a06d>] mdt_intent_reint+0x1ed/0x520 [mdt]
<4> [<ffffffffa0e27f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
<4> [<ffffffffa078d831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
<4> [<ffffffffa07b41ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
<4> [<ffffffffa0e283a6>] mdt_enqueue+0x46/0xe0 [mdt]
<4> [<ffffffffa0e2ea97>] mdt_handle_common+0x647/0x16d0 [mdt]
<4> [<ffffffffa07d6bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
<4> [<ffffffffa0e683f5>] mds_regular_handle+0x15/0x20 [mdt]
<4> [<ffffffffa07e63c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
<4> [<ffffffffa04eb5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
<4> [<ffffffffa04fcd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
<4> [<ffffffffa07dd729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
<4> [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
<4> [<ffffffffa07e775e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
<4> [<ffffffffa07e6c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
<4> [<ffffffff8100c0ca>] child_rip+0xa/0x20
<4> [<ffffffffa07e6c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
<4> [<ffffffffa07e6c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
<4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG

I also have a vmcore file for this crash, though none of the files in /tmp that I remember from 1.8 times, not sure if this is a 2.4 thing or related to the reboots, which happen even though kernel.panic=0.

It doesn't make any difference if I mount with our without --abort-recovery, the LBUG happens within a minute of the file system coming back, every time.

This test file system has been upgrade from 1.8 to 2.3 previously and was running 2.3 for a while. It is also possible that this has been upgraded from 1.6 initially, though I'd have to check this.

It might be of note that even though we moved the OSSes to a different network, we did not manage to shutdown all clients before the migration, so quite a few clients are likely trying to communicate with the OSSes using the old IPs and will fail.

Attachments

Issue Links

is related to

LU-4282 some OSTs reported as inactive in lfs df, UP with lctl dl, data not accessible

Resolved

Activity

People

Assignee:: Bob Glossman (Inactive)

Reporter:: Frederik Ferner (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 11/Nov/13 7:43 PM

Updated:: 04/Feb/14 6:43 PM

Resolved:: 04/Feb/14 6:43 PM