[LU-5529] LBUG when unmounting MDT Created: 21/Aug/14  Updated: 22/Aug/14  Resolved: 22/Aug/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Frederik Ferner (Inactive) Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: None
Environment:

RHEL6


Severity: 3
Rank (Obsolete): 15387

 Description   

after the recent upgrade to 2.5.2 on our servers, I've just tried to unmount the MDT (and MGS) to try and fail over to the second server (after applying the patches recommended in LU-5514). While waiting for the unmount to complete, we had this LBUG:

kernel:LustreError: 11779:0:(osp_sync.c:878:osp_sync_thread()) ASSERTION( count < 10 ) failed: lustre03-OST0009-osc: 1 1 empty
kernel:LustreError: 11779:0:(osp_sync.c:878:osp_sync_thread()) LBUG

The machine then rebooted, so not much debugging available, but I managed to get the following from the Red Hat crash logs (vmcore-dmesg.txt):

<0>LustreError: 11779:0:(osp_sync.c:878:osp_sync_thread()) ASSERTION( count < 10 ) failed: lustre03-OST0009-osc: 1 1 empty
<0>LustreError: 11779:0:(osp_sync.c:878:osp_sync_thread()) LBUG
<4>Pid: 11779, comm: osp-syn-9-0
<4>
<4>Call Trace:
<4> [<ffffffffa0515895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0515e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa1033132>] osp_sync_thread+0x6c2/0x7d0 [osp]
<4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa1032a70>] ? osp_sync_thread+0x0/0x7d0 [osp]
<4> [<ffffffff8109ab56>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109aac0>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 11779, comm: osp-syn-9-0 Not tainted 2.6.32-431.17.1.el6_lustre.x86_64 #1
<4>Call Trace:
<4> [<ffffffff8152795f>] ? panic+0xa7/0x16f
<4> [<ffffffffa0515eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa1033132>] ? osp_sync_thread+0x6c2/0x7d0 [osp]
<4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa1032a70>] ? osp_sync_thread+0x0/0x7d0 [osp]
<4> [<ffffffff8109ab56>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109aac0>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

This looks like it could be LU-5244, can someone confirm this and provide a patch for 2.5.2?

As we've only seen it during the MDT unmount so far, it's not that urgent at the moment, but if this is going to hit us during normal operation, users won't be happy...



 Comments   
Comment by Peter Jones [ 21/Aug/14 ]

Bobijam is looking into this ticket

Comment by Zhenyu Xu [ 21/Aug/14 ]

yes, it's similar to LU-5244, and you can apply patch http://review.whamcloud.com/11543, and it will be landed for 2.5.3.

Comment by Frederik Ferner (Inactive) [ 22/Aug/14 ]

Thanks for looking into this. I've applied this patch (on top of the three already added, on top of 2.5.2) and we're running with this on our MDTs now. We have not seen any LBUG since then. Previously it happened every time we unmounted the MDT, after applying the patch, we've unmounted the MDT a number of times as part of our testing...

Regards,
Frederik

Comment by Peter Jones [ 22/Aug/14 ]

That's great news Frederik. So I will close out this ticket as a duplicate then and you can rely on this fix being present when you move to 2.5.3 or newer releases

Generated at Sat Feb 10 01:52:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.