[LU-6471] Unexpected Lustre Client LBUG in llog_write() Created: 17/Apr/15  Updated: 14/Jun/18  Resolved: 25/May/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Bruno Faccini (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-5195 HSM: mdt_hsm_cdt_actions.c:104:cdt_ll... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Some IEEL2.2.x and Lustre 2.5 installations have triggered this unexpected LBUG on Lustre Clients consuming ChangeLogs :

<3>LustreError: 31536:0:(llog_cat.c:192:llog_cat_id2handle()) store2-MDT0000-mdc-ffff88206515bc00: error opening log id 0x2def:1:0: rc = -2
<3>LustreError: 31536:0:(llog_cat.c:565:llog_cat_process_cb()) store2-MDT0000-mdc-ffff88206515bc00: cannot find handle for llog 0x2def:1: -2
<0>LustreError: 31536:0:(llog.c:861:llog_write()) ASSERTION( loghandle->lgh_obj != ((void *)0) ) failed:
<0>LustreError: 31536:0:(llog.c:861:llog_write()) LBUG
<4>Pid: 31536, comm: mdc_clg_send_th
<4>
<4>Call Trace:
<4> [<ffffffffa0437895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0437e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0553c52>] llog_write+0x352/0x420 [obdclass]
<4> [<ffffffffa0553e0c>] llog_cancel_rec+0xbc/0x7c0 [obdclass]
<4> [<ffffffffa0558bcc>] llog_cat_cleanup+0x9c/0x260 [obdclass]
<4> [<ffffffffa0559fd8>] llog_cat_process_cb+0x588/0x630 [obdclass]
<4> [<ffffffffa0554e6b>] llog_process_thread+0x95b/0xd70 [obdclass]
<4> [<ffffffffa0559a50>] ? llog_cat_process_cb+0x0/0x630 [obdclass]
<4> [<ffffffffa0556a67>] llog_process_or_fork+0x127/0x550 [obdclass]
<4> [<ffffffffa0557f09>] llog_cat_process_or_fork+0x89/0x350 [obdclass]
<4> [<ffffffff8152aede>] ? mutex_lock+0x1e/0x50
<4> [<ffffffffa094fe10>] ? changelog_kkuc_cb+0x0/0x250 [mdc]
<4> [<ffffffffa05581e9>] llog_cat_process+0x19/0x20 [obdclass]
<4> [<ffffffffa09541eb>] mdc_changelog_send_thread+0x4cb/0x810 [mdc]
<4> [<ffffffffa0953d20>] ? mdc_changelog_send_thread+0x0/0x810 [mdc]
<4> [<ffffffff8109e66e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 31536, comm: mdc_clg_send_th Not tainted 2.6.32-504.12.2.el6.x86_64 #1
<4>Call Trace:
<4> [<ffffffff8152933c>] ? panic+0xa7/0x16f
<4> [<ffffffffa0437eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa0553c52>] ? llog_write+0x352/0x420 [obdclass]
<4> [<ffffffffa0553e0c>] ? llog_cancel_rec+0xbc/0x7c0 [obdclass]
<4> [<ffffffffa0558bcc>] ? llog_cat_cleanup+0x9c/0x260 [obdclass]
<4> [<ffffffffa0559fd8>] ? llog_cat_process_cb+0x588/0x630 [obdclass]
<4> [<ffffffffa0554e6b>] ? llog_process_thread+0x95b/0xd70 [obdclass]
<4> [<ffffffffa0559a50>] ? llog_cat_process_cb+0x0/0x630 [obdclass]
<4> [<ffffffffa0556a67>] ? llog_process_or_fork+0x127/0x550 [obdclass]
<4> [<ffffffffa0557f09>] ? llog_cat_process_or_fork+0x89/0x350 [obdclass]
<4> [<ffffffff8152aede>] ? mutex_lock+0x1e/0x50
<4> [<ffffffffa094fe10>] ? changelog_kkuc_cb+0x0/0x250 [mdc]
<4> [<ffffffffa05581e9>] ? llog_cat_process+0x19/0x20 [obdclass]
<4> [<ffffffffa09541eb>] ? mdc_changelog_send_thread+0x4cb/0x810 [mdc]
<4> [<ffffffffa0953d20>] ? mdc_changelog_send_thread+0x0/0x810 [mdc]
<4> [<ffffffff8109e66e>] ? kthread+0x9e/0xc0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

The problem seems to have been introduced by patch for LU-5195, which does not handle the case where code is running on a Lustre Client (ie, cathandle->lgh_obj == NULL) and is only a ChangeLog/LLOG consumer/replicator.

This problem had not been seen before since concerned change is only being executed when triggering some catalog inconsistency. Looks like master is also impacted, will push a patch soon.



 Comments   
Comment by Gerrit Updater [ 17/Apr/15 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/14489
Subject: LU-6471 obdclass: fix llog_cat_cleanup() usage on Client
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9219f1ba884874f5f25df7499a13956c4eb65bea

Comment by Antoine Percher [ 17/Apr/15 ]

We meet the same issue on T100 system with lustre 2.5.3.90 on the robinhood node

Comment by Bruno Faccini (Inactive) [ 17/Apr/15 ]

Having a look to old (ie, running with Lustre version not including patch from LU-5195) ChangeLogs consumer Client's syslogs, I have found they already contained the 2 msgs preceding the LBUG now :

<3>LustreError: 31536:0:(llog_cat.c:192:llog_cat_id2handle()) store2-MDT0000-mdc-ffff88206515bc00: error opening log id 0x2def:1:0: rc = -2
<3>LustreError: 31536:0:(llog_cat.c:565:llog_cat_process_cb()) store2-MDT0000-mdc-ffff88206515bc00: cannot find handle for llog 0x2def:1: -2

This means that since Servers/MDS also did not run with LU-5195 patch in at this time, this fix is not the responsible of the LBUG by itself, but that some kind of a race possibly exists to allow a Client to work with an out-of-sync Catalog content ... This is confirmed by the lustre debug trace logs, extracted fro crash-dump upon LBUG, where it appears that the reported missing plan log has been available and used a few times before.
So finally, this may be a new consequence of the fact that the plain log destroy and catlog index record deletion are not atomic, already addressed for Server side (only!) in LU-5195 patch.

Comment by Gerrit Updater [ 23/Apr/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14489/
Subject: LU-6471 obdclass: fix llog_cat_cleanup() usage on Client
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 485f3ba8743365c7fb5823893a7eb45030cdffb5

Comment by Peter Jones [ 25/May/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:00:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.