[LU-2568] MDT unable to start with corrupted llog files. Created: 03/Jan/13  Updated: 09/Jan/20  Resolved: 09/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Alexander Zarochentsev Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 6000

 Description   

here is a log from failed mdt start:

Feb 12 22:04:14 tstmds0a01 kernel: LustreError: 5302:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x7a4801b:0x2790e5b9: rc -116
Feb 12 22:04:14 tstmds0a01 kernel: LustreError: 5302:0:(llog_cat.c:174:llog_cat_id2handle()) error opening log id 0x7a4801b:2790e5b9: rc -116
Feb 12 22:04:14 tstmds0a01 kernel: LustreError: 5302:0:(llog_obd.c:318:cat_cancel_cb()) Cannot find handle for log 0x7a4801b
Feb 12 22:04:14 tstmds0a01 kernel: LustreError: 5248:0:(llog_obd.c:391:llog_obd_origin_setup()) llog_process() with cat_cancel_cb failed: -116
Feb 12 22:04:14 tstmds0a01 kernel: LustreError: 5248:0:(llog_obd.c:218:llog_setup_named()) obd mdd_obd-content-MDT0000 ctxt 12 lop_setup=ffffffff88625e70 failed -116
Feb 12 22:04:14 tstmds0a01 kernel: LustreError: 5248:0:(mds_log.c:182:mds_changelog_llog_init()) changelog llog setup failed -116
Feb 12 22:04:14 tstmds0a01 kernel: LustreError: 5248:0:(mdd_device.c:196:mdd_changelog_llog_init()) no changelog context
Feb 12 22:04:14 tstmds0a01 kernel: LustreError: 5248:0:(mdd_device.c:271:mdd_changelog_init()) Changelog setup during init failed -22
Feb 12 22:04:15 tstmds0a01 kernel: Lustre: content-MDT0000: used disk, loading
Feb 12 22:04:15 tstmds0a01 kernel: LustreError: 5248:0:(mdt_handler.c:1889:mdt_llog_ctxt_clone()) Can't get mdd ctxt -2
Feb 12 22:04:15 tstmds0a01 kernel: LustreError: 5248:0:(obd_config.c:522:class_setup()) setup content-MDT0000 failed (-2)



 Comments   
Comment by Alexander Zarochentsev [ 04/Jan/13 ]

Xyratex has a fix for this issue I will upload it later.

Comment by Mikhail Pershin [ 07/Jan/13 ]

Zam, this doesn't look as master bug, is it some older Lustre version than 2.3?

Comment by Alexander Zarochentsev [ 07/Jan/13 ]

Yes, it is older bug. but looks like it is still in master.

The issue was with missing llog files and their inode numbers were re-used for other objects.
llog_lvfs_create returns ESTATE in that case – it is the error seen at the top level.

The key fix was:

diff --git a/lustre/obdclass/llog_lvfs.c b/lustre/obdclass/llog_lvfs.c
index 0987020..60bad4c 100644
--- a/lustre/obdclass/llog_lvfs.c
+++ b/lustre/obdclass/llog_lvfs.c
@@ -615,6 +615,10 @@ static int llog_lvfs_create(struct llog_ctxt *ctxt, struct llog_handle **res,
                         rc = PTR_ERR(dchild);
                         CERROR("error looking up logfile "LPX64":0x%x: rc %d\n",
                                logid->lgl_oid, logid->lgl_ogen, rc);
+                        if (rc == -ESTALE)
+                                /* handle reused inode same way as
+                                   non-existing one */
+                                GOTO(out, rc = -ENOENT);
                         GOTO(out, rc);
                 }
 

I still think it is actual for the master branch. but I haven't tried to re-create it on master.

Comment by Mikhail Pershin [ 08/Jan/13 ]

the llog_lvfs_create is not used anymore in master, llogs are OSD-based now, do you have any reproducer for this? I suppose it shouldn't be problem now if the reason was inode re-use because now llog object is fid-based, but we need to check that

Comment by Andreas Dilger [ 09/Jan/20 ]

Close old ticket.

Generated at Sat Feb 10 01:26:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.