[LU-13061] osd_fid_lookup()) ASSERTION( fid_is_sane(fid) || fid_is_idif(fid) ) failed: [0x0:0x68:0x0] Created: 10/Dec/19  Updated: 22/Jan/20  Resolved: 10/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: Lustre 2.14.0, Lustre 2.12.4

Type: Bug Priority: Critical
Reporter: Andreas Dilger Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-5369 servers should reject invalid FIDs, r... Resolved
is related to LU-12593 update_log corruption Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

If a system has hit LU-12593 with a corrupt block in the llog file, it may trigger an LASSERT() because of a bad FID found in the unintialized part of the block. Applying the patch from that ticket is too late to fix the problem.

LustreError: 17438:0:(osd_handler.c:1077:osd_fid_lookup()) ASSERTION( fid_is_sane(fid) || fid_is_idif(fid) ) failed: [0x0:0x68:0x0]
LustreError: 17438:0:(osd_handler.c:1077:osd_fid_lookup()) LBUG
Pid: 17438, comm: llog_process_th 3.10.0-1062.1.1.el7_lustre.x86_64
Call Trace:
 libcfs_call_trace+0x8c/0xc0 [libcfs]
 lbug_with_loc+0x4c/0xa0 [libcfs]
 osd_fid_lookup+0xc8/0x1c60 [osd_ldiskfs]
 osd_object_init+0x61/0x110 [osd_ldiskfs]
 lu_object_start.isra.35+0x8b/0x120 [obdclass]  
 lu_object_find_at+0x1e1/0xa60 [obdclass]
 dt_locate_at+0x1d/0xb0 [obdclass]
 llog_osd_open+0x50e/0xf30 [obdclass]
 llog_open+0x15a/0x3e0 [obdclass]
 osp_sync_init+0x44a/0xe20 [osp]
 osp_init0.isra.19+0x1aed/0x1f60 [osp]
 osp_device_alloc+0x86/0x130 [osp]
 obd_setup+0x119/0x280 [obdclass]
 class_setup+0x2a8/0x840 [obdclass]
 class_process_config+0x1726/0x2830 [obdclass]
 class_config_llog_handler+0x819/0x1520 [obdclass]
 llog_process_thread+0x82f/0x18e0 [obdclass]
 llog_process_thread_daemonize+0x9f/0xe0 [obdclass]

Lustre should avoid ever triggering an LASSERT() on data read from disk or from the network. In this case, it probably makes sense to add a check in llog_osd_open() with fid_is_sane() before it uses the FID, and just return an error rather than crashing.



 Comments   
Comment by Andreas Dilger [ 10/Dec/19 ]

HongChao, can you please make a patch for this, preferably with a test case. You can add a CFS_FAIL_LOC() check to set fid_seq=0 or similar to trigger the LASSERT(), then verify your patch fixes it.

Comment by Andreas Dilger [ 11/Dec/19 ]

Based on further analysis of the problem, it seems we can't verify fid_is_sane() directly in osp_sync_llog_init() since the logid is not yet converted to a FID. If llog_osd_open() checks fid_is_sane() and returns -ENOENT if it isn't sane:

               if (logid != NULL) {
                        logid_to_fid(logid, &lgi->lgi_fid);
                        if (!fid_is_sane(&lgi->lgi_fid)) {
                                CERROR("%s: bad FID "DFID" in %s idx: rc = %d\n",
                                       ctxt->loc_exp->exp_obd->obd_name,
                                       PFID(&lgi->lgi_fid), -ENOENT);
                                RETURN(-ENOENT);
                        } else {
                                /* If logid == NULL, then it means the caller needs

then this return code is checked in osp_sync_llog_init() and the llog will be recreated:

                rc = llog_open(env, ctxt, &lgh, &osi->osi_cid.lci_logid, NULL,
                               LLOG_OPEN_EXISTS);
                /* re-create llog if it is missing */
                if (rc == -ENOENT)
                        logid_set_id(&osi->osi_cid.lci_logid, 0);
Comment by Hongchao Zhang [ 11/Dec/19 ]

Hi Andreas,
How about calling 'logid_to_fid" in osp_sync_llog_init after getting the logid from "CATALOGS" file, then check whether it is sane?

Comment by Andreas Dilger [ 11/Dec/19 ]

Sure, that would also work, and is also much more direct.

Comment by Andreas Dilger [ 11/Dec/19 ]

There is a bit of a question in my mind whether logid may not be a FID in some cases (e.g. upgrade from some very old system), but I don't think that is true anymore. This compatibility was needed for old filesystems when they were upgraded from pre-2.4 Lustre, but all of these old systems should be gone at this point, or are never upgrading.

After fixing the current LASSERT issue, it might make sense to look at the use of llog_logid in the code and update it to use lu_fid wherever possible, by converting the llog_logid into lu_fid immediately and only passing the FID around. The next step would be to add a new LLOG_LOGFID_MAGIC that only stores the lu_fid to the catalog, start using it after some delay (with OBD_VERSION_CHECK() and/or backport code to LTS to read LLOG_LOGFID_MAGIC records and convert into llog_logid to minimize code change but still allow upgrade/downgrade to work) and then eventually we can remove llog_logid.

Comment by Hongchao Zhang [ 11/Dec/19 ]

as per the dump of the "CATALOGS", the content of the "llog_catid" is

000020 0000000400000068 0000000000000000
000030 0000000000000000 0000000000000000

the OID field is "0x400000068", but is truncated to "0x68" during assigning it to lu_fid.f_oid(32bits).
the data seems to be a normal FID (f_seq=0x400000068, f_oid=0, f_ver=0)

Comment by Andreas Dilger [ 11/Dec/19 ]

the OID field is "0x400000068", but is truncated to "0x68" during assigning it to lu_fid.f_oid(32bits).
the data seems to be a normal FID (f_seq=0x400000068, f_oid=0, f_ver=0)

No, the 0x400000068 field is the oi_id field (mapped to f_oid), and the f_seq field is 0x0, which is what triggers the LASSERT. Valid records look like the following with oi_seq = f_seq = 0x1 = FID_SEQ_LLOG:

000080 00000000000004ef 0000000000000001
000090 0000000000000000 0000000000000000

In this case, the FID is [0x1:0x4ef:0x0] and would map to object O/1/d15/1263 on the MDT.

Comment by Gerrit Updater [ 12/Dec/19 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36998
Subject: LU-13061 osp: check catlog FID after reading in
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7a5df8c9c0fa2cab2135e68e305a9a850263d720

Comment by Gerrit Updater [ 10/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36998/
Subject: LU-13061 osp: check catlog FID after reading in
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4597fa7d884de0f1a1b030052d4d34983fed6109

Comment by Peter Jones [ 10/Jan/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 10/Jan/20 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37185
Subject: LU-13061 osp: check catlog FID after reading in
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 0d2e23757ef6663f8b06b935253835937d8ca1f3

Comment by Gerrit Updater [ 22/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37185/
Subject: LU-13061 osp: check catlog FID after reading in
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 055eab6bd4c29bc961a10824ffa44323cce7640c

Generated at Sat Feb 10 02:58:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.