[LU-13061] osd_fid_lookup()) ASSERTION( fid_is_sane(fid) || fid_is_idif(fid) ) failed: [0x0:0x68:0x0] Created: 10/Dec/19 Updated: 22/Jan/20 Resolved: 10/Jan/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.3 |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Andreas Dilger | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
If a system has hit LustreError: 17438:0:(osd_handler.c:1077:osd_fid_lookup()) ASSERTION( fid_is_sane(fid) || fid_is_idif(fid) ) failed: [0x0:0x68:0x0] LustreError: 17438:0:(osd_handler.c:1077:osd_fid_lookup()) LBUG Pid: 17438, comm: llog_process_th 3.10.0-1062.1.1.el7_lustre.x86_64 Call Trace: libcfs_call_trace+0x8c/0xc0 [libcfs] lbug_with_loc+0x4c/0xa0 [libcfs] osd_fid_lookup+0xc8/0x1c60 [osd_ldiskfs] osd_object_init+0x61/0x110 [osd_ldiskfs] lu_object_start.isra.35+0x8b/0x120 [obdclass] lu_object_find_at+0x1e1/0xa60 [obdclass] dt_locate_at+0x1d/0xb0 [obdclass] llog_osd_open+0x50e/0xf30 [obdclass] llog_open+0x15a/0x3e0 [obdclass] osp_sync_init+0x44a/0xe20 [osp] osp_init0.isra.19+0x1aed/0x1f60 [osp] osp_device_alloc+0x86/0x130 [osp] obd_setup+0x119/0x280 [obdclass] class_setup+0x2a8/0x840 [obdclass] class_process_config+0x1726/0x2830 [obdclass] class_config_llog_handler+0x819/0x1520 [obdclass] llog_process_thread+0x82f/0x18e0 [obdclass] llog_process_thread_daemonize+0x9f/0xe0 [obdclass] Lustre should avoid ever triggering an LASSERT() on data read from disk or from the network. In this case, it probably makes sense to add a check in llog_osd_open() with fid_is_sane() before it uses the FID, and just return an error rather than crashing. |
| Comments |
| Comment by Andreas Dilger [ 10/Dec/19 ] |
|
HongChao, can you please make a patch for this, preferably with a test case. You can add a CFS_FAIL_LOC() check to set fid_seq=0 or similar to trigger the LASSERT(), then verify your patch fixes it. |
| Comment by Andreas Dilger [ 11/Dec/19 ] |
|
Based on further analysis of the problem, it seems we can't verify fid_is_sane() directly in osp_sync_llog_init() since the logid is not yet converted to a FID. If llog_osd_open() checks fid_is_sane() and returns -ENOENT if it isn't sane:
if (logid != NULL) {
logid_to_fid(logid, &lgi->lgi_fid);
if (!fid_is_sane(&lgi->lgi_fid)) {
CERROR("%s: bad FID "DFID" in %s idx: rc = %d\n",
ctxt->loc_exp->exp_obd->obd_name,
PFID(&lgi->lgi_fid), -ENOENT);
RETURN(-ENOENT);
} else {
/* If logid == NULL, then it means the caller needs
then this return code is checked in osp_sync_llog_init() and the llog will be recreated:
rc = llog_open(env, ctxt, &lgh, &osi->osi_cid.lci_logid, NULL,
LLOG_OPEN_EXISTS);
/* re-create llog if it is missing */
if (rc == -ENOENT)
logid_set_id(&osi->osi_cid.lci_logid, 0);
|
| Comment by Hongchao Zhang [ 11/Dec/19 ] |
|
Hi Andreas, |
| Comment by Andreas Dilger [ 11/Dec/19 ] |
|
Sure, that would also work, and is also much more direct. |
| Comment by Andreas Dilger [ 11/Dec/19 ] |
|
There is a bit of a question in my mind whether logid may not be a FID in some cases (e.g. upgrade from some very old system), but I don't think that is true anymore. This compatibility was needed for old filesystems when they were upgraded from pre-2.4 After fixing the current LASSERT issue, it might make sense to look at the use of llog_logid in the code and update it to use lu_fid wherever possible, by converting the llog_logid into lu_fid immediately and only passing the FID around. The next step would be to add a new LLOG_LOGFID_MAGIC that only stores the lu_fid to the catalog, start using it after some delay (with OBD_VERSION_CHECK() and/or backport code to LTS to read LLOG_LOGFID_MAGIC records and convert into llog_logid to minimize code change but still allow upgrade/downgrade to work) and then eventually we can remove llog_logid. |
| Comment by Hongchao Zhang [ 11/Dec/19 ] |
|
as per the dump of the "CATALOGS", the content of the "llog_catid" is 000020 0000000400000068 0000000000000000 000030 0000000000000000 0000000000000000 the OID field is "0x400000068", but is truncated to "0x68" during assigning it to lu_fid.f_oid(32bits). |
| Comment by Andreas Dilger [ 11/Dec/19 ] |
No, the 0x400000068 field is the oi_id field (mapped to f_oid), and the f_seq field is 0x0, which is what triggers the LASSERT. Valid records look like the following with oi_seq = f_seq = 0x1 = FID_SEQ_LLOG: 000080 00000000000004ef 0000000000000001 000090 0000000000000000 0000000000000000 In this case, the FID is [0x1:0x4ef:0x0] and would map to object O/1/d15/1263 on the MDT. |
| Comment by Gerrit Updater [ 12/Dec/19 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36998 |
| Comment by Gerrit Updater [ 10/Jan/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36998/ |
| Comment by Peter Jones [ 10/Jan/20 ] |
|
Landed for 2.14 |
| Comment by Gerrit Updater [ 10/Jan/20 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37185 |
| Comment by Gerrit Updater [ 22/Jan/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37185/ |