Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13061

osd_fid_lookup()) ASSERTION( fid_is_sane(fid) || fid_is_idif(fid) ) failed: [0x0:0x68:0x0]

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.14.0, Lustre 2.12.4
    • Lustre 2.12.3
    • None
    • 3
    • 9223372036854775807

    Description

      If a system has hit LU-12593 with a corrupt block in the llog file, it may trigger an LASSERT() because of a bad FID found in the unintialized part of the block. Applying the patch from that ticket is too late to fix the problem.

      LustreError: 17438:0:(osd_handler.c:1077:osd_fid_lookup()) ASSERTION( fid_is_sane(fid) || fid_is_idif(fid) ) failed: [0x0:0x68:0x0]
      LustreError: 17438:0:(osd_handler.c:1077:osd_fid_lookup()) LBUG
      Pid: 17438, comm: llog_process_th 3.10.0-1062.1.1.el7_lustre.x86_64
      Call Trace:
       libcfs_call_trace+0x8c/0xc0 [libcfs]
       lbug_with_loc+0x4c/0xa0 [libcfs]
       osd_fid_lookup+0xc8/0x1c60 [osd_ldiskfs]
       osd_object_init+0x61/0x110 [osd_ldiskfs]
       lu_object_start.isra.35+0x8b/0x120 [obdclass]  
       lu_object_find_at+0x1e1/0xa60 [obdclass]
       dt_locate_at+0x1d/0xb0 [obdclass]
       llog_osd_open+0x50e/0xf30 [obdclass]
       llog_open+0x15a/0x3e0 [obdclass]
       osp_sync_init+0x44a/0xe20 [osp]
       osp_init0.isra.19+0x1aed/0x1f60 [osp]
       osp_device_alloc+0x86/0x130 [osp]
       obd_setup+0x119/0x280 [obdclass]
       class_setup+0x2a8/0x840 [obdclass]
       class_process_config+0x1726/0x2830 [obdclass]
       class_config_llog_handler+0x819/0x1520 [obdclass]
       llog_process_thread+0x82f/0x18e0 [obdclass]
       llog_process_thread_daemonize+0x9f/0xe0 [obdclass]
      

      Lustre should avoid ever triggering an LASSERT() on data read from disk or from the network. In this case, it probably makes sense to add a check in llog_osd_open() with fid_is_sane() before it uses the FID, and just return an error rather than crashing.

      Attachments

        Issue Links

          Activity

            [LU-13061] osd_fid_lookup()) ASSERTION( fid_is_sane(fid) || fid_is_idif(fid) ) failed: [0x0:0x68:0x0]

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37185/
            Subject: LU-13061 osp: check catlog FID after reading in
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 055eab6bd4c29bc961a10824ffa44323cce7640c

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37185/ Subject: LU-13061 osp: check catlog FID after reading in Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 055eab6bd4c29bc961a10824ffa44323cce7640c

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37185
            Subject: LU-13061 osp: check catlog FID after reading in
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 0d2e23757ef6663f8b06b935253835937d8ca1f3

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37185 Subject: LU-13061 osp: check catlog FID after reading in Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 0d2e23757ef6663f8b06b935253835937d8ca1f3
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36998/
            Subject: LU-13061 osp: check catlog FID after reading in
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4597fa7d884de0f1a1b030052d4d34983fed6109

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36998/ Subject: LU-13061 osp: check catlog FID after reading in Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4597fa7d884de0f1a1b030052d4d34983fed6109

            Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36998
            Subject: LU-13061 osp: check catlog FID after reading in
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7a5df8c9c0fa2cab2135e68e305a9a850263d720

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36998 Subject: LU-13061 osp: check catlog FID after reading in Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7a5df8c9c0fa2cab2135e68e305a9a850263d720

            the OID field is "0x400000068", but is truncated to "0x68" during assigning it to lu_fid.f_oid(32bits).
            the data seems to be a normal FID (f_seq=0x400000068, f_oid=0, f_ver=0)

            No, the 0x400000068 field is the oi_id field (mapped to f_oid), and the f_seq field is 0x0, which is what triggers the LASSERT. Valid records look like the following with oi_seq = f_seq = 0x1 = FID_SEQ_LLOG:

            000080 00000000000004ef 0000000000000001
            000090 0000000000000000 0000000000000000
            

            In this case, the FID is [0x1:0x4ef:0x0] and would map to object O/1/d15/1263 on the MDT.

            adilger Andreas Dilger added a comment - the OID field is "0x400000068", but is truncated to "0x68" during assigning it to lu_fid.f_oid(32bits). the data seems to be a normal FID (f_seq=0x400000068, f_oid=0, f_ver=0) No, the 0x400000068 field is the oi_id field (mapped to f_oid ), and the f_seq field is 0x0 , which is what triggers the LASSERT. Valid records look like the following with oi_seq = f_seq = 0x1 = FID_SEQ_LLOG : 000080 00000000000004ef 0000000000000001 000090 0000000000000000 0000000000000000 In this case, the FID is [0x1:0x4ef:0x0] and would map to object O/1/d15/1263 on the MDT.

            as per the dump of the "CATALOGS", the content of the "llog_catid" is

            000020 0000000400000068 0000000000000000
            000030 0000000000000000 0000000000000000
            

            the OID field is "0x400000068", but is truncated to "0x68" during assigning it to lu_fid.f_oid(32bits).
            the data seems to be a normal FID (f_seq=0x400000068, f_oid=0, f_ver=0)

            hongchao.zhang Hongchao Zhang added a comment - as per the dump of the "CATALOGS", the content of the "llog_catid" is 000020 0000000400000068 0000000000000000 000030 0000000000000000 0000000000000000 the OID field is "0x400000068", but is truncated to "0x68" during assigning it to lu_fid.f_oid(32bits). the data seems to be a normal FID (f_seq=0x400000068, f_oid=0, f_ver=0)

            There is a bit of a question in my mind whether logid may not be a FID in some cases (e.g. upgrade from some very old system), but I don't think that is true anymore. This compatibility was needed for old filesystems when they were upgraded from pre-2.4 Lustre, but all of these old systems should be gone at this point, or are never upgrading.

            After fixing the current LASSERT issue, it might make sense to look at the use of llog_logid in the code and update it to use lu_fid wherever possible, by converting the llog_logid into lu_fid immediately and only passing the FID around. The next step would be to add a new LLOG_LOGFID_MAGIC that only stores the lu_fid to the catalog, start using it after some delay (with OBD_VERSION_CHECK() and/or backport code to LTS to read LLOG_LOGFID_MAGIC records and convert into llog_logid to minimize code change but still allow upgrade/downgrade to work) and then eventually we can remove llog_logid.

            adilger Andreas Dilger added a comment - There is a bit of a question in my mind whether logid may not be a FID in some cases (e.g. upgrade from some very old system), but I don't think that is true anymore. This compatibility was needed for old filesystems when they were upgraded from pre-2.4 Lustre, but all of these old systems should be gone at this point, or are never upgrading. After fixing the current LASSERT issue, it might make sense to look at the use of llog_logid in the code and update it to use lu_fid wherever possible, by converting the llog_logid into lu_fid immediately and only passing the FID around. The next step would be to add a new LLOG_LOGFID_MAGIC that only stores the lu_fid to the catalog, start using it after some delay (with OBD_VERSION_CHECK() and/or backport code to LTS to read LLOG_LOGFID_MAGIC records and convert into llog_logid to minimize code change but still allow upgrade/downgrade to work) and then eventually we can remove llog_logid .

            Sure, that would also work, and is also much more direct.

            adilger Andreas Dilger added a comment - Sure, that would also work, and is also much more direct.
            hongchao.zhang Hongchao Zhang added a comment - - edited

            Hi Andreas,
            How about calling 'logid_to_fid" in osp_sync_llog_init after getting the logid from "CATALOGS" file, then check whether it is sane?

            hongchao.zhang Hongchao Zhang added a comment - - edited Hi Andreas, How about calling 'logid_to_fid" in osp_sync_llog_init after getting the logid from "CATALOGS" file, then check whether it is sane?

            People

              hongchao.zhang Hongchao Zhang
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: