Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4829

LBUG: ASSERTION( !fid_is_idif(fid) )

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.5.0
    • Lustre 2.4.3
    • 13287

    Description

      We have our TDS system setup in wide-stripe mode. Each OSS is mounting over 100 OSTs. On mount the other day, we hit an assertion when scrub started.

      [12319.230157] LustreError: 54554:0:(osd_internal.h:752:osd_fid2oi()) ASSERTION( !fid_is_idif(fid) ) failed: [0x100000000:0x1:0x0]
      [12319.242502] LustreError: 54554:0:(osd_internal.h:752:osd_fid2oi()) LBUG
      [12319.249538] Pid: 54554, comm: OI_scrub
      [12319.253707] 
      [12319.253707] Call Trace:
      [12319.258529]  [<ffffffffa03dd895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [12319.265837]  [<ffffffffa03dde97>] lbug_with_loc+0x47/0xb0 [libcfs]
      [12319.272395]  [<ffffffffa0be40f5>] __osd_oi_lookup+0x3a5/0x3b0 [osd_ldiskfs]
      [12319.279770]  [<ffffffff8119dfcd>] ? generic_drop_inode+0x1d/0x80
      [12319.286133]  [<ffffffffa0be4174>] osd_oi_lookup+0x74/0x140 [osd_ldiskfs]
      [12319.293197]  [<ffffffffa0bf8fbf>] osd_scrub_exec+0x1af/0xf30 [osd_ldiskfs]
      [12319.300553]  [<ffffffffa0bfa5f2>] ? osd_scrub_next+0x142/0x4b0 [osd_ldiskfs]
      [12319.308061]  [<ffffffffa0b71432>] ? ldiskfs_read_inode_bitmap+0x172/0x2c0 [ldiskfs]
      [12319.316454]  [<ffffffffa0bf4d4f>] osd_inode_iteration+0x1cf/0x570 [osd_ldiskfs]
      [12319.324461]  [<ffffffff810516b9>] ? __wake_up_common+0x59/0x90
      [12319.330764]  [<ffffffffa0bf8e10>] ? osd_scrub_exec+0x0/0xf30 [osd_ldiskfs]
      [12319.337941]  [<ffffffffa0bfa4b0>] ? osd_scrub_next+0x0/0x4b0 [osd_ldiskfs]
      [12319.345300]  [<ffffffffa0bf732a>] osd_scrub_main+0x59a/0xd00 [osd_ldiskfs]
      [12319.352591]  [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
      [12319.358585]  [<ffffffffa0bf6d90>] ? osd_scrub_main+0x0/0xd00 [osd_ldiskfs]
      [12319.365881]  [<ffffffff8100c0ca>] child_rip+0xa/0x20
      [12319.371174]  [<ffffffffa0bf6d90>] ? osd_scrub_main+0x0/0xd00 [osd_ldiskfs]
      [12319.378509]  [<ffffffffa0bf6d90>] ? osd_scrub_main+0x0/0xd00 [osd_ldiskfs]
      [12319.385799]  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      

      We had panic_on_lbug off, so we don't have a crash dump. But the system is still running, so if there's anything useful we can try to grab it. I tried to cat /proc/fs/lustre/osd-ldiskfs/atlastds-OST00f3/oi_scrub but it just hangs. That 'cat' process is stuck on the following:

      # cat /proc/83715/stack
      [<ffffffff81281f34>] call_rwsem_down_read_failed+0x14/0x30
      [<ffffffffa0bf630d>] osd_scrub_dump+0x3d/0x320 [osd_ldiskfs]
      [<ffffffffa0be6055>] lprocfs_osd_rd_oi_scrub+0x75/0xb0 [osd_ldiskfs]
      [<ffffffffa054f563>] lprocfs_fops_read+0xf3/0x1f0 [obdclass]
      [<ffffffff811e9fee>] proc_reg_read+0x7e/0xc0
      [<ffffffff81181f05>] vfs_read+0xb5/0x1a0
      [<ffffffff81182041>] sys_read+0x51/0x90
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      the FID it's complaining about [0x100000000:0x1:0x0] looks suspect. The sequence is FID_SEQ_IDIF and the ObjID is 1. I know on ext4 inode 1 stores the bad blocks information, but I don't think that's what we're seeing here.

      We haven't yet tried to re-mount to see if the issue is persistent, since there may be something on the running system that you want us to provide. But we can do that if it's helpful.

      Attachments

        Issue Links

          Activity

            [LU-4829] LBUG: ASSERTION( !fid_is_idif(fid) )
            ezell Matt Ezell added a comment -

            Yes, we can close it. Thanks.

            ezell Matt Ezell added a comment - Yes, we can close it. Thanks.

            Matt,

            Since the fix is in b2_5 and later releases and there is a patch for b2_4, should we close this ticket or is there something else you need from us?

            Thanks

            jamesanunez James Nunez (Inactive) added a comment - Matt, Since the fix is in b2_5 and later releases and there is a patch for b2_4, should we close this ticket or is there something else you need from us? Thanks
            laisiyao Lai Siyao added a comment -

            backport patch for 2.4.x is on http://review.whamcloud.com/#/c/10061/

            laisiyao Lai Siyao added a comment - backport patch for 2.4.x is on http://review.whamcloud.com/#/c/10061/

            Matt, how about your process? what do you want us to do next step?

            yong.fan nasf (Inactive) added a comment - Matt, how about your process? what do you want us to do next step?

            Only reboot or re-mount but without 6669 patch applied can NOT resolve the issue, even though it may work for a while, you still will hit it again some time later.

            yong.fan nasf (Inactive) added a comment - Only reboot or re-mount but without 6669 patch applied can NOT resolve the issue, even though it may work for a while, you still will hit it again some time later.
            ezell Matt Ezell added a comment -

            We will need to take several test shots between now putting 2.5 into production, but I think it's reasonable for us to target that (and avoid the work of backporting). We haven't seen this in production, yet.

            I guess my only question is:
            If we hit this in production, would a reboot and re-mount hit the problem again? Or is it intermittent?

            ezell Matt Ezell added a comment - We will need to take several test shots between now putting 2.5 into production, but I think it's reasonable for us to target that (and avoid the work of backporting). We haven't seen this in production, yet. I guess my only question is: If we hit this in production, would a reboot and re-mount hit the problem again? Or is it intermittent?
            pjones Peter Jones added a comment -

            Matt

            This fix is included in all 2.5.x releases. It would be possible to back port it to 2.4.x but there would be quite a few dependencies to pick up so how we proceed will depend on the timeline for your move to 2.5.x

            Regards

            Peter

            pjones Peter Jones added a comment - Matt This fix is included in all 2.5.x releases. It would be possible to back port it to 2.4.x but there would be quite a few dependencies to pick up so how we proceed will depend on the timeline for your move to 2.5.x Regards Peter

            With enabling LMA on OST-object for lustre-2.4.3, we need to back port the patch http://review.whamcloud.com/#/c/6669/

            yong.fan nasf (Inactive) added a comment - With enabling LMA on OST-object for lustre-2.4.3, we need to back port the patch http://review.whamcloud.com/#/c/6669/
            pjones Peter Jones added a comment -

            Fan Yong

            Could you please advise on this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Fan Yong Could you please advise on this one? Thanks Peter
            ezell Matt Ezell added a comment -

            Sorry, I should have included that in the original report. It was formatted with 2.4, so there shouldn't be any IGIF/IDIF files.

            ezell Matt Ezell added a comment - Sorry, I should have included that in the original report. It was formatted with 2.4, so there shouldn't be any IGIF/IDIF files.

            People

              yong.fan nasf (Inactive)
              ezell Matt Ezell
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: