[LU-4829] LBUG: ASSERTION( !fid_is_idif(fid) ) Created: 28/Mar/14  Updated: 10/Aug/14  Resolved: 07/Aug/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: Lustre 2.5.0

Type: Bug Priority: Minor
Reporter: Matt Ezell Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: mn4

Issue Links:
Related
is related to LU-3335 LFSCK II: MDT-OST OST local consisten... Resolved
Rank (Obsolete): 13287

 Description   

We have our TDS system setup in wide-stripe mode. Each OSS is mounting over 100 OSTs. On mount the other day, we hit an assertion when scrub started.

[12319.230157] LustreError: 54554:0:(osd_internal.h:752:osd_fid2oi()) ASSERTION( !fid_is_idif(fid) ) failed: [0x100000000:0x1:0x0]
[12319.242502] LustreError: 54554:0:(osd_internal.h:752:osd_fid2oi()) LBUG
[12319.249538] Pid: 54554, comm: OI_scrub
[12319.253707] 
[12319.253707] Call Trace:
[12319.258529]  [<ffffffffa03dd895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[12319.265837]  [<ffffffffa03dde97>] lbug_with_loc+0x47/0xb0 [libcfs]
[12319.272395]  [<ffffffffa0be40f5>] __osd_oi_lookup+0x3a5/0x3b0 [osd_ldiskfs]
[12319.279770]  [<ffffffff8119dfcd>] ? generic_drop_inode+0x1d/0x80
[12319.286133]  [<ffffffffa0be4174>] osd_oi_lookup+0x74/0x140 [osd_ldiskfs]
[12319.293197]  [<ffffffffa0bf8fbf>] osd_scrub_exec+0x1af/0xf30 [osd_ldiskfs]
[12319.300553]  [<ffffffffa0bfa5f2>] ? osd_scrub_next+0x142/0x4b0 [osd_ldiskfs]
[12319.308061]  [<ffffffffa0b71432>] ? ldiskfs_read_inode_bitmap+0x172/0x2c0 [ldiskfs]
[12319.316454]  [<ffffffffa0bf4d4f>] osd_inode_iteration+0x1cf/0x570 [osd_ldiskfs]
[12319.324461]  [<ffffffff810516b9>] ? __wake_up_common+0x59/0x90
[12319.330764]  [<ffffffffa0bf8e10>] ? osd_scrub_exec+0x0/0xf30 [osd_ldiskfs]
[12319.337941]  [<ffffffffa0bfa4b0>] ? osd_scrub_next+0x0/0x4b0 [osd_ldiskfs]
[12319.345300]  [<ffffffffa0bf732a>] osd_scrub_main+0x59a/0xd00 [osd_ldiskfs]
[12319.352591]  [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
[12319.358585]  [<ffffffffa0bf6d90>] ? osd_scrub_main+0x0/0xd00 [osd_ldiskfs]
[12319.365881]  [<ffffffff8100c0ca>] child_rip+0xa/0x20
[12319.371174]  [<ffffffffa0bf6d90>] ? osd_scrub_main+0x0/0xd00 [osd_ldiskfs]
[12319.378509]  [<ffffffffa0bf6d90>] ? osd_scrub_main+0x0/0xd00 [osd_ldiskfs]
[12319.385799]  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

We had panic_on_lbug off, so we don't have a crash dump. But the system is still running, so if there's anything useful we can try to grab it. I tried to cat /proc/fs/lustre/osd-ldiskfs/atlastds-OST00f3/oi_scrub but it just hangs. That 'cat' process is stuck on the following:

# cat /proc/83715/stack
[<ffffffff81281f34>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffa0bf630d>] osd_scrub_dump+0x3d/0x320 [osd_ldiskfs]
[<ffffffffa0be6055>] lprocfs_osd_rd_oi_scrub+0x75/0xb0 [osd_ldiskfs]
[<ffffffffa054f563>] lprocfs_fops_read+0xf3/0x1f0 [obdclass]
[<ffffffff811e9fee>] proc_reg_read+0x7e/0xc0
[<ffffffff81181f05>] vfs_read+0xb5/0x1a0
[<ffffffff81182041>] sys_read+0x51/0x90
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

the FID it's complaining about [0x100000000:0x1:0x0] looks suspect. The sequence is FID_SEQ_IDIF and the ObjID is 1. I know on ext4 inode 1 stores the bad blocks information, but I don't think that's what we're seeing here.

We haven't yet tried to re-mount to see if the issue is persistent, since there may be something on the running system that you want us to provide. But we can do that if it's helpful.



 Comments   
Comment by Oleg Drokin [ 28/Mar/14 ]

Is this a 2.4 formatted filesystem or was it created in the past with some other version and then upgraded?

Comment by Matt Ezell [ 28/Mar/14 ]

Sorry, I should have included that in the original report. It was formatted with 2.4, so there shouldn't be any IGIF/IDIF files.

Comment by Peter Jones [ 28/Mar/14 ]

Fan Yong

Could you please advise on this one?

Thanks

Peter

Comment by nasf (Inactive) [ 31/Mar/14 ]

With enabling LMA on OST-object for lustre-2.4.3, we need to back port the patch http://review.whamcloud.com/#/c/6669/

Comment by Peter Jones [ 31/Mar/14 ]

Matt

This fix is included in all 2.5.x releases. It would be possible to back port it to 2.4.x but there would be quite a few dependencies to pick up so how we proceed will depend on the timeline for your move to 2.5.x

Regards

Peter

Comment by Matt Ezell [ 31/Mar/14 ]

We will need to take several test shots between now putting 2.5 into production, but I think it's reasonable for us to target that (and avoid the work of backporting). We haven't seen this in production, yet.

I guess my only question is:
If we hit this in production, would a reboot and re-mount hit the problem again? Or is it intermittent?

Comment by nasf (Inactive) [ 01/Apr/14 ]

Only reboot or re-mount but without 6669 patch applied can NOT resolve the issue, even though it may work for a while, you still will hit it again some time later.

Comment by nasf (Inactive) [ 11/Apr/14 ]

Matt, how about your process? what do you want us to do next step?

Comment by Lai Siyao [ 23/Apr/14 ]

backport patch for 2.4.x is on http://review.whamcloud.com/#/c/10061/

Comment by James Nunez (Inactive) [ 06/Aug/14 ]

Matt,

Since the fix is in b2_5 and later releases and there is a patch for b2_4, should we close this ticket or is there something else you need from us?

Thanks

Comment by Matt Ezell [ 07/Aug/14 ]

Yes, we can close it. Thanks.

Generated at Sat Feb 10 01:46:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.