[LU-4829] LBUG: ASSERTION( !fid_is_idif(fid) ) Created: 28/Mar/14 Updated: 10/Aug/14 Resolved: 07/Aug/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | Lustre 2.5.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Matt Ezell | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mn4 | ||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 13287 | ||||||||
| Description |
|
We have our TDS system setup in wide-stripe mode. Each OSS is mounting over 100 OSTs. On mount the other day, we hit an assertion when scrub started. [12319.230157] LustreError: 54554:0:(osd_internal.h:752:osd_fid2oi()) ASSERTION( !fid_is_idif(fid) ) failed: [0x100000000:0x1:0x0] [12319.242502] LustreError: 54554:0:(osd_internal.h:752:osd_fid2oi()) LBUG [12319.249538] Pid: 54554, comm: OI_scrub [12319.253707] [12319.253707] Call Trace: [12319.258529] [<ffffffffa03dd895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [12319.265837] [<ffffffffa03dde97>] lbug_with_loc+0x47/0xb0 [libcfs] [12319.272395] [<ffffffffa0be40f5>] __osd_oi_lookup+0x3a5/0x3b0 [osd_ldiskfs] [12319.279770] [<ffffffff8119dfcd>] ? generic_drop_inode+0x1d/0x80 [12319.286133] [<ffffffffa0be4174>] osd_oi_lookup+0x74/0x140 [osd_ldiskfs] [12319.293197] [<ffffffffa0bf8fbf>] osd_scrub_exec+0x1af/0xf30 [osd_ldiskfs] [12319.300553] [<ffffffffa0bfa5f2>] ? osd_scrub_next+0x142/0x4b0 [osd_ldiskfs] [12319.308061] [<ffffffffa0b71432>] ? ldiskfs_read_inode_bitmap+0x172/0x2c0 [ldiskfs] [12319.316454] [<ffffffffa0bf4d4f>] osd_inode_iteration+0x1cf/0x570 [osd_ldiskfs] [12319.324461] [<ffffffff810516b9>] ? __wake_up_common+0x59/0x90 [12319.330764] [<ffffffffa0bf8e10>] ? osd_scrub_exec+0x0/0xf30 [osd_ldiskfs] [12319.337941] [<ffffffffa0bfa4b0>] ? osd_scrub_next+0x0/0x4b0 [osd_ldiskfs] [12319.345300] [<ffffffffa0bf732a>] osd_scrub_main+0x59a/0xd00 [osd_ldiskfs] [12319.352591] [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320 [12319.358585] [<ffffffffa0bf6d90>] ? osd_scrub_main+0x0/0xd00 [osd_ldiskfs] [12319.365881] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [12319.371174] [<ffffffffa0bf6d90>] ? osd_scrub_main+0x0/0xd00 [osd_ldiskfs] [12319.378509] [<ffffffffa0bf6d90>] ? osd_scrub_main+0x0/0xd00 [osd_ldiskfs] [12319.385799] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 We had panic_on_lbug off, so we don't have a crash dump. But the system is still running, so if there's anything useful we can try to grab it. I tried to cat /proc/fs/lustre/osd-ldiskfs/atlastds-OST00f3/oi_scrub but it just hangs. That 'cat' process is stuck on the following: # cat /proc/83715/stack [<ffffffff81281f34>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffffa0bf630d>] osd_scrub_dump+0x3d/0x320 [osd_ldiskfs] [<ffffffffa0be6055>] lprocfs_osd_rd_oi_scrub+0x75/0xb0 [osd_ldiskfs] [<ffffffffa054f563>] lprocfs_fops_read+0xf3/0x1f0 [obdclass] [<ffffffff811e9fee>] proc_reg_read+0x7e/0xc0 [<ffffffff81181f05>] vfs_read+0xb5/0x1a0 [<ffffffff81182041>] sys_read+0x51/0x90 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff the FID it's complaining about [0x100000000:0x1:0x0] looks suspect. The sequence is FID_SEQ_IDIF and the ObjID is 1. I know on ext4 inode 1 stores the bad blocks information, but I don't think that's what we're seeing here. We haven't yet tried to re-mount to see if the issue is persistent, since there may be something on the running system that you want us to provide. But we can do that if it's helpful. |
| Comments |
| Comment by Oleg Drokin [ 28/Mar/14 ] |
|
Is this a 2.4 formatted filesystem or was it created in the past with some other version and then upgraded? |
| Comment by Matt Ezell [ 28/Mar/14 ] |
|
Sorry, I should have included that in the original report. It was formatted with 2.4, so there shouldn't be any IGIF/IDIF files. |
| Comment by Peter Jones [ 28/Mar/14 ] |
|
Fan Yong Could you please advise on this one? Thanks Peter |
| Comment by nasf (Inactive) [ 31/Mar/14 ] |
|
With enabling LMA on OST-object for lustre-2.4.3, we need to back port the patch http://review.whamcloud.com/#/c/6669/ |
| Comment by Peter Jones [ 31/Mar/14 ] |
|
Matt This fix is included in all 2.5.x releases. It would be possible to back port it to 2.4.x but there would be quite a few dependencies to pick up so how we proceed will depend on the timeline for your move to 2.5.x Regards Peter |
| Comment by Matt Ezell [ 31/Mar/14 ] |
|
We will need to take several test shots between now putting 2.5 into production, but I think it's reasonable for us to target that (and avoid the work of backporting). We haven't seen this in production, yet. I guess my only question is: |
| Comment by nasf (Inactive) [ 01/Apr/14 ] |
|
Only reboot or re-mount but without 6669 patch applied can NOT resolve the issue, even though it may work for a while, you still will hit it again some time later. |
| Comment by nasf (Inactive) [ 11/Apr/14 ] |
|
Matt, how about your process? what do you want us to do next step? |
| Comment by Lai Siyao [ 23/Apr/14 ] |
|
backport patch for 2.4.x is on http://review.whamcloud.com/#/c/10061/ |
| Comment by James Nunez (Inactive) [ 06/Aug/14 ] |
|
Matt, Since the fix is in b2_5 and later releases and there is a patch for b2_4, should we close this ticket or is there something else you need from us? Thanks |
| Comment by Matt Ezell [ 07/Aug/14 ] |
|
Yes, we can close it. Thanks. |