Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.4.3
-
13287
Description
We have our TDS system setup in wide-stripe mode. Each OSS is mounting over 100 OSTs. On mount the other day, we hit an assertion when scrub started.
[12319.230157] LustreError: 54554:0:(osd_internal.h:752:osd_fid2oi()) ASSERTION( !fid_is_idif(fid) ) failed: [0x100000000:0x1:0x0] [12319.242502] LustreError: 54554:0:(osd_internal.h:752:osd_fid2oi()) LBUG [12319.249538] Pid: 54554, comm: OI_scrub [12319.253707] [12319.253707] Call Trace: [12319.258529] [<ffffffffa03dd895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [12319.265837] [<ffffffffa03dde97>] lbug_with_loc+0x47/0xb0 [libcfs] [12319.272395] [<ffffffffa0be40f5>] __osd_oi_lookup+0x3a5/0x3b0 [osd_ldiskfs] [12319.279770] [<ffffffff8119dfcd>] ? generic_drop_inode+0x1d/0x80 [12319.286133] [<ffffffffa0be4174>] osd_oi_lookup+0x74/0x140 [osd_ldiskfs] [12319.293197] [<ffffffffa0bf8fbf>] osd_scrub_exec+0x1af/0xf30 [osd_ldiskfs] [12319.300553] [<ffffffffa0bfa5f2>] ? osd_scrub_next+0x142/0x4b0 [osd_ldiskfs] [12319.308061] [<ffffffffa0b71432>] ? ldiskfs_read_inode_bitmap+0x172/0x2c0 [ldiskfs] [12319.316454] [<ffffffffa0bf4d4f>] osd_inode_iteration+0x1cf/0x570 [osd_ldiskfs] [12319.324461] [<ffffffff810516b9>] ? __wake_up_common+0x59/0x90 [12319.330764] [<ffffffffa0bf8e10>] ? osd_scrub_exec+0x0/0xf30 [osd_ldiskfs] [12319.337941] [<ffffffffa0bfa4b0>] ? osd_scrub_next+0x0/0x4b0 [osd_ldiskfs] [12319.345300] [<ffffffffa0bf732a>] osd_scrub_main+0x59a/0xd00 [osd_ldiskfs] [12319.352591] [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320 [12319.358585] [<ffffffffa0bf6d90>] ? osd_scrub_main+0x0/0xd00 [osd_ldiskfs] [12319.365881] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [12319.371174] [<ffffffffa0bf6d90>] ? osd_scrub_main+0x0/0xd00 [osd_ldiskfs] [12319.378509] [<ffffffffa0bf6d90>] ? osd_scrub_main+0x0/0xd00 [osd_ldiskfs] [12319.385799] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
We had panic_on_lbug off, so we don't have a crash dump. But the system is still running, so if there's anything useful we can try to grab it. I tried to cat /proc/fs/lustre/osd-ldiskfs/atlastds-OST00f3/oi_scrub but it just hangs. That 'cat' process is stuck on the following:
# cat /proc/83715/stack [<ffffffff81281f34>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffffa0bf630d>] osd_scrub_dump+0x3d/0x320 [osd_ldiskfs] [<ffffffffa0be6055>] lprocfs_osd_rd_oi_scrub+0x75/0xb0 [osd_ldiskfs] [<ffffffffa054f563>] lprocfs_fops_read+0xf3/0x1f0 [obdclass] [<ffffffff811e9fee>] proc_reg_read+0x7e/0xc0 [<ffffffff81181f05>] vfs_read+0xb5/0x1a0 [<ffffffff81182041>] sys_read+0x51/0x90 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff
the FID it's complaining about [0x100000000:0x1:0x0] looks suspect. The sequence is FID_SEQ_IDIF and the ObjID is 1. I know on ext4 inode 1 stores the bad blocks information, but I don't think that's what we're seeing here.
We haven't yet tried to re-mount to see if the issue is persistent, since there may be something on the running system that you want us to provide. But we can do that if it's helpful.
Attachments
Issue Links
- is related to
-
LU-3335 LFSCK II: MDT-OST OST local consistency checking
- Resolved