Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.10.5
-
None
-
1
-
9223372036854775807
Description
server keeps crashing with the following error.
[ 981.957669] Lustre: nbp13-OST0008: trigger OI scrub by RPC for the [0x100080000:0x217edd:0x0] with flags 0x4a, rc = 0
[ 981.989579] Lustre: Skipped 11 previous similar messages
[ 1045.404615] ------------[ cut here ]------------
[ 1045.418484] kernel BUG at /tmp/rpmbuild-lustre-jlan-ItUrr9b3/BUILD/lustre-2.10.5/ldiskfs/ldiskfs.h:1907!
[ 1045.446989] invalid opcode: 0000 [#1] SMP
[ 1045.459302] Modules linked in: ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) dm_service_time ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) lpfc ib_iser(OE) libiscsi scsi_transport_iscsi crct10dif_generic scsi_transport_fc scsi_tgt rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) bonding ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) sunrpc dm_mirror dm_region_hash dm_log mlx5_ib(OE) ib_core(OE) intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel i2c_algo_bit ttm dm_multipath aesni_intel drm_kms_helper lrw syscopyarea gf128mul sysfillrect sysimgblt glue_helper fb_sys_fops ablk_helper mlx5_core(OE) mlxfw(OE) tg3 ses cryptd mlx_compat(OE) drm ptp ipmi_si enclosure mei_me i2c_core pps_core hpwdt hpilo ipmi_devintf lpc_ich dm_mod mfd_core mei shpchp pcspkr wmi ipmi_msghandler acpi_power_meter binfmt_misc tcp_bic ip_tables virtio_scsi virtio_ring virtio xfs libcrc32c ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_common sg usb_storage smartpqi(E) crc32c_intel scsi_transport_sas [last unloaded: pps_core]
[ 1045.776428] CPU: 5 PID: 11348 Comm: lfsck Tainted: G OE ------------ 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1
[ 1045.811992] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 06/15/2018
[ 1045.837624] task: ffff882ddca23f40 ti: ffff882bd280c000 task.ti: ffff882bd280c000
[ 1045.860117] RIP: 0010:[<ffffffffa10fbd04>] [<ffffffffa10fbd04>] ldiskfs_rec_len_to_disk.part.9+0x4/0x10 [ldiskfs]
[ 1045.891259] RSP: 0018:ffff882bd280f980 EFLAGS: 00010207
[ 1045.907218] RAX: 0000000000000000 RBX: ffff882bd280fb58 RCX: ffff882bd280f994
[ 1045.928666] RDX: 00000000ffffffac RSI: ffffffffffffff81 RDI: 00000000ffffff81
[ 1045.950113] RBP: ffff882bd280f980 R08: 00000000ffffff81 R09: ffffffffa10fded0
[ 1045.971560] R10: ffff88303f803b00 R11: 0000000000ffffff R12: 000000000000003c
[ 1045.993006] R13: ffff881e2eae7708 R14: ffff881e2eae7690 R15: 0000000000000000
[ 1046.014452] FS: 0000000000000000(0000) GS:ffff882f7ef40000(0000) knlGS:0000000000000000
[ 1046.038775] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1046.056039] CR2: 00007ffff20df034 CR3: 0000002ef4268000 CR4: 00000000003607e0
[ 1046.077485] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1046.098932] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1046.120378] Call Trace:
[ 1046.127717] [<ffffffffa10fe245>] htree_inlinedir_to_tree+0x445/0x450 [ldiskfs]
[ 1046.149690] [<ffffffff8123002e>] ? __generic_file_splice_read+0x4ee/0x5e0
[ 1046.170356] [<ffffffff81234cdd>] ? __getblk+0x2d/0x2e0
[ 1046.186052] [<ffffffff81234c4c>] ? __find_get_block+0xbc/0x120
[ 1046.203841] [<ffffffff81234cdd>] ? __getblk+0x2d/0x2e0
[ 1046.219541] [<ffffffffa10cdfa0>] ? __ldiskfs_get_inode_loc+0x110/0x3e0 [ldiskfs]
[ 1046.242039] [<ffffffffa10c89ef>] ? ldiskfs_xattr_find_entry+0x9f/0x130 [ldiskfs]
[ 1046.264536] [<ffffffffa10c0277>] ldiskfs_htree_fill_tree+0x137/0x2f0 [ldiskfs]
[ 1046.286507] [<ffffffff811df826>] ? kmem_cache_alloc_trace+0x1d6/0x200
[ 1046.306126] [<ffffffffa10ae5ec>] ldiskfs_readdir+0x61c/0x850 [ldiskfs]
[ 1046.326012] [<ffffffffa1147640>] ? osd_declare_ref_del+0x130/0x130 [osd_ldiskfs]
[ 1046.348507] [<ffffffff812256b2>] ? generic_getxattr+0x52/0x70
[ 1046.366036] [<ffffffffa1145cde>] osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
[ 1046.387747] [<ffffffffa1145eb7>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
[ 1046.408158] [<ffffffffa122808c>] lfsck_open_dir+0x11c/0x3a0 [lfsck]
[ 1046.427257] [<ffffffffa1228cb2>] lfsck_master_oit_engine+0x9a2/0x1190 [lfsck]
[ 1046.448969] [<ffffffff816946f7>] ? __schedule+0x477/0xa30
[ 1046.465453] [<ffffffffa1229d96>] lfsck_master_engine+0x8f6/0x1360 [lfsck]
[ 1046.486120] [<ffffffff810c4d40>] ? wake_up_state+0x20/0x20
[ 1046.502865] [<ffffffffa12294a0>] ? lfsck_master_oit_engine+0x1190/0x1190 [lfsck]
[ 1046.525360] [<ffffffff810b1131>] kthread+0xd1/0xe0
[ 1046.540011] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40
[ 1046.558323] [<ffffffff816a14dd>] ret_from_fork+0x5d/0xb0
[ 1046.574540] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40
[ 1046.592852] Code: 44 04 02 48 8d 44 03 c8 48 01 c7 e8 b7 f6 22 e0 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 0f 0b 0f 1f 40 00 55 48 89 e5 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 85 f6 48
[ 1046.650192] RIP [<ffffffffa10fbd04>] ldiskfs_rec_len_to_disk.part.9+0x4/0x10 [ldiskfs]
I was able to find all the inodes with bad LMA and delete them via ldiskfs. So what we have left are files that trigger OI scrub and that report "?" for size/uid/etc. The user has been able recover all the effected files, so we just need a way to delete the files.
If we delete the files via ldiskfs how can we make sure that the objects will be cleaned up.