The debug patch triggered another error with metadata_csum:
https://testing.whamcloud.com/test_sessions/2985fd8c-c1cb-4c8a-887b-bbb051056eb3
and it crashed this time:
[ 3755.461295] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 3755.793142] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 3756.134861] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 3756.474540] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 3756.820665] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 3757.166268] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 3757.504059] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0
[ 3757.826509] Lustre: DEBUG MARKER: /usr/sbin/lctl lfsck_start -M lustre-MDT0000 -A -C -t namespace
[ 3758.000591] LDISKFS-fs error (device dm-3): ldiskfs_iget:4459: inode #1370: comm OI_scrub: checksum invalid
[ 3758.002430] Aborting journal on device dm-3-8.
[ 3758.003354] LDISKFS-fs (dm-3): Remounting filesystem read-only
[ 3758.003511] LustreError: 7527:0:(osd_handler.c:1727:osd_trans_commit_cb()) transaction @0xffff8e8c9c06c200 commit error: 2
[ 3758.015734] LustreError: 18534:0:(scrub.c:236:scrub_file_store()) lustre-MDT0000: store scrub file: rc = -30
[ 3758.175022] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_namespace |
[ 3758.175022] awk '/^status/ { print $2 }'
[ 3758.210580] LDISKFS-fs error (device dm-3): ldiskfs_iget:4459: inode #1370: comm lfsck: checksum invalid
[ 3758.213788] LDISKFS-fs error (device dm-3): ldiskfs_iget:4459: inode #1370: comm lfsck_namespace: checksum invalid
[ 3758.216336] BUG: unable to handle kernel NULL pointer dereference at 000000000000001b
[ 3758.217805] IP: [<ffffffffb6e61d48>] is_bad_inode+0x8/0x20
[ 3758.218855] PGD 0
[ 3758.219258] Oops: 0000 [#1] SMP
[ 3758.219944] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel parport_pc lrw gf128mul glue_helper virtio_balloon ablk_helper parport cryptd joydev pcspkr i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk 8139too crct10dif_pclmul crct10dif_common ata_piix crc32c_intel serio_raw virtio_pci libata virtio_ring virtio 8139cp mii floppy [last unloaded: llog_test]
[ 3758.236589] CPU: 1 PID: 18535 Comm: lfsck_namespace Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7_lustre.x86_64 #1
[ 3758.238786] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 3758.239778] task: ffff8e8c9d5be180 ti: ffff8e8c9ac64000 task.ti: ffff8e8c9ac64000
[ 3758.241087] RIP: 0010:[<ffffffffb6e61d48>] [<ffffffffb6e61d48>] is_bad_inode+0x8/0x20
[ 3758.242505] RSP: 0018:ffff8e8c9ac679e8 EFLAGS: 00010246
[ 3758.243450] RAX: 0000000000000000 RBX: fffffffffffffffb RCX: 0000000000000000
[ 3758.244686] RDX: 0000000000000002 RSI: ffff8e8ca04cdec0 RDI: fffffffffffffffb
[ 3758.245919] RBP: ffff8e8c9ac67a20 R08: ffff8e8c9accb898 R09: ffff8e8c9d5be180
[ 3758.247150] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8e8ca066d1a8
[ 3758.248392] R13: 000000000000055a R14: ffff8e8c9ac67a48 R15: ffff8e8ca066d1a8
[ 3758.249626] FS: 0000000000000000(0000) GS:ffff8e8cbfd00000(0000) knlGS:0000000000000000
[ 3758.251037] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3758.252024] CR2: 000000000000001b CR3: 00000000780a8000 CR4: 00000000000606e0
[ 3758.253294] Call Trace:
[ 3758.253823] [<ffffffffc09a8ec3>] ? ldiskfs_xattr_inode_iget+0x143/0x180 [ldiskfs]
[ 3758.255145] [<ffffffffc09a8f35>] ldiskfs_xattr_inode_get+0x35/0x130 [ldiskfs]
[ 3758.256424] [<ffffffffc09a92a0>] ldiskfs_xattr_ibody_get+0x160/0x1a0 [ldiskfs]
[ 3758.257699] [<ffffffffc09a934d>] ldiskfs_xattr_get+0x6d/0x360 [ldiskfs]
[ 3758.258896] [<ffffffffb6ccfeb4>] ? __wake_up+0x44/0x50
[ 3758.259854] [<ffffffffc09e805a>] ldiskfs_xattr_trusted_get+0x2a/0x30 [ldiskfs]
[ 3758.261127] [<ffffffffb6e6a522>] generic_getxattr+0x52/0x70
[ 3758.262199] [<ffffffffc1075bae>] osd_xattr_get+0x17e/0x7e0 [osd_ldiskfs]
[ 3758.263410] [<ffffffffc097be00>] ? fld_server_lookup+0x20/0x320 [fld]
[ 3758.264562] [<ffffffffb6cddd9e>] ? account_entity_dequeue+0xae/0xd0
[ 3758.265695] [<ffffffffc116aa8f>] __lfsck_links_read+0x7f/0x2d0 [lfsck]
[ 3758.266859] [<ffffffffc117707b>] lfsck_namespace_assistant_handler_p1+0x3fb/0x2090 [lfsck]
[ 3758.268348] [<ffffffffb6c2a59e>] ? __switch_to+0xce/0x580
[ 3758.269326] [<ffffffffb6cd1a00>] ? rq_attach_root+0x90/0x110
[ 3758.270322] [<ffffffffc115a0bf>] lfsck_assistant_engine+0x3cf/0x20c0 [lfsck]
[ 3758.271576] [<ffffffffb6cd7c40>] ? wake_up_state+0x20/0x20
[ 3758.272572] [<ffffffffc1159cf0>] ? lfsck_master_engine+0x1370/0x1370 [lfsck]
[ 3758.273818] [<ffffffffb6cc2e81>] kthread+0xd1/0xe0
[ 3758.274695] [<ffffffffb6cc2db0>] ? insert_kthread_work+0x40/0x40
[ 3758.275787] [<ffffffffb7377c37>] ret_from_fork_nospec_begin+0x21/0x21
[ 3758.276929] [<ffffffffb6cc2db0>] ? insert_kthread_work+0x40/0x40
[ 3758.277994] Code: 00 00 00 66 66 66 66 90 55 b8 fb ff ff ff 48 89 e5 5d c3 66 66 66 66 90 55 b8 fb ff ff ff 48 89 e5 5d c3 66 66 66 66 90 55 31 c0 <48> 81 7f 20 40 e4 42 b7 48 89 e5 5d 0f 94 c0 c3 0f 1f 84 00 00
[ 3758.283591] RIP [<ffffffffb6e61d48>] is_bad_inode+0x8/0x20
[ 3758.284617] RSP <ffff8e8c9ac679e8>
[ 3758.285244] CR2: 000000000000001b
Looks like this time it's an xattr inode failed checksum verify, and it crashed as a result:
struct inode *ext4_xattr_inode_iget(struct inode *parent,
unsigned long ea_ino, int *err)
{
struct inode *ea_inode = NULL; ea_inode = ext4_iget(parent->i_sb, ea_ino);
if (IS_ERR(ea_inode) || is_bad_inode(ea_inode)) {
int rc = IS_ERR(ea_inode) ? PTR_ERR(ea_inode) : 0;
ext4_error(parent->i_sb, "error while reading EA inode %lu "
"/ %d %d", ea_ino, rc, is_bad_inode(ea_inode));
*err = rc != 0 ? rc : -EIO;
return NULL;
}
the ea_inode is an error value, the second call to is_bad_inode will crash.
Looks like we need upstream commit bab79b04999ccbbf59f1693d0783cd6ae27e4278 for rhel7 kernels to stop it crashing after the checksum verify failure from ext4_iget...
Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/36484
Subject: LU-12827 ldiskfs: fix ldiskfs_xattr_inode_iget error checking
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 11667154eee1ea60c2179fa5ba152a23825cc2a1