[LU-12827] sanity: test 900 read-only filesystem during cleanup: ldiskfs_xattr_block_get() bad block Created: 01/Oct/19 Updated: 18/Oct/19 |
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Dongyang Li |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/2161d6cc-de2e-11e9-add9-52540065bddc Seen in the test log a large number of errors during cleanup: rm: cannot remove '/mnt/lustre/d56w.sanity/dir1/link164': Read-only file system Seen on the MDS console log: LDISKFS-fs error (device dm-3): ldiskfs_xattr_block_get:405: inode #1267: comm mdt00_002: bad block 152837 Aborting journal on device dm-3-8. LustreError: 993:0:(osd_handler.c:1708:osd_trans_commit_cb()) transaction @0xffff8ab195065900 commit error: 2 LDISKFS-fs (dm-3): Remounting filesystem read-only LustreError: 1012:0:(mdt_handler.c:1224:mdt_getattr_internal()) lustre-MDT0000: getattr error for [0x2000013a1:0xd880:0x0]: rc = -5 LustreError: 1042:0:(llog_cat.c:762:llog_cat_cancel_records()) lustre-OST0005-osc-MDT0000: fail to cancel 1 of 1 llog-records: rc = -30 LustreError: 1042:0:(osp_sync.c:1065:osp_sync_process_committed()) lustre-OST0005-osc-MDT0000: can't cancel record: -30 |
| Comments |
| Comment by Andreas Dilger [ 03/Oct/19 ] |
|
This appears to have been related to the patch being tested. |
| Comment by Gerrit Updater [ 08/Oct/19 ] |
|
Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/36402 |
| Comment by Dongyang Li [ 08/Oct/19 ] |
|
The debug patch triggered another error with metadata_csum: https://testing.whamcloud.com/test_sessions/2985fd8c-c1cb-4c8a-887b-bbb051056eb3 and it crashed this time:
[ 3755.461295] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 3755.793142] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 3756.134861] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 3756.474540] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 3756.820665] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 3757.166268] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 3757.504059] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0
[ 3757.826509] Lustre: DEBUG MARKER: /usr/sbin/lctl lfsck_start -M lustre-MDT0000 -A -C -t namespace
[ 3758.000591] LDISKFS-fs error (device dm-3): ldiskfs_iget:4459: inode #1370: comm OI_scrub: checksum invalid
[ 3758.002430] Aborting journal on device dm-3-8.
[ 3758.003354] LDISKFS-fs (dm-3): Remounting filesystem read-only
[ 3758.003511] LustreError: 7527:0:(osd_handler.c:1727:osd_trans_commit_cb()) transaction @0xffff8e8c9c06c200 commit error: 2
[ 3758.015734] LustreError: 18534:0:(scrub.c:236:scrub_file_store()) lustre-MDT0000: store scrub file: rc = -30
[ 3758.175022] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_namespace |
[ 3758.175022] awk '/^status/ { print $2 }'
[ 3758.210580] LDISKFS-fs error (device dm-3): ldiskfs_iget:4459: inode #1370: comm lfsck: checksum invalid
[ 3758.213788] LDISKFS-fs error (device dm-3): ldiskfs_iget:4459: inode #1370: comm lfsck_namespace: checksum invalid
[ 3758.216336] BUG: unable to handle kernel NULL pointer dereference at 000000000000001b
[ 3758.217805] IP: [<ffffffffb6e61d48>] is_bad_inode+0x8/0x20
[ 3758.218855] PGD 0
[ 3758.219258] Oops: 0000 [#1] SMP
[ 3758.219944] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel parport_pc lrw gf128mul glue_helper virtio_balloon ablk_helper parport cryptd joydev pcspkr i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk 8139too crct10dif_pclmul crct10dif_common ata_piix crc32c_intel serio_raw virtio_pci libata virtio_ring virtio 8139cp mii floppy [last unloaded: llog_test]
[ 3758.236589] CPU: 1 PID: 18535 Comm: lfsck_namespace Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7_lustre.x86_64 #1
[ 3758.238786] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 3758.239778] task: ffff8e8c9d5be180 ti: ffff8e8c9ac64000 task.ti: ffff8e8c9ac64000
[ 3758.241087] RIP: 0010:[<ffffffffb6e61d48>] [<ffffffffb6e61d48>] is_bad_inode+0x8/0x20
[ 3758.242505] RSP: 0018:ffff8e8c9ac679e8 EFLAGS: 00010246
[ 3758.243450] RAX: 0000000000000000 RBX: fffffffffffffffb RCX: 0000000000000000
[ 3758.244686] RDX: 0000000000000002 RSI: ffff8e8ca04cdec0 RDI: fffffffffffffffb
[ 3758.245919] RBP: ffff8e8c9ac67a20 R08: ffff8e8c9accb898 R09: ffff8e8c9d5be180
[ 3758.247150] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8e8ca066d1a8
[ 3758.248392] R13: 000000000000055a R14: ffff8e8c9ac67a48 R15: ffff8e8ca066d1a8
[ 3758.249626] FS: 0000000000000000(0000) GS:ffff8e8cbfd00000(0000) knlGS:0000000000000000
[ 3758.251037] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3758.252024] CR2: 000000000000001b CR3: 00000000780a8000 CR4: 00000000000606e0
[ 3758.253294] Call Trace:
[ 3758.253823] [<ffffffffc09a8ec3>] ? ldiskfs_xattr_inode_iget+0x143/0x180 [ldiskfs]
[ 3758.255145] [<ffffffffc09a8f35>] ldiskfs_xattr_inode_get+0x35/0x130 [ldiskfs]
[ 3758.256424] [<ffffffffc09a92a0>] ldiskfs_xattr_ibody_get+0x160/0x1a0 [ldiskfs]
[ 3758.257699] [<ffffffffc09a934d>] ldiskfs_xattr_get+0x6d/0x360 [ldiskfs]
[ 3758.258896] [<ffffffffb6ccfeb4>] ? __wake_up+0x44/0x50
[ 3758.259854] [<ffffffffc09e805a>] ldiskfs_xattr_trusted_get+0x2a/0x30 [ldiskfs]
[ 3758.261127] [<ffffffffb6e6a522>] generic_getxattr+0x52/0x70
[ 3758.262199] [<ffffffffc1075bae>] osd_xattr_get+0x17e/0x7e0 [osd_ldiskfs]
[ 3758.263410] [<ffffffffc097be00>] ? fld_server_lookup+0x20/0x320 [fld]
[ 3758.264562] [<ffffffffb6cddd9e>] ? account_entity_dequeue+0xae/0xd0
[ 3758.265695] [<ffffffffc116aa8f>] __lfsck_links_read+0x7f/0x2d0 [lfsck]
[ 3758.266859] [<ffffffffc117707b>] lfsck_namespace_assistant_handler_p1+0x3fb/0x2090 [lfsck]
[ 3758.268348] [<ffffffffb6c2a59e>] ? __switch_to+0xce/0x580
[ 3758.269326] [<ffffffffb6cd1a00>] ? rq_attach_root+0x90/0x110
[ 3758.270322] [<ffffffffc115a0bf>] lfsck_assistant_engine+0x3cf/0x20c0 [lfsck]
[ 3758.271576] [<ffffffffb6cd7c40>] ? wake_up_state+0x20/0x20
[ 3758.272572] [<ffffffffc1159cf0>] ? lfsck_master_engine+0x1370/0x1370 [lfsck]
[ 3758.273818] [<ffffffffb6cc2e81>] kthread+0xd1/0xe0
[ 3758.274695] [<ffffffffb6cc2db0>] ? insert_kthread_work+0x40/0x40
[ 3758.275787] [<ffffffffb7377c37>] ret_from_fork_nospec_begin+0x21/0x21
[ 3758.276929] [<ffffffffb6cc2db0>] ? insert_kthread_work+0x40/0x40
[ 3758.277994] Code: 00 00 00 66 66 66 66 90 55 b8 fb ff ff ff 48 89 e5 5d c3 66 66 66 66 90 55 b8 fb ff ff ff 48 89 e5 5d c3 66 66 66 66 90 55 31 c0 <48> 81 7f 20 40 e4 42 b7 48 89 e5 5d 0f 94 c0 c3 0f 1f 84 00 00
[ 3758.283591] RIP [<ffffffffb6e61d48>] is_bad_inode+0x8/0x20
[ 3758.284617] RSP <ffff8e8c9ac679e8>
[ 3758.285244] CR2: 000000000000001b
Looks like this time it's an xattr inode failed checksum verify, and it crashed as a result:
struct inode *ext4_xattr_inode_iget(struct inode *parent,
unsigned long ea_ino, int *err)
{
struct inode *ea_inode = NULL; ea_inode = ext4_iget(parent->i_sb, ea_ino);
if (IS_ERR(ea_inode) || is_bad_inode(ea_inode)) {
int rc = IS_ERR(ea_inode) ? PTR_ERR(ea_inode) : 0;
ext4_error(parent->i_sb, "error while reading EA inode %lu "
"/ %d %d", ea_ino, rc, is_bad_inode(ea_inode));
*err = rc != 0 ? rc : -EIO;
return NULL;
}
the ea_inode is an error value, the second call to is_bad_inode will crash. Looks like we need upstream commit bab79b04999ccbbf59f1693d0783cd6ae27e4278 for rhel7 kernels to stop it crashing after the checksum verify failure from ext4_iget... |
| Comment by Andreas Dilger [ 09/Oct/19 ] |
|
Dongyang, could you please add a backported patch for the problem you found. |
| Comment by Gerrit Updater [ 18/Oct/19 ] |
|
Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/36484 |