[LU-9047] sanity-lfsck test_31a: test failed to respond and timed out Created: 25/Jan/17 Updated: 05/Aug/20 Resolved: 05/Aug/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
MDSCOUNT=4 |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for Bob Glossman <bob.glossman@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/819e44c6-e309-11e6-bac2-5254006e85c2. The sub-test test_31a failed with the following error: test failed to respond and timed out This looks a bit like Panic seen on MDS: 08:29:09:[14599.772616] ------------[ cut here ]------------ 08:29:09:[14599.775103] WARNING: at lib/list_debug.c:62 __list_del_entry+0x82/0xd0() 08:29:09:[14599.777257] list_del corruption. next->prev should be ffff880039623660, but was (null) 08:29:09: 08:29:09:[14599.779597] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) ldiskfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt nfsd ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr virtio_balloon i2c_piix4 parport_pc parport nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk crct10dif_pclmul crct10dif_common cirrus 8139too crc32c_intel drm_kms_helper serio_raw syscopyarea sysfillrect sysimgblt virtio_pci virtio_ring virtio fb_sys_fops ttm 8139cp mii drm i2c_core ata_piix libata floppy 08:29:09:[14599.802619] CPU: 1 PID: 10612 Comm: lfsck_namespace Tainted: G OE ------------ 3.10.0-514.2.2.el7_lustre.x86_64 #1 08:29:09:[14599.806954] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 08:29:09:[14599.809170] ffff88001f3ef678 000000009fda725f ffff88001f3ef630 ffffffff81686318 08:29:09:[14599.811564] ffff88001f3ef668 ffffffff81085940 ffff880039623660 ffff880039623660 08:29:09:[14599.813971] ffff88001ff2b000 00000000000fa540 0000000000000000 ffff88001f3ef6d0 08:29:09:[14599.816367] Call Trace: 08:29:09:[14599.818307] [<ffffffff81686318>] dump_stack+0x19/0x1b 08:29:09:[14599.820443] [<ffffffff81085940>] warn_slowpath_common+0x70/0xb0 08:29:09:[14599.822596] [<ffffffff810859dc>] warn_slowpath_fmt+0x5c/0x80 08:29:09:[14599.824686] [<ffffffff813332e2>] __list_del_entry+0x82/0xd0 08:29:09:[14599.826767] [<ffffffff8133333d>] list_del+0xd/0x30 08:29:09:[14599.828797] [<ffffffffa07c66d4>] lnet_me_unlink+0x14/0xc0 [lnet] 08:29:09:[14599.830872] [<ffffffffa07caa48>] lnet_md_unlink+0x308/0x3f0 [lnet] 08:29:09:[14599.832960] [<ffffffffa07cb195>] lnet_try_match_md+0x1e5/0x330 [lnet] 08:29:09:[14599.835047] [<ffffffff811a2079>] ? zone_statistics+0x89/0xa0 08:29:09:[14599.837060] [<ffffffffa07cb99c>] lnet_mt_match_md+0x8c/0x1b0 [lnet] 08:29:09:[14599.839109] [<ffffffffa07cbbcd>] lnet_ptl_match_md+0x10d/0x820 [lnet] 08:29:09:[14599.841184] [<ffffffff8123312c>] ? __find_get_block+0xbc/0x120 08:29:09:[14599.843210] [<ffffffffa07d535a>] lnet_parse_local+0x51a/0xd30 [lnet] 08:29:09:[14599.845322] [<ffffffffa06d0c96>] ? ldiskfs_getblk+0xa6/0x200 [ldiskfs] 08:29:09:[14599.847426] [<ffffffffa07d61da>] lnet_parse+0x66a/0xe60 [lnet] 08:29:09:[14599.849483] [<ffffffff81322803>] ? number.isra.2+0x323/0x360 08:29:09:[14599.851530] [<ffffffffa07d76cb>] lolnd_send+0x2b/0xa0 [lnet] 08:29:09:[14599.853564] [<ffffffffa07cf5ef>] lnet_ni_send+0x3f/0xe0 [lnet] 08:29:09:[14599.855572] [<ffffffffa07d3ad8>] lnet_send+0x978/0xc90 [lnet] 08:29:09:[14599.857544] [<ffffffff811de456>] ? kmem_cache_alloc_trace+0x1d6/0x200 08:29:09:[14599.859540] [<ffffffffa07d4035>] LNetPut+0x245/0x7a0 [lnet] 08:29:09:[14599.861456] [<ffffffffa0a9ce33>] ptl_send_buf+0x183/0x500 [ptlrpc] 08:29:09:[14599.863368] [<ffffffff810ea9ba>] ? __getnstimeofday64+0x3a/0xd0 08:29:09:[14599.865234] [<ffffffffa0a9f531>] ptl_send_rpc+0x611/0xda0 [ptlrpc] 08:29:09:[14599.867137] [<ffffffffa0a942b0>] ptlrpc_send_new_req+0x460/0xa60 [ptlrpc] 08:29:09:[14599.869094] [<ffffffffa0a98f11>] ptlrpc_set_wait+0x3d1/0x900 [ptlrpc] 08:29:09:[14599.871037] [<ffffffffa0d922f0>] ? lfsck_create_lpf_local+0xca0/0xca0 [lfsck] 08:29:09:[14599.873018] [<ffffffffa0aa48d5>] ? lustre_msg_set_jobid+0x95/0x100 [ptlrpc] 08:29:09:[14599.874967] [<ffffffffa0a94911>] ? ptlrpc_set_add_req+0x61/0xc0 [ptlrpc] 08:29:09:[14599.876868] [<ffffffffa0d9b2bb>] ? lfsck_async_request+0x16b/0x240 [lfsck] 08:29:09:[14599.878770] [<ffffffffa0da1ba4>] lfsck_assistant_notify_others+0x7d4/0x12c0 [lfsck] 08:29:09:[14599.880723] [<ffffffffa0da6626>] lfsck_assistant_engine+0x866/0x20c0 [lfsck] 08:29:09:[14599.882616] [<ffffffff810ce47c>] ? dequeue_entity+0x11c/0x5d0 08:29:09:[14599.884407] [<ffffffff8168b3e0>] ? __schedule+0x3b0/0x990 08:29:09:[14599.886088] [<ffffffff810c4fe0>] ? wake_up_state+0x20/0x20 08:29:09:[14599.887758] [<ffffffffa0da5dc0>] ? lfsck_master_engine+0x1330/0x1330 [lfsck] 08:29:09:[14599.889508] [<ffffffff810b064f>] kthread+0xcf/0xe0 08:29:09:[14599.891031] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 08:29:09:[14599.892664] [<ffffffff81696898>] ret_from_fork+0x58/0x90 08:29:09:[14599.894168] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 08:29:09:[14599.895732] ---[ end trace 4894ecda8f001c8c ]--- 08:29:09:[14599.898961] general protection fault: 0000 [#1] SMP 08:29:09:[14599.899008] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) ldiskfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt nfsd ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr virtio_balloon i2c_piix4 parport_pc parport nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk crct10dif_pclmul crct10dif_common cirrus 8139too crc32c_intel drm_kms_helper serio_raw syscopyarea sysfillrect sysimgblt virtio_pci virtio_ring virtio fb_sys_fops ttm 8139cp mii drm i2c_core ata_piix libata floppy 08:29:09:[14599.903021] CPU: 0 PID: 7126 Comm: socknal_sd00_00 Tainted: G W OE ------------ 3.10.0-514.2.2.el7_lustre.x86_64 #1 08:29:09:[14599.903021] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 08:29:09:[14599.903021] task: ffff88007751af10 ti: ffff88007a020000 task.ti: ffff88007a020000 08:29:09:[14599.903021] RIP: 0010:[<ffffffffa07cb987>] [<ffffffffa07cb987>] lnet_mt_match_md+0x77/0x1b0 [lnet] 08:29:09:[14599.903021] RSP: 0018:ffff88007a023c00 EFLAGS: 00010206 08:29:09:[14599.903021] RAX: ffff880039623f60 RBX: ffff880077f21000 RCX: 0000000000000018 08:29:09:[14599.903021] RDX: 0000000000000000 RSI: ffff880077a84e40 RDI: 5a5a5a5a5a5a5a5a 08:29:09:[14599.903021] RBP: ffff88007a023c38 R08: 0000000000000005 R09: 0000282416e40000 08:29:09:[14599.903021] R10: 0001d7dbf32505b9 R11: 0000000000000000 R12: ffff88007a023cf0 08:29:09:[14599.903021] R13: 0000000000000000 R14: ffff88003f48ae00 R15: 0000000000000008 08:29:09:[14599.903021] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 08:29:09:[14599.903021] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 08:29:09:[14599.903021] CR2: 00007fa69ade5050 CR3: 0000000079bb2000 CR4: 00000000000406f0 08:29:09:[14599.903021] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 08:29:09:[14599.903021] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 08:29:09:[14599.903021] Stack: 08:29:09:[14599.903021] 0000014000000000 ffff880077a84e40 ffff88003f48ae00 ffff880077a84e40 08:29:09:[14599.903021] ffff88007963b5a0 ffff88007a023cf0 ffff88007963b5a0 ffff88007a023ca8 08:29:09:[14599.903021] ffffffffa07cbbcd 0000000000000000 0000000000000000 0000000000000000 08:29:09:[14599.903021] Call Trace: 08:29:09:[14599.903021] [<ffffffffa07cbbcd>] lnet_ptl_match_md+0x10d/0x820 [lnet] 08:29:09:[14599.903021] [<ffffffffa07d535a>] lnet_parse_local+0x51a/0xd30 [lnet] 08:29:09:[14599.903021] [<ffffffffa07d61da>] lnet_parse+0x66a/0xe60 [lnet] 08:29:09:[14599.903021] [<ffffffffa0a0bda0>] ksocknal_process_receive+0x480/0xda0 [ksocklnd] 08:29:09:[14599.903021] [<ffffffffa0a0ca4e>] ksocknal_scheduler+0xee/0x670 [ksocklnd] 08:29:09:[14599.903021] [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30 08:29:09:[14599.903021] [<ffffffffa0a0c960>] ? ksocknal_recv+0x2a0/0x2a0 [ksocklnd] 08:29:09:[14599.903021] [<ffffffff810b064f>] kthread+0xcf/0xe0 08:29:09:[14599.903021] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 08:29:09:[14599.903021] [<ffffffff81696898>] ret_from_fork+0x58/0x90 08:29:09:[14599.903021] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 08:29:09:[14599.903021] Code: 48 8b 14 ca 44 8b 42 08 47 8d 3c 00 41 83 e7 08 48 39 d8 75 0d eb 41 0f 1f 44 00 00 4c 89 e8 49 89 d5 48 8b 78 58 48 85 ff 74 24 <48> 3b 47 28 0f 85 fc 00 00 00 4c 89 f2 4c 89 e6 e8 14 f6 ff ff 08:29:09:[14599.903021] RIP [<ffffffffa07cb987>] lnet_mt_match_md+0x77/0x1b0 [lnet] 08:29:09:[14599.903021] RSP <ffff88007a023c00> Info required for matching: sanity-lfsck 31a |
| Comments |
| Comment by Jian Yu [ 26/Jan/17 ] |
|
The failure also occurred while running sanity-hsm test 500 in review-dne-part-2 test session on master branch: Lustre: DEBUG MARKER: == sanity-hsm test 500: various LLAPI HSM tests ====================================================== 21:23:36 (1485379416) Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffffa07c39ae>] lnet_mt_match_md+0x9e/0x1b0 [lnet] PGD 1f7ac067 PUD 7b86e067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) ldiskfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic nfsd ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ppdev pcspkr i2c_piix4 virtio_balloon parport_pc parport ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk crct10dif_pclmul crct10dif_common cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt crc32c_intel fb_sys_fops 8139too ttm serio_raw virtio_pci virtio_ring virtio 8139cp mii drm ata_piix i2c_core libata floppy CPU: 1 PID: 7130 Comm: socknal_sd00_00 Tainted: G W OE ------------ 3.10.0-514.2.2.el7_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 task: ffff880078b64e70 ti: ffff880079454000 task.ti: ffff880079454000 RIP: 0010:[<ffffffffa07c39ae>] [<ffffffffa07c39ae>] lnet_mt_match_md+0x9e/0x1b0 [lnet] RSP: 0018:ffff880079457c00 EFLAGS: 00010207 RAX: ffff88003b94acc0 RBX: ffff880077d2b000 RCX: 0000000000000018 RDX: ffff88007903b0c0 RSI: ffff8800799bd5c0 RDI: 0000000000000000 RBP: ffff880079457c38 R08: 0000000000000005 R09: 0000282410d40000 R10: 0001d7dbf9350435 R11: 0000000000000000 R12: ffff880079457cf0 R13: 0000000000000000 R14: ffff8800786f5200 R15: 0000000000000008 FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000000428b0000 CR4: 00000000000406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: 000001207b452888 ffff8800799bd5c0 ffff8800786f5200 ffff8800799bd5c0 ffff88007903b0c0 ffff880079457cf0 ffff88007903b0c0 ffff880079457ca8 ffffffffa07c3bcd 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffffa07c3bcd>] lnet_ptl_match_md+0x10d/0x820 [lnet] [<ffffffffa07cd35a>] lnet_parse_local+0x51a/0xd30 [lnet] [<ffffffffa07ce1da>] lnet_parse+0x66a/0xe60 [lnet] [<ffffffffa0a03da0>] ksocknal_process_receive+0x480/0xda0 [ksocklnd] [<ffffffffa0a04a4e>] ksocknal_scheduler+0xee/0x670 [ksocklnd] [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30 [<ffffffffa0a04960>] ? ksocknal_recv+0x2a0/0x2a0 [ksocklnd] [<ffffffff810b064f>] kthread+0xcf/0xe0 [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 [<ffffffff81696898>] ret_from_fork+0x58/0x90 [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 Code: 85 ff 74 24 48 3b 47 28 0f 85 fc 00 00 00 4c 89 f2 4c 89 e6 e8 14 f6 ff ff b9 00 00 00 00 a8 08 44 0f 44 f9 a8 06 75 55 49 39 dd <49> 8b 55 00 75 c4 41 83 ff 08 74 58 45 85 ff 0f 85 7c 00 00 00 RIP [<ffffffffa07c39ae>] lnet_mt_match_md+0x9e/0x1b0 [lnet] RSP <ffff880079457c00> Maloo reports: |
| Comment by Bob Glossman (Inactive) [ 26/Jan/17 ] |
|
another on master: |
| Comment by Jian Yu [ 29/Jan/17 ] |
|
another on master: |
| Comment by Andreas Dilger [ 05/Aug/20 ] |
|
Closing old issue that has not been seen in a long time. |