[LU-9047] sanity-lfsck test_31a: test failed to respond and timed out Created: 25/Jan/17  Updated: 05/Aug/20  Resolved: 05/Aug/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

MDSCOUNT=4


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/819e44c6-e309-11e6-bac2-5254006e85c2.

The sub-test test_31a failed with the following error:

test failed to respond and timed out

This looks a bit like LU-8362 (I think) but since that is marked Resolved I'm creating a new ticket. I'll let an expert decide if this is an old issue or a new one.

Panic seen on MDS:

08:29:09:[14599.772616] ------------[ cut here ]------------
08:29:09:[14599.775103] WARNING: at lib/list_debug.c:62 __list_del_entry+0x82/0xd0()
08:29:09:[14599.777257] list_del corruption. next->prev should be ffff880039623660, but was           (null)
08:29:09:
08:29:09:[14599.779597] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) ldiskfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt nfsd ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr virtio_balloon i2c_piix4 parport_pc parport nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk crct10dif_pclmul crct10dif_common cirrus 8139too crc32c_intel drm_kms_helper serio_raw syscopyarea sysfillrect sysimgblt virtio_pci virtio_ring virtio fb_sys_fops ttm 8139cp mii drm i2c_core ata_piix libata floppy
08:29:09:[14599.802619] CPU: 1 PID: 10612 Comm: lfsck_namespace Tainted: G           OE  ------------   3.10.0-514.2.2.el7_lustre.x86_64 #1
08:29:09:[14599.806954] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
08:29:09:[14599.809170]  ffff88001f3ef678 000000009fda725f ffff88001f3ef630 ffffffff81686318
08:29:09:[14599.811564]  ffff88001f3ef668 ffffffff81085940 ffff880039623660 ffff880039623660
08:29:09:[14599.813971]  ffff88001ff2b000 00000000000fa540 0000000000000000 ffff88001f3ef6d0
08:29:09:[14599.816367] Call Trace:
08:29:09:[14599.818307]  [<ffffffff81686318>] dump_stack+0x19/0x1b
08:29:09:[14599.820443]  [<ffffffff81085940>] warn_slowpath_common+0x70/0xb0
08:29:09:[14599.822596]  [<ffffffff810859dc>] warn_slowpath_fmt+0x5c/0x80
08:29:09:[14599.824686]  [<ffffffff813332e2>] __list_del_entry+0x82/0xd0
08:29:09:[14599.826767]  [<ffffffff8133333d>] list_del+0xd/0x30
08:29:09:[14599.828797]  [<ffffffffa07c66d4>] lnet_me_unlink+0x14/0xc0 [lnet]
08:29:09:[14599.830872]  [<ffffffffa07caa48>] lnet_md_unlink+0x308/0x3f0 [lnet]
08:29:09:[14599.832960]  [<ffffffffa07cb195>] lnet_try_match_md+0x1e5/0x330 [lnet]
08:29:09:[14599.835047]  [<ffffffff811a2079>] ? zone_statistics+0x89/0xa0
08:29:09:[14599.837060]  [<ffffffffa07cb99c>] lnet_mt_match_md+0x8c/0x1b0 [lnet]
08:29:09:[14599.839109]  [<ffffffffa07cbbcd>] lnet_ptl_match_md+0x10d/0x820 [lnet]
08:29:09:[14599.841184]  [<ffffffff8123312c>] ? __find_get_block+0xbc/0x120
08:29:09:[14599.843210]  [<ffffffffa07d535a>] lnet_parse_local+0x51a/0xd30 [lnet]
08:29:09:[14599.845322]  [<ffffffffa06d0c96>] ? ldiskfs_getblk+0xa6/0x200 [ldiskfs]
08:29:09:[14599.847426]  [<ffffffffa07d61da>] lnet_parse+0x66a/0xe60 [lnet]
08:29:09:[14599.849483]  [<ffffffff81322803>] ? number.isra.2+0x323/0x360
08:29:09:[14599.851530]  [<ffffffffa07d76cb>] lolnd_send+0x2b/0xa0 [lnet]
08:29:09:[14599.853564]  [<ffffffffa07cf5ef>] lnet_ni_send+0x3f/0xe0 [lnet]
08:29:09:[14599.855572]  [<ffffffffa07d3ad8>] lnet_send+0x978/0xc90 [lnet]
08:29:09:[14599.857544]  [<ffffffff811de456>] ? kmem_cache_alloc_trace+0x1d6/0x200
08:29:09:[14599.859540]  [<ffffffffa07d4035>] LNetPut+0x245/0x7a0 [lnet]
08:29:09:[14599.861456]  [<ffffffffa0a9ce33>] ptl_send_buf+0x183/0x500 [ptlrpc]
08:29:09:[14599.863368]  [<ffffffff810ea9ba>] ? __getnstimeofday64+0x3a/0xd0
08:29:09:[14599.865234]  [<ffffffffa0a9f531>] ptl_send_rpc+0x611/0xda0 [ptlrpc]
08:29:09:[14599.867137]  [<ffffffffa0a942b0>] ptlrpc_send_new_req+0x460/0xa60 [ptlrpc]
08:29:09:[14599.869094]  [<ffffffffa0a98f11>] ptlrpc_set_wait+0x3d1/0x900 [ptlrpc]
08:29:09:[14599.871037]  [<ffffffffa0d922f0>] ? lfsck_create_lpf_local+0xca0/0xca0 [lfsck]
08:29:09:[14599.873018]  [<ffffffffa0aa48d5>] ? lustre_msg_set_jobid+0x95/0x100 [ptlrpc]
08:29:09:[14599.874967]  [<ffffffffa0a94911>] ? ptlrpc_set_add_req+0x61/0xc0 [ptlrpc]
08:29:09:[14599.876868]  [<ffffffffa0d9b2bb>] ? lfsck_async_request+0x16b/0x240 [lfsck]
08:29:09:[14599.878770]  [<ffffffffa0da1ba4>] lfsck_assistant_notify_others+0x7d4/0x12c0 [lfsck]
08:29:09:[14599.880723]  [<ffffffffa0da6626>] lfsck_assistant_engine+0x866/0x20c0 [lfsck]
08:29:09:[14599.882616]  [<ffffffff810ce47c>] ? dequeue_entity+0x11c/0x5d0
08:29:09:[14599.884407]  [<ffffffff8168b3e0>] ? __schedule+0x3b0/0x990
08:29:09:[14599.886088]  [<ffffffff810c4fe0>] ? wake_up_state+0x20/0x20
08:29:09:[14599.887758]  [<ffffffffa0da5dc0>] ? lfsck_master_engine+0x1330/0x1330 [lfsck]
08:29:09:[14599.889508]  [<ffffffff810b064f>] kthread+0xcf/0xe0
08:29:09:[14599.891031]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
08:29:09:[14599.892664]  [<ffffffff81696898>] ret_from_fork+0x58/0x90
08:29:09:[14599.894168]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
08:29:09:[14599.895732] ---[ end trace 4894ecda8f001c8c ]---
08:29:09:[14599.898961] general protection fault: 0000 [#1] SMP 
08:29:09:[14599.899008] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) ldiskfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt nfsd ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr virtio_balloon i2c_piix4 parport_pc parport nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk crct10dif_pclmul crct10dif_common cirrus 8139too crc32c_intel drm_kms_helper serio_raw syscopyarea sysfillrect sysimgblt virtio_pci virtio_ring virtio fb_sys_fops ttm 8139cp mii drm i2c_core ata_piix libata floppy
08:29:09:[14599.903021] CPU: 0 PID: 7126 Comm: socknal_sd00_00 Tainted: G        W  OE  ------------   3.10.0-514.2.2.el7_lustre.x86_64 #1
08:29:09:[14599.903021] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
08:29:09:[14599.903021] task: ffff88007751af10 ti: ffff88007a020000 task.ti: ffff88007a020000
08:29:09:[14599.903021] RIP: 0010:[<ffffffffa07cb987>]  [<ffffffffa07cb987>] lnet_mt_match_md+0x77/0x1b0 [lnet]
08:29:09:[14599.903021] RSP: 0018:ffff88007a023c00  EFLAGS: 00010206
08:29:09:[14599.903021] RAX: ffff880039623f60 RBX: ffff880077f21000 RCX: 0000000000000018
08:29:09:[14599.903021] RDX: 0000000000000000 RSI: ffff880077a84e40 RDI: 5a5a5a5a5a5a5a5a
08:29:09:[14599.903021] RBP: ffff88007a023c38 R08: 0000000000000005 R09: 0000282416e40000
08:29:09:[14599.903021] R10: 0001d7dbf32505b9 R11: 0000000000000000 R12: ffff88007a023cf0
08:29:09:[14599.903021] R13: 0000000000000000 R14: ffff88003f48ae00 R15: 0000000000000008
08:29:09:[14599.903021] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
08:29:09:[14599.903021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
08:29:09:[14599.903021] CR2: 00007fa69ade5050 CR3: 0000000079bb2000 CR4: 00000000000406f0
08:29:09:[14599.903021] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
08:29:09:[14599.903021] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
08:29:09:[14599.903021] Stack:
08:29:09:[14599.903021]  0000014000000000 ffff880077a84e40 ffff88003f48ae00 ffff880077a84e40
08:29:09:[14599.903021]  ffff88007963b5a0 ffff88007a023cf0 ffff88007963b5a0 ffff88007a023ca8
08:29:09:[14599.903021]  ffffffffa07cbbcd 0000000000000000 0000000000000000 0000000000000000
08:29:09:[14599.903021] Call Trace:
08:29:09:[14599.903021]  [<ffffffffa07cbbcd>] lnet_ptl_match_md+0x10d/0x820 [lnet]
08:29:09:[14599.903021]  [<ffffffffa07d535a>] lnet_parse_local+0x51a/0xd30 [lnet]
08:29:09:[14599.903021]  [<ffffffffa07d61da>] lnet_parse+0x66a/0xe60 [lnet]
08:29:09:[14599.903021]  [<ffffffffa0a0bda0>] ksocknal_process_receive+0x480/0xda0 [ksocklnd]
08:29:09:[14599.903021]  [<ffffffffa0a0ca4e>] ksocknal_scheduler+0xee/0x670 [ksocklnd]
08:29:09:[14599.903021]  [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30
08:29:09:[14599.903021]  [<ffffffffa0a0c960>] ? ksocknal_recv+0x2a0/0x2a0 [ksocklnd]
08:29:09:[14599.903021]  [<ffffffff810b064f>] kthread+0xcf/0xe0
08:29:09:[14599.903021]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
08:29:09:[14599.903021]  [<ffffffff81696898>] ret_from_fork+0x58/0x90
08:29:09:[14599.903021]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
08:29:09:[14599.903021] Code: 48 8b 14 ca 44 8b 42 08 47 8d 3c 00 41 83 e7 08 48 39 d8 75 0d eb 41 0f 1f 44 00 00 4c 89 e8 49 89 d5 48 8b 78 58 48 85 ff 74 24 <48> 3b 47 28 0f 85 fc 00 00 00 4c 89 f2 4c 89 e6 e8 14 f6 ff ff 
08:29:09:[14599.903021] RIP  [<ffffffffa07cb987>] lnet_mt_match_md+0x77/0x1b0 [lnet]
08:29:09:[14599.903021]  RSP <ffff88007a023c00>

Info required for matching: sanity-lfsck 31a



 Comments   
Comment by Jian Yu [ 26/Jan/17 ]

The failure also occurred while running sanity-hsm test 500 in review-dne-part-2 test session on master branch:

 Lustre: DEBUG MARKER: == sanity-hsm test 500: various LLAPI HSM tests ====================================================== 21:23:36 (1485379416)
 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version
 Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
 BUG: unable to handle kernel NULL pointer dereference at           (null)
 IP: [<ffffffffa07c39ae>] lnet_mt_match_md+0x9e/0x1b0 [lnet]
 PGD 1f7ac067 PUD 7b86e067 PMD 0 
 Oops: 0000 [#1] SMP 
 Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) ldiskfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic nfsd ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ppdev pcspkr i2c_piix4 virtio_balloon parport_pc parport ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk crct10dif_pclmul crct10dif_common cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt crc32c_intel fb_sys_fops 8139too ttm serio_raw virtio_pci virtio_ring virtio 8139cp mii drm ata_piix i2c_core libata floppy 
 CPU: 1 PID: 7130 Comm: socknal_sd00_00 Tainted: G        W  OE  ------------   3.10.0-514.2.2.el7_lustre.x86_64 #1
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
 task: ffff880078b64e70 ti: ffff880079454000 task.ti: ffff880079454000
 RIP: 0010:[<ffffffffa07c39ae>]  [<ffffffffa07c39ae>] lnet_mt_match_md+0x9e/0x1b0 [lnet]
 RSP: 0018:ffff880079457c00  EFLAGS: 00010207
 RAX: ffff88003b94acc0 RBX: ffff880077d2b000 RCX: 0000000000000018
 RDX: ffff88007903b0c0 RSI: ffff8800799bd5c0 RDI: 0000000000000000
 RBP: ffff880079457c38 R08: 0000000000000005 R09: 0000282410d40000
 R10: 0001d7dbf9350435 R11: 0000000000000000 R12: ffff880079457cf0
 R13: 0000000000000000 R14: ffff8800786f5200 R15: 0000000000000008
 FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 00000000428b0000 CR4: 00000000000406e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Stack:
  000001207b452888 ffff8800799bd5c0 ffff8800786f5200 ffff8800799bd5c0
  ffff88007903b0c0 ffff880079457cf0 ffff88007903b0c0 ffff880079457ca8
  ffffffffa07c3bcd 0000000000000000 0000000000000000 0000000000000000
 Call Trace:
  [<ffffffffa07c3bcd>] lnet_ptl_match_md+0x10d/0x820 [lnet]
  [<ffffffffa07cd35a>] lnet_parse_local+0x51a/0xd30 [lnet]
  [<ffffffffa07ce1da>] lnet_parse+0x66a/0xe60 [lnet]
  [<ffffffffa0a03da0>] ksocknal_process_receive+0x480/0xda0 [ksocklnd]
  [<ffffffffa0a04a4e>] ksocknal_scheduler+0xee/0x670 [ksocklnd]
  [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30
  [<ffffffffa0a04960>] ? ksocknal_recv+0x2a0/0x2a0 [ksocklnd]
  [<ffffffff810b064f>] kthread+0xcf/0xe0
  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
  [<ffffffff81696898>] ret_from_fork+0x58/0x90
  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 Code: 85 ff 74 24 48 3b 47 28 0f 85 fc 00 00 00 4c 89 f2 4c 89 e6 e8 14 f6 ff ff b9 00 00 00 00 a8 08 44 0f 44 f9 a8 06 75 55 49 39 dd <49> 8b 55 00 75 c4 41 83 ff 08 74 58 45 85 ff 0f 85 7c 00 00 00 
 RIP  [<ffffffffa07c39ae>] lnet_mt_match_md+0x9e/0x1b0 [lnet]
  RSP <ffff880079457c00> 

Maloo reports:
https://testing.hpdd.intel.com/test_sets/95977c1e-e380-11e6-bac2-5254006e85c2
https://testing.hpdd.intel.com/test_sets/680aeda8-e380-11e6-8845-5254006e85c2

Comment by Bob Glossman (Inactive) [ 26/Jan/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/1fdbb7da-e379-11e6-84f4-5254006e85c2

Comment by Jian Yu [ 29/Jan/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/21f83a1e-e5ce-11e6-978f-5254006e85c2

Comment by Andreas Dilger [ 05/Aug/20 ]

Closing old issue that has not been seen in a long time.

Generated at Sat Feb 10 02:22:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.