[LU-9203] parallel-scale-nfsv3 test_compilebench: MDS hit BUG: unable to handle kernel paging request Created: 10/Mar/17  Updated: 10/Aug/17  Resolved: 01/Aug/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.10.1, Lustre 2.11.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Sonia Sharma (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/1a11ae8c-f8f9-11e6-aac4-5254006e85c2.

The sub-test test_compilebench failed with the following error:

test failed to respond and timed out

Not sure if this is the dup of LU-8584
server/client lustre-master tag-2.9.53 el7 zfs

MDS console

09:10:55:[  327.535441] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test compilebench: compilebench ============================================== 09:04:15 (1487754255)
09:10:55:[  327.867477] Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2         -r 2 --makej
09:10:55:[  328.161392] Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
09:10:55:
09:10:55:[  721.605127] BUG: unable to handle kernel paging request at ffffeb040013bd80
09:10:55:[  721.605127] IP: [<ffffffffa0aa330f>] lnet_cpt_of_md+0xdf/0x120 [lnet]
09:10:55:[  721.605127] PGD 0 
09:10:55:[  721.605127] Oops: 0000 [#1] SMP 
09:10:55:[  721.605127] Modules linked in: osc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr virtio_balloon i2c_piix4 parport_pc parport nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk crct10dif_pclmul crct10dif_common 8139too crc32c_intel cirrus drm_kms_helper serio_raw syscopyarea sysfillrect sysimgblt fb_sys_fops ttm 8139cp mii virtio_pci virtio_ring virtio drm ata_piix libata i2c_core floppy
09:10:55:[  721.605127] CPU: 1 PID: 8060 Comm: mdt00_001 Tainted: P           OE  ------------   3.10.0-514.6.1.el7_lustre.x86_64 #1
09:10:55:[  721.605127] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
09:10:55:[  721.605127] task: ffff880045b43ec0 ti: ffff8800454c8000 task.ti: ffff8800454c8000
09:10:55:[  721.605127] RIP: 0010:[<ffffffffa0aa330f>]  [<ffffffffa0aa330f>] lnet_cpt_of_md+0xdf/0x120 [lnet]
09:10:55:[  721.605127] RSP: 0018:ffff8800454cba18  EFLAGS: 00010202
09:10:55:[  721.605127] RAX: 000001040013bd80 RBX: 0009000000000000 RCX: 000077ff80000000
09:10:55:[  721.605127] RDX: ffffea0000000000 RSI: 0000000000000000 RDI: ffff880079de2280
09:10:55:[  721.605127] RBP: ffff8800454cba18 R08: 0000000000000009 R09: 00000000000003f8
09:10:56:[  721.605127] R10: ffff88003b92a200 R11: ffffc90004ef6100 R12: ffff880013cca380
09:10:56:[  721.605127] R13: 0009000000000000 R14: ffff88003b92a200 R15: 0000000000000000
09:10:56:[  721.605127] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
09:10:56:[  721.605127] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
09:10:56:[  721.605127] CR2: ffffeb040013bd80 CR3: 00000000019ba000 CR4: 00000000000406e0
09:10:56:[  721.605127] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
09:10:56:[  721.605127] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
09:10:56:[  721.605127] Stack:
09:10:56:[  721.605127]  ffff8800454cbab0 ffffffffa0aaa7ba ffff88003df83df0 ffffffffffffffff
09:10:56:[  721.605127]  0000000016636b10 0000000000000246 ffff88007d001900 0000000000008050
09:10:56:[  721.605127]  00000000ffffffff ffffffffa0aa200c 0009000000000000 ffff88003b92a200
09:10:56:[  721.605127] Call Trace:
09:10:56:[  721.605127]  [<ffffffffa0aaa7ba>] lnet_select_pathway+0x5a/0x1010 [lnet]
09:10:56:[  721.605127]  [<ffffffffa0aa200c>] ? LNetMDBind+0x7c/0x5e0 [lnet]
09:10:56:[  721.605127]  [<ffffffffa0aace71>] lnet_send+0x51/0x180 [lnet]
09:10:56:[  721.605127]  [<ffffffffa0aad1e5>] LNetPut+0x245/0x7a0 [lnet]
09:10:56:[  721.605127]  [<ffffffffa0d79d76>] ptl_send_buf+0x146/0x530 [ptlrpc]
09:10:56:[  721.605127]  [<ffffffffa0a12cce>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
09:10:56:[  721.605127]  [<ffffffffa0d9c637>] ? at_measured+0x1c7/0x380 [ptlrpc]
09:10:56:[  721.605127]  [<ffffffffa0d7cffb>] ptlrpc_send_reply+0x29b/0x830 [ptlrpc]
09:10:56:[  721.605127]  [<ffffffffa0d3b24e>] target_send_reply_msg+0x8e/0x170 [ptlrpc]
09:10:56:[  721.605127]  [<ffffffffa0d45fc6>] target_send_reply+0x306/0x730 [ptlrpc]
09:10:56:[  721.605127]  [<ffffffffa0d83657>] ? lustre_msg_set_last_committed+0x27/0xa0 [ptlrpc]
09:10:56:[  721.605127]  [<ffffffffa0de1f37>] tgt_request_handle+0x587/0x1320 [ptlrpc]
09:10:56:[  721.605127]  [<ffffffffa0d8d7ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
09:10:56:[  721.605127]  [<ffffffffa0d8b368>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
09:10:56:[  721.605127]  [<ffffffff810c4fe2>] ? default_wake_function+0x12/0x20
09:10:56:[  721.605127]  [<ffffffff810ba238>] ? __wake_up_common+0x58/0x90
09:10:56:[  721.605127]  [<ffffffffa0d917b0>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
09:10:56:[  721.605127]  [<ffffffffa0d90d10>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
09:10:56:[  721.605127]  [<ffffffff810b064f>] kthread+0xcf/0xe0
09:10:56:[  721.605127]  [<ffffffff810bf9f3>] ? finish_task_switch+0x53/0x180
09:10:56:[  721.605127]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
09:10:56:[  721.605127]  [<ffffffff81696958>] ret_from_fork+0x58/0x90
09:10:56:[  721.605127]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
09:10:56:[  721.605127] Code: ff 77 00 00 48 8b 3d 91 53 03 00 48 01 d0 48 0f 42 0d 16 dd f1 e0 48 ba 00 00 00 00 00 ea ff ff 48 01 c8 48 c1 e8 0c 48 c1 e0 06 <48> 8b 34 10 48 c1 ee 36 e8 c4 fc f6 ff 5d c3 66 90 b8 ff ff ff 
09:10:56:[  721.605127] RIP  [<ffffffffa0aa330f>] lnet_cpt_of_md+0xdf/0x120 [lnet]
09:10:56:[  721.605127]  RSP <ffff8800454cba18>
09:10:56:[  721.605127] CR2: ffffeb040013bd80
09:10:56:[    0.000000] Initializing cgroup subsys cpuset
09:10:56:[    0.000000] Initializing cgroup subsys cpu
09:10:56:[    0.000000] Initializing cgroup subsys cpuacct
09:10:56:[    0.000000] Linux version 3.10.0-514.6.1.el7_lustre.x86_64 (jenkins@trevis-308.trevis.hpdd.intel.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Tue Feb 14 04:06:44 UTC 2017
09:10:56:[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-514.6.1.el7_lustre.x86_64 root=UUID=70563313-e0a3-4c81-b456-70fdcd7f6e9f ro console=tty0 LANG=en_US.UTF-8 console=ttyS0,115200 net.ifnames=0 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never disable_cpu_apicid=0 elfcorehdr=867708K
09:10:56:[    0.000000] Disabled fast string operations

Info required for matching: parallel-scale-nfsv3 compilebench



 Comments   
Comment by Sarah Liu [ 10/Mar/17 ]

DNE with zfs also failed on the same tag https://testing.hpdd.intel.com/test_sets/68dd9f2c-f92a-11e6-aa39-5254006e85c2

Comment by James Casper [ 24/May/17 ]

2.9.57, b3575:
https://testing.hpdd.intel.com/test_sessions/df55763f-2960-40d5-b78d-bd088d00e6e3
(el7+el7, zfs)

Comment by nasf (Inactive) [ 15/Jun/17 ]

This can be reproduced on RHEL7 + ZFS during sanity-quota test_38

Comment by Peter Jones [ 19/Jul/17 ]

Sonia

Could you please investigate?

Thanks

Peter

Comment by Sarah Liu [ 20/Jul/17 ]

please refer to https://wiki.hpdd.intel.com/display/TEI/Core+dump+location+for+autotest+nodes for the core dump

Comment by Gerrit Updater [ 21/Jul/17 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/28165
Subject: LU-9203 lnet: fix lnet_cpt_of_md()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3e945177186e4cf16445467fac9c4ee5b4ed060e

Comment by Gerrit Updater [ 01/Aug/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28165/
Subject: LU-9203 lnet: fix lnet_cpt_of_md()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 43b0e6328b113d9ee64e0b8a0cc35bff28eb3383

Comment by Peter Jones [ 01/Aug/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 07/Aug/17 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28400
Subject: LU-9203 lnet: fix lnet_cpt_of_md()
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 3d98886a31f8ee5cb683eb0686a822d2cc5ea878

Comment by Gerrit Updater [ 10/Aug/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28400/
Subject: LU-9203 lnet: fix lnet_cpt_of_md()
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: fe993a1fc4d8681112cb0f452ad569233692e4c9

Generated at Sat Feb 10 02:24:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.