[LU-8542] Soft lockup, eventually ending in a Kernel Panic Created: 25/Aug/16 Updated: 28/Aug/16 Resolved: 28/Aug/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Adam Roe (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.2, NVMe devices, DNE2, LDISKFS MDT's, OPA with IFS 10.1.1.0.9 and Lustre-master Build #3419 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Soft lockup eventually ending in a kernel panic. I have seen this issue once when running ZFS as the backend but I see it very frequently on LDISKFS. Workload I am running to cause this is MDTEST with DNE2 striped directories, this instance failed at 7x MDS's with 1x MDT per MDS, however I have seen it do it with various combinations. Message from syslogd@zlfs2-oss7 at Aug 25 12:27:14 ... kernel:BUG: soft lockup - CPU#18 stuck for 23s! [mdt02_034:4962] Aug 25 12:27:14 zlfs2-oss7 kernel: BUG: soft lockup - CPU#18 stuck for 23s! [mdt02_034:4962] Aug 25 12:27:14 zlfs2-oss7 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) mbcache jbd2 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) xprtrdma ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_powerclamp coretemp intel_rapl kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper mxm_wmi cryptd iTCO_wdt iTCO_vendor_support i2c_i801 lpc_ich sg mfd_core ipmi_devintf pcspkr mei_me mei ioatdma hfi1 ipmi_si ipmi_msghandler sb_edac edac_core wmi shpchp acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl Aug 25 12:27:14 zlfs2-oss7 kernel: lockd grace sunrpc ip_tables xfs libcrc32c mlx4_ib ib_sa ib_mad mlx4_en vxlan ip6_udp_tunnel udp_tunnel ib_core ib_addr raid1 sd_mod crc_t10dif crct10dif_generic mgag200 crct10dif_pclmul syscopyarea crct10dif_common sysfillrect sysimgblt crc32c_intel i2c_algo_bit drm_kms_helper ttm nvme drm ixgbe ahci libahci mlx4_core mdio i2c_core libata ptp pps_core dca dm_mirror dm_region_hash dm_log dm_mod zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate Aug 25 12:27:14 zlfs2-oss7 kernel: CPU: 18 PID: 4962 Comm: mdt02_034 Tainted: P OEL ------------ 3.10.0-327.22.2.el7_lustre.x86_64 #1 Aug 25 12:27:14 zlfs2-oss7 kernel: Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0018.072020161249 07/20/2016 Aug 25 12:27:14 zlfs2-oss7 kernel: task: ffff88202258b980 ti: ffff881f8d9b4000 task.ti: ffff881f8d9b4000 Aug 25 12:27:14 zlfs2-oss7 kernel: RIP: 0010:[<ffffffff8163dcd7>] [<ffffffff8163dcd7>] _raw_spin_lock+0x37/0x50 Aug 25 12:27:14 zlfs2-oss7 kernel: RSP: 0018:ffff881f8d9b74d0 EFLAGS: 00000206 Aug 25 12:27:14 zlfs2-oss7 kernel: RAX: 000000000000544e RBX: ffff88102533f270 RCX: 00000000000034ec Aug 25 12:27:14 zlfs2-oss7 kernel: RDX: 0000000000000e3a RSI: 0000000000000e3a RDI: ffff881fe27893a0 Aug 25 12:27:14 zlfs2-oss7 kernel: RBP: ffff881f8d9b74d0 R08: 7010000000000000 R09: 10009e4e38080000 Aug 25 12:27:14 zlfs2-oss7 kernel: R10: efe165b1ef8b8e02 R11: 0000000000000000 R12: ffff882000fc4dd0 Aug 25 12:27:14 zlfs2-oss7 kernel: R13: ffff8810009e4e38 R14: ffffffff8121298b R15: ffff881f8d9b74e0 Aug 25 12:27:14 zlfs2-oss7 kernel: FS: 0000000000000000(0000) GS:ffff88203ea80000(0000) knlGS:0000000000000000 Aug 25 12:27:14 zlfs2-oss7 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 25 12:27:14 zlfs2-oss7 kernel: CR2: 00000000006dde20 CR3: 000000000194a000 CR4: 00000000001407e0 Aug 25 12:27:14 zlfs2-oss7 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Aug 25 12:27:14 zlfs2-oss7 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Aug 25 12:27:14 zlfs2-oss7 kernel: Stack: Aug 25 12:27:14 zlfs2-oss7 kernel: ffff881f8d9b7558 ffffffffa0d2dbfc ffff881f8d9b7500 ffffffff8121329c Aug 25 12:27:14 zlfs2-oss7 kernel: 00000000e278a000 ffff881fe2789000 0000000000000000 ffffffff8121332d Aug 25 12:27:14 zlfs2-oss7 kernel: 0000000100000020 ffff881f8d9b7528 ffffffff812404dc 00000000b5ead192 Aug 25 12:27:14 zlfs2-oss7 kernel: Call Trace: Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa0d2dbfc>] do_get_write_access+0x32c/0x4e0 [jbd2] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffff8121329c>] ? __find_get_block+0xbc/0x120 Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffff8121332d>] ? __getblk+0x2d/0x2e0 Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffff812404dc>] ? inode_reserved_space+0x1c/0x20 Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa0d2ddd7>] jbd2_journal_get_write_access+0x27/0x40 [jbd2] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa136b37b>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa1372197>] __ldiskfs_new_inode+0x447/0x1300 [ldiskfs] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa13394c7>] ldiskfs_create_inode+0x37/0xa0 [ldiskfs] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa14636e9>] osd_mkfile.isra.80+0x119/0x230 [osd_ldiskfs] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa146c2f5>] ? osd_trans_exec_op+0x25/0x310 [osd_ldiskfs] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa1463873>] osd_mkreg+0x33/0x70 [osd_ldiskfs] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa1475b65>] osd_object_ea_create+0x1f5/0xc60 [osd_ldiskfs] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa169a7d2>] lod_sub_object_create+0x1f2/0x480 [lod] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffff811c153a>] ? kmem_cache_alloc+0x1ba/0x1d0 Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa1691c4f>] lod_object_create+0xaf/0x200 [lod] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa16f4f35>] mdd_object_create_internal+0xb5/0x280 [mdd] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa16e0086>] mdd_object_create+0x76/0xa30 [mdd] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa16ec1a0>] ? mdd_declare_create+0x490/0xc60 [mdd] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa16ed637>] mdd_create+0xcc7/0x12b0 [mdd] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa15d1c1b>] mdt_reint_open+0x223b/0x31a0 [mdt] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa0db4009>] ? upcall_cache_get_entry+0x3e9/0x8e0 [obdclass] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa15b7ab3>] ? ucred_set_jobid+0x53/0x70 [mdt] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa15c7080>] mdt_reint_rec+0x80/0x210 [mdt] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa15a9d62>] mdt_reint_internal+0x5b2/0x9b0 [mdt] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa15aa2c2>] mdt_intent_reint+0x162/0x430 [mdt] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa15b493c>] mdt_intent_policy+0x5bc/0xbb0 [mdt] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa0f30f02>] ? ldlm_resource_get+0x5e2/0xa30 [ptlrpc] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa0f2a1e7>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa0f52ce2>] ldlm_handle_enqueue0+0x772/0x16b0 [ptlrpc] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa0f7ac30>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa0fd36b2>] tgt_enqueue+0x62/0x210 [ptlrpc] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa0fd7b15>] tgt_request_handle+0x915/0x1320 [ptlrpc] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa0f83ccb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa0be6568>] ? lc_watchdog_touch+0x68/0x180 [libcfs] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa0f81888>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffff810b88d2>] ? default_wake_function+0x12/0x20 Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffff810af038>] ? __wake_up_common+0x58/0x90 Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa0f87d80>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffffa0f872e0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0 Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffff816469d8>] ret_from_fork+0x58/0x90 Aug 25 12:27:14 zlfs2-oss7 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 Aug 25 12:27:14 zlfs2-oss7 kernel: Code: 02 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f b7 f2 b8 00 80 00 00 eb 0c 0f 1f 44 00 00 f3 90 83 e8 01 74 0a <0f> b7 0f 66 39 ca 75 f1 5d c3 0f 1f 80 00 00 00 00 eb da 66 0f |
| Comments |
| Comment by nasf (Inactive) [ 28/Aug/16 ] |
|
Adam, if you also have similar trouble on ZFS backend based system, the stack trace should be different, because ZFS has its own transaction mechanism. According to current stack trace, it is another failure instance of |