Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.8.0
-
None
-
3.10.0-327.3.1.el7_lustre.x86_64
lustre-2.8.0-3.10.0_327.3.1.el7_lustre.x86_64.x86_64
-
3
-
9223372036854775807
Description
Hi,
I've been trying to test DNE across two MDS servers but I'm failing at the first hurdle. After adding a second MDT to a test filesystem and running some copy & unlink tests on the client, the root MDS is falling over pretty badly and it reports soft locks until it dies completely.
My inexperienced eyes could not really make much sense of the dmesg errors (jbd2 journal?) and a quick search of JIRA did not really match with any previous tickets. The log messages before the last server hang looked like:
Jan 27 13:03:30 fmds1 kernel: BUG: soft lockup - CPU#48 stuck for 22s! [jbd2/md0-8:11087] Jan 27 13:03:30 fmds1 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) ldiskfs(OE) mbcache jbd2 vfat fat mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase dell_rbu xt_multiport iptable_filter nfsv3 nfs fscache bonding dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel iTCO_wdt iTCO_vendor_support dcdbas kvm sg raid1 ipmi_devintf sb_edac ipmi_si edac_core ipmi_msghandler mei_me lpc_ich shpchp mei mfd_core acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc ip_tables xfs sd_mod crc_t10dif crct10dif_generic mxm_wmi crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel mgag200 syscopyarea sysfillrect ghash_clmulni_intel Jan 27 13:03:30 fmds1 kernel: sysimgblt i2c_algo_bit drm_kms_helper aesni_intel lrw gf128mul ttm glue_helper ablk_helper cryptd bnx2x drm ahci libahci ptp i2c_core libata pps_core mdio megaraid_sas nvme libcrc32c wmi sunrpc dm_mirror dm_region_hash dm_log scsi_transport_iscsi dm_multipath dm_mod [last unloaded: usb_storage] Jan 27 13:03:30 fmds1 kernel: CPU: 48 PID: 11087 Comm: jbd2/md0-8 Tainted: G OE ------------ 3.10.0-327.3.1.el7_lustre.x86_64 #1 Jan 27 13:03:30 fmds1 kernel: Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016 Jan 27 13:03:30 fmds1 kernel: task: ffff883fce5e7300 ti: ffff883fcaebc000 task.ti: ffff883fcaebc000 Jan 27 13:03:30 fmds1 kernel: RIP: 0010:[<ffffffffa0dccfdc>] [<ffffffffa0dccfdc>] ptlrpc_commit_replies+0xbc/0x380 [ptlrpc] Jan 27 13:03:30 fmds1 kernel: RSP: 0018:ffff883fcaebfb90 EFLAGS: 00000212 Jan 27 13:03:30 fmds1 kernel: RAX: ffff881e068c3030 RBX: 0000000000000246 RCX: ffff881ef1baa010 Jan 27 13:03:30 fmds1 kernel: RDX: ffff881e068dc000 RSI: ffff883fcaebfba0 RDI: ffff881ef1baa040 Jan 27 13:03:30 fmds1 kernel: RBP: ffff883fcaebfc00 R08: ffff883fcaebfba0 R09: 0000000180400034 Jan 27 13:03:30 fmds1 kernel: R10: ffffea007c591900 R11: ffffffffa09cf975 R12: ffff883fcaebfbb0 Jan 27 13:03:30 fmds1 kernel: R13: ffff881ffec07600 R14: ffff881e20887200 R15: ffffea0078822000 Jan 27 13:03:30 fmds1 kernel: FS: 0000000000000000(0000) GS:ffff881fff500000(0000) knlGS:0000000000000000 Jan 27 13:03:30 fmds1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 27 13:03:30 fmds1 kernel: CR2: 00007fc20e5e5a08 CR3: 000000000194a000 CR4: 00000000003407e0 Jan 27 13:03:30 fmds1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jan 27 13:03:30 fmds1 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jan 27 13:03:30 fmds1 kernel: Stack: Jan 27 13:03:30 fmds1 kernel: ffff881fd10e5938 ffff881ef1baa010 ffff881ef1baa010 ffff881ef1baa010 Jan 27 13:03:30 fmds1 kernel: 0000000000000001 ffff881f57287800 0000000000000028 000000009ca909b7 Jan 27 13:03:30 fmds1 kernel: ffff881f16466400 ffff881e9b5da180 ffff881e490ae800 0000000000000000 Jan 27 13:03:30 fmds1 kernel: Call Trace: Jan 27 13:03:30 fmds1 kernel: [<ffffffffa0e1371c>] tgt_cb_last_committed+0x16c/0x3d0 [ptlrpc] Jan 27 13:03:30 fmds1 kernel: [<ffffffffa09cf8cb>] osd_trans_commit_cb+0xbb/0x380 [osd_ldiskfs] Jan 27 13:03:30 fmds1 kernel: [<ffffffffa07998e4>] ldiskfs_journal_commit_callback+0x84/0xc0 [ldiskfs] Jan 27 13:03:30 fmds1 kernel: [<ffffffffa06ae364>] jbd2_journal_commit_transaction+0x1494/0x19a0 [jbd2] Jan 27 13:03:30 fmds1 kernel: [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 Jan 27 13:03:30 fmds1 kernel: [<ffffffffa06b2d79>] kjournald2+0xc9/0x260 [jbd2] Jan 27 13:03:30 fmds1 kernel: [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 Jan 27 13:03:30 fmds1 kernel: [<ffffffffa06b2cb0>] ? commit_timeout+0x10/0x10 [jbd2] Jan 27 13:03:30 fmds1 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0 Jan 27 13:03:30 fmds1 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 Jan 27 13:03:30 fmds1 kernel: [<ffffffff81645a58>] ret_from_fork+0x58/0x90 Jan 27 13:03:30 fmds1 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 Jan 27 13:03:30 fmds1 kernel: Code: 7d d0 48 8d 58 d0 0f 84 43 01 00 00 41 f6 45 14 01 0f 84 67 02 00 00 49 83 7d 38 00 75 39 e9 8d 02 00 00 0f 1f 40 00 48 8b 43 30 <4c> 8d 6b 30 4d 39 f5 48 8d 50 d0 0f 84 13 01 00 00 f6 43 44 01 Jan 27 13:04:02 fmds1 kernel: BUG: soft lockup - CPU#16 stuck for 22s! [jbd2/md0-8:11087] Jan 27 13:04:02 fmds1 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) ldiskfs(OE) mbcache jbd2 vfat fat mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase dell_rbu xt_multiport iptable_filter nfsv3 nfs fscache bonding dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel iTCO_wdt iTCO_vendor_support dcdbas kvm sg raid1 ipmi_devintf sb_edac ipmi_si edac_core ipmi_msghandler mei_me lpc_ich shpchp mei mfd_core acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc ip_tables xfs sd_mod crc_t10dif crct10dif_generic mxm_wmi crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel mgag200 syscopyarea sysfillrect ghash_clmulni_intel Jan 27 13:04:02 fmds1 kernel: sysimgblt i2c_algo_bit drm_kms_helper aesni_intel lrw gf128mul ttm glue_helper ablk_helper cryptd bnx2x drm ahci libahci ptp i2c_core libata pps_core mdio megaraid_sas nvme libcrc32c wmi sunrpc dm_mirror dm_region_hash dm_log scsi_transport_iscsi dm_multipath dm_mod [last unloaded: usb_storage] Jan 27 13:04:02 fmds1 kernel: CPU: 16 PID: 11087 Comm: jbd2/md0-8 Tainted: G OEL ------------ 3.10.0-327.3.1.el7_lustre.x86_64 #1 Jan 27 13:04:02 fmds1 kernel: Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016 Jan 27 13:04:02 fmds1 kernel: task: ffff883fce5e7300 ti: ffff883fcaebc000 task.ti: ffff883fcaebc000 Jan 27 13:04:02 fmds1 kernel: RIP: 0010:[<ffffffffa0dccfdc>] [<ffffffffa0dccfdc>] ptlrpc_commit_replies+0xbc/0x380 [ptlrpc] Jan 27 13:04:02 fmds1 kernel: RSP: 0018:ffff883fcaebfb90 EFLAGS: 00000206 Jan 27 13:04:02 fmds1 kernel: RAX: ffff881e126c8030 RBX: 0000000000000246 RCX: 0000000000004d7c Jan 27 13:04:02 fmds1 kernel: RDX: ffff881e12693000 RSI: 0000000000004d7c RDI: ffff881fd10e5938 Jan 27 13:04:02 fmds1 kernel: RBP: ffff883fcaebfc00 R08: ffff881f16a4b900 R09: 0000000180400038 Jan 27 13:04:02 fmds1 kernel: R10: ffffea007c595900 R11: ffffffffa09cf975 R12: ffff883fcaebfbb0 Jan 27 13:04:02 fmds1 kernel: R13: ffff881ffec07600 R14: ffff881e1f891a00 R15: ffffea00787e2400 Jan 27 13:04:02 fmds1 kernel: FS: 0000000000000000(0000) GS:ffff881fff300000(0000) knlGS:0000000000000000 Jan 27 13:04:02 fmds1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 27 13:04:02 fmds1 kernel: CR2: 00007f693d28b000 CR3: 000000000194a000 CR4: 00000000003407e0 Jan 27 13:04:02 fmds1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jan 27 13:04:02 fmds1 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jan 27 13:04:02 fmds1 kernel: Stack: Jan 27 13:04:02 fmds1 kernel: ffff881fd10e5938 ffffffffa0bc48e1 ffff883fcaebfba0 ffff883fcaebfba0 Jan 27 13:04:02 fmds1 kernel: 0000000000000000 0000000000000000 0000000000000028 000000009ca909b7 Jan 27 13:04:02 fmds1 kernel: ffff881f16567100 ffff881f16a4b900 ffff881f2a7d0e00 0000000000000000 Jan 27 13:04:02 fmds1 kernel: Call Trace: Jan 27 13:04:02 fmds1 kernel: [<ffffffffa0bc48e1>] ? keys_fini+0xb1/0x1d0 [obdclass] Jan 27 13:04:02 fmds1 kernel: [<ffffffffa0e1371c>] tgt_cb_last_committed+0x16c/0x3d0 [ptlrpc] Jan 27 13:04:02 fmds1 kernel: [<ffffffffa09cf8cb>] osd_trans_commit_cb+0xbb/0x380 [osd_ldiskfs] Jan 27 13:04:02 fmds1 kernel: [<ffffffffa07998e4>] ldiskfs_journal_commit_callback+0x84/0xc0 [ldiskfs] Jan 27 13:04:02 fmds1 kernel: [<ffffffffa06ae364>] jbd2_journal_commit_transaction+0x1494/0x19a0 [jbd2] Jan 27 13:04:02 fmds1 kernel: [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 Jan 27 13:04:02 fmds1 kernel: [<ffffffffa06b2d79>] kjournald2+0xc9/0x260 [jbd2] Jan 27 13:04:02 fmds1 kernel: [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 Jan 27 13:04:02 fmds1 kernel: [<ffffffffa06b2cb0>] ? commit_timeout+0x10/0x10 [jbd2] Jan 27 13:04:02 fmds1 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0 Jan 27 13:04:02 fmds1 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 Jan 27 13:04:02 fmds1 kernel: [<ffffffff81645a58>] ret_from_fork+0x58/0x90 Jan 27 13:04:02 fmds1 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 Jan 27 13:04:02 fmds1 kernel: Code: 7d d0 48 8d 58 d0 0f 84 43 01 00 00 41 f6 45 14 01 0f 84 67 02 00 00 49 83 7d 38 00 75 39 e9 8d 02 00 00 0f 1f 40 00 48 8b 43 30 <4c> 8d 6b 30 4d 39 f5 48 8d 50 d0 0f 84 13 01 00 00 f6 43 44 01 Jan 27 13:24:12 fmds1 rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="3956" x-info="http://www.rsyslog.com"] start Jan 27 13:24:12 fmds1 rsyslogd-2184: action '*' treated as ':omusrmsg:*' - please change syntax, '*' will not be supported in the future [try http://www.rsyslog.com/e/2184 ] Jan 27 13:23:33 fmds1 journal: Runtime journal is using 8.0M (max allowed 4.0G, trying to leave 4.0G free of 125.8G available → current limit 4.0G). Jan 27 13:23:33 fmds1 kernel: Initializing cgroup subsys cpuset Jan 27 13:23:33 fmds1 kernel: Initializing cgroup subsys cpu Jan 27 13:23:33 fmds1 kernel: Initializing cgroup subsys cpuacct Jan 27 13:23:33 fmds1 kernel: Linux version 3.10.0-327.3.1.el7_lustre.x86_64 (jenkins@onyx-1-sdg1-el7-x8664.onyx.hpdd.intel.com) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Feb 18 10:53:23 PST 2016 Jan 27 13:23:33 fmds1 kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.0-327.3.1.el7_lustre.x86_64 root=UUID=c5ba4ad5-a0e7-4e5a-b2d4-f2defd3aa0fa ro nouveau.modeset=0 crashkernel=auto selinux=0 biosdevname=0 net.ifnames=0 rhgb quiet LANG=en_GB.UTF-8 Jan 27 13:23:33 fmds1 kernel: e820: BIOS-provided physical RAM map:
My test setup goes a little something like this:
# fmds1 mkfs.lustre --reformat --fsname=lustre --mdt --mgs --index 0 /dev/md0 mount -t lustre /dev/md0 /mnt/mdt # fmds2 mkfs.lustre --reformat --fsname=lustre --mdt --mgsnode=fmds1 --index 1 /dev/md0 mount -t lustre /dev/md0 /mnt/mdt # foss1 touch /dev/shm/lustre-ost1; mkfs.lustre --reformat --fsname=lustre --mgsnode=fmds1 --ost --index=0 --device-size=10000000 /dev/shm/lustre-ost1 touch /dev/shm/lustre-ost2; mkfs.lustre --reformat --fsname=lustre --mgsnode=fmds1 --ost --index=1 --device-size=10000000 /dev/shm/lustre-ost2 mount -t lustre -o loop /dev/shm/lustre-ost1 /mnt/ost1 mount -t lustre -o loop /dev/shm/lustre-ost2 /mnt/ost2 # client mount -t lustre fmds1:/lustre /mnt/lustre for x in `seq 1 10`; do cp -a /mnt/lustre /mnt/lustre.$x &; done; wait for x in `seq 1 10`; do rm -rf /mnt/lustre.$x &; done; wait
But yesterday, I was running the loopback OSTs on the same server as MDT0000 (fmds1) and got some more descriptive logs for some reason:
Jan 26 16:11:25 fmds1 kernel: LNet: Service thread pid 1085 was inactive for 62.03s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jan 26 16:11:25 fmds1 kernel: Pid: 1085, comm: mdt06_003 Jan 26 16:11:25 fmds1 kernel: #012Call Trace: Jan 26 16:11:25 fmds1 kernel: [<ffffffff8163aab9>] schedule+0x29/0x70 Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0785085>] wait_transaction_locked+0x85/0xd0 [jbd2] Jan 26 16:11:25 fmds1 kernel: [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40 Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0785400>] start_this_handle+0x2b0/0x5d0 [jbd2] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa12a558e>] ? lod_sub_object_declare_write+0xde/0x2a0 [lod] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa1037aca>] ? ldiskfs_xattr_trusted_get+0x2a/0x30 [ldiskfs] Jan 26 16:11:25 fmds1 kernel: [<ffffffff811c0f9a>] ? kmem_cache_alloc+0x1ba/0x1d0 Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0785933>] jbd2__journal_start+0xf3/0x1e0 [jbd2] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa10b4374>] ? osd_trans_start+0x174/0x410 [osd_ldiskfs] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0fe84b9>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa10b4374>] osd_trans_start+0x174/0x410 [osd_ldiskfs] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0c4507f>] top_trans_start+0x70f/0x950 [ptlrpc] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa117bf15>] ? mdd_declare_links_add+0xb5/0x340 [mdd] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa12773e1>] lod_trans_start+0x31/0x40 [lod] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11958a4>] mdd_trans_start+0x14/0x20 [mdd] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11858fa>] mdd_unlink+0x63a/0xd80 [mdd] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa10af600>] ? osd_xattr_get+0x50/0x260 [osd_ldiskfs] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa12193fb>] mdo_unlink+0x43/0x45 [mdt] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11eaf32>] mdt_reint_unlink+0xa62/0x10c0 [mdt] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11ee450>] mdt_reint_rec+0x80/0x210 [mdt] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11d1971>] mdt_reint_internal+0x5e1/0x990 [mdt] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11db0d7>] mdt_reint+0x67/0x140 [mdt] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0c31385>] tgt_request_handle+0x915/0x1320 [ptlrpc] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0bde02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa081c608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0bdc8d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0be20d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0be1640>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc] Jan 26 16:11:25 fmds1 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0 Jan 26 16:11:25 fmds1 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0 Jan 26 16:11:25 fmds1 kernel: [<ffffffff81645a58>] ret_from_fork+0x58/0x90 Jan 26 16:11:25 fmds1 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0 Jan 26 16:11:25 fmds1 kernel: Jan 26 16:11:25 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447085.1085 Jan 26 16:11:25 fmds1 kernel: Pid: 5678, comm: mdt04_015 Jan 26 16:11:25 fmds1 kernel: #012Call Trace: Jan 26 16:11:25 fmds1 kernel: [<ffffffff8163aab9>] schedule+0x29/0x70 Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0785085>] wait_transaction_locked+0x85/0xd0 [jbd2] Jan 26 16:11:25 fmds1 kernel: [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40 Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0785400>] start_this_handle+0x2b0/0x5d0 [jbd2] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa12a558e>] ? lod_sub_object_declare_write+0xde/0x2a0 [lod] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa1037aca>] ? ldiskfs_xattr_trusted_get+0x2a/0x30 [ldiskfs] Jan 26 16:11:25 fmds1 kernel: [<ffffffff811c0f9a>] ? kmem_cache_alloc+0x1ba/0x1d0 Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0785933>] jbd2__journal_start+0xf3/0x1e0 [jbd2] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa10b4374>] ? osd_trans_start+0x174/0x410 [osd_ldiskfs] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0fe84b9>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa10b4374>] osd_trans_start+0x174/0x410 [osd_ldiskfs] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0c4507f>] top_trans_start+0x70f/0x950 [ptlrpc] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa117bf15>] ? mdd_declare_links_add+0xb5/0x340 [mdd] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa12773e1>] lod_trans_start+0x31/0x40 [lod] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11958a4>] mdd_trans_start+0x14/0x20 [mdd] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11858fa>] mdd_unlink+0x63a/0xd80 [mdd] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa10af600>] ? osd_xattr_get+0x50/0x260 [osd_ldiskfs] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa12193fb>] mdo_unlink+0x43/0x45 [mdt] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11eaf32>] mdt_reint_unlink+0xa62/0x10c0 [mdt] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11ee450>] mdt_reint_rec+0x80/0x210 [mdt] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11d1971>] mdt_reint_internal+0x5e1/0x990 [mdt] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11db0d7>] mdt_reint+0x67/0x140 [mdt] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0c31385>] tgt_request_handle+0x915/0x1320 [ptlrpc] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0bde02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa081c608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0bdc8d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0be20d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0be1640>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc] Jan 26 16:11:25 fmds1 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0 Jan 26 16:11:25 fmds1 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0 Jan 26 16:11:25 fmds1 kernel: [<ffffffff81645a58>] ret_from_fork+0x58/0x90 Jan 26 16:11:25 fmds1 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0 Jan 26 16:11:25 fmds1 kernel: Jan 26 16:12:08 fmds1 kernel: LNet: Skipped 1 previous similar message Jan 26 16:12:08 fmds1 kernel: Pid: 15733, comm: mdt_rdpg02_012 Jan 26 16:12:08 fmds1 kernel: #012Call Trace: Jan 26 16:12:08 fmds1 kernel: [<ffffffff8163aab9>] schedule+0x29/0x70 Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0785085>] wait_transaction_locked+0x85/0xd0 [jbd2] Jan 26 16:12:08 fmds1 kernel: [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40 Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0785400>] start_this_handle+0x2b0/0x5d0 [jbd2] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa12a558e>] ? lod_sub_object_declare_write+0xde/0x2a0 [lod] Jan 26 16:12:08 fmds1 kernel: [<ffffffff811c0f9a>] ? kmem_cache_alloc+0x1ba/0x1d0 Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0785933>] jbd2__journal_start+0xf3/0x1e0 [jbd2] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa10b4374>] ? osd_trans_start+0x174/0x410 [osd_ldiskfs] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0fe84b9>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa10b4374>] osd_trans_start+0x174/0x410 [osd_ldiskfs] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0c4507f>] top_trans_start+0x70f/0x950 [ptlrpc] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa12861ab>] ? lod_load_striping+0x7b/0x180 [lod] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa12773e1>] lod_trans_start+0x31/0x40 [lod] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa11958a4>] mdd_trans_start+0x14/0x20 [mdd] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa118b961>] mdd_attr_set+0x541/0xb60 [mdd] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0bd2eb7>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa11f622e>] mdt_mfd_close+0x19e/0x620 [mdt] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa11fb963>] mdt_close_internal+0x103/0x470 [mdt] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa11fbef0>] mdt_close+0x220/0x780 [mdt] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0c31385>] tgt_request_handle+0x915/0x1320 [ptlrpc] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0bde02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa081c608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0bdc8d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Jan 26 16:12:08 fmds1 kernel: [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 Jan 26 16:12:08 fmds1 kernel: [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0be20d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0be1640>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc] Jan 26 16:12:08 fmds1 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0 Jan 26 16:12:08 fmds1 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0 Jan 26 16:12:08 fmds1 kernel: [<ffffffff81645a58>] ret_from_fork+0x58/0x90 Jan 26 16:12:08 fmds1 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0 Jan 26 16:12:08 fmds1 kernel: Jan 26 16:12:08 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447128.15733 Jan 26 16:12:18 fmds1 kernel: LNet: Service thread pid 196057 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jan 26 16:12:18 fmds1 kernel: Pid: 196057, comm: mdt_rdpg03_001 Jan 26 16:12:18 fmds1 kernel: #012Call Trace: Jan 26 16:12:18 fmds1 kernel: [<ffffffff8163aab9>] schedule+0x29/0x70 Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0785085>] wait_transaction_locked+0x85/0xd0 [jbd2] Jan 26 16:12:18 fmds1 kernel: [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40 Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0785400>] start_this_handle+0x2b0/0x5d0 [jbd2] Jan 26 16:12:18 fmds1 kernel: [<ffffffffa12a558e>] ? lod_sub_object_declare_write+0xde/0x2a0 [lod] Jan 26 16:12:18 fmds1 kernel: [<ffffffff811c0f9a>] ? kmem_cache_alloc+0x1ba/0x1d0 Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0785933>] jbd2__journal_start+0xf3/0x1e0 [jbd2] Jan 26 16:12:18 fmds1 kernel: [<ffffffffa10b4374>] ? osd_trans_start+0x174/0x410 [osd_ldiskfs] Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0fe84b9>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] Jan 26 16:12:18 fmds1 kernel: [<ffffffffa10b4374>] osd_trans_start+0x174/0x410 [osd_ldiskfs] Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0c4507f>] top_trans_start+0x70f/0x950 [ptlrpc] Jan 26 16:12:18 fmds1 kernel: [<ffffffffa12861ab>] ? lod_load_striping+0x7b/0x180 [lod] Jan 26 16:12:18 fmds1 kernel: [<ffffffffa12773e1>] lod_trans_start+0x31/0x40 [lod] Jan 26 16:12:18 fmds1 kernel: [<ffffffffa11958a4>] mdd_trans_start+0x14/0x20 [mdd] Jan 26 16:12:18 fmds1 kernel: [<ffffffffa118b961>] mdd_attr_set+0x541/0xb60 [mdd] Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0bd2eb7>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc] Jan 26 16:12:18 fmds1 kernel: [<ffffffffa11f622e>] mdt_mfd_close+0x19e/0x620 [mdt] Jan 26 16:12:18 fmds1 kernel: [<ffffffffa11fb963>] mdt_close_internal+0x103/0x470 [mdt] Jan 26 16:12:18 fmds1 kernel: [<ffffffffa11fbef0>] mdt_close+0x220/0x780 [mdt] Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0c31385>] tgt_request_handle+0x915/0x1320 [ptlrpc] Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0bde02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] Jan 26 16:12:18 fmds1 kernel: [<ffffffffa081c608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0bdc8d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Jan 26 16:12:18 fmds1 kernel: [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 Jan 26 16:12:18 fmds1 kernel: [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 .. .. Jan 26 16:12:28 fmds1 kernel: Jan 26 16:12:28 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447148.16988 Jan 26 16:12:38 fmds1 kernel: LNet: Service thread pid 15732 was inactive for 40.09s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Jan 26 16:12:38 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447158.15732 Jan 26 16:12:48 fmds1 kernel: LNet: Service thread pid 157913 was inactive for 40.08s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Jan 26 16:12:48 fmds1 kernel: LNet: Skipped 1 previous similar message Jan 26 16:12:48 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447168.157913 Jan 26 16:12:58 fmds1 kernel: LNet: Service thread pid 88689 was inactive for 40.06s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Jan 26 16:12:58 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447178.88689 Jan 26 16:13:08 fmds1 kernel: LNet: Service thread pid 6200 was inactive for 40.04s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Jan 26 16:13:08 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447188.6200 Jan 26 16:13:10 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447190.196065 Jan 26 16:13:40 fmds1 kernel: LNet: Service thread pid 13050 was inactive for 62.17s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Jan 26 16:13:40 fmds1 kernel: LNet: Skipped 2 previous similar messages Jan 26 16:13:40 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447220.13050 Jan 26 16:13:40 fmds1 kernel: INFO: task mdt_rdpg03_001:196057 blocked for more than 120 seconds. Jan 26 16:13:40 fmds1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 26 16:13:40 fmds1 kernel: mdt_rdpg03_001 D ffff881f5bd4f6f0 0 196057 2 0x00000000 Jan 26 16:13:40 fmds1 kernel: ffff883fa725b950 0000000000000046 ffff883f7ca7f300 ffff883fa725bfd8 Jan 26 16:13:40 fmds1 kernel: ffff883fa725bfd8 ffff883fa725bfd8 ffff883f7ca7f300 ffff883afbb4d000 Jan 26 16:13:40 fmds1 kernel: ffff883afbb4d078 0000000000000894 ffff883f9a037800 ffff881f5bd4f6f0 Jan 26 16:13:40 fmds1 kernel: Call Trace: Jan 26 16:13:40 fmds1 kernel: [<ffffffff8163aab9>] schedule+0x29/0x70 Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0785085>] wait_transaction_locked+0x85/0xd0 [jbd2] Jan 26 16:13:40 fmds1 kernel: [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0785400>] start_this_handle+0x2b0/0x5d0 [jbd2] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa12a558e>] ? lod_sub_object_declare_write+0xde/0x2a0 [lod] Jan 26 16:13:40 fmds1 kernel: [<ffffffff811c0f9a>] ? kmem_cache_alloc+0x1ba/0x1d0 Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0785933>] jbd2__journal_start+0xf3/0x1e0 [jbd2] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa10b4374>] ? osd_trans_start+0x174/0x410 [osd_ldiskfs] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0fe84b9>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa10b4374>] osd_trans_start+0x174/0x410 [osd_ldiskfs] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0c4507f>] top_trans_start+0x70f/0x950 [ptlrpc] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa12861ab>] ? lod_load_striping+0x7b/0x180 [lod] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa12773e1>] lod_trans_start+0x31/0x40 [lod] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa11958a4>] mdd_trans_start+0x14/0x20 [mdd] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa118b961>] mdd_attr_set+0x541/0xb60 [mdd] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0bd2eb7>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa11f622e>] mdt_mfd_close+0x19e/0x620 [mdt] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa11fb963>] mdt_close_internal+0x103/0x470 [mdt] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa11fbef0>] mdt_close+0x220/0x780 [mdt] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0c31385>] tgt_request_handle+0x915/0x1320 [ptlrpc] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0bde02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa081c608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0bdc8d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Jan 26 16:13:40 fmds1 kernel: [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 Jan 26 16:13:40 fmds1 kernel: [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0be20d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0be1640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] Jan 26 16:13:40 fmds1 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0 Jan 26 16:13:40 fmds1 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 Jan 26 16:13:40 fmds1 kernel: [<ffffffff81645a58>] ret_from_fork+0x58/0x90 Jan 26 16:13:40 fmds1 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 Jan 26 16:13:40 fmds1 kernel: INFO: task mdt06_003:1085 blocked for more than 120 seconds. Jan 26 16:13:40 fmds1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 26 16:13:40 fmds1 kernel: mdt06_003 D ffff883fb9c0bf00 0 1085 2 0x00000000 Jan 26 16:13:40 fmds1 kernel: ffff883953fd3870 0000000000000046 ffff883f6cfb3980 ffff883953fd3fd8 Jan 26 16:13:40 fmds1 kernel: ffff883953fd3fd8 ffff883953fd3fd8 ffff883f6cfb3980 ffff883afbb4d000 Jan 26 16:13:40 fmds1 kernel: ffff883afbb4d078 0000000000000894 ffff883f9a037800 ffff883fb9c0bf00 Jan 26 16:13:40 fmds1 kernel: Call Trace: .. ..
The copies seem to work fine but the rm's cause the MDT0000 server (fmds1) to fail pretty hard. I have not even created any directories on the second MDT0001 (fmds2) yet.
If I do the exact same thing but don't add the second MDT0001 to the cluster, all the tests complete successfully.
The /dev/md0 arrays are made up of 2xNVMe devices in RAID1 on fmds1 & fmds2.