[LU-9058] DNE testing - unlinks causing MDS to hang Created: 27/Jan/17  Updated: 05/Aug/20  Resolved: 31/Jan/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Daire Byrne (Inactive) Assignee: Peter Jones
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

3.10.0-327.3.1.el7_lustre.x86_64
lustre-2.8.0-3.10.0_327.3.1.el7_lustre.x86_64.x86_64


Issue Links:
Duplicate
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hi,

I've been trying to test DNE across two MDS servers but I'm failing at the first hurdle. After adding a second MDT to a test filesystem and running some copy & unlink tests on the client, the root MDS is falling over pretty badly and it reports soft locks until it dies completely.

My inexperienced eyes could not really make much sense of the dmesg errors (jbd2 journal?) and a quick search of JIRA did not really match with any previous tickets. The log messages before the last server hang looked like:

Jan 27 13:03:30 fmds1 kernel: BUG: soft lockup - CPU#48 stuck for 22s! [jbd2/md0-8:11087]
Jan 27 13:03:30 fmds1 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) ldiskfs(OE) mbcache jbd2 vfat fat mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase dell_rbu xt_multiport iptable_filter nfsv3 nfs fscache bonding dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel iTCO_wdt iTCO_vendor_support dcdbas kvm sg raid1 ipmi_devintf sb_edac ipmi_si edac_core ipmi_msghandler mei_me lpc_ich shpchp mei mfd_core acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc ip_tables xfs sd_mod crc_t10dif crct10dif_generic mxm_wmi crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel mgag200 syscopyarea sysfillrect ghash_clmulni_intel
Jan 27 13:03:30 fmds1 kernel: sysimgblt i2c_algo_bit drm_kms_helper aesni_intel lrw gf128mul ttm glue_helper ablk_helper cryptd bnx2x drm ahci libahci ptp i2c_core libata pps_core mdio megaraid_sas nvme libcrc32c wmi sunrpc dm_mirror dm_region_hash dm_log scsi_transport_iscsi dm_multipath dm_mod [last unloaded: usb_storage]
Jan 27 13:03:30 fmds1 kernel: CPU: 48 PID: 11087 Comm: jbd2/md0-8 Tainted: G           OE  ------------   3.10.0-327.3.1.el7_lustre.x86_64 #1
Jan 27 13:03:30 fmds1 kernel: Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016
Jan 27 13:03:30 fmds1 kernel: task: ffff883fce5e7300 ti: ffff883fcaebc000 task.ti: ffff883fcaebc000
Jan 27 13:03:30 fmds1 kernel: RIP: 0010:[<ffffffffa0dccfdc>]  [<ffffffffa0dccfdc>] ptlrpc_commit_replies+0xbc/0x380 [ptlrpc]
Jan 27 13:03:30 fmds1 kernel: RSP: 0018:ffff883fcaebfb90  EFLAGS: 00000212
Jan 27 13:03:30 fmds1 kernel: RAX: ffff881e068c3030 RBX: 0000000000000246 RCX: ffff881ef1baa010
Jan 27 13:03:30 fmds1 kernel: RDX: ffff881e068dc000 RSI: ffff883fcaebfba0 RDI: ffff881ef1baa040
Jan 27 13:03:30 fmds1 kernel: RBP: ffff883fcaebfc00 R08: ffff883fcaebfba0 R09: 0000000180400034
Jan 27 13:03:30 fmds1 kernel: R10: ffffea007c591900 R11: ffffffffa09cf975 R12: ffff883fcaebfbb0
Jan 27 13:03:30 fmds1 kernel: R13: ffff881ffec07600 R14: ffff881e20887200 R15: ffffea0078822000
Jan 27 13:03:30 fmds1 kernel: FS:  0000000000000000(0000) GS:ffff881fff500000(0000) knlGS:0000000000000000
Jan 27 13:03:30 fmds1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 27 13:03:30 fmds1 kernel: CR2: 00007fc20e5e5a08 CR3: 000000000194a000 CR4: 00000000003407e0
Jan 27 13:03:30 fmds1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 27 13:03:30 fmds1 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jan 27 13:03:30 fmds1 kernel: Stack:
Jan 27 13:03:30 fmds1 kernel: ffff881fd10e5938 ffff881ef1baa010 ffff881ef1baa010 ffff881ef1baa010
Jan 27 13:03:30 fmds1 kernel: 0000000000000001 ffff881f57287800 0000000000000028 000000009ca909b7
Jan 27 13:03:30 fmds1 kernel: ffff881f16466400 ffff881e9b5da180 ffff881e490ae800 0000000000000000
Jan 27 13:03:30 fmds1 kernel: Call Trace:
Jan 27 13:03:30 fmds1 kernel: [<ffffffffa0e1371c>] tgt_cb_last_committed+0x16c/0x3d0 [ptlrpc]
Jan 27 13:03:30 fmds1 kernel: [<ffffffffa09cf8cb>] osd_trans_commit_cb+0xbb/0x380 [osd_ldiskfs]
Jan 27 13:03:30 fmds1 kernel: [<ffffffffa07998e4>] ldiskfs_journal_commit_callback+0x84/0xc0 [ldiskfs]
Jan 27 13:03:30 fmds1 kernel: [<ffffffffa06ae364>] jbd2_journal_commit_transaction+0x1494/0x19a0 [jbd2]
Jan 27 13:03:30 fmds1 kernel: [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
Jan 27 13:03:30 fmds1 kernel: [<ffffffffa06b2d79>] kjournald2+0xc9/0x260 [jbd2]
Jan 27 13:03:30 fmds1 kernel: [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
Jan 27 13:03:30 fmds1 kernel: [<ffffffffa06b2cb0>] ? commit_timeout+0x10/0x10 [jbd2]
Jan 27 13:03:30 fmds1 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0
Jan 27 13:03:30 fmds1 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Jan 27 13:03:30 fmds1 kernel: [<ffffffff81645a58>] ret_from_fork+0x58/0x90
Jan 27 13:03:30 fmds1 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Jan 27 13:03:30 fmds1 kernel: Code: 7d d0 48 8d 58 d0 0f 84 43 01 00 00 41 f6 45 14 01 0f 84 67 02 00 00 49 83 7d 38 00 75 39 e9 8d 02 00 00 0f 1f 40 00 48 8b 43 30 <4c> 8d 6b 30 4d 39 f5 48 8d 50 d0 0f 84 13 01 00 00 f6 43 44 01 
Jan 27 13:04:02 fmds1 kernel: BUG: soft lockup - CPU#16 stuck for 22s! [jbd2/md0-8:11087]
Jan 27 13:04:02 fmds1 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) ldiskfs(OE) mbcache jbd2 vfat fat mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase dell_rbu xt_multiport iptable_filter nfsv3 nfs fscache bonding dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel iTCO_wdt iTCO_vendor_support dcdbas kvm sg raid1 ipmi_devintf sb_edac ipmi_si edac_core ipmi_msghandler mei_me lpc_ich shpchp mei mfd_core acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc ip_tables xfs sd_mod crc_t10dif crct10dif_generic mxm_wmi crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel mgag200 syscopyarea sysfillrect ghash_clmulni_intel
Jan 27 13:04:02 fmds1 kernel: sysimgblt i2c_algo_bit drm_kms_helper aesni_intel lrw gf128mul ttm glue_helper ablk_helper cryptd bnx2x drm ahci libahci ptp i2c_core libata pps_core mdio megaraid_sas nvme libcrc32c wmi sunrpc dm_mirror dm_region_hash dm_log scsi_transport_iscsi dm_multipath dm_mod [last unloaded: usb_storage]
Jan 27 13:04:02 fmds1 kernel: CPU: 16 PID: 11087 Comm: jbd2/md0-8 Tainted: G           OEL ------------   3.10.0-327.3.1.el7_lustre.x86_64 #1
Jan 27 13:04:02 fmds1 kernel: Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016
Jan 27 13:04:02 fmds1 kernel: task: ffff883fce5e7300 ti: ffff883fcaebc000 task.ti: ffff883fcaebc000
Jan 27 13:04:02 fmds1 kernel: RIP: 0010:[<ffffffffa0dccfdc>]  [<ffffffffa0dccfdc>] ptlrpc_commit_replies+0xbc/0x380 [ptlrpc]
Jan 27 13:04:02 fmds1 kernel: RSP: 0018:ffff883fcaebfb90  EFLAGS: 00000206
Jan 27 13:04:02 fmds1 kernel: RAX: ffff881e126c8030 RBX: 0000000000000246 RCX: 0000000000004d7c
Jan 27 13:04:02 fmds1 kernel: RDX: ffff881e12693000 RSI: 0000000000004d7c RDI: ffff881fd10e5938
Jan 27 13:04:02 fmds1 kernel: RBP: ffff883fcaebfc00 R08: ffff881f16a4b900 R09: 0000000180400038
Jan 27 13:04:02 fmds1 kernel: R10: ffffea007c595900 R11: ffffffffa09cf975 R12: ffff883fcaebfbb0
Jan 27 13:04:02 fmds1 kernel: R13: ffff881ffec07600 R14: ffff881e1f891a00 R15: ffffea00787e2400
Jan 27 13:04:02 fmds1 kernel: FS:  0000000000000000(0000) GS:ffff881fff300000(0000) knlGS:0000000000000000
Jan 27 13:04:02 fmds1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 27 13:04:02 fmds1 kernel: CR2: 00007f693d28b000 CR3: 000000000194a000 CR4: 00000000003407e0
Jan 27 13:04:02 fmds1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 27 13:04:02 fmds1 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jan 27 13:04:02 fmds1 kernel: Stack:
Jan 27 13:04:02 fmds1 kernel: ffff881fd10e5938 ffffffffa0bc48e1 ffff883fcaebfba0 ffff883fcaebfba0
Jan 27 13:04:02 fmds1 kernel: 0000000000000000 0000000000000000 0000000000000028 000000009ca909b7
Jan 27 13:04:02 fmds1 kernel: ffff881f16567100 ffff881f16a4b900 ffff881f2a7d0e00 0000000000000000
Jan 27 13:04:02 fmds1 kernel: Call Trace:
Jan 27 13:04:02 fmds1 kernel: [<ffffffffa0bc48e1>] ? keys_fini+0xb1/0x1d0 [obdclass]
Jan 27 13:04:02 fmds1 kernel: [<ffffffffa0e1371c>] tgt_cb_last_committed+0x16c/0x3d0 [ptlrpc]
Jan 27 13:04:02 fmds1 kernel: [<ffffffffa09cf8cb>] osd_trans_commit_cb+0xbb/0x380 [osd_ldiskfs]
Jan 27 13:04:02 fmds1 kernel: [<ffffffffa07998e4>] ldiskfs_journal_commit_callback+0x84/0xc0 [ldiskfs]
Jan 27 13:04:02 fmds1 kernel: [<ffffffffa06ae364>] jbd2_journal_commit_transaction+0x1494/0x19a0 [jbd2]
Jan 27 13:04:02 fmds1 kernel: [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
Jan 27 13:04:02 fmds1 kernel: [<ffffffffa06b2d79>] kjournald2+0xc9/0x260 [jbd2]
Jan 27 13:04:02 fmds1 kernel: [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
Jan 27 13:04:02 fmds1 kernel: [<ffffffffa06b2cb0>] ? commit_timeout+0x10/0x10 [jbd2]
Jan 27 13:04:02 fmds1 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0
Jan 27 13:04:02 fmds1 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Jan 27 13:04:02 fmds1 kernel: [<ffffffff81645a58>] ret_from_fork+0x58/0x90
Jan 27 13:04:02 fmds1 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Jan 27 13:04:02 fmds1 kernel: Code: 7d d0 48 8d 58 d0 0f 84 43 01 00 00 41 f6 45 14 01 0f 84 67 02 00 00 49 83 7d 38 00 75 39 e9 8d 02 00 00 0f 1f 40 00 48 8b 43 30 <4c> 8d 6b 30 4d 39 f5 48 8d 50 d0 0f 84 13 01 00 00 f6 43 44 01 
Jan 27 13:24:12 fmds1 rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="3956" x-info="http://www.rsyslog.com"] start
Jan 27 13:24:12 fmds1 rsyslogd-2184: action '*' treated as ':omusrmsg:*' - please change syntax, '*' will not be supported in the future [try http://www.rsyslog.com/e/2184 ]
Jan 27 13:23:33 fmds1 journal: Runtime journal is using 8.0M (max allowed 4.0G, trying to leave 4.0G free of 125.8G available → current limit 4.0G).
Jan 27 13:23:33 fmds1 kernel: Initializing cgroup subsys cpuset
Jan 27 13:23:33 fmds1 kernel: Initializing cgroup subsys cpu
Jan 27 13:23:33 fmds1 kernel: Initializing cgroup subsys cpuacct
Jan 27 13:23:33 fmds1 kernel: Linux version 3.10.0-327.3.1.el7_lustre.x86_64 (jenkins@onyx-1-sdg1-el7-x8664.onyx.hpdd.intel.com) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Feb 18 10:53:23 PST 2016
Jan 27 13:23:33 fmds1 kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.0-327.3.1.el7_lustre.x86_64 root=UUID=c5ba4ad5-a0e7-4e5a-b2d4-f2defd3aa0fa ro nouveau.modeset=0 crashkernel=auto selinux=0 biosdevname=0 net.ifnames=0 rhgb quiet LANG=en_GB.UTF-8
Jan 27 13:23:33 fmds1 kernel: e820: BIOS-provided physical RAM map:

My test setup goes a little something like this:

# fmds1
mkfs.lustre --reformat --fsname=lustre --mdt --mgs --index 0 /dev/md0
mount -t lustre /dev/md0 /mnt/mdt

# fmds2
mkfs.lustre --reformat --fsname=lustre --mdt --mgsnode=fmds1 --index 1 /dev/md0
mount -t lustre /dev/md0 /mnt/mdt

# foss1
touch /dev/shm/lustre-ost1; mkfs.lustre --reformat --fsname=lustre --mgsnode=fmds1 --ost --index=0 --device-size=10000000 /dev/shm/lustre-ost1
touch /dev/shm/lustre-ost2; mkfs.lustre --reformat --fsname=lustre --mgsnode=fmds1 --ost --index=1 --device-size=10000000 /dev/shm/lustre-ost2
mount -t lustre -o loop /dev/shm/lustre-ost1 /mnt/ost1
mount -t lustre -o loop /dev/shm/lustre-ost2 /mnt/ost2

# client
mount -t lustre fmds1:/lustre /mnt/lustre

for x in `seq 1 10`; do cp -a  /mnt/lustre  /mnt/lustre.$x &; done; wait
for x in `seq 1 10`; do rm -rf /mnt/lustre.$x &; done; wait

But yesterday, I was running the loopback OSTs on the same server as MDT0000 (fmds1) and got some more descriptive logs for some reason:

Jan 26 16:11:25 fmds1 kernel: LNet: Service thread pid 1085 was inactive for 62.03s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jan 26 16:11:25 fmds1 kernel: Pid: 1085, comm: mdt06_003
Jan 26 16:11:25 fmds1 kernel: #012Call Trace:
Jan 26 16:11:25 fmds1 kernel: [<ffffffff8163aab9>] schedule+0x29/0x70
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0785085>] wait_transaction_locked+0x85/0xd0 [jbd2]
Jan 26 16:11:25 fmds1 kernel: [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0785400>] start_this_handle+0x2b0/0x5d0 [jbd2]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa12a558e>] ? lod_sub_object_declare_write+0xde/0x2a0 [lod]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa1037aca>] ? ldiskfs_xattr_trusted_get+0x2a/0x30 [ldiskfs]
Jan 26 16:11:25 fmds1 kernel: [<ffffffff811c0f9a>] ? kmem_cache_alloc+0x1ba/0x1d0
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0785933>] jbd2__journal_start+0xf3/0x1e0 [jbd2]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa10b4374>] ? osd_trans_start+0x174/0x410 [osd_ldiskfs]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0fe84b9>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa10b4374>] osd_trans_start+0x174/0x410 [osd_ldiskfs]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0c4507f>] top_trans_start+0x70f/0x950 [ptlrpc]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa117bf15>] ? mdd_declare_links_add+0xb5/0x340 [mdd]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa12773e1>] lod_trans_start+0x31/0x40 [lod]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11958a4>] mdd_trans_start+0x14/0x20 [mdd]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11858fa>] mdd_unlink+0x63a/0xd80 [mdd]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa10af600>] ? osd_xattr_get+0x50/0x260 [osd_ldiskfs]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa12193fb>] mdo_unlink+0x43/0x45 [mdt]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11eaf32>] mdt_reint_unlink+0xa62/0x10c0 [mdt]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11ee450>] mdt_reint_rec+0x80/0x210 [mdt]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11d1971>] mdt_reint_internal+0x5e1/0x990 [mdt]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11db0d7>] mdt_reint+0x67/0x140 [mdt]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0c31385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0bde02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa081c608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0bdc8d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0be20d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0be1640>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc]
Jan 26 16:11:25 fmds1 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0
Jan 26 16:11:25 fmds1 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0
Jan 26 16:11:25 fmds1 kernel: [<ffffffff81645a58>] ret_from_fork+0x58/0x90
Jan 26 16:11:25 fmds1 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0
Jan 26 16:11:25 fmds1 kernel: 
Jan 26 16:11:25 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447085.1085
Jan 26 16:11:25 fmds1 kernel: Pid: 5678, comm: mdt04_015
Jan 26 16:11:25 fmds1 kernel: #012Call Trace:
Jan 26 16:11:25 fmds1 kernel: [<ffffffff8163aab9>] schedule+0x29/0x70
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0785085>] wait_transaction_locked+0x85/0xd0 [jbd2]
Jan 26 16:11:25 fmds1 kernel: [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0785400>] start_this_handle+0x2b0/0x5d0 [jbd2]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa12a558e>] ? lod_sub_object_declare_write+0xde/0x2a0 [lod]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa1037aca>] ? ldiskfs_xattr_trusted_get+0x2a/0x30 [ldiskfs]
Jan 26 16:11:25 fmds1 kernel: [<ffffffff811c0f9a>] ? kmem_cache_alloc+0x1ba/0x1d0
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0785933>] jbd2__journal_start+0xf3/0x1e0 [jbd2]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa10b4374>] ? osd_trans_start+0x174/0x410 [osd_ldiskfs]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0fe84b9>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa10b4374>] osd_trans_start+0x174/0x410 [osd_ldiskfs]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0c4507f>] top_trans_start+0x70f/0x950 [ptlrpc]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa117bf15>] ? mdd_declare_links_add+0xb5/0x340 [mdd]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa12773e1>] lod_trans_start+0x31/0x40 [lod]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11958a4>] mdd_trans_start+0x14/0x20 [mdd]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11858fa>] mdd_unlink+0x63a/0xd80 [mdd]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa10af600>] ? osd_xattr_get+0x50/0x260 [osd_ldiskfs]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa12193fb>] mdo_unlink+0x43/0x45 [mdt]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11eaf32>] mdt_reint_unlink+0xa62/0x10c0 [mdt]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11ee450>] mdt_reint_rec+0x80/0x210 [mdt]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11d1971>] mdt_reint_internal+0x5e1/0x990 [mdt]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa11db0d7>] mdt_reint+0x67/0x140 [mdt]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0c31385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0bde02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa081c608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0bdc8d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0be20d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
Jan 26 16:11:25 fmds1 kernel: [<ffffffffa0be1640>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc]
Jan 26 16:11:25 fmds1 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0
Jan 26 16:11:25 fmds1 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0
Jan 26 16:11:25 fmds1 kernel: [<ffffffff81645a58>] ret_from_fork+0x58/0x90
Jan 26 16:11:25 fmds1 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0
Jan 26 16:11:25 fmds1 kernel: 
Jan 26 16:12:08 fmds1 kernel: LNet: Skipped 1 previous similar message
Jan 26 16:12:08 fmds1 kernel: Pid: 15733, comm: mdt_rdpg02_012
Jan 26 16:12:08 fmds1 kernel: #012Call Trace:
Jan 26 16:12:08 fmds1 kernel: [<ffffffff8163aab9>] schedule+0x29/0x70
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0785085>] wait_transaction_locked+0x85/0xd0 [jbd2]
Jan 26 16:12:08 fmds1 kernel: [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0785400>] start_this_handle+0x2b0/0x5d0 [jbd2]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa12a558e>] ? lod_sub_object_declare_write+0xde/0x2a0 [lod]
Jan 26 16:12:08 fmds1 kernel: [<ffffffff811c0f9a>] ? kmem_cache_alloc+0x1ba/0x1d0
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0785933>] jbd2__journal_start+0xf3/0x1e0 [jbd2]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa10b4374>] ? osd_trans_start+0x174/0x410 [osd_ldiskfs]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0fe84b9>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa10b4374>] osd_trans_start+0x174/0x410 [osd_ldiskfs]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0c4507f>] top_trans_start+0x70f/0x950 [ptlrpc]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa12861ab>] ? lod_load_striping+0x7b/0x180 [lod]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa12773e1>] lod_trans_start+0x31/0x40 [lod]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa11958a4>] mdd_trans_start+0x14/0x20 [mdd]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa118b961>] mdd_attr_set+0x541/0xb60 [mdd]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0bd2eb7>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa11f622e>] mdt_mfd_close+0x19e/0x620 [mdt]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa11fb963>] mdt_close_internal+0x103/0x470 [mdt]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa11fbef0>] mdt_close+0x220/0x780 [mdt]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0c31385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0bde02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa081c608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0bdc8d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Jan 26 16:12:08 fmds1 kernel: [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
Jan 26 16:12:08 fmds1 kernel: [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0be20d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
Jan 26 16:12:08 fmds1 kernel: [<ffffffffa0be1640>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc]
Jan 26 16:12:08 fmds1 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0
Jan 26 16:12:08 fmds1 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0
Jan 26 16:12:08 fmds1 kernel: [<ffffffff81645a58>] ret_from_fork+0x58/0x90
Jan 26 16:12:08 fmds1 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0
Jan 26 16:12:08 fmds1 kernel: 
Jan 26 16:12:08 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447128.15733
Jan 26 16:12:18 fmds1 kernel: LNet: Service thread pid 196057 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jan 26 16:12:18 fmds1 kernel: Pid: 196057, comm: mdt_rdpg03_001
Jan 26 16:12:18 fmds1 kernel: #012Call Trace:
Jan 26 16:12:18 fmds1 kernel: [<ffffffff8163aab9>] schedule+0x29/0x70
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0785085>] wait_transaction_locked+0x85/0xd0 [jbd2]
Jan 26 16:12:18 fmds1 kernel: [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0785400>] start_this_handle+0x2b0/0x5d0 [jbd2]
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa12a558e>] ? lod_sub_object_declare_write+0xde/0x2a0 [lod]
Jan 26 16:12:18 fmds1 kernel: [<ffffffff811c0f9a>] ? kmem_cache_alloc+0x1ba/0x1d0
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0785933>] jbd2__journal_start+0xf3/0x1e0 [jbd2]
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa10b4374>] ? osd_trans_start+0x174/0x410 [osd_ldiskfs]
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0fe84b9>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa10b4374>] osd_trans_start+0x174/0x410 [osd_ldiskfs]
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0c4507f>] top_trans_start+0x70f/0x950 [ptlrpc]
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa12861ab>] ? lod_load_striping+0x7b/0x180 [lod]
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa12773e1>] lod_trans_start+0x31/0x40 [lod]
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa11958a4>] mdd_trans_start+0x14/0x20 [mdd]
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa118b961>] mdd_attr_set+0x541/0xb60 [mdd]
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0bd2eb7>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc]
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa11f622e>] mdt_mfd_close+0x19e/0x620 [mdt]
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa11fb963>] mdt_close_internal+0x103/0x470 [mdt]
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa11fbef0>] mdt_close+0x220/0x780 [mdt]
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0c31385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0bde02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa081c608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
Jan 26 16:12:18 fmds1 kernel: [<ffffffffa0bdc8d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Jan 26 16:12:18 fmds1 kernel: [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
Jan 26 16:12:18 fmds1 kernel: [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
..
..
Jan 26 16:12:28 fmds1 kernel: 
Jan 26 16:12:28 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447148.16988
Jan 26 16:12:38 fmds1 kernel: LNet: Service thread pid 15732 was inactive for 40.09s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
Jan 26 16:12:38 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447158.15732
Jan 26 16:12:48 fmds1 kernel: LNet: Service thread pid 157913 was inactive for 40.08s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
Jan 26 16:12:48 fmds1 kernel: LNet: Skipped 1 previous similar message
Jan 26 16:12:48 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447168.157913
Jan 26 16:12:58 fmds1 kernel: LNet: Service thread pid 88689 was inactive for 40.06s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
Jan 26 16:12:58 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447178.88689
Jan 26 16:13:08 fmds1 kernel: LNet: Service thread pid 6200 was inactive for 40.04s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
Jan 26 16:13:08 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447188.6200
Jan 26 16:13:10 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447190.196065
Jan 26 16:13:40 fmds1 kernel: LNet: Service thread pid 13050 was inactive for 62.17s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
Jan 26 16:13:40 fmds1 kernel: LNet: Skipped 2 previous similar messages
Jan 26 16:13:40 fmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1485447220.13050
Jan 26 16:13:40 fmds1 kernel: INFO: task mdt_rdpg03_001:196057 blocked for more than 120 seconds.
Jan 26 16:13:40 fmds1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 26 16:13:40 fmds1 kernel: mdt_rdpg03_001  D ffff881f5bd4f6f0     0 196057      2 0x00000000
Jan 26 16:13:40 fmds1 kernel: ffff883fa725b950 0000000000000046 ffff883f7ca7f300 ffff883fa725bfd8
Jan 26 16:13:40 fmds1 kernel: ffff883fa725bfd8 ffff883fa725bfd8 ffff883f7ca7f300 ffff883afbb4d000
Jan 26 16:13:40 fmds1 kernel: ffff883afbb4d078 0000000000000894 ffff883f9a037800 ffff881f5bd4f6f0
Jan 26 16:13:40 fmds1 kernel: Call Trace:
Jan 26 16:13:40 fmds1 kernel: [<ffffffff8163aab9>] schedule+0x29/0x70
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0785085>] wait_transaction_locked+0x85/0xd0 [jbd2]
Jan 26 16:13:40 fmds1 kernel: [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0785400>] start_this_handle+0x2b0/0x5d0 [jbd2]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa12a558e>] ? lod_sub_object_declare_write+0xde/0x2a0 [lod]
Jan 26 16:13:40 fmds1 kernel: [<ffffffff811c0f9a>] ? kmem_cache_alloc+0x1ba/0x1d0
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0785933>] jbd2__journal_start+0xf3/0x1e0 [jbd2]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa10b4374>] ? osd_trans_start+0x174/0x410 [osd_ldiskfs]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0fe84b9>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa10b4374>] osd_trans_start+0x174/0x410 [osd_ldiskfs]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0c4507f>] top_trans_start+0x70f/0x950 [ptlrpc]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa12861ab>] ? lod_load_striping+0x7b/0x180 [lod]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa12773e1>] lod_trans_start+0x31/0x40 [lod]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa11958a4>] mdd_trans_start+0x14/0x20 [mdd]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa118b961>] mdd_attr_set+0x541/0xb60 [mdd]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0bd2eb7>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa11f622e>] mdt_mfd_close+0x19e/0x620 [mdt]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa11fb963>] mdt_close_internal+0x103/0x470 [mdt]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa11fbef0>] mdt_close+0x220/0x780 [mdt]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0c31385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0bde02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa081c608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0bdc8d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Jan 26 16:13:40 fmds1 kernel: [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
Jan 26 16:13:40 fmds1 kernel: [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0be20d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
Jan 26 16:13:40 fmds1 kernel: [<ffffffffa0be1640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
Jan 26 16:13:40 fmds1 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0
Jan 26 16:13:40 fmds1 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Jan 26 16:13:40 fmds1 kernel: [<ffffffff81645a58>] ret_from_fork+0x58/0x90
Jan 26 16:13:40 fmds1 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Jan 26 16:13:40 fmds1 kernel: INFO: task mdt06_003:1085 blocked for more than 120 seconds.
Jan 26 16:13:40 fmds1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 26 16:13:40 fmds1 kernel: mdt06_003       D ffff883fb9c0bf00     0  1085      2 0x00000000
Jan 26 16:13:40 fmds1 kernel: ffff883953fd3870 0000000000000046 ffff883f6cfb3980 ffff883953fd3fd8
Jan 26 16:13:40 fmds1 kernel: ffff883953fd3fd8 ffff883953fd3fd8 ffff883f6cfb3980 ffff883afbb4d000
Jan 26 16:13:40 fmds1 kernel: ffff883afbb4d078 0000000000000894 ffff883f9a037800 ffff883fb9c0bf00
Jan 26 16:13:40 fmds1 kernel: Call Trace:
..
..

The copies seem to work fine but the rm's cause the MDT0000 server (fmds1) to fail pretty hard. I have not even created any directories on the second MDT0001 (fmds2) yet.

If I do the exact same thing but don't add the second MDT0001 to the cluster, all the tests complete successfully.

The /dev/md0 arrays are made up of 2xNVMe devices in RAID1 on fmds1 & fmds2.



 Comments   
Comment by Joseph Gmitter (Inactive) [ 27/Jan/17 ]

Hi daire,

You stated that you were testing 2.8.0. Is there a reason you have chosen to test 2.8 versus using the 2.9.0 release? There were multiple DNE stabilization fixes that went into 2.9. Would it be possible for you to test with 2.9.0?

Thanks.
Joe

Comment by Daire Byrne (Inactive) [ 27/Jan/17 ]

No particular reason other than our production filesystems are already running v2.8 and I thought DNE would be considered "mature" in that release. I can't seem to get past a very simple test so I'm not discounting I'm doing something wrong!

I will test v2.9 on Monday if you think that will help. So which is the current stable "production" version of Lustre that is being recommended?

Comment by Joseph Gmitter (Inactive) [ 27/Jan/17 ]

For the community tree, version 2.9.0 is the latest release and contains DNE fixes. We would recommend using 2.9 over 2.8 for DNE evaluations.

Comment by Daire Byrne (Inactive) [ 31/Jan/17 ]

Hmm. Either I didn't add the comment yesterday or commented on the wrong ticket....

It looks like v2.9 is indeed working better and I am not getting the MDS crashes anymore. I'm still seeing the odd sporadic input/ouput errors on clients when copying a directory tree but that needs further testing to verify. You can probably close this ticket now. I'll re-open if I get stuck again. Cheers.

Comment by Peter Jones [ 31/Jan/17 ]

ok - thanks Daire

Generated at Sat Feb 10 02:22:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.