[LU-5242] Test hang sanity test_132, test_133: umount ost Created: 23/Jun/14 Updated: 23/Nov/17 Resolved: 18/May/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.7.0, Lustre 2.5.3 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Isaac Huang (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | prz | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 14622 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/e5783778-f887-11e3-b13a-52540035b04c. The sub-test test_132 failed with the following error:
Info required for matching: sanity 132 |
| Comments |
| Comment by Oleg Drokin [ 23/Jun/14 ] |
|
MDS2 traces have a very characteristic signature of "waiting for obd_unlinked_exports more than XX" - there a bunch of bug about this, they hit from tiem to time on unmount. |
| Comment by Andreas Dilger [ 26/Jun/14 ] |
|
It looks like there is some kind of problem in cleaning up the OSP device, and this has been causing a significant number of failures recently (about 10% for both review-zfs and review-dne-part-1). 03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) header@ffff88005b849740[0x0, 1, [0x1:0x0:0x0] hash exist]{
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....local_storage@ffff88005b849790
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....osd-ldiskfs@ffff88005b849800osd-ldiskfs-object@ffff88005b849800(i:ffff88007b5fe5a8:144/3610117262)[plain]
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) } header@ffff88005b849740
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) header@ffff880046ba3600[0x0, 1, [0x200000003:0x0:0x0] hash exist]{
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....local_storage@ffff880046ba3650
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....osd-ldiskfs@ffff88007954f680osd-ldiskfs-object@ffff88007954f680(i:ffff88007b60c628:110/3610117227)[plain]
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) } header@ffff880046ba3600
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) header@ffff88006abf6800[0x0, 1, [0x200000003:0x2:0x0] hash exist]{
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....local_storage@ffff88006abf6850
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....osd-ldiskfs@ffff88006abf6c80osd-ldiskfs-object@ffff88006abf6c80(i:ffff880057c7c1d0:32771/3610117228)[plain]
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) } header@ffff88006abf6800
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) header@ffff8800572f4680[0x0, 1, [0xa:0x0:0x0] hash exist]{
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....local_storage@ffff8800572f46d0
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....osd-ldiskfs@ffff8800684f6500osd-ldiskfs-object@ffff8800684f6500(i:ffff880060b77110:178/3610117296)[plain]
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) } header@ffff8800572f4680
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) header@ffff88006abf62c0[0x1, 1, [0x200000001:0x1017:0x0] hash exist]{
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....local_storage@ffff88006abf6310
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....osd-ldiskfs@ffff88006abf6d40osd-ldiskfs-object@ffff88006abf6d40(i:ffff880057c706e8:524289/986286545)[plain]
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) } header@ffff88006abf62c0
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) header@ffff880067b38e00[0x0, 1, [0x4c0000402:0x2f:0x0] hash exist]{
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....mdt@ffff880067b38e50mdt-object@ffff880067b38e00(ioepoch=0 flags=0x0, epochcount=0, writecount=0)
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....mdd@ffff880057491cd0mdd-object@ffff880057491cd0(open_count=0, valid=0, cltime=0, flags=0)
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....lod@ffff8800717d0bf0lod-object@ffff8800717d0bf0
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....osp@ffff8800717d3d00osp-object@ffff8800717d3cb0
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) } header@ffff880067b38e00
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) header@ffff88007a048928[0x0, 1, [0x440000bd1:0x2:0x0] hash exist]{
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....mdt@ffff88007a048978mdt-object@ffff88007a048928(ioepoch=0 flags=0x0, epochcount=0, writecount=0)
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....mdd@ffff880061a52e10mdd-object@ffff880061a52e10(open_count=0, valid=0, cltime=0, flags=0)
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....lod@ffff880061a40830lod-object@ffff880061a40830
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....osp@ffff88005e25e1a0osp-object@ffff88005e25e150
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) } header@ffff88007a048928
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) header@ffff880052e8ad08[0x0, 2, [0x440000400:0xd5:0x0] hash exist]{
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....mdt@ffff880052e8ad58mdt-object@ffff880052e8ad08(ioepoch=0 flags=0x0, epochcount=0, writecount=0)
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....mdd@ffff8800717fc3c0mdd-object@ffff8800717fc3c0(open_count=0, valid=0, cltime=0, flags=0)
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....lod@ffff880052e88c68lod-object@ffff880052e88c68
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....osp@ffff88005ba3f5b0osp-object@ffff88005ba3f560
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) } header@ffff880052e8ad08
It may be that this is a new type of failure caused by a recently landed patch, or it could be the same failure that Oleg describes that has been made worse by some other patch. |
| Comment by Peter Jones [ 27/Jun/14 ] |
|
Emoly Could you please look into this one? Thanks Peter |
| Comment by Andreas Dilger [ 30/Jun/14 ] |
|
Emoly, |
| Comment by Emoly Liu [ 01/Jul/14 ] |
|
I can hit this "hang" issue on zfs in my local test, but not every time. I will investigate it. |
| Comment by Emoly Liu [ 01/Jul/14 ] |
|
Yesterday I happened to hit this problem, and its call trace is same to the one shown in Jun 30 16:16:28 centos6-x kernel: umount D 0000000000000000 0 31062 31061 0x00000080 Jun 30 16:16:28 centos6-x kernel: ffff8800176b7aa8 0000000000000086 ffff8800176b7a08 ffff88001f4c3c00 Jun 30 16:16:28 centos6-x kernel: ffffffffa12fbd1d 0000000000000000 ffff88000e00c184 ffffffffa12fbd1d Jun 30 16:16:28 centos6-x kernel: ffff88001f405098 ffff8800176b7fd8 000000000000fb88 ffff88001f405098 Jun 30 16:16:28 centos6-x kernel: Call Trace: Jun 30 16:16:28 centos6-x kernel: [<ffffffff8150f362>] schedule_timeout+0x192/0x2e0 Jun 30 16:16:28 centos6-x kernel: [<ffffffff810811e0>] ? process_timeout+0x0/0x10 Jun 30 16:16:28 centos6-x kernel: [<ffffffffa1281acb>] obd_exports_barrier+0xab/0x180 [obdclass] Jun 30 16:16:28 centos6-x kernel: [<ffffffffa0ba776f>] ofd_device_fini+0x5f/0x260 [ofd] Jun 30 16:16:28 centos6-x kernel: [<ffffffffa12a8633>] class_cleanup+0x573/0xd30 [obdclass] Jun 30 16:16:28 centos6-x kernel: [<ffffffffa12838a6>] ? class_name2dev+0x56/0xe0 [obdclass] Jun 30 16:16:28 centos6-x kernel: [<ffffffffa12aa35a>] class_process_config+0x156a/0x1ad0 [obdclass] Jun 30 16:16:28 centos6-x kernel: [<ffffffffa12a26fb>] ? lustre_cfg_new+0x2cb/0x680 [obdclass] Jun 30 16:16:28 centos6-x kernel: [<ffffffffa12aaa39>] class_manual_cleanup+0x179/0x6f0 [obdclass] Jun 30 16:16:28 centos6-x kernel: [<ffffffffa12838a6>] ? class_name2dev+0x56/0xe0 [obdclass] Jun 30 16:16:28 centos6-x kernel: [<ffffffffa12e8269>] server_put_super+0x8f9/0xe50 [obdclass] Jun 30 16:16:28 centos6-x kernel: [<ffffffff8118363b>] generic_shutdown_super+0x5b/0xe0 Jun 30 16:16:28 centos6-x kernel: [<ffffffff81183726>] kill_anon_super+0x16/0x60 Jun 30 16:16:28 centos6-x kernel: [<ffffffffa12ac916>] lustre_kill_super+0x36/0x60 [obdclass] Jun 30 16:16:28 centos6-x kernel: [<ffffffff81183ec7>] deactivate_super+0x57/0x80 Jun 30 16:16:28 centos6-x kernel: [<ffffffff811a21bf>] mntput_no_expire+0xbf/0x110 Jun 30 16:16:28 centos6-x kernel: [<ffffffff811a2c2b>] sys_umount+0x7b/0x3a0 Jun 30 16:16:28 centos6-x kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b I am checking the recent maloo test reports to see when the failure started. |
| Comment by Emoly Liu [ 01/Jul/14 ] |
|
I checked and vetted recent two months sanity.sh test_132 failure logs and the logs show that there are three kinds of timeout failure. 1) 2) 12:34:53:LustreError: 7776:0:(osp_dev.c:858:osp_device_free()) } header@ffff8800607cee00 12:34:53:LustreError: 7776:0:(lu_object.c:1198:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 12:34:53:LustreError: 11252:0:(mdt_handler.c:4376:mdt_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: 12:34:53:LustreError: 11252:0:(mdt_handler.c:4376:mdt_fini()) LBUG 12:34:53:Pid: 11252, comm: umount 12:34:53: 12:34:53:Call Trace: 12:34:53: [<ffffffffa048e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 12:34:53: [<ffffffffa048ee97>] lbug_with_loc+0x47/0xb0 [libcfs] 12:34:53: [<ffffffffa0ebccaf>] mdt_device_fini+0xc8f/0xcd0 [mdt] 12:34:53: [<ffffffffa05c3f5d>] ? class_disconnect_exports+0x17d/0x2f0 [obdclass] 12:34:53: [<ffffffffa05e6443>] class_cleanup+0x573/0xd30 [obdclass] 12:34:53: [<ffffffffa05c1836>] ? class_name2dev+0x56/0xe0 [obdclass] 12:34:53: [<ffffffffa05e816a>] class_process_config+0x156a/0x1ad0 [obdclass] 12:34:53: [<ffffffffa0499478>] ? libcfs_log_return+0x28/0x40 [libcfs] 12:34:53: [<ffffffffa05e0549>] ? lustre_cfg_new+0x309/0x680 [obdclass] 12:34:53: [<ffffffffa05e8849>] class_manual_cleanup+0x179/0x6f0 [obdclass] 12:34:53: [<ffffffffa05c1836>] ? class_name2dev+0x56/0xe0 [obdclass] 12:34:53: [<ffffffffa0625f09>] server_put_super+0x8f9/0xe50 [obdclass] 12:34:53: [<ffffffff8118af0b>] generic_shutdown_super+0x5b/0xe0 12:34:53: [<ffffffff8118aff6>] kill_anon_super+0x16/0x60 12:34:53: [<ffffffffa05ea726>] lustre_kill_super+0x36/0x60 [obdclass] 12:34:53: [<ffffffff8118b797>] deactivate_super+0x57/0x80 12:34:53: [<ffffffff811aa79f>] mntput_no_expire+0xbf/0x110 12:34:53: [<ffffffff811ab2eb>] sys_umount+0x7b/0x3a0 12:34:53: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Seems this problem has gone since June-15. 3) 03:45:20:BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 03:45:20:IP: [<ffffffffa108b30b>] mdt_build_target_list+0xfb/0x6a0 [mdt] 03:45:20:PGD 0 03:45:20:Oops: 0000 [#1] SMP 03:45:20:last sysfs file: /sys/devices/system/cpu/online 03:45:20:CPU 1 03:45:20:Modules linked in: osp(U) lod(U) mdt(U) mdd(U) mgs(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) ksocklnd(U) scollective(U) gossip(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: llog_test] 03:45:20: 03:45:20:Pid: 16913, comm: mdt00_001 Tainted: P --------------- 2.6.32-358.6.2.el6_lustre.g1624f5c.x86_64 #1 Red Hat KVM 03:45:20:RIP: 0010:[<ffffffffa108b30b>] [<ffffffffa108b30b>] mdt_build_target_list+0xfb/0x6a0 [mdt] 03:45:20:RSP: 0018:ffff88006e32bba0 EFLAGS: 00010246 03:45:20:RAX: ffff88006c779ba0 RBX: ffff88002ea7a5c0 RCX: fffffffffffffffe 03:45:20:RDX: 0000000000000000 RSI: ffff88006c779b18 RDI: ffff88002ea7a5c0 03:45:20:RBP: ffff88006e32bc00 R08: ffff880048a91720 R09: 0000000000000000 03:45:20:R10: ffff88002d578800 R11: 0000000000000418 R12: 0000000000000000 03:45:20:R13: d4f7928dd528ca79 R14: ffff88006c779b18 R15: ffff88002dd6b000 03:45:20:FS: 00007f7d60754700(0000) GS:ffff880002300000(0000) knlGS:0000000000000000 03:45:20:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 03:45:20:CR2: 0000000000000030 CR3: 0000000001a85000 CR4: 00000000000006e0 03:45:20:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 03:45:20:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 03:45:20:Process mdt00_001 (pid: 16913, threadinfo ffff88006e32a000, task ffff88007005f500) 03:45:20:Stack: 03:45:20: ffff88006e32bbd0 0000000000000000 ffff88006e32bc18 ffffffffa1089e50 03:45:20:<d> ffffffffffffffff ffff88002dd6b000 ffff88006e32bc00 ffff88004b9d17d8 03:45:20:<d> ffff88002dd6b378 d4f7928dd528ca79 ffff88002dd6b000 0000000000000000 03:45:20:Call Trace: 03:45:20: [<ffffffffa1089e50>] ? mdt_corpc_add_shard+0x0/0x5d0 [mdt] 03:45:20: [<ffffffffa108ef76>] mdt_corpc_release_epochs+0x136/0x6c0 [mdt] 03:45:20: [<ffffffffa108f5fb>] mdt_container_close+0xfb/0x200 [mdt] 03:45:20: [<ffffffffa106e0de>] mdt_mfd_close+0x2de/0x700 [mdt] 03:45:20: [<ffffffffa106e9e2>] mdt_done_writing+0x4e2/0xd50 [mdt] 03:45:20: [<ffffffffa103df0d>] ? mdt_unpack_req_pack_rep+0x4d/0x4d0 [mdt] 03:45:20: [<ffffffffa0983d9c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 03:45:20: [<ffffffffa1047b98>] mdt_handle_common+0x648/0x1690 [mdt] 03:45:20: [<ffffffffa10818f5>] mds_regular_handle+0x15/0x20 [mdt] 03:45:20: [<ffffffffa0993558>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] 03:45:20: [<ffffffffa05e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs] 03:45:20: [<ffffffffa05f8c5f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 03:45:20: [<ffffffffa098a8b9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 03:45:20: [<ffffffff81055ab3>] ? __wake_up+0x53/0x70 03:45:20: [<ffffffffa09948ee>] ptlrpc_main+0xace/0x1710 [ptlrpc] 03:45:20: [<ffffffffa0993e20>] ? ptlrpc_main+0x0/0x1710 [ptlrpc] 03:45:20: [<ffffffff8100c0ca>] child_rip+0xa/0x20 03:45:20: [<ffffffffa0993e20>] ? ptlrpc_main+0x0/0x1710 [ptlrpc] 03:45:20: [<ffffffffa0993e20>] ? ptlrpc_main+0x0/0x1710 [ptlrpc] 03:45:20: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Seems this problem also has gone since June-17. The frequency of these issues is 1) > 3) > 2). BTW, unfortunately, many tests have no test logs, so I can't tell which situation they belong to. |
| Comment by Andreas Dilger [ 02/Jul/14 ] |
|
Emoly, thank you for your detailed investigation. When you write that the problem #2 and #3 "have gone since June NN" do you mean they have "gone away" == no longer happening, or "have gone on" == started on? Could you check if patches were landed on those days (or a day before) that might have caused these failures? Even better is to find if the very first failures were happening on a patch before it was landed to master. You can check https://old-testing.hpdd.intel.com/ for the old failure logs. |
| Comment by Emoly Liu [ 02/Jul/14 ] |
|
Andreas, sorry for my unclear expression, I mean problem #2 and #3 have gone away since the middle of June. I will check those old failure logs and see if the failure is related to some patches landed on those days, and then give an update. |
| Comment by Emoly Liu [ 03/Jul/14 ] |
|
Since problem #3 |
| Comment by Emoly Liu [ 03/Jul/14 ] |
|
I checked the test logs one by one and only found two instances with problem #2 (I mean the console logs show “(osp_dev.c:858:osp_device_free()) } header@” obviously). If I am right, problem #1 is the only problem. |
| Comment by Di Wang [ 04/Jul/14 ] |
|
Hmm, problem #2 (LBUG during umount mdt ) only happened 2 times and it seems it had not happen since June 15th, so it was probably already being fixed. I checked the all of test_132 timeout status, it seems most of them happened in ZFS and umount -f ost, maybe some one familiar with OST stack should have a look, Thanks. 05:18:16:Lustre: 3931:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213483/real 1404213483] req@ffff88006d539c00 x1472420846307548/t0(0) o400->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 224/224 e 0 to 1 dl 1404213490 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 05:18:16:LustreError: 166-1: MGC10.1.6.21@tcp: Connection to MGS (at 10.1.6.21@tcp) was lost; in progress operations using this service will fail 05:18:16:Lustre: DEBUG MARKER: ! zpool list -H lustre-ost1 >/dev/null 2>&1 || 05:25:51: grep -q ^lustre-ost1/ /proc/mounts || 05:25:51: zpool export lustre-ost1 05:25:51:Lustre: DEBUG MARKER: grep -c /mnt/ost2' ' /proc/mounts 05:25:52:Lustre: DEBUG MARKER: umount -d -f /mnt/ost2 05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213490/real 1404213490] req@ffff8800683eb400 x1472420846307568/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213496 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 05:25:52:Lustre: lustre-OST0001 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? 05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213500/real 1404213500] req@ffff8800683eb000 x1472420846307572/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213511 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 05:25:52:Lustre: lustre-OST0001 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck? 05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213515/real 1404213515] req@ffff88002eb48c00 x1472420846307576/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213531 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 05:25:52:Lustre: lustre-OST0001 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck? 05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213535/real 1404213535] req@ffff88002e895c00 x1472420846307580/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213556 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213560/real 1404213560] req@ffff88006d82a000 x1472420846307584/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213585 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213590/real 1404213590] req@ffff88006a6d6c00 x1472420846307588/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213615 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 05:25:52:Lustre: lustre-OST0001 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck? 05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213650/real 1404213650] req@ffff88002e895c00 x1472420846307596/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213675 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 1 previous similar message 05:25:52:Lustre: lustre-OST0001 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck? 05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213740/real 1404213740] req@ffff88002eb48c00 x1472420846307608/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213765 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 2 previous similar messages 05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213880/real 1404213880] req@ffff88002e895c00 x1472420846307628/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213905 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 4 previous similar messages 05:25:52:INFO: task umount:8756 blocked for more than 120 seconds. 05:25:52: Tainted: P --------------- 2.6.32-431.17.1.el6_lustre.gdc86085.x86_64 #1 05:25:52:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 05:25:52:umount D 0000000000000000 0 8756 8755 0x00000080 05:25:52: ffff88006f099aa8 0000000000000082 ffff88006f099a08 ffff880071fcec00 05:31:55: ffffffffa079de77 0000000000000000 ffff88007142c184 ffffffffa079de77 05:31:55: ffff8800792a1ab8 ffff88006f099fd8 000000000000fbc8 ffff8800792a1ab8 05:31:55:Call Trace: 05:31:55: [<ffffffff81528e82>] schedule_timeout+0x192/0x2e0 05:31:55: [<ffffffff81083e90>] ? process_timeout+0x0/0x10 05:31:55: [<ffffffffa0720a6b>] obd_exports_barrier+0xab/0x180 [obdclass] 05:31:55: [<ffffffffa0f7476f>] ofd_device_fini+0x5f/0x260 [ofd] 05:31:55: [<ffffffffa07475d3>] class_cleanup+0x573/0xd30 [obdclass] 05:31:55: [<ffffffffa0722846>] ? class_name2dev+0x56/0xe0 [obdclass] 05:31:55: [<ffffffffa07492fa>] class_process_config+0x156a/0x1ad0 [obdclass] 05:31:55: [<ffffffffa05fa488>] ? libcfs_log_return+0x28/0x40 [libcfs] 05:31:55: [<ffffffffa07416d9>] ? lustre_cfg_new+0x309/0x680 [obdclass] 05:31:55: [<ffffffffa07499d9>] class_manual_cleanup+0x179/0x6f0 [obdclass] 05:31:55: [<ffffffffa0722846>] ? class_name2dev+0x56/0xe0 [obdclass] 05:31:55: [<ffffffffa0787209>] server_put_super+0x8f9/0xe50 [obdclass] 05:31:55: [<ffffffff8118af0b>] generic_shutdown_super+0x5b/0xe0 05:31:55: [<ffffffff8118aff6>] kill_anon_super+0x16/0x60 05:31:55: [<ffffffffa074b8b6>] lustre_kill_super+0x36/0x60 [obdclass] 05:31:55: [<ffffffff8118b797>] deactivate_super+0x57/0x80 05:31:55: [<ffffffff811aa79f>] mntput_no_expire+0xbf/0x110 05:31:55: [<ffffffff811ab2eb>] sys_umount+0x7b/0x3a0 05:31:55: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 05:31:55:Lustre: lustre-OST0001 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck? |
| Comment by Andreas Dilger [ 04/Jul/14 ] |
|
Di, the reason that this failure was not hit since June 15 is because that is the last time your patch http://review.whamcloud.com/10673 for |
| Comment by Di Wang [ 05/Jul/14 ] |
|
Ah, ok. I will fix the patch 10673 in |
| Comment by Alex Zhuravlev [ 07/Jul/14 ] |
|
I checked on log - https://testing.hpdd.intel.com/test_logs/846317de-f434-11e3-9db1-5254006e85c2/show_text: 2:34:53:LustreError: 7776:0:(osp_dev.c:858:osp_device_free()) header@ffff88006a3c3980[0x0, 1, [0x1:0x0:0x0] hash exist] { 12:34:53:LustreError: 7776:0:(osp_dev.c:858:osp_device_free()) ....local_storage@ffff88006a3c39d0 12:34:53:LustreError: 7776:0:(osp_dev.c:858:osp_device_free()) ....osd-ldiskfs@ffff880061e9dc80osd-ldiskfs-object@ffff880061e9dc80(i:ffff8800578a97e8:144/3738986019)[plain] 12:34:53:LustreError: 7776:0:(osp_dev.c:858:osp_device_free()) }header@ffff88006a3c3980 doesn't seem to be specific to ZFS backend? |
| Comment by Di Wang [ 07/Jul/14 ] |
Oh, this is caused by my patch http://review.whamcloud.com/#/c/10673/ , which is not being landed yet, I will update the patch anyway. But this ticket is about umount on ZFS, https://testing.hpdd.intel.com/sub_tests/query?utf8=✓&test_set%5Btest_set_script_id%5D=f9516376-32bc-11e0-aaee-52540025f9ae&sub_test%5Bsub_test_script_id%5D=12fcb374-32be-11e0-b685-52540025f9ae&sub_test%5Bstatus%5D=TIMEOUT&sub_test%5Bquery_bugs%5D=&test_session%5Btest_host%5D=&test_session%5Btest_group%5D=&test_session%5Buser_id%5D=&test_session%5Bquery_date%5D=&test_session%5Bquery_recent_period%5D=&test_node%5Bos_type_id%5D=&test_node%5Bdistribution_type_id%5D=&test_node%5Barchitecture_type_id%5D=&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=&test_node_network%5Bnetwork_type_id%5D=&commit=Update+results please check. |
| Comment by Alex Zhuravlev [ 08/Jul/14 ] |
|
this is "known" issue with no resolution yet ll_ost00_004 D 0000000000000001 0 26767 2 0x00000080 ffff88006fe55ba0 0000000000000046 00000000ffffffff 000015e4f4d9c106 there are many duplicates, actually.. for exaple, we discussed this few times with Brian B., but the root cause is still unclear.. |
| Comment by Di Wang [ 10/Jul/14 ] |
|
[Andreas Dilger] Deleted this comment because this problem is actually |
| Comment by Di Wang [ 10/Jul/14 ] |
|
Fortunately, these two failures only happened once since July, so maybe it should be remark it as major? |
| Comment by Jian Yu [ 14/Jul/14 ] |
|
More instance on master branch: |
| Comment by Nathaniel Clark [ 17/Jul/14 ] |
|
It does happen more frequently on ZFS eg: Failure rate for review-zfs is 22% over the last week |
| Comment by Di Wang [ 28/Jul/14 ] |
|
Hmm, the failure on DNE is mostly about amount MGS Jul 13 10:08:02 onyx-51vm3 kernel: Lustre: MGS is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck? Jul 13 10:09:08 onyx-51vm3 kernel: Lustre: 3023:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1405271323/real 1405271323] req@ffff88005f6fe400 x1473528013665456/t0(0) o250->MGC10.2.5.30@tcp@0@lo:26/25 lens 400/544 e 0 to 1 dl 1405271348 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jul 13 10:09:08 onyx-51vm3 kernel: Lustre: 3023:0:(client.c:1926:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Jul 13 10:09:16 onyx-51vm3 kernel: LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.5.31@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Jul 13 10:09:16 onyx-51vm3 kernel: LustreError: Skipped 349 previous similar messages Jul 13 10:10:05 onyx-51vm3 kernel: INFO: task umount:7320 blocked for more than 120 seconds. Jul 13 10:10:05 onyx-51vm3 kernel: Not tainted 2.6.32-431.20.3.el6_lustre.x86_64 #1 Jul 13 10:10:05 onyx-51vm3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 13 10:10:05 onyx-51vm3 kernel: umount D 0000000000000001 0 7320 7319 0x00000080 Jul 13 10:10:05 onyx-51vm3 kernel: ffff8800627c1aa8 0000000000000086 0000000000000000 ffff88007bd0d400 Jul 13 10:10:05 onyx-51vm3 kernel: ffffffffa063cf3f 0000000000000000 ffff880055446144 ffffffffa063cf3f Jul 13 10:10:05 onyx-51vm3 kernel: ffff8800627c7af8 ffff8800627c1fd8 000000000000fbc8 ffff8800627c7af8 Jul 13 10:10:05 onyx-51vm3 kernel: Call Trace: Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffff81529ac2>] schedule_timeout+0x192/0x2e0 Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffff81083f30>] ? process_timeout+0x0/0x10 Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa05bfa8b>] obd_exports_barrier+0xab/0x180 [obdclass] Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa0dea57e>] mgs_device_fini+0xfe/0x590 [mgs] Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa05e6683>] class_cleanup+0x573/0xd30 [obdclass] Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa05c1866>] ? class_name2dev+0x56/0xe0 [obdclass] Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa05e83aa>] class_process_config+0x156a/0x1ad0 [obdclass] Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa05e074b>] ? lustre_cfg_new+0x2cb/0x680 [obdclass] Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa05e8a89>] class_manual_cleanup+0x179/0x6f0 [obdclass] Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa05c1866>] ? class_name2dev+0x56/0xe0 [obdclass] Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa062632b>] server_put_super+0x96b/0xe50 [obdclass] Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffff8118b0cb>] generic_shutdown_super+0x5b/0xe0 Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffff8118b1b6>] kill_anon_super+0x16/0x60 Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa05ea966>] lustre_kill_super+0x36/0x60 [obdclass] Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffff8118b957>] deactivate_super+0x57/0x80 Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffff811ab35f>] mntput_no_expire+0xbf/0x110 Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffff811abeab>] sys_umount+0x7b/0x3a0 Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Jul 13 10:10:10 onyx-51vm3 kernel: Lustre: MGS is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck? Jul 13 10:11:38 onyx-51vm3 kernel: Lustre: 3023:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1405271473/real 1405271473] req@fff Probably deserve a new ticket, but it only happened once in July, so probably not serious enough to be a critical ticket here? |
| Comment by Di Wang [ 28/Jul/14 ] |
|
Hmm, this umount mgs failure only happens for http://review.whamcloud.com/#/c/10249/, so I am not sure this is related with that patch or not. I will create a new ticket if I saw more failures. |
| Comment by Peter Jones [ 29/Jul/14 ] |
|
Bobijam Could you please look into this issue? Thanks Peter |
| Comment by Zhenyu Xu [ 30/Jul/14 ] |
|
failure with signature of "waiting for obd_unlinked_exports more than XX" is dup of |
| Comment by Nathaniel Clark [ 31/Jul/14 ] |
|
I agree the symptom is the same, but fixing obdfilter-survey ( |
| Comment by Zhenyu Xu [ 07/Aug/14 ] |
|
from the 1st report http://maloo.whamcloud.com/test_sets/e5783778-f887-11e3-b13a-52540035b04c. 02:17:13:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252206/real 1403252206] req@ffff880045c20800 x1471410848609332/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1403252212 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 02:17:13:LustreError: 5280:0:(mgc_request.c:516:do_requeue()) failed processing log: -5 02:17:13:LustreError: 5280:0:(mgc_request.c:516:do_requeue()) Skipped 15 previous similar messages 02:17:13:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252206/real 1403252206] req@ffff880065b40400 x1471410848609204/t0(0) o38->lustre-MDT0000-lwp-OST0007@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403252217 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 02:17:13:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252206/real 1403252206] req@ffff88006a05d000 x1471410848609284/t0(0) o38->lustre-MDT0000-lwp-OST0004@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403252222 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 02:17:13:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 4 previous similar messages 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252216/real 1403252216] req@ffff880065ed6400 x1471410848609596/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1403252227 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 2 previous similar messages 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252221/real 1403252221] req@ffff880074a52400 x1471410848609744/t0(0) o38->lustre-MDT0000-lwp-OST0003@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403252237 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252241/real 1403252241] req@ffff880042557800 x1471410848610120/t0(0) o38->lustre-MDT0000-lwp-OST0003@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403252262 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 8 previous similar messages 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252276/real 1403252276] req@ffff880073b55400 x1471410848610676/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1403252301 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 13 previous similar messages 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252341/real 1403252341] req@ffff880074e19400 x1471410848612092/t0(0) o38->lustre-MDT0000-lwp-OST0006@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403252366 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 18 previous similar messages 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252476/real 1403252476] req@ffff8800672e3800 x1471410848614808/t0(0) o38->lustre-MDT0000-lwp-OST0003@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403252501 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 38 previous similar messages 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252746/real 1403252746] req@ffff88006633c800 x1471410848620304/t0(0) o38->lustre-MDT0000-lwp-OST0003@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403252771 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 80 previous similar messages 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403253266/real 1403253266] req@ffff880027636c00 x1471410848630652/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1403253291 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 157 previous similar messages 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403253866/real 1403253866] req@ffff880074dd0000 x1471410848642692/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1403253891 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 179 previous similar messages 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403254471/real 1403254471] req@ffff880074dc6c00 x1471410848654508/t0(0) o38->lustre-MDT0000-lwp-OST0004@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403254496 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 181 previous similar messages 03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403255086/real 1403255086] req@ffff880064c9e000 x1471410848666436/t0(0) o38->lustre-MDT0000-lwp-OST0003@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403255111 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 There are MGS_CONNECT and MDS_CONNECT RPCs being issued again and again (not resend), it seems that the reconnect engine does not honor that the obd device (mgc obd and lwp obd) has been cleaned and keeps trying to reconnect the device. |
| Comment by Oleg Drokin [ 14/Aug/14 ] |
|
I think the root issue comes from this message: 02:17:17:Lustre: lustre-MDT0002 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 11. Is it stuck? ... Which basically means the obd has some extra references and it's those references that don't let it go away and so it tries to reconnect still? This sort of a message is what is seen in all of bugs of this class. |
| Comment by Jian Yu [ 14/Aug/14 ] |
|
One more instance on master branch: https://testing.hpdd.intel.com/test_sets/a0ca0910-23fe-11e4-84ee-5254006e85c2 |
| Comment by Zhenyu Xu [ 15/Aug/14 ] |
|
Is it possible the result of the un-dying connect requests being kept re-generating and they hold the obd refcount? |
| Comment by Jian Yu [ 22/Aug/14 ] |
While testing patch http://review.whamcloud.com/11539 on Lustre b2_5 branch, sanity-lfsck test 0 hit the unmounting mgs failure. I created a new ticket |
| Comment by Jian Yu [ 28/Aug/14 ] |
|
While testing patch http://review.whamcloud.com/11574 on Lustre b2_5 branch with FSTYPE=zfs, sanity test 132 hit the same failure in this ticket. |
| Comment by Isaac Huang (Inactive) [ 05/Sep/14 ] |
|
I checked the Maloo report above and found a deadlock on the OSS very similar to the latest ones in
I haven't figured out why ZFS stuck at doing WRITE_FLUSH_FUA, but there's two things to move forward at this point: |
| Comment by Alex Zhuravlev [ 15/Sep/14 ] |
|
Isaac, can you explain what signs point to WRITE_FLUSH_FUA, please ? |
| Comment by Isaac Huang (Inactive) [ 16/Sep/14 ] |
|
Alex, in the OSS stack dump: The z_ioctl_iss thread handles ZIO_TYPE_IOCTL and the only command is DKIOCFLUSHWRITECACHE, which is handled by vdev_disk_io_flush(). I suspected that vdev_disk_io_flush() was blocking in: bio = bio_alloc(GFP_NOIO, 0) - which blocks until there's memory to return. Also, txg_sync thread was waiting spa_sync()->zio_wait(probably the IOCTL zio); and ll_ost00_008 thread was blocking in txg_wait_open(). That's why I suspected the DKIOCFLUSHWRITECACHE ioctl was blocking and preventing the txgs from moving forward - WRITE_FLUSH_FUA bio was probably not submitted yet. |
| Comment by Alex Zhuravlev [ 16/Sep/14 ] |
|
Isaac, have a look here - https://testing.hpdd.intel.com/test_logs/c2a5b9e6-2d5c-11e4-b550-5254006e85c2/show_text Sep 12 19:57:07 shadow-11vm3 kernel: ll_ost00_001 D 0000000000000000 0 9264 2 0x00000080 Sep 12 19:57:07 shadow-11vm3 kernel: txg_quiesce S 0000000000000000 0 307 2 0x00000080 Sep 12 19:57:07 shadow-11vm3 kernel: txg_sync D 0000000000000001 0 308 2 0x00000080 |
| Comment by Isaac Huang (Inactive) [ 17/Sep/14 ] |
|
My previous analysis might be inaccurate as I neglected the '?' in the stack entry: z_ioctl_iss/0 S 0000000000000001 0 10013 2 0x00000080 Call Trace: [<ffffffffa024aa46>] ? vdev_disk_io_start+0x156/0x1b0 [zfs] Which meant that the vdev_disk_io_start() function likely returned and the WRITE_FLUSH_FUA bio was already submitted - the z_ioctl_iss thread was waiting for more work to do. I checked the OSS stack dumps, and the z_* threads were all idle, so there was no more IO left to do. These led me to believe that: Suggestions to troubleshoot stay the same: on the OSS, set zfs_nocacheflush, and gather "zpool events -v" outputs. |
| Comment by Alex Zhuravlev [ 17/Sep/14 ] |
|
why do you think vdev_disk_io_start() was already used in this txg? that could be a leftover from the previous txg? |
| Comment by Isaac Huang (Inactive) [ 17/Sep/14 ] |
|
Alex, it was only a guess. If we had "zpool events -v" outputs, that'd provide a lot more information on the stuck zio. Also I had a hunch that the stuck zio had something to do with the way the zpool was configured. There seemed to be too many levels of indirection - the use of /dev/lvm-Role_OSS/ indicated LVM which seemed not necessary as zfs could manage its own volumes. |
| Comment by Andreas Dilger [ 18/Sep/14 ] |
|
Seems this is the major cause of review-zfs test failures. |
| Comment by Peter Jones [ 23/Sep/14 ] |
|
Isaac This issue continues to disrupt zfs testing. What do you advise? Thanks Peter |
| Comment by Isaac Huang (Inactive) [ 24/Sep/14 ] |
|
Peter, The auto-test results unfortunately didn't include sufficient ZFS debug information, see |
| Comment by Isaac Huang (Inactive) [ 26/Sep/14 ] |
|
It seemed like LVM+deadline scheduler could be the cause of IO timeouts, see |
| Comment by Jian Yu [ 02/Oct/14 ] |
|
While verifying http://review.whamcloud.com/12001 on Lustre b2_5 branch with FSTYPE=zfs, sanity test 900 hung at unmounting MDS. Console log on MDS: 06:27:21:Lustre: DEBUG MARKER: umount -d -f /mnt/mds1
06:27:21:Lustre: lustre-MDT0000: Not available for connect from 10.1.4.26@tcp (stopping)
06:27:21:LustreError: 3264:0:(client.c:1079:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880036038800 x1480755389930956/t0(0) o13->lustre-OST0000-osc-MDT0000@10.1.4.26@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
06:27:21:LustreError: 3264:0:(client.c:1079:ptlrpc_import_delay_req()) Skipped 3 previous similar messages
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) header@ffff88005bfda240[0x0, 1, [0x1:0x0:0x0] hash exist]{
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....local_storage@ffff88005bfda298
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....osd-zfs@ffff880059957ac0osd-zfs-object@ffff880059957ac0
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) } header@ffff88005bfda240
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) header@ffff88006891de08[0x0, 0, [0x100020000:0x1d03:0x0] hash lru]{
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....osp@ffff88006891de60osp-object@ffff88006891de08
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) } header@ffff88006891de08
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) header@ffff88005bfda9c0[0x0, 1, [0x200000003:0x0:0x0] hash exist]{
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....local_storage@ffff88005bfdaa18
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....osd-zfs@ffff880059957ee0osd-zfs-object@ffff880059957ee0
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) } header@ffff88005bfda9c0
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) header@ffff88005bfda3c0[0x1, 1, [0x200000003:0x2:0x0] hash exist]{
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....local_storage@ffff88005bfda418
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....osd-zfs@ffff880059957cd0osd-zfs-object@ffff880059957cd0
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) } header@ffff88005bfda3c0
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) header@ffff88005bfda300[0x0, 1, [0x200000003:0x3:0x0] hash exist]{
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....local_storage@ffff88005bfda358
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....osd-zfs@ffff880059957bc8osd-zfs-object@ffff880059957bc8
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) } header@ffff88005bfda300
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) header@ffff880027ebfc80[0x0, 1, [0xa:0x0:0x0] hash exist]{
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....local_storage@ffff880027ebfcd8
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....osd-zfs@ffff8800599579b8osd-zfs-object@ffff8800599579b8
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) } header@ffff880027ebfc80
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) header@ffff88006891ded0[0x0, 0, [0x100000000:0x1e0e:0x0] hash lru]{
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....osp@ffff88006891df28osp-object@ffff88006891ded0
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) } header@ffff88006891ded0
06:27:21:LustreError: 3249:0:(lu_object.c:1252:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
06:27:21:LustreError: 3249:0:(lu_object.c:1252:lu_device_fini()) LBUG
06:27:21:Pid: 3249, comm: obd_zombid
06:27:22:
06:27:22:Call Trace:
06:27:22: [<ffffffffa05ef895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
06:27:22: [<ffffffffa05efe97>] lbug_with_loc+0x47/0xb0 [libcfs]
06:27:22: [<ffffffffa075be88>] lu_device_fini+0xb8/0xc0 [obdclass]
06:27:22: [<ffffffffa076287e>] dt_device_fini+0xe/0x10 [obdclass]
06:27:22: [<ffffffffa104e93f>] osp_device_free+0xff/0x220 [osp]
06:27:22: [<ffffffffa0746e5d>] class_decref+0x46d/0x550 [obdclass]
06:27:22: [<ffffffffa07257df>] obd_zombie_impexp_cull+0x30f/0x5d0 [obdclass]
06:27:22: [<ffffffffa0725b05>] obd_zombie_impexp_thread+0x65/0x190 [obdclass]
06:27:22: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
06:27:22: [<ffffffffa0725aa0>] ? obd_zombie_impexp_thread+0x0/0x190 [obdclass]
06:27:22: [<ffffffff8109abf6>] kthread+0x96/0xa0
06:27:22: [<ffffffff8100c20a>] child_rip+0xa/0x20
06:27:22: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
06:27:22: [<ffffffff8100c200>] ? child_rip+0x0/0x20
06:27:22:
06:27:22:Kernel panic - not syncing: LBUG
Maloo report: https://testing.hpdd.intel.com/test_sets/6f56a29c-4a07-11e4-95b1-5254006e85c2 |
| Comment by Isaac Huang (Inactive) [ 02/Oct/14 ] |
|
The failure above was different from the zio timeout here in this ticket. The OSS zpools were idle, and on the MDS the zpool seemed to have been exported already (because the txg_sync thread already quit). It might have something to do with the patch being tested, but I want to focus on the zio timeout issue here. |
| Comment by Jian Yu [ 02/Oct/14 ] |
|
Hi Isaac, According to problem 2) in the earlier comments https://jira.hpdd.intel.com/browse/LU-5242?focusedCommentId=87897&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-87897 in this ticket, I reported the above failure here. I also saw the failure was reported in |
| Comment by Jian Yu [ 07/Oct/14 ] |
|
One more instance on master branch: |
| Comment by Doug Oucharek (Inactive) [ 15/Oct/14 ] |
|
Same failure has occurred on master branch for test_23a: https://testing.hpdd.intel.com/test_sets/90cd3a7e-53d7-11e4-9c8a-5254006e85c2 |
| Comment by Henri Doreau (Inactive) [ 23/Oct/14 ] |
|
Looks like another instance, on master: https://testing.hpdd.intel.com/test_sets/a40e6a14-5a2f-11e4-8dbb-5254006e85c2 |
| Comment by Jian Yu [ 25/Oct/14 ] |
|
One more instance on master branch: https://testing.hpdd.intel.com/test_sets/b8f4813c-5c25-11e4-b9ce-5254006e85c2 |
| Comment by Isaac Huang (Inactive) [ 29/Oct/14 ] |
|
I was able to reproduce it today and kept the system live for debugging. The txg_sync thread stuck in zio_wait() but the weird thing was there was no IO pending/running at either the ZFS vdev queue or the Linux block dev queue. At this point I tend to think it's a ZFS issue rather than Lustre issue. |
| Comment by Jian Yu [ 29/Oct/14 ] |
|
One more instance on Lustre b2_5 branch: |
| Comment by Isaac Huang (Inactive) [ 30/Oct/14 ] |
|
Likely it can be fixed by this ZFS patch: I'll test it soon. |
| Comment by Dmitry Eremin (Inactive) [ 10/Nov/14 ] |
|
One more failure on master: https://testing.hpdd.intel.com/sub_tests/bee79924-6861-11e4-a1b6-5254006e85c2 |
| Comment by Andreas Dilger [ 13/Nov/14 ] |
|
Isaac, any update on testing this patch? While we prefer to stick with the upstream releases, we are also able to land a patch into our own ZFS Git repo so that our own testing passes. If this fixes the problem we are seeing, it is also worthwhile to update the issue on GitHub with this information so that it will speed up the patch landing into a release. |
| Comment by Isaac Huang (Inactive) [ 19/Nov/14 ] |
|
I'm still troubleshooting some local ZFS test failures after merging the patch. I'll apply it to Jenkins ZFS build once all tests pass. |
| Comment by Jian Yu [ 03/Dec/14 ] |
|
More instance on Lustre b2_5 branch: |
| Comment by Isaac Huang (Inactive) [ 04/Dec/14 ] |
|
I've been testing with a patched ZFS build and so far haven't been able to reproduce it. But strange thing is, I can't reproduce it either with a previous build where I could easily reproduce it with sanity test_132. So now I'm not sure whether the patch fixed it, or some other change made it much harder to reproduce. I'm now looking into why I can't reproduce it any more with the same build/test hw where I was able to easily reproduce it. |
| Comment by Andreas Dilger [ 08/Dec/14 ] |
|
Is it possible to push the upstream GitHub patch into our local ZFS git repository, so that we are always testing with that patch? That would allow us to get better testing, to see if it fixes this problem, and to see if it introduces some other problems or not. This is one of the last issues that is preventing review-zfs to be enforced for all patches. |
| Comment by Alex Zhuravlev [ 09/Dec/14 ] |
|
here is an example how to add a patch: http://git.whamcloud.com/ff/daos_lustre.git/commit/21f403380e365b083cf4e48003e863a1661a5a60 |
| Comment by Gerrit Updater [ 12/Dec/14 ] |
|
Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13050 |
| Comment by Gerrit Updater [ 16/Dec/14 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/13093 |
| Comment by Andreas Dilger [ 16/Dec/14 ] |
|
Based on the test results in https://testing.hpdd.intel.com/test_sessions/f8d48678-84d7-11e4-985f-5254006e85c2 it appears that the 13050 patch does not resolve the problem being seen here. Is there anything new that can be learned from the logs in these failed tests? This problem and For now, I've pushed a patch to disable this test, because SOM is not a supported feature, and it will continue to be tested with ldiskfs anyway. |
| Comment by Gerrit Updater [ 17/Dec/14 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/13093/ |
| Comment by Andreas Dilger [ 18/Dec/14 ] |
|
Sadly, it seems now that test_132() was added to the ALWAYS_EXCEPT list, test_133() has now started causing timeouts at OST unmount time: This makes me wonder if there is something happening in an earlier test that is causing a problem (e.g. refcount leak) and any later unmount will hit the problem? If we could hit this reliably in a shorter test (e.g. just sanity in a loop, even if it only hit 1/5 times), then we might be able to bisect which subtest is causing the problem to isolate the root cause. |
| Comment by Alex Zhuravlev [ 18/Dec/14 ] |
|
we could change t-f to remount OST after every sub-test (or randomly) ? |
| Comment by Andreas Dilger [ 18/Dec/14 ] |
|
There used to be the ability to do this in sanity.sh, by defining the "setup" and "cleanup" macros, but I suspect that is long gone. |
| Comment by Gerrit Updater [ 18/Dec/14 ] |
|
Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13130 |
| Comment by nasf (Inactive) [ 24/Dec/14 ] |
|
Another failure instance: |
| Comment by Oleg Drokin [ 31/Dec/14 ] |
|
If diagnosing this problem requires some extra data that autotest framework does not collect - a patch needs to be created to collect this data if this is something not easily reproduceable on other systems. |
| Comment by Andreas Dilger [ 06/Jan/15 ] |
|
Looking into the OST syslog from one of the recent Previous suggestions in this bug and
Can any of these options be tested/implemented in some reasonable manner? I'd like to get some forward progress on these issues. |
| Comment by Andreas Dilger [ 07/Jan/15 ] |
|
I've noticed in the test_133g hangs that there is a watchdog timeout in test_116a that indicates the source of the deadlock is at that point or earlier (this is easily seen in the OST dmesg log, but doesn't have any timestamps): == sanity test 116a: stripe QOS: free space balance ===================== 18:41:05 (1420224065) INFO: task ll_ost00_017:10306 blocked for more than 120 seconds. ll_ost00_017 D 0000000000000001 0 10306 2 0x00000080 Call Trace: [<ffffffff8152bd65>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff8152bec3>] rwsem_down_write_failed+0x23/0x30 [<ffffffff8128fbe3>] call_rwsem_down_write_failed+0x13/0x20 [<ffffffff8152b3c2>] ? down_write+0x32/0x40 [<ffffffffa0e79d81>] osd_object_write_lock+0x61/0x70 [osd_zfs] [<ffffffffa0fbf076>] ofd_object_destroy+0x66/0x8e0 [ofd] [<ffffffffa0fb964d>] ofd_destroy_by_fid+0x35d/0x620 [ofd] [<ffffffffa0fb2f6a>] ofd_destroy_hdl+0x2fa/0xb60 [ofd] [<ffffffffa09df97e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] [<ffffffffa098f711>] ptlrpc_main+0xe41/0x1960 [ptlrpc] Unfortunately the timestamps on the console logs are not very accurate (they appear to only update once a minute or so), and this is a long-running test (about 200s), so test_116a may in fact be the source of the problem itself. I looked at all 20 of the test_133g hang OST dmesg logs from the past two weeks, and in every case the first test that is reporting this watchdog is test_116a, which itself takes about 200s to run. That makes it very likely that test_116a (or a test run just before it) is the source of the problem, since the previous tests can vary by a few seconds, so if it were a long time before the start of test_116a in some cases the watchdog message would have appeared during an earlier test. |
| Comment by Gerrit Updater [ 07/Jan/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/13264 |
| Comment by Isaac Huang (Inactive) [ 09/Jan/15 ] |
|
I just managed to reproduce it and the symptoms matched ZoL bug 2523, which was recently fixed by SPL patch: Nathaniel is working on updating our ZFS/SPL version to 0.6.3-1.2 in |
| Comment by Andreas Dilger [ 14/Jan/15 ] |
|
Unfortunately, the upgrade to ZFS 0.6.3-1.2 doesn't seem to have solved this problem: It might be that there are multiple issues at work here? I haven't checked the latest failures to see if they have different symptoms or not. |
| Comment by Isaac Huang (Inactive) [ 14/Jan/15 ] |
|
The failures looked similar - txg_sync thread stuck in zio_wait(). Either there were multiple causes of this same symptom or the upstream fix didn't completely close the race. I'm looking into that. Meanwhile, I think we'd go ahead and land it and see if it'd reduce its occurrences, since quite some ZoL users reported that the fix did seem to work for them. |
| Comment by Gerrit Updater [ 15/Jan/15 ] |
|
Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13416 |
| Comment by Isaac Huang (Inactive) [ 15/Jan/15 ] |
|
I just had a close look at: Although the txg_sync thread was stuck in D state in zio_wait(), it probably wan't the cause of the test hang. Here's what happened: It was unlikely that the two threads hanged because of txg_sync hang in zio_wait, as we've seen previously, because if so:
So I'd tend to say that the test hang was a result of the hanged ll_ost threads, rather than the txg_sync zio_wait deadlock. The txg_sync thread did show as D state in zio_wait at the stack dump after sanity test 133g wouldn't complete. But very likely it just happened to be there waiting for some IOs to complete on the slow VM disks. It's quite often to see the txg_sync thread blocked for IO completion for more than 20 seconds on the test VMs. On the same OSS I saw ll_ost00_011 blocked for 40 seconds in txg_wait_open() at 20:09:38 and then recovered - so the txg_sync thread must have been blocked for IO roughly 40 seconds and moved on. That's long enough a window for the txg_sync thread to just happen to be in D state in zio_wait() when the stacks are dumped. So, I'd still suggest to merge the patch to upgrade to ZFS 0.6.3-1.2. So far there's been no new report of zio_wait hang on ZoL since the SPL fix was merged. |
| Comment by Andreas Dilger [ 16/Jan/15 ] |
|
I tried commenting out test_116a in http://review.whamcloud.com/13264, but that also hit stuck threads with the same stack trace in test_120g (twice, and that test runs for about 55s), and in test_120b (once, and it only runs for 1s - https://testing.hpdd.intel.com/test_sets/6903d29a-983e-11e4-8f7b-5254006e85c2). Using the test_120b stack dump time, it puts the hang back in test_103c or earlier. Using the test_120g stack dump time (start/end window) it puts the failure in test_118c-test_118k or earlier, so there is no overlap. It seems like there is a specific test or behaviour that is triggering this, since the stacks are always dumped in test_116a (except when I had excluded it). If it was just a random race condition, it would appear in other tests as well. |
| Comment by Isaac Huang (Inactive) [ 16/Jan/15 ] |
|
I just noticed that the ORI-616 fix was missing on master. I've asked for confirmation whether it'd be needed still. If yes, then that might explain some of the symptoms here as a refcount was missing. |
| Comment by Gerrit Updater [ 16/Jan/15 ] |
|
Isaac Huang (he.huang@intel.com) uploaded a new patch: http://review.whamcloud.com/13431 |
| Comment by Andreas Dilger [ 16/Jan/15 ] |
|
Isaac, given how often this bug is being hit (about 50% of all review-zfs test runs), it is worthwhile to just try porting the patch to master and running a bunch of sanity tests via Test-Parameters: to see if it passes or fails. |
| Comment by Andreas Dilger [ 19/Jan/15 ] |
|
Unfortunately, I don't think that the ORI-616 patch is relevant. It looks like the ORI-616 patch http://review.whamcloud.com/2560 that adds the lu_object_get() call in osd_bufs_get() is later dropped by ORI-645 patch http://review.whamcloud.com/2607 (presumably intentionally). |
| Comment by Nathaniel Clark [ 21/Jan/15 ] |
|
I've just hit this on my local VM setup and txg_sync is what seems to be holding up the umount of the OST. Attached is a dump of `zpool events -v` from the OSS. Below is a dump I created blocked processes using sysrq: SysRq : Show Blocked State task PC stack pid father ll_ost01_004 D 0000000000000001 0 9350 2 0x00000080 ffff8800052e5960 0000000000000046 0000000000016880 0000000000000001 ffffc90008afb350 0000000000000082 ffff880024e73588 ffff880001d1fd80 ffff880005337098 ffff8800052e5fd8 000000000000fbc8 ffff880005337098 Call Trace: [<ffffffff8109b1ee>] ? prepare_to_wait_exclusive+0x4e/0x80 [<ffffffffa02d847d>] cv_wait_common+0xed/0x100 [spl] [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa02d84e5>] __cv_wait+0x15/0x20 [spl] [<ffffffffa03d7f7b>] txg_wait_open+0x7b/0xa0 [zfs] [<ffffffffa039f95e>] dmu_tx_wait+0x29e/0x2b0 [zfs] [<ffffffff8152a29e>] ? mutex_lock+0x1e/0x50 [<ffffffffa039fa01>] dmu_tx_assign+0x91/0x490 [zfs] [<ffffffffa070f5ad>] osd_trans_start+0xed/0x430 [osd_zfs] [<ffffffffa0d9504c>] ofd_trans_start+0x7c/0x100 [ofd] [<ffffffffa0d97500>] ofd_object_destroy+0x3d0/0x8e0 [ofd] [<ffffffffa0d9176d>] ofd_destroy_by_fid+0x35d/0x620 [ofd] [<ffffffffa07a4ee0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] [<ffffffffa07a6780>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] [<ffffffffa0d8b0ba>] ofd_destroy_hdl+0x2fa/0xb60 [ofd] [<ffffffffa083396e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] [<ffffffffa07e3701>] ptlrpc_main+0xe41/0x1960 [ptlrpc] [<ffffffffa07e28c0>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [<ffffffff8109abf6>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109ab60>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 ll_ost01_014 D 0000000000000001 0 23470 2 0x00000080 ffff88000e1e9a70 0000000000000046 ffff8800186e4090 ffff88000e1e9a50 ffffffffa08617d3 0000000000000000 ffff88000e1e9fd8 ffff8800056d8040 ffff8800056d85f8 ffff88000e1e9fd8 000000000000fbc8 ffff8800056d85f8 Call Trace: [<ffffffff8152b315>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff8152b473>] rwsem_down_write_failed+0x23/0x30 [<ffffffff8128f683>] call_rwsem_down_write_failed+0x13/0x20 [<ffffffff8152a972>] ? down_write+0x32/0x40 [<ffffffffa07171a1>] osd_object_write_lock+0x61/0x70 [osd_zfs] [<ffffffffa0d97196>] ofd_object_destroy+0x66/0x8e0 [ofd] [<ffffffffa0d9176d>] ofd_destroy_by_fid+0x35d/0x620 [ofd] [<ffffffffa07a4ee0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] [<ffffffffa07a6780>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] [<ffffffffa0d8b0ba>] ofd_destroy_hdl+0x2fa/0xb60 [ofd] [<ffffffffa083396e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] [<ffffffffa07e3701>] ptlrpc_main+0xe41/0x1960 [ptlrpc] [<ffffffffa07e28c0>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [<ffffffff8109abf6>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109ab60>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 ll_ost01_016 D 0000000000000001 0 27741 2 0x00000080 ffff88000d2c7a70 0000000000000046 ffff8800186e4090 ffff88000d2c7a50 ffffffffa08617d3 0000000000000000 ffff88000d2c7fd8 ffff8800054f6080 ffff8800054f6638 ffff88000d2c7fd8 000000000000fbc8 ffff8800054f6638 Call Trace: [<ffffffff8152b315>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff8152b473>] rwsem_down_write_failed+0x23/0x30 [<ffffffff8128f683>] call_rwsem_down_write_failed+0x13/0x20 [<ffffffff8152a972>] ? down_write+0x32/0x40 [<ffffffffa07171a1>] osd_object_write_lock+0x61/0x70 [osd_zfs] [<ffffffffa0d97196>] ofd_object_destroy+0x66/0x8e0 [ofd] [<ffffffffa0d9176d>] ofd_destroy_by_fid+0x35d/0x620 [ofd] [<ffffffffa07a4ee0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] [<ffffffffa07a6780>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] [<ffffffffa0d8b0ba>] ofd_destroy_hdl+0x2fa/0xb60 [ofd] [<ffffffffa083396e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] [<ffffffffa07e3701>] ptlrpc_main+0xe41/0x1960 [ptlrpc] [<ffffffffa07e28c0>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [<ffffffff8109abf6>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109ab60>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 txg_sync D 0000000000000000 0 48138 2 0x00000080 ffff88000a53dba0 0000000000000046 0000000000000000 0000000000000000 ffff88000a53db10 ffffffff81061d12 ffff88000a53db60 ffffffff810546b9 ffff880024e73af8 ffff88000a53dfd8 000000000000fbc8 ffff880024e73af8 Call Trace: [<ffffffff81061d12>] ? default_wake_function+0x12/0x20 [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90 [<ffffffff810a6d21>] ? ktime_get_ts+0xb1/0xf0 [<ffffffff81529433>] io_schedule+0x73/0xc0 [<ffffffffa02d841c>] cv_wait_common+0x8c/0x100 [spl] [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa02d84a8>] __cv_wait_io+0x18/0x20 [spl] [<ffffffffa041e0ab>] zio_wait+0xfb/0x1b0 [zfs] [<ffffffffa03b3be3>] dsl_pool_sync+0x2b3/0x3f0 [zfs] [<ffffffffa03cbe4b>] spa_sync+0x40b/0xa60 [zfs] [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90 [<ffffffffa03d8916>] txg_sync_thread+0x2e6/0x510 [zfs] [<ffffffff810591a9>] ? set_user_nice+0xc9/0x130 [<ffffffffa03d8630>] ? txg_sync_thread+0x0/0x510 [zfs] [<ffffffffa02d3c2f>] thread_generic_wrapper+0x5f/0x70 [spl] [<ffffffffa02d3bd0>] ? thread_generic_wrapper+0x0/0x70 [spl] [<ffffffff8109abf6>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109ab60>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 umount D 0000000000000001 0 60033 60032 0x00000080 ffff88001d44da98 0000000000000082 ffff88001d44d9f8 ffff8800182ec000 ffffffffa151f501 0000000000000000 ffff880001cdb14c ffffffffa151f501 ffff8800136525f8 ffff88001d44dfd8 000000000000fbc8 ffff8800136525f8 Call Trace: [<ffffffff81529ac2>] schedule_timeout+0x192/0x2e0 [<ffffffff81083f30>] ? process_timeout+0x0/0x10 [<ffffffffa14aeba6>] obd_exports_barrier+0xb6/0x190 [obdclass] [<ffffffffa0d8281f>] ofd_device_fini+0x5f/0x260 [ofd] [<ffffffffa14d0b62>] class_cleanup+0x552/0xd10 [obdclass] [<ffffffffa14b0b26>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa14d330a>] class_process_config+0x1fea/0x27c0 [obdclass] [<ffffffffa0ef71b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa14cc315>] ? lustre_cfg_new+0x435/0x630 [obdclass] [<ffffffffa14d3c01>] class_manual_cleanup+0x121/0x870 [obdclass] [<ffffffffa14b0b26>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa150c457>] server_put_super+0xb37/0xe50 [obdclass] [<ffffffff8118b0cb>] generic_shutdown_super+0x5b/0xe0 [<ffffffff8118b1b6>] kill_anon_super+0x16/0x60 [<ffffffffa14d5e56>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff8118b957>] deactivate_super+0x57/0x80 [<ffffffff811ab35f>] mntput_no_expire+0xbf/0x110 [<ffffffff811abeab>] sys_umount+0x7b/0x3a0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b iotop shows that txg_sync running writes to disk and top shows this: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 48138 root 0 -20 0 0 0 R 64.4 0.0 102:13.51 txg_sync 48060 root 39 19 0 0 0 S 41.2 0.0 64:16.60 z_null_iss/0 48137 root 0 -20 0 0 0 S 23.9 0.0 37:43.30 txg_quiesce 9350 root 20 0 0 0 0 R 5.3 0.0 8:33.88 ll_ost01_004 22 root 20 0 0 0 0 S 0.3 0.0 36:24.79 kblockd/0 |
| Comment by Alex Zhuravlev [ 22/Jan/15 ] |
|
Call Trace: so, it's still in DMU doing I/O .. |
| Comment by Isaac Huang (Inactive) [ 22/Jan/15 ] |
|
Nathan, if iotop showed txg_sync issuing writes, it could be just wait for IO on the slow VM. It's quite often to see txg_sync to block for 10s of seconds waiting for IO to complete on VMs. The important thing is how long it keeps blocked there. If it blocks there for hours and there was no active IO going on, then it's an instance of the zio race we've seen previously. Otherwise it could be just a red herring. Also, it's not clear whether ll_ost01_014 and ll_ost01_016 blocked before the txg_sync or after. In the case of https://testing.hpdd.intel.com/test_sets/deca9712-9bc1-11e4-857a-5254006e85c2 the two ost threads blocked in similar places but far before the txg_sync thread was blocked. |
| Comment by Isaac Huang (Inactive) [ 22/Jan/15 ] |
|
Alex, yes it's still in DMU doing I/O. But it's quite often to see txg_sync to block for 10s of seconds waiting for IO to complete on VMs. In the case of https://testing.hpdd.intel.com/test_sets/deca9712-9bc1-11e4-857a-5254006e85c2 the two ost threads blocked in similar places but far before the txg_sync thread was blocked. |
| Comment by Alex Zhuravlev [ 22/Jan/15 ] |
|
true, but then I'd expect only those to left at some point? while in all the cases we get that txg_sync() and another thread with zio_wait()? |
| Comment by Alex Zhuravlev [ 22/Jan/15 ] |
|
would it be possible to get a crash dump? |
| Comment by Isaac Huang (Inactive) [ 25/Jan/15 ] |
|
I just reproduced it and ll_ost00_017 was stuck in D state (umount ost1 also stuck): [<ffffffffa011b415>] cv_wait_common+0x125/0x150 [spl] [<ffffffffa011b495>] __cv_wait+0x15/0x20 [spl] [<ffffffffa02215e3>] txg_wait_open+0xa3/0xf0 [zfs] [<ffffffffa01e4939>] dmu_tx_wait+0x389/0x390 [zfs] [<ffffffffa01e49e9>] dmu_tx_assign+0xa9/0x520 [zfs] [<ffffffffa0e6d5ad>] osd_trans_start+0xed/0x430 [osd_zfs] [<ffffffffa0fb907c>] ofd_trans_start+0x7c/0x100 [ofd] [<ffffffffa0fbb530>] ofd_object_destroy+0x3d0/0x8e0 [ofd] [<ffffffffa0fb579d>] ofd_destroy_by_fid+0x35d/0x620 [ofd] [<ffffffffa0faf0ea>] ofd_destroy_hdl+0x2fa/0xb60 [ofd] [<ffffffffa09dbd9e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] [<ffffffffa098b891>] ptlrpc_main+0xe41/0x1960 [ptlrpc] However, it was actually running in D rather than sleeping, using lots of CPU time: So it looked like cv_wait_common() was actually spinning on a damaged lock. Also the txgs of both ost pools were moving forward so it's not the txg_sync hang we've seen previously. I've never seen any report like this on ZoL, so I'd tend to think it's related to the way we use ZoL. I wonder whether it could be some shutdown race where umount process had freed some resource that ll_ost00_017 was trying to use, then cv_wait_common() just got stuck on a damaged lock. The umount stack: [<ffffffffa0724bd6>] obd_exports_barrier+0xb6/0x190 [obdclass] [<ffffffffa0fa684f>] ofd_device_fini+0x5f/0x260 [ofd] [<ffffffffa0746072>] class_cleanup+0x552/0xd10 [obdclass] [<ffffffffa074881a>] class_process_config+0x1fea/0x27c0 [obdclass] [<ffffffffa0749111>] class_manual_cleanup+0x121/0x870 [obdclass] [<ffffffffa07819d7>] server_put_super+0xb37/0xe50 [obdclass] [<ffffffff8118b61b>] generic_shutdown_super+0x5b/0xe0 [<ffffffff8118b706>] kill_anon_super+0x16/0x60 [<ffffffffa074b366>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff8118bea7>] deactivate_super+0x57/0x80 [<ffffffff811ab8af>] mntput_no_expire+0xbf/0x110 [<ffffffff811ac3fb>] sys_umount+0x7b/0x3a0 It appeared that class_cleanup()=>obd_precleanup() has been called, and if ldto_device_fini() was called as well, i.e. osd_device_fini() was called and dmu_objset_disown(o->od_os) called. At the same time, osd_trans_start() could still access od_os from dmu_tx_assign(oh->ot_tx, TXG_WAIT). If there could be a race between osd_device_fini() and osd_trans_start(), that'd explain what I see here. Comments very welcomed. |
| Comment by Isaac Huang (Inactive) [ 25/Jan/15 ] |
|
Also, the pool that umount was trying to umount seemed to be messed up: cat lustre-ost1/txgs; sleep 1; cat lustre-ost1/txgs 469 0 0x01 3 336 31269437119932 94328219768999 txg birth state ndirty nread nwritten reads writes otime qtime wtime stime 504280372 94327859132340 S 0 0 0 0 0 52962 5155 34018 0 504280373 94327859185302 W 0 0 0 0 0 42311 5308 0 0 504280374 94327859227613 O 0 0 0 0 0 0 0 0 0 469 0 0x01 3 336 31269437119932 94329224682995 txg birth state ndirty nread nwritten reads writes otime qtime wtime stime 504282558 94329223042758 S 0 0 0 0 0 819957 7673 68244 0 504282559 94329223862715 W 0 0 0 0 0 119631 5697 0 0 504282560 94329223982346 O 0 0 0 0 0 0 0 0 0 The TXG # increased by 2186 in just 1 second - at the same time, LNet sent/got no message and all OST threads were idle - very weird. Again, no similar report was found on ZoL. Likely we're doing something wrong here. |
| Comment by Alex Zhuravlev [ 25/Jan/15 ] |
|
class_cleanup() calls class_disconnect_exports() before going into precleanup process. so all the exports should be disconnected, then obd_exports_barrier() is called to ensure all exports are gone. and only after that in ofd_stack_fini() we disconnect from OSD so OSD can start to umount. I think there should be no racing dmu_objset_disown() here. also notice that dsl_dataset_disown() doesn't really do anything like "umount", it just mark the objset as "not used" while all txg logic is driven a level higher - as a whole pool. |
| Comment by Andreas Dilger [ 26/Jan/15 ] |
|
Brian and Olaf were looking at this bug on Friday and was going to post, but since he didn't get a chance to do so yet I'll recap their findings as best I can. When ZFS is compiled with ASSERT checks enabled there is a test failure around sanity subtest 100 that shows a bonus buffer being modified without a lock held on it. Judging by the stack traces in test_116, this shows a possible root cause for the stuck threads and later unmount failure. |
| Comment by Andreas Dilger [ 26/Jan/15 ] |
|
Brian filed that as |
| Comment by Isaac Huang (Inactive) [ 26/Jan/15 ] |
|
I had debugging on for SPL/ZFS, but was able to hit the hang without hitting any ZFS assertion. I checked the kernel hung task watcher code and it looked like it'd not warn for tasks running in D state. The ll_ost00_017 thread likely hung on txg_wait_open() long before the umount started, judging from the CPU time it had used. That'd also explain why there were two other OST threads hanging in osd_object_write_lock() but there was no error message for any thread blocked holding the lock. I'll try to narrow down to subtests earlier than 133. Would it be possible to force OST umount/remount after each test? That'd pinpoint exactly when SHTF. |
| Comment by Andreas Dilger [ 27/Jan/15 ] |
|
Isaac, how recent was your testing? I wonder if we were hitting multiple different problems here, maybe one was fixed with 0.6.3-1.2 and there is a second problem causing the threads to be stuck and dump in test_116? Unfortunately, Nathaniel already ran tests with an unmount after every subtest (see his debug patch earlier in this bug) but AFAIK this didn't find the problem that said, I didn't check the results of this patch myself. It may be that unmounting after every subtest didn't allow memory pressure to build up and his the problem. I also had a patch on this bug to disable test_116 to see if it was the source of the problem, since it runs more than 120 seconds itself and the watchdogs could be triggered by hangs after the start of the test. However, the stack traces moved to test_118, but at least have me a bit better idea where the stuck threads first got stuck. It looks like it was around test_103 or so, but it is hard to know because the console messages do not have any timestamps. It might be more productive to try a few patches to disable groups of subtests around 100 to 103, each with Test-Parameters that have 6-10 sanity runs to have a high confidence that the problem is caught. |
| Comment by Isaac Huang (Inactive) [ 27/Jan/15 ] |
|
The one where txg_sync hanged in D state while there's no pending IO was fixed - I was able to hit it often before 0.6.3-1.2 but not once after. Now I'm seeing one OST thread blocked in D state in txg_wait_open() forever, while txg_sync was busy doing IO and TXGs of the pool were moving forward. I'm now testing with a temporary workaround for |
| Comment by Isaac Huang (Inactive) [ 27/Jan/15 ] |
|
The |
| Comment by Nathaniel Clark [ 27/Jan/15 ] |
|
Attempting to bisect cause, umount/mount of osts works prior to test 50 and fails prior to test 100: |
| Comment by Isaac Huang (Inactive) [ 30/Jan/15 ] |
|
I wrote a script to watch for stuck OSS threads and found in one failed test: Lustre: DEBUG MARKER: == sanity test 103b: MDS mount option 'noacl' == 03:40:06 (1422618006) LustreError: 11-0: lustre-MDT0000-lwp-OST0001: operation obd_ping to node 10.100.4.147@tcp failed: rc = -107 Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.100.4.147@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 1 previous similar message Lustre: 9624:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1422618009/real 1422618009] req@ffff88006f83ccc0 x1491657489867716/t0(0) o400->MGC10.100.4.147@tcp@10.100.4.147@tcp:26/25 lens 224/224 e 0 to 1 dl 1422618016 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 9624:0:(client.c:1942:ptlrpc_expire_one_request()) Skipped 1 previous similar message LustreError: 166-1: MGC10.100.4.147@tcp: Connection to MGS (at 10.100.4.147@tcp) was lost; in progress operations using this service will fail Lustre: 9623:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1422618016/real 1422618016] req@ffff88003e42e680 x1491657489867740/t0(0) o250->MGC10.100.4.147@tcp@10.100.4.147@tcp:26/25 lens 400/544 e 0 to 1 dl 1422618022 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: Evicted from MGS (at 10.100.4.147@tcp) after server handle changed from 0x9f6377cf69c91a9b to 0x9f6377cf6a97ffe3 Lustre: MGC10.100.4.147@tcp: Connection restored to MGS (at 10.100.4.147@tcp) Lustre: Skipped 1 previous similar message Lustre: 9623:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1422618016/real 1422618016] req@ffff88005febe9c0 x1491657489867744/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.100.4.147@tcp:12/10 lens 400/544 e 0 to 1 dl 1422618027 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: lustre-OST0001: deleting orphan objects from 0x0:162984 to 0x0:163009 Lustre: lustre-OST0000: deleting orphan objects from 0x0:155843 to 0x0:155873 LustreError: 11-0: MGC10.100.4.147@tcp: operation obd_ping to node 10.100.4.147@tcp failed: rc = -107 LustreError: Skipped 1 previous similar message LustreError: 166-1: MGC10.100.4.147@tcp: Connection to MGS (at 10.100.4.147@tcp) was lost; in progress operations using this service will fail LustreError: 14122:0:(mgc_request.c:526:do_requeue()) failed processing log: -5 LustreError: 14122:0:(mgc_request.c:526:do_requeue()) Skipped 7 previous similar messages Lustre: Evicted from MGS (at 10.100.4.147@tcp) after server handle changed from 0x9f6377cf6a97ffe3 to 0x9f6377cf6a980275 4. Later at 133g umount would block forever I tend to think the hang thread in txg_wait_open() was the real cause, and the later hang threads in osd_object_write_lock() a consequence of it:
|
| Comment by Isaac Huang (Inactive) [ 01/Feb/15 ] |
|
Now I can reproduce the txg_wait_open() hang with "sanity --only 78,79,80". It seemed to be the smallest subset of tests to trigger it - I haven't been able to reproduce it if any one test is removed. Once a thread got stuck in txg_wait_open(), umount on OSS would block forever just like at the end of 133g. |
| Comment by Isaac Huang (Inactive) [ 02/Feb/15 ] |
|
I found a possible bug |
| Comment by Brian Behlendorf [ 02/Feb/15 ] |
|
If you're able to reproduce this bug it might be useful to check the following proc variables. First I'd look at dmu_tx, it shows a list of counters which indicate why a TX can't be assigned to a TXG. From the stacks which have been posted it looks like dmu_tx_assign() may just be failing repeatedly (maybe an incorrectly constructed TX?) and thus blocking in dmu_tx_wait() forever. Knowing why would help narrow things down. On a healthy system you should see almost everything in dmu_tx_assigned (success), the other common cases are dmu_tx_dirty_throttle and dmu_tx_dirty_delay. These indicate that the txg_sync thread can't write data as fast as the processes are creating it and it's injecting a delay to slow them down to keep things in balance. If you see more than a handful of something else it should be investigated. The other thing I'd look at is /proc/spl/kstat/zfs/<pool>/txgs. This contains of a history of the last N TXGs which were committed. How long they took, how much IO was performed, how many IOPs were issued, etc. To enable this you'll need to set the history size using the zfs_txg_history module parameter, by default it's disabled. But if you want to log the last say 10 TXGs set it to 10. In particular, I'm curious if TXGs continue to roll forward at the usual rate while this thread is stuck (it sounds like it). If so then the TXG engine itself would seem to be working fine and there's just something wrong with the specific TX which is blocked. |
| Comment by Alex Zhuravlev [ 02/Feb/15 ] |
|
well, we're supposed to get a message in that case: rc = |
| Comment by Brian Behlendorf [ 02/Feb/15 ] |
|
Well not exactly. When calling dmu_tx_assign() with TXG_WAIT it should only fail in the case where we're over quota. For all other errors it will retry and we should be assigned to the next TXG (assuming the TX is sane). So checking the dmu_tx proc file would show if we're hitting any of the ERESTART cases. /* * Assign tx to a transaction group. txg_how can be one of: * * (1) TXG_WAIT. If the current open txg is full, waits until there's * a new one. This should be used when you're not holding locks. * It will only fail if we're truly out of space (or over quota). * * (2) TXG_NOWAIT. If we can't assign into the current open txg without * blocking, returns immediately with ERESTART. This should be used * whenever you're holding locks. On an ERESTART error, the caller * should drop locks, do a dmu_tx_wait(tx), and try again. |
| Comment by Isaac Huang (Inactive) [ 02/Feb/15 ] |
|
Brian, you nailed it! Thanks a lot! Every time the hang happened, the TXGs would move forward, but unusually fast: # cat lustre-ost1/txgs; sleep 1; cat lustre-ost1/txgs 25 0 0x01 3 336 244675310220 123698699004892 txg birth state ndirty nread nwritten reads writes otime qtime wtime stime 1010319525 123698585210993 S 0 0 0 0 0 84246 3380 33414 0 1010319526 123698585295239 W 0 0 0 0 0 36794 2496 0 0 1010319527 123698585332033 O 0 0 0 0 0 0 0 0 0 25 0 0x01 3 336 244675310220 123699705972680 txg birth state ndirty nread nwritten reads writes otime qtime wtime stime 1010324666 123699503872529 S 0 0 0 0 0 111328 5851 53035 0 1010324667 123699503983857 W 0 0 0 0 0 59840 6406 0 0 1010324668 123699504043697 O 0 0 0 0 0 0 0 0 0 In this case, 5141 in one second. Here's dmu_tx: # cat dmu_tx 4 1 0x01 11 528 3293147069 124044054813482 name type data dmu_tx_assigned 4 32644 dmu_tx_delay 4 0 dmu_tx_error 4 0 dmu_tx_suspended 4 0 dmu_tx_group 4 0 dmu_tx_memory_reserve 4 1012288401 dmu_tx_memory_reclaim 4 0 dmu_tx_dirty_throttle 4 0 dmu_tx_dirty_delay 4 0 dmu_tx_dirty_over_max 4 0 dmu_tx_quota 4 0 And dmu_tx_memory_reserve is still rapidly growing even though there's now no IO from Lustre at all: # grep dmu_tx_memory_reserve dmu_tx; sleep 1; grep dmu_tx_memory_reserve dmu_tx; dmu_tx_memory_reserve 4 1012794812 dmu_tx_memory_reserve 4 1012800331 So the counter grew by 5519 in a second, roughly the same as the increase in TXG #. Looks like that's why the thread is still hanging in txg_wait_open(). The only place where that counter gets bumped is dmu_tx_try_assign()=>dsl_dir_tempreserve_space()=>arc_tempreserve_space(). So I increased zfs_arc_max by about 500M, waited for the arc_adapt thread to pick up the change, and then a few seconds later the thread recovered from the hang and everything looked normal! Even umount worked! Before I increased zfs_arc_max, there was about 800M available from arc_sz to arc_c, and the hanged tx was only trying to remove one object (and update OI and accounting ZAPs) so clearly something was wrong with that TX (maybe |
| Comment by Brian Behlendorf [ 02/Feb/15 ] |
|
Whoa, that's definitely going to be a problem! It sounds like Lustre constructed a massive TX relative to the target ARC size and therefore could never assign it to a TXG. Something like |
| Comment by Alex Zhuravlev [ 02/Feb/15 ] |
|
iirc, we got -ENOSPC from dmu_tx_assign() if TX is too big. for example, when we tried to destroy a file with lots of stripes. |
| Comment by Brian Behlendorf [ 02/Feb/15 ] |
|
Alex that can happen (along with EDQUOT) if the pool is low on space but it's not guaranteed to happen even for very large TXs. It's entirely possible that you'll fall in to this ERESTART retry loop first. The DMU expects that individual TXs will be relatively small (~10MB at most) but that a TXG may contain many TXs. If Lustre is creating TXs much larger than this I can see how this deadlock could happen. It would be best if we could put an upper bound of the size of a Lustre TX. |
| Comment by Andreas Dilger [ 02/Feb/15 ] |
|
Kudos to everyone for the excellent work in tracking down this tricky problem. Nathaniel, while a solution is being working on, could you please submit a patch to sanity to re-enable subtest 132 and exclude subtests 78, 79, and 80 for ZFS OSTs with a Test-Parameters: line that runs sanity enough times to be confident of the result (maybe 6-10 times)? That would allow us to start enforcing the review-zfs test results while the fix for this new issue is prepared, and ensure we don't regress in other areas in the meantime. Excluding sanity test_133g failures we would have passed 27 of 30 recent review-zfs test sessions (one failued due to a problem in the patch, one failed due to timeouts in sanity-quota and ost-pools, and one had an additional failure in sanity-quota ). There is a similar check in ext4 to limit the size of a single transaction handle to be < 1/4 of the total journal size. I don't think it unreasonable to put some upper bound on the size of a single ZFS tx, but what that size should be is unclear. I recall the 10MB tx limit also, but I think it was just recently increased? Related to this, llog updates (appends) they may be quite large in theory but in practice will only affect a small number of blocks, and since an unlink of a widely-striped file may update a lot of llog files (up to 2000 stripes) it could cause very large theoretical transaction sizes that aren't seen in practice, so we don't want to cause them to gratuitously fail due to this check. That is what LU-2160 "Implement ZFS dmu_tx_hold_append() declarations for llog" is about, but that hasn't been implemented yet. |
| Comment by Isaac Huang (Inactive) [ 02/Feb/15 ] |
|
Working on the patch to disable 78, 79, and 80... |
| Comment by Gerrit Updater [ 02/Feb/15 ] |
|
Isaac Huang (he.huang@intel.com) uploaded a new patch: http://review.whamcloud.com/13600 |
| Comment by Isaac Huang (Inactive) [ 03/Feb/15 ] |
|
I just reproduced it again (with a # zdb -e -ddddd lustre-ost2/ost2 195
Object lvl iblk dblk dsize lsize %full type
195 5 16K 128K 1.01M 1.00T 0.00 ZFS plain file
Indirect blocks:
0 L4 0:54a00:400 4000L/400P F=8 B=513/513
10000000000 L3 0:54600:400 4000L/400P F=8 B=513/513
10000000000 L2 0:54200:400 4000L/400P F=8 B=513/513
10000000000 L1 0:4fe00:400 4000L/400P F=8 B=513/513
10000000000 L0 0:76aa00:20000 20000L/20000P F=1 B=513/513
10000020000 L0 0:78aa00:20000 20000L/20000P F=1 B=513/513
10000040000 L0 0:7aaa00:20000 20000L/20000P F=1 B=513/513
10000060000 L0 0:7caa00:20000 20000L/20000P F=1 B=513/513
10000080000 L0 0:7eaa00:20000 20000L/20000P F=1 B=513/513
100000a0000 L0 0:80aa00:20000 20000L/20000P F=1 B=513/513
100000c0000 L0 0:82aa00:20000 20000L/20000P F=1 B=513/513
100000e0000 L0 0:84aa00:20000 20000L/20000P F=1 B=513/513
segment [0000010000000000, 0000010000100000) size 1M
An object with only 4 indirect blocks, despite its lsize of 1T. I guessed it's the object the hanged OST thread was trying to remove. The arc_c was 938M when it hanged, then I gradually increased it and the hang disappeared when arc_c was increased to 1.0G. Then that object also disappeared: # zdb -e -ddddd lustre-ost2/ost2 195
Object lvl iblk dblk dsize lsize %full type
zdb: dmu_bonus_hold(195) failed, errno 2
So my guess was right: the thread tried to remove the object with 1T lsize, and dmu_tx_hold_free() estimated the memory overhead of freeing the object to be about 1G, then dmu_tx_assign() hanged in dmu_tx_try_assign()=>dsl_dir_tempreserve_space()=>arc_tempreserve_space(). I also kept watching arcsz, and it grew by only a few MBs (it never grew over 60M). So it looked like dmu_tx_count_free() severely over-estimated txh_memory_tohold for dmu_tx_hold_free(). |
| Comment by Alex Zhuravlev [ 03/Feb/15 ] |
|
cool catch! I guess it should be possible to reproduce this with a regular truncate to, say 2-4TB ? |
| Comment by Alex Zhuravlev [ 03/Feb/15 ] |
|
also, would it make sense to have own copy of dmu_tx_assign() with some checks like if we're trying for too long, then print an error? |
| Comment by Andreas Dilger [ 03/Feb/15 ] |
|
This looks like it is caused by sanity.sh test_80 "Page eviction is equally fast at high offsets too", which writes 1MB at a 1TB offset in a sparse file to test client page cache truncation performance. I guess it is also testing OSD-side block truncation... I don't think making our own version of dmu_tx_assign() is the right solution, but rather fixing the upstream ZFS code. I'm not sure if this should fail outright (leaving a user with an unremovable object?) or only print an error that the arc_c limit should be increased (preferably with some advice on how large it needs to be). As for the proper solution, I'd expect that the DMU code shouldn't be so pessimistic about TX size when deleting this sparse file. I'd imagine it only really needs a few hundred KB of dirty blocks, so maybe it needs a bit better accounting? Is it accounting for a dense file and not realizing the file is sparse? Is it possible to short-circuit all of the calculations in dmu_tx_count_free() by checking the allocated block count from the dnode (dn_used?), or is that a layering violation? |
| Comment by Alex Zhuravlev [ 03/Feb/15 ] |
|
I agree that the proper fix can be done only in DMU. but there might be another similar overestimation leading to a livelock in some corner cases which will be hard to recognize given it's a silent loop within dmu_tx_assign(). |
| Comment by Isaac Huang (Inactive) [ 03/Feb/15 ] |
|
Last night I changed the seek=1M to seek=5K in test_80, and so far I've run 5 iterations of sanity and 64 iterations of "sanity --only 78,79,80" without reproducing it. So it looks like we can work around it by reducing seek in test_80 (probably test_34g as well) without skipping any sub-test, while working on a proper fix. As to the fix:
|
| Comment by Alex Zhuravlev [ 03/Feb/15 ] |
|
take into account that with wide-striping we may need to modify upto ~2K objects (and each llog is modified twice: 8K header and record itself). |
| Comment by Jodi Levi (Inactive) [ 03/Feb/15 ] |
|
Patch to disable test(s): http://review.whamcloud.com/#/c/13600 |
| Comment by Andreas Dilger [ 03/Feb/15 ] |
|
Isaac, we cannot change the seek value in test_80, because that is testing VM cache invalidation specific to the client, so changing the file size will break this test on the client. Oleg's suggestion is to only exclude subtest 80 if the ARC size is too small, so that it still runs on other systems that have enough RAM to pass this subtest. It might be useful to check pass/fail rates against VM clusters to see if some VMs are failing more often because they have less RAM? It doesn't seem that we need to exclude subtest 78 or 79 at all, but we're landing this patch anyway to allow review-ZFS to begin passing. |
| Comment by Gerrit Updater [ 03/Feb/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13600/ |
| Comment by Jodi Levi (Inactive) [ 03/Feb/15 ] |
|
Failing tests disabled. Fix will be included in 2.8. |
| Comment by Isaac Huang (Inactive) [ 03/Feb/15 ] |
|
Andreas, DMU wanted about 1G ARC size for the object delete. Most of our VMs are configured with 1.85G memory so ARC size defaults at 938M. So they'll all run into this problem. We can increase ARC size a bit to let the object delete pass, because only a couple of MBs were really needed. But that'd likely run into other issues, e.g. |
| Comment by Gerrit Updater [ 03/Feb/15 ] |
|
Isaac Huang (he.huang@intel.com) uploaded a new patch: http://review.whamcloud.com/13620 |
| Comment by Gerrit Updater [ 04/Feb/15 ] |
|
Isaac Huang (he.huang@intel.com) uploaded a new patch: http://review.whamcloud.com/13630 |
| Comment by Isaac Huang (Inactive) [ 04/Feb/15 ] |
|
It looked like the preferred DMU API to remove large/sparse objects is dmu_free_long_range(). The ZPL was able to remove any large/sparse objects I threw at it where it'd just choke up osd-zfs: zfs_rmnode(znode_t *zp)
......
/*
* Free up all the data in the file.
*/
error = dmu_free_long_range(os, zp->z_id, 0, DMU_OBJECT_END);
......
/*
* Set up the final transaction.
*/
tx = dmu_tx_create(os);
dmu_tx_hold_free(tx, zp->z_id, 0, DMU_OBJECT_END);
......
error = dmu_tx_assign(tx, TXG_WAIT);
......
zfs_znode_delete(zp, tx); ==> VERIFY(0 == dmu_object_free(os, obj, tx));
I created a simple patch and my tests showed it worked: @@ -444,6 +444,9 @@ static void __osd_declare_object_destroy(const struct lu_env *env,
zap_cursor_t *zc;
int rc = 0;
+ rc = dmu_free_long_range(osd->od_os, oid, 0, DMU_OBJECT_END);
+ LASSERTF(rc == 0, "dmu_free_long_range "LPU64" failed: %d\n", oid, rc);
+
dmu_tx_hold_free(tx, oid, 0, DMU_OBJECT_END);
But two things I'm not sure about:
We can work on dmu_tx_count_free() to correctly estimate memory overhead for the particular sparse file created by test_80(). But we'd have to deal with any large/sparse object anyway, so I think in the long run it'd be better to go with dmu_free_long_range(). |
| Comment by Brian Behlendorf [ 04/Feb/15 ] |
|
Nice find, that neatly explains everything. Thanks Issac for opening issue 3064 at zfsonlinux so we can decide of what should be done on the ZFS side. We should certainly investigate improving the quality of the worst case estimate. But it really does need to be a worst case estimate or we risk potentially stalling the txg engine in the worst (but very unlikely) case. Anyway, I'm open to ideas! |
| Comment by Alex Zhuravlev [ 04/Feb/15 ] |
|
Isaac, this is usually done using an additional index/list containing object to be freed. so OSD's destroy put an object on the list (plus, say, OI removal), then a separate thread (or osd_trans_stop()) truncates the object, frees it and remove from the list/index using as many transactions as needed. |
| Comment by Brian Behlendorf [ 04/Feb/15 ] |
|
Yes, what Alex said. This is exactly how the Posix layer also works, the object to be removed is placed on an unlinked listed for handling latter in as many transaction as needed. Only once the blocks are all freed in the obejct removed from the list. See zfs_unlinked_drain(). |
| Comment by Andreas Dilger [ 04/Feb/15 ] |
|
Note that by the time that the OST object is being destroyed it has already been unlinked from the MDS namespace, and that unlink is committed on the MDS. The OFD has also revoked client locks and discarded any cached dirty pages on the clients, so there is no danger to destroy the object in stages. That is what ext4 is doing internally (truncating blocks in small chunks from the end of the file). Since the MDT is already logging the object destroy locally and will resend it if the OSS crashes. The OST object hasn't been deleted from the OST namespace yet, so I don't see a requirement for more infrastructure to handle this at the OSD level. It is fine to truncate the object before destroy FOR THE OST ONLY, and higher layers will handle it. That said, since the OSD code is common, we shouldn't be exposing truncated-but-not-unlined objects on the MDT, so it may be easiest to truncate to zero explicitly from the OFD before the object destroy? |
| Comment by Alex Zhuravlev [ 05/Feb/15 ] |
|
well, that would mean we can't atomically destroy huge objects using OUT which I'd like to do to batch OST object destroy's from MDT. |
| Comment by Isaac Huang (Inactive) [ 10/Feb/15 ] |
|
How about:
All implemented at the osd-zfs layer. |
| Comment by Alex Zhuravlev [ 10/Feb/15 ] |
|
from Lustre point of view, the object is destroyed when it can't be found with lu_object_find(). IOW, when corresponding FID is removed from OI - this should be a part of object destroy. space accounting shouldn't be an issue as it's not released immediately in any case: grants/quota can release reserved space upon commit, but a new reserve can be made only with new statfs. truncate/dnode destroy in osd_trans_stop() looks OK. |
| Comment by Andreas Dilger [ 11/Feb/15 ] |
|
It makes sense to keep this compatible with ZFS ZPL if at all possible, so that if large files are unlinked under ZPL and then mounted as Lustre, or vice versa, we don't defer deleting them forever. I see this is handled in zfs_unlinked_add() and zfs_unlinked_drain() with a ZAP named "DELETE_QUEUE" (ZFS_UNLINKED_SET) in the MASTER_NODE_OBJ. Even if we need to implement our own routines to handle this, it makes sense to use the same ZAP and zap format (zap_add_int()). |
| Comment by Brian Behlendorf [ 11/Feb/15 ] |
|
I like Andreas's idea of keeping the unlink behavior compatible with the ZPL. It would be ideal if you could reuse the existing ZPL functions but those functions are tied quite closely to ZPL specific data structures so that's probably not workable. But the ZFS_UNLINKED_SET object itself is just a ZAP containing a list of object ids. And since objects on disk are already constructed to be compatible with the ZPL we should be able to safely use it. Isaac's design is nice, but let me suggest a few minor tweaks:
|
| Comment by Alex Zhuravlev [ 11/Feb/15 ] |
|
I'd think that the optimization for small objects (notice most of MDT's objects are literally empty) make sense as we don't need to modify yet another ZAP twice. I guess there is no strong requirement to implement this right away, but still. |
| Comment by Brian Behlendorf [ 11/Feb/15 ] |
|
I could see a case for zero length files as an possible optimization. But I suspect that even for the MDT it would be more efficient to handle the freeing asynchronously outside of any request processing. Even if the file is zero length you're still going to be freeing a spill block for the xattrs and updating a dbuf for the dnode object. Personally I'd keep it as simple and concise as possible until it's clear something more is required. But that's just my preference. Keep in mind that none of this free space will be available for a couple TXGs anyway. |
| Comment by Alex Zhuravlev [ 11/Feb/15 ] |
|
I have no objection to do this as simple as possible. my point was that MDT is known to be CPU-bound and ZAP (even micro ZAP) isn't free. |
| Comment by Isaac Huang (Inactive) [ 12/Feb/15 ] |
|
Thanks all. I'll work on a patch first without the small object optimization to get this bug fixed; then will benchmark to figure out whether to optimize the small object path or not. |
| Comment by Andreas Dilger [ 12/Feb/15 ] |
|
I would agree with Alex on this. By deferring unlink of small files it will probably double or triple the total IO that the MDT is doing because in addition to the actual dnode deletion it also needs to insert the dnode into the deathrow ZAP in one TXG and then delete it from the same ZAP in a different txg. If there are a large number of objects being deleted at once (easily possible on the MDT), then the deathrow ZAP may get quite large (and never shrink) and updates would become less efficient than if it is kept small. |
| Comment by Bob Glossman (Inactive) [ 19/Feb/15 ] |
|
another seen on b2_5 with zfs: |
| Comment by Gerrit Updater [ 19/Feb/15 ] |
|
Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13805 |
| Comment by Gerrit Updater [ 17/May/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13630/ |
| Comment by Peter Jones [ 18/May/15 ] |
|
Landed for 2.8 |