[LU-5242] Test hang sanity test_132, test_133: umount ost Created: 23/Jun/14  Updated: 23/Nov/17  Resolved: 18/May/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0, Lustre 2.5.3
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Isaac Huang (Inactive)
Resolution: Fixed Votes: 0
Labels: prz

Attachments: Text File zpool-events.txt    
Issue Links:
Duplicate
duplicates LU-5163 (lu_object.h:852:lu_object_attr()) AS... Resolved
duplicates LU-6105 Update ZFS/SPL version to 0.6.3-1.2 Resolved
is duplicated by LU-4716 replay-ost-single test_5: stuck in db... Resolved
Related
is related to LU-4968 Test failure sanity test_132: umount ... Resolved
is related to LU-3665 obdfilter-survey test_3a: unmount stu... Resolved
is related to LU-6089 qsd_handler.c:1139:qsd_op_adjust()) A... Resolved
is related to LU-6155 osd_count_not_mapped() calls dbuf_hol... Resolved
is related to LU-5277 sanity test_132: mdt_build_target_lis... Resolved
is related to LU-5737 sanity test_132: client NULL req->rq_... Resolved
is related to LU-6008 sanity test_102b: setstripe failed Resolved
is related to LU-6195 osd-zfs: osd_declare_object_destroy()... Resolved
is related to LU-4950 sanity-benchmark test fsx hung: txg_s... Closed
is related to LU-7020 OST_DESTROY message times out on MDS ... Closed
is related to LU-2160 Implement ZFS dmu_tx_hold_append() de... Open
Severity: 3
Rank (Obsolete): 14622

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:

http://maloo.whamcloud.com/test_sets/e5783778-f887-11e3-b13a-52540035b04c.

The sub-test test_132 failed with the following error:

test failed to respond and timed out

Info required for matching: sanity 132



 Comments   
Comment by Oleg Drokin [ 23/Jun/14 ]

MDS2 traces have a very characteristic signature of "waiting for obd_unlinked_exports more than XX" - there a bunch of bug about this, they hit from tiem to time on unmount.

Comment by Andreas Dilger [ 26/Jun/14 ]

It looks like there is some kind of problem in cleaning up the OSP device, and this has been causing a significant number of failures recently (about 10% for both review-zfs and review-dne-part-1).

03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) header@ffff88005b849740[0x0, 1, [0x1:0x0:0x0] hash exist]{
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....local_storage@ffff88005b849790
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....osd-ldiskfs@ffff88005b849800osd-ldiskfs-object@ffff88005b849800(i:ffff88007b5fe5a8:144/3610117262)[plain]
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) } header@ffff88005b849740
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) header@ffff880046ba3600[0x0, 1, [0x200000003:0x0:0x0] hash exist]{
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....local_storage@ffff880046ba3650
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....osd-ldiskfs@ffff88007954f680osd-ldiskfs-object@ffff88007954f680(i:ffff88007b60c628:110/3610117227)[plain]
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) } header@ffff880046ba3600
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) header@ffff88006abf6800[0x0, 1, [0x200000003:0x2:0x0] hash exist]{
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....local_storage@ffff88006abf6850
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....osd-ldiskfs@ffff88006abf6c80osd-ldiskfs-object@ffff88006abf6c80(i:ffff880057c7c1d0:32771/3610117228)[plain]
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) } header@ffff88006abf6800
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) header@ffff8800572f4680[0x0, 1, [0xa:0x0:0x0] hash exist]{
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....local_storage@ffff8800572f46d0
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....osd-ldiskfs@ffff8800684f6500osd-ldiskfs-object@ffff8800684f6500(i:ffff880060b77110:178/3610117296)[plain]
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) } header@ffff8800572f4680
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) header@ffff88006abf62c0[0x1, 1, [0x200000001:0x1017:0x0] hash exist]{
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....local_storage@ffff88006abf6310
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....osd-ldiskfs@ffff88006abf6d40osd-ldiskfs-object@ffff88006abf6d40(i:ffff880057c706e8:524289/986286545)[plain]
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) } header@ffff88006abf62c0
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) header@ffff880067b38e00[0x0, 1, [0x4c0000402:0x2f:0x0] hash exist]{
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....mdt@ffff880067b38e50mdt-object@ffff880067b38e00(ioepoch=0 flags=0x0, epochcount=0, writecount=0)
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....mdd@ffff880057491cd0mdd-object@ffff880057491cd0(open_count=0, valid=0, cltime=0, flags=0)
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....lod@ffff8800717d0bf0lod-object@ffff8800717d0bf0
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....osp@ffff8800717d3d00osp-object@ffff8800717d3cb0
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) } header@ffff880067b38e00
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) header@ffff88007a048928[0x0, 1, [0x440000bd1:0x2:0x0] hash exist]{
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....mdt@ffff88007a048978mdt-object@ffff88007a048928(ioepoch=0 flags=0x0, epochcount=0, writecount=0)
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....mdd@ffff880061a52e10mdd-object@ffff880061a52e10(open_count=0, valid=0, cltime=0, flags=0)
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....lod@ffff880061a40830lod-object@ffff880061a40830
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....osp@ffff88005e25e1a0osp-object@ffff88005e25e150
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) } header@ffff88007a048928
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) header@ffff880052e8ad08[0x0, 2, [0x440000400:0xd5:0x0] hash exist]{
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....mdt@ffff880052e8ad58mdt-object@ffff880052e8ad08(ioepoch=0 flags=0x0, epochcount=0, writecount=0)
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....mdd@ffff8800717fc3c0mdd-object@ffff8800717fc3c0(open_count=0, valid=0, cltime=0, flags=0)
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....lod@ffff880052e88c68lod-object@ffff880052e88c68
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) ....osp@ffff88005ba3f5b0osp-object@ffff88005ba3f560
03:21:03:LustreError: 7751:0:(osp_dev.c:858:osp_device_free()) } header@ffff880052e8ad08

It may be that this is a new type of failure caused by a recently landed patch, or it could be the same failure that Oleg describes that has been made worse by some other patch.

Comment by Peter Jones [ 27/Jun/14 ]

Emoly

Could you please look into this one?

Thanks

Peter

Comment by Andreas Dilger [ 30/Jun/14 ]

Emoly,
could you please look into the recent sanity test_132 failures to see when the above failures started in osp_device_free(). I think this particular failure mode started fairly recently, and is different from older failures reported in LU-4062/LU-3665. If we can track down when the more recent failures started we could see if they relate to a particular patch landing.

Comment by Emoly Liu [ 01/Jul/14 ]

I can hit this "hang" issue on zfs in my local test, but not every time. I will investigate it.

Comment by Emoly Liu [ 01/Jul/14 ]

Yesterday I happened to hit this problem, and its call trace is same to the one shown in LU-4062. But unfortunately, I can't reproduce it today.

Jun 30 16:16:28 centos6-x kernel: umount        D 0000000000000000     0 31062  31061 0x00000080
Jun 30 16:16:28 centos6-x kernel: ffff8800176b7aa8 0000000000000086 ffff8800176b7a08 ffff88001f4c3c00
Jun 30 16:16:28 centos6-x kernel: ffffffffa12fbd1d 0000000000000000 ffff88000e00c184 ffffffffa12fbd1d
Jun 30 16:16:28 centos6-x kernel: ffff88001f405098 ffff8800176b7fd8 000000000000fb88 ffff88001f405098
Jun 30 16:16:28 centos6-x kernel: Call Trace:
Jun 30 16:16:28 centos6-x kernel: [<ffffffff8150f362>] schedule_timeout+0x192/0x2e0
Jun 30 16:16:28 centos6-x kernel: [<ffffffff810811e0>] ? process_timeout+0x0/0x10
Jun 30 16:16:28 centos6-x kernel: [<ffffffffa1281acb>] obd_exports_barrier+0xab/0x180 [obdclass]
Jun 30 16:16:28 centos6-x kernel: [<ffffffffa0ba776f>] ofd_device_fini+0x5f/0x260 [ofd]
Jun 30 16:16:28 centos6-x kernel: [<ffffffffa12a8633>] class_cleanup+0x573/0xd30 [obdclass]
Jun 30 16:16:28 centos6-x kernel: [<ffffffffa12838a6>] ? class_name2dev+0x56/0xe0 [obdclass]
Jun 30 16:16:28 centos6-x kernel: [<ffffffffa12aa35a>] class_process_config+0x156a/0x1ad0 [obdclass]
Jun 30 16:16:28 centos6-x kernel: [<ffffffffa12a26fb>] ? lustre_cfg_new+0x2cb/0x680 [obdclass]
Jun 30 16:16:28 centos6-x kernel: [<ffffffffa12aaa39>] class_manual_cleanup+0x179/0x6f0 [obdclass]
Jun 30 16:16:28 centos6-x kernel: [<ffffffffa12838a6>] ? class_name2dev+0x56/0xe0 [obdclass]
Jun 30 16:16:28 centos6-x kernel: [<ffffffffa12e8269>] server_put_super+0x8f9/0xe50 [obdclass]
Jun 30 16:16:28 centos6-x kernel: [<ffffffff8118363b>] generic_shutdown_super+0x5b/0xe0
Jun 30 16:16:28 centos6-x kernel: [<ffffffff81183726>] kill_anon_super+0x16/0x60
Jun 30 16:16:28 centos6-x kernel: [<ffffffffa12ac916>] lustre_kill_super+0x36/0x60 [obdclass]
Jun 30 16:16:28 centos6-x kernel: [<ffffffff81183ec7>] deactivate_super+0x57/0x80
Jun 30 16:16:28 centos6-x kernel: [<ffffffff811a21bf>] mntput_no_expire+0xbf/0x110
Jun 30 16:16:28 centos6-x kernel: [<ffffffff811a2c2b>] sys_umount+0x7b/0x3a0
Jun 30 16:16:28 centos6-x kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

I am checking the recent maloo test reports to see when the failure started.

Comment by Emoly Liu [ 01/Jul/14 ]

I checked and vetted recent two months sanity.sh test_132 failure logs and the logs show that there are three kinds of timeout failure.

1) LU-4062/LU-3665, as Oleg said, its signature is "XXX is waiting for obd_unlinked_exports more than xxx seconds". It really does happen from time to time, even in yesterday's test on the latest master branch https://testing.hpdd.intel.com/test_logs/1a2b5a56-00e2-11e4-b331-5254006e85c2/show_text .
This issue will be probably fixed by the patch http://review.whamcloud.com/#/c/9350 .

2) LU-5242, as Andreas said, its signature is “(osp_dev.c:858:osp_device_free()) } header@” and

12:34:53:LustreError: 7776:0:(osp_dev.c:858:osp_device_free()) } header@ffff8800607cee00
12:34:53:LustreError: 7776:0:(lu_object.c:1198:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
12:34:53:LustreError: 11252:0:(mdt_handler.c:4376:mdt_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: 
12:34:53:LustreError: 11252:0:(mdt_handler.c:4376:mdt_fini()) LBUG
12:34:53:Pid: 11252, comm: umount
12:34:53:
12:34:53:Call Trace:
12:34:53: [<ffffffffa048e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
12:34:53: [<ffffffffa048ee97>] lbug_with_loc+0x47/0xb0 [libcfs]
12:34:53: [<ffffffffa0ebccaf>] mdt_device_fini+0xc8f/0xcd0 [mdt]
12:34:53: [<ffffffffa05c3f5d>] ? class_disconnect_exports+0x17d/0x2f0 [obdclass]
12:34:53: [<ffffffffa05e6443>] class_cleanup+0x573/0xd30 [obdclass]
12:34:53: [<ffffffffa05c1836>] ? class_name2dev+0x56/0xe0 [obdclass]
12:34:53: [<ffffffffa05e816a>] class_process_config+0x156a/0x1ad0 [obdclass]
12:34:53: [<ffffffffa0499478>] ? libcfs_log_return+0x28/0x40 [libcfs]
12:34:53: [<ffffffffa05e0549>] ? lustre_cfg_new+0x309/0x680 [obdclass]
12:34:53: [<ffffffffa05e8849>] class_manual_cleanup+0x179/0x6f0 [obdclass]
12:34:53: [<ffffffffa05c1836>] ? class_name2dev+0x56/0xe0 [obdclass]
12:34:53: [<ffffffffa0625f09>] server_put_super+0x8f9/0xe50 [obdclass]
12:34:53: [<ffffffff8118af0b>] generic_shutdown_super+0x5b/0xe0
12:34:53: [<ffffffff8118aff6>] kill_anon_super+0x16/0x60
12:34:53: [<ffffffffa05ea726>] lustre_kill_super+0x36/0x60 [obdclass]
12:34:53: [<ffffffff8118b797>] deactivate_super+0x57/0x80
12:34:53: [<ffffffff811aa79f>] mntput_no_expire+0xbf/0x110
12:34:53: [<ffffffff811ab2eb>] sys_umount+0x7b/0x3a0
12:34:53: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Seems this problem has gone since June-15.

3) LU-5277, I just file this in a new ticket. Its signature is

03:45:20:BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
03:45:20:IP: [<ffffffffa108b30b>] mdt_build_target_list+0xfb/0x6a0 [mdt]
03:45:20:PGD 0 
03:45:20:Oops: 0000 [#1] SMP 
03:45:20:last sysfs file: /sys/devices/system/cpu/online
03:45:20:CPU 1 
03:45:20:Modules linked in: osp(U) lod(U) mdt(U) mdd(U) mgs(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) ksocklnd(U) scollective(U) gossip(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: llog_test]
03:45:20:
03:45:20:Pid: 16913, comm: mdt00_001 Tainted: P           ---------------    2.6.32-358.6.2.el6_lustre.g1624f5c.x86_64 #1 Red Hat KVM
03:45:20:RIP: 0010:[<ffffffffa108b30b>]  [<ffffffffa108b30b>] mdt_build_target_list+0xfb/0x6a0 [mdt]
03:45:20:RSP: 0018:ffff88006e32bba0  EFLAGS: 00010246
03:45:20:RAX: ffff88006c779ba0 RBX: ffff88002ea7a5c0 RCX: fffffffffffffffe
03:45:20:RDX: 0000000000000000 RSI: ffff88006c779b18 RDI: ffff88002ea7a5c0
03:45:20:RBP: ffff88006e32bc00 R08: ffff880048a91720 R09: 0000000000000000
03:45:20:R10: ffff88002d578800 R11: 0000000000000418 R12: 0000000000000000
03:45:20:R13: d4f7928dd528ca79 R14: ffff88006c779b18 R15: ffff88002dd6b000
03:45:20:FS:  00007f7d60754700(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
03:45:20:CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
03:45:20:CR2: 0000000000000030 CR3: 0000000001a85000 CR4: 00000000000006e0
03:45:20:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
03:45:20:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
03:45:20:Process mdt00_001 (pid: 16913, threadinfo ffff88006e32a000, task ffff88007005f500)
03:45:20:Stack:
03:45:20: ffff88006e32bbd0 0000000000000000 ffff88006e32bc18 ffffffffa1089e50
03:45:20:<d> ffffffffffffffff ffff88002dd6b000 ffff88006e32bc00 ffff88004b9d17d8
03:45:20:<d> ffff88002dd6b378 d4f7928dd528ca79 ffff88002dd6b000 0000000000000000
03:45:20:Call Trace:
03:45:20: [<ffffffffa1089e50>] ? mdt_corpc_add_shard+0x0/0x5d0 [mdt]
03:45:20: [<ffffffffa108ef76>] mdt_corpc_release_epochs+0x136/0x6c0 [mdt]
03:45:20: [<ffffffffa108f5fb>] mdt_container_close+0xfb/0x200 [mdt]
03:45:20: [<ffffffffa106e0de>] mdt_mfd_close+0x2de/0x700 [mdt]
03:45:20: [<ffffffffa106e9e2>] mdt_done_writing+0x4e2/0xd50 [mdt]
03:45:20: [<ffffffffa103df0d>] ? mdt_unpack_req_pack_rep+0x4d/0x4d0 [mdt]
03:45:20: [<ffffffffa0983d9c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
03:45:20: [<ffffffffa1047b98>] mdt_handle_common+0x648/0x1690 [mdt]
03:45:20: [<ffffffffa10818f5>] mds_regular_handle+0x15/0x20 [mdt]
03:45:20: [<ffffffffa0993558>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
03:45:20: [<ffffffffa05e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
03:45:20: [<ffffffffa05f8c5f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
03:45:20: [<ffffffffa098a8b9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
03:45:20: [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
03:45:20: [<ffffffffa09948ee>] ptlrpc_main+0xace/0x1710 [ptlrpc]
03:45:20: [<ffffffffa0993e20>] ? ptlrpc_main+0x0/0x1710 [ptlrpc]
03:45:20: [<ffffffff8100c0ca>] child_rip+0xa/0x20
03:45:20: [<ffffffffa0993e20>] ? ptlrpc_main+0x0/0x1710 [ptlrpc]
03:45:20: [<ffffffffa0993e20>] ? ptlrpc_main+0x0/0x1710 [ptlrpc]
03:45:20: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Seems this problem also has gone since June-17.

The frequency of these issues is 1) > 3) > 2). BTW, unfortunately, many tests have no test logs, so I can't tell which situation they belong to.

Comment by Andreas Dilger [ 02/Jul/14 ]

Emoly, thank you for your detailed investigation. When you write that the problem #2 and #3 "have gone since June NN" do you mean they have "gone away" == no longer happening, or "have gone on" == started on?

Could you check if patches were landed on those days (or a day before) that might have caused these failures? Even better is to find if the very first failures were happening on a patch before it was landed to master. You can check https://old-testing.hpdd.intel.com/ for the old failure logs.

Comment by Emoly Liu [ 02/Jul/14 ]

Andreas, sorry for my unclear expression, I mean problem #2 and #3 have gone away since the middle of June.

I will check those old failure logs and see if the failure is related to some patches landed on those days, and then give an update.

Comment by Emoly Liu [ 03/Jul/14 ]

Since problem #3 LU-5277 only happened in a daos-regression group, I re-filed it to FF-273. So we can only force on problem #2.

Comment by Emoly Liu [ 03/Jul/14 ]

I checked the test logs one by one and only found two instances with problem #2 (I mean the console logs show “(osp_dev.c:858:osp_device_free()) } header@” obviously).
They are https://testing.hpdd.intel.com/test_sets/f19d7516-f433-11e3-9db1-5254006e85c2 and https://testing.hpdd.intel.com/test_sets/1a7d3012-f4c9-11e3-b233-5254006e85c2 , which are from the same build https://build.hpdd.intel.com/job/lustre-reviews/24535 triggered by wangdi.

If I am right, problem #1 is the only problem.

Comment by Di Wang [ 04/Jul/14 ]

Hmm, problem #2 (LBUG during umount mdt ) only happened 2 times and it seems it had not happen since June 15th, so it was probably already being fixed.

I checked the all of test_132 timeout status, it seems most of them happened in ZFS and umount -f ost, maybe some one familiar with OST stack should have a look, Thanks.

05:18:16:Lustre: 3931:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213483/real 1404213483]  req@ffff88006d539c00 x1472420846307548/t0(0) o400->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 224/224 e 0 to 1 dl 1404213490 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
05:18:16:LustreError: 166-1: MGC10.1.6.21@tcp: Connection to MGS (at 10.1.6.21@tcp) was lost; in progress operations using this service will fail
05:18:16:Lustre: DEBUG MARKER: ! zpool list -H lustre-ost1 >/dev/null 2>&1 ||
05:25:51:			grep -q ^lustre-ost1/ /proc/mounts ||
05:25:51:			zpool export  lustre-ost1
05:25:51:Lustre: DEBUG MARKER: grep -c /mnt/ost2' ' /proc/mounts
05:25:52:Lustre: DEBUG MARKER: umount -d -f /mnt/ost2
05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213490/real 1404213490]  req@ffff8800683eb400 x1472420846307568/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213496 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
05:25:52:Lustre: lustre-OST0001 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck?
05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213500/real 1404213500]  req@ffff8800683eb000 x1472420846307572/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213511 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
05:25:52:Lustre: lustre-OST0001 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck?
05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213515/real 1404213515]  req@ffff88002eb48c00 x1472420846307576/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213531 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
05:25:52:Lustre: lustre-OST0001 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck?
05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213535/real 1404213535]  req@ffff88002e895c00 x1472420846307580/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213556 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213560/real 1404213560]  req@ffff88006d82a000 x1472420846307584/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213585 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213590/real 1404213590]  req@ffff88006a6d6c00 x1472420846307588/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213615 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
05:25:52:Lustre: lustre-OST0001 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck?
05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213650/real 1404213650]  req@ffff88002e895c00 x1472420846307596/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213675 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 1 previous similar message
05:25:52:Lustre: lustre-OST0001 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck?
05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213740/real 1404213740]  req@ffff88002eb48c00 x1472420846307608/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213765 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404213880/real 1404213880]  req@ffff88002e895c00 x1472420846307628/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1404213905 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
05:25:52:Lustre: 3930:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
05:25:52:INFO: task umount:8756 blocked for more than 120 seconds.
05:25:52:      Tainted: P           ---------------    2.6.32-431.17.1.el6_lustre.gdc86085.x86_64 #1
05:25:52:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
05:25:52:umount        D 0000000000000000     0  8756   8755 0x00000080
05:25:52: ffff88006f099aa8 0000000000000082 ffff88006f099a08 ffff880071fcec00
05:31:55: ffffffffa079de77 0000000000000000 ffff88007142c184 ffffffffa079de77
05:31:55: ffff8800792a1ab8 ffff88006f099fd8 000000000000fbc8 ffff8800792a1ab8
05:31:55:Call Trace:
05:31:55: [<ffffffff81528e82>] schedule_timeout+0x192/0x2e0
05:31:55: [<ffffffff81083e90>] ? process_timeout+0x0/0x10
05:31:55: [<ffffffffa0720a6b>] obd_exports_barrier+0xab/0x180 [obdclass]
05:31:55: [<ffffffffa0f7476f>] ofd_device_fini+0x5f/0x260 [ofd]
05:31:55: [<ffffffffa07475d3>] class_cleanup+0x573/0xd30 [obdclass]
05:31:55: [<ffffffffa0722846>] ? class_name2dev+0x56/0xe0 [obdclass]
05:31:55: [<ffffffffa07492fa>] class_process_config+0x156a/0x1ad0 [obdclass]
05:31:55: [<ffffffffa05fa488>] ? libcfs_log_return+0x28/0x40 [libcfs]
05:31:55: [<ffffffffa07416d9>] ? lustre_cfg_new+0x309/0x680 [obdclass]
05:31:55: [<ffffffffa07499d9>] class_manual_cleanup+0x179/0x6f0 [obdclass]
05:31:55: [<ffffffffa0722846>] ? class_name2dev+0x56/0xe0 [obdclass]
05:31:55: [<ffffffffa0787209>] server_put_super+0x8f9/0xe50 [obdclass]
05:31:55: [<ffffffff8118af0b>] generic_shutdown_super+0x5b/0xe0
05:31:55: [<ffffffff8118aff6>] kill_anon_super+0x16/0x60
05:31:55: [<ffffffffa074b8b6>] lustre_kill_super+0x36/0x60 [obdclass]
05:31:55: [<ffffffff8118b797>] deactivate_super+0x57/0x80
05:31:55: [<ffffffff811aa79f>] mntput_no_expire+0xbf/0x110
05:31:55: [<ffffffff811ab2eb>] sys_umount+0x7b/0x3a0
05:31:55: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
05:31:55:Lustre: lustre-OST0001 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck?
Comment by Andreas Dilger [ 04/Jul/14 ]

Di, the reason that this failure was not hit since June 15 is because that is the last time your patch http://review.whamcloud.com/10673 for LU-5163 was tested. This problem has only been hit when testing that patch.

Comment by Di Wang [ 05/Jul/14 ]

Ah, ok. I will fix the patch 10673 in LU-5163. Alex, Could you please comment on this one, not so familiar with ZFS.

Comment by Alex Zhuravlev [ 07/Jul/14 ]

I checked on log - https://testing.hpdd.intel.com/test_logs/846317de-f434-11e3-9db1-5254006e85c2/show_text:

2:34:53:LustreError: 7776:0:(osp_dev.c:858:osp_device_free()) header@ffff88006a3c3980[0x0, 1, [0x1:0x0:0x0] hash exist]

{ 12:34:53:LustreError: 7776:0:(osp_dev.c:858:osp_device_free()) ....local_storage@ffff88006a3c39d0 12:34:53:LustreError: 7776:0:(osp_dev.c:858:osp_device_free()) ....osd-ldiskfs@ffff880061e9dc80osd-ldiskfs-object@ffff880061e9dc80(i:ffff8800578a97e8:144/3738986019)[plain] 12:34:53:LustreError: 7776:0:(osp_dev.c:858:osp_device_free()) }

header@ffff88006a3c3980

doesn't seem to be specific to ZFS backend?

Comment by Di Wang [ 07/Jul/14 ]

2:34:53:LustreError: 7776:0:(osp_dev.c:858:osp_device_free()) header@ffff88006a3c3980[0x0, 1, [0x1:0x0:0x0] hash exist]
12:34:53:LustreError: 7776:0:(osp_dev.c:858:osp_device_free()) ....local_storage@ffff88006a3c39d0 12:34:53:LustreError: 7776:0:(osp_dev.c:858:osp_device_free()) ....osd-ldiskfs@ffff880061e9dc80osd-ldiskfs-object@ffff880061e9dc80(i:ffff8800578a97e8:144/3738986019)[plain] 12:34:53:LustreError: 7776:0:(osp_dev.c:858:osp_device_free())

Oh, this is caused by my patch http://review.whamcloud.com/#/c/10673/ , which is not being landed yet, I will update the patch anyway.

But this ticket is about umount on ZFS, https://testing.hpdd.intel.com/sub_tests/query?utf8=✓&test_set%5Btest_set_script_id%5D=f9516376-32bc-11e0-aaee-52540025f9ae&sub_test%5Bsub_test_script_id%5D=12fcb374-32be-11e0-b685-52540025f9ae&sub_test%5Bstatus%5D=TIMEOUT&sub_test%5Bquery_bugs%5D=&test_session%5Btest_host%5D=&test_session%5Btest_group%5D=&test_session%5Buser_id%5D=&test_session%5Bquery_date%5D=&test_session%5Bquery_recent_period%5D=&test_node%5Bos_type_id%5D=&test_node%5Bdistribution_type_id%5D=&test_node%5Barchitecture_type_id%5D=&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=&test_node_network%5Bnetwork_type_id%5D=&commit=Update+results

please check.

Comment by Alex Zhuravlev [ 08/Jul/14 ]

this is "known" issue with no resolution yet

ll_ost00_004 D 0000000000000001 0 26767 2 0x00000080
ffff88007946f970 0000000000000046 000000016fef8aa0 0000000000000001
ffff88006dcc01f0 0000000000000082 ffff88007946f950 ffff88007607bd80
ffff88006fef9058 ffff88007946ffd8 000000000000fbc8 ffff88006fef9058
Call Trace:
[<ffffffff8109b14e>] ? prepare_to_wait_exclusive+0x4e/0x80
[<ffffffffa014347d>] cv_wait_common+0xed/0x100 [spl]
[<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
[<ffffffffa01434e5>] __cv_wait+0x15/0x20 [spl]
[<ffffffffa0242f7b>] txg_wait_open+0x7b/0xa0 [zfs]
[<ffffffffa020a95e>] dmu_tx_wait+0x29e/0x2b0 [zfs]
[<ffffffff8152965e>] ? mutex_lock+0x1e/0x50
[<ffffffffa020aa01>] dmu_tx_assign+0x91/0x490 [zfs]
[<ffffffffa0e9656c>] osd_trans_start+0x9c/0x410 [osd_zfs]
[<ffffffffa0f9e79c>] ofd_trans_start+0x7c/0x100 [ofd]
[<ffffffffa0f9f4c3>] ofd_object_destroy+0x203/0x680 [ofd]
[<ffffffffa0f9afed>] ofd_destroy_by_fid+0x35d/0x620 [ofd]
[<ffffffffa0973e80>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
[<ffffffffa09754e0>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
[<ffffffffa0f948d2>] ofd_destroy_hdl+0x2e2/0xb80 [ofd]
[<ffffffffa0a01a7c>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
[<ffffffffa09b129a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
[<ffffffffa09b0580>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]

ffff88006fe55ba0 0000000000000046 00000000ffffffff 000015e4f4d9c106
ffff88006fe55b10 ffff88007034b910 000000000045358e ffffffffac457b12
ffff8800797a2638 ffff88006fe55fd8 000000000000fbc8 ffff8800797a2638
Call Trace:
[<ffffffff810a6d01>] ? ktime_get_ts+0xb1/0xf0
[<ffffffff815287f3>] io_schedule+0x73/0xc0
[<ffffffffa014341c>] cv_wait_common+0x8c/0x100 [spl]
[<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
[<ffffffffa01434a8>] __cv_wait_io+0x18/0x20 [spl]
[<ffffffffa02890ab>] zio_wait+0xfb/0x1b0 [zfs]
[<ffffffffa021ebe3>] dsl_pool_sync+0x2b3/0x3f0 [zfs]
[<ffffffffa0236e4b>] spa_sync+0x40b/0xa60 [zfs]
[<ffffffffa0243916>] txg_sync_thread+0x2e6/0x510 [zfs]
[<ffffffff810591a9>] ? set_user_nice+0xc9/0x130
[<ffffffffa0243630>] ? txg_sync_thread+0x0/0x510 [zfs]

there are many duplicates, actually.. for exaple, LU-4716

we discussed this few times with Brian B., but the root cause is still unclear..

Comment by Di Wang [ 10/Jul/14 ]

[Andreas Dilger] Deleted this comment because this problem is actually LU-3665 and didn't want this incorrectly appearing in search results.

Comment by Di Wang [ 10/Jul/14 ]

Fortunately, these two failures only happened once since July, so maybe it should be remark it as major?

Comment by Jian Yu [ 14/Jul/14 ]

More instance on master branch:
https://testing.hpdd.intel.com/test_sets/33f21012-098c-11e4-a512-5254006e85c2

Comment by Nathaniel Clark [ 17/Jul/14 ]

It does happen more frequently on ZFS eg:
https://testing.hpdd.intel.com/test_sets/b575d130-0dc1-11e4-b3f5-5254006e85c2
but it does also happen on DNE:
https://testing.hpdd.intel.com/test_sets/f019ab4a-0ae2-11e4-8ed2-5254006e85c2

Failure rate for review-zfs is 22% over the last week

Comment by Di Wang [ 28/Jul/14 ]

Hmm, the failure on DNE is mostly about amount MGS

Jul 13 10:08:02 onyx-51vm3 kernel: Lustre: MGS is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck?
Jul 13 10:09:08 onyx-51vm3 kernel: Lustre: 3023:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1405271323/real 1405271323]  req@ffff88005f6fe400 x1473528013665456/t0(0) o250->MGC10.2.5.30@tcp@0@lo:26/25 lens 400/544 e 0 to 1 dl 1405271348 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jul 13 10:09:08 onyx-51vm3 kernel: Lustre: 3023:0:(client.c:1926:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Jul 13 10:09:16 onyx-51vm3 kernel: LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.5.31@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Jul 13 10:09:16 onyx-51vm3 kernel: LustreError: Skipped 349 previous similar messages
Jul 13 10:10:05 onyx-51vm3 kernel: INFO: task umount:7320 blocked for more than 120 seconds.
Jul 13 10:10:05 onyx-51vm3 kernel:      Not tainted 2.6.32-431.20.3.el6_lustre.x86_64 #1
Jul 13 10:10:05 onyx-51vm3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 13 10:10:05 onyx-51vm3 kernel: umount        D 0000000000000001     0  7320   7319 0x00000080
Jul 13 10:10:05 onyx-51vm3 kernel: ffff8800627c1aa8 0000000000000086 0000000000000000 ffff88007bd0d400
Jul 13 10:10:05 onyx-51vm3 kernel: ffffffffa063cf3f 0000000000000000 ffff880055446144 ffffffffa063cf3f
Jul 13 10:10:05 onyx-51vm3 kernel: ffff8800627c7af8 ffff8800627c1fd8 000000000000fbc8 ffff8800627c7af8
Jul 13 10:10:05 onyx-51vm3 kernel: Call Trace:
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffff81529ac2>] schedule_timeout+0x192/0x2e0
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffff81083f30>] ? process_timeout+0x0/0x10
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa05bfa8b>] obd_exports_barrier+0xab/0x180 [obdclass]
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa0dea57e>] mgs_device_fini+0xfe/0x590 [mgs]
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa05e6683>] class_cleanup+0x573/0xd30 [obdclass]
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa05c1866>] ? class_name2dev+0x56/0xe0 [obdclass]
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa05e83aa>] class_process_config+0x156a/0x1ad0 [obdclass]
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa05e074b>] ? lustre_cfg_new+0x2cb/0x680 [obdclass]
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa05e8a89>] class_manual_cleanup+0x179/0x6f0 [obdclass]
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa05c1866>] ? class_name2dev+0x56/0xe0 [obdclass]
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa062632b>] server_put_super+0x96b/0xe50 [obdclass]
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffff8118b0cb>] generic_shutdown_super+0x5b/0xe0
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffff8118b1b6>] kill_anon_super+0x16/0x60
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffffa05ea966>] lustre_kill_super+0x36/0x60 [obdclass]
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffff8118b957>] deactivate_super+0x57/0x80
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffff811ab35f>] mntput_no_expire+0xbf/0x110
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffff811abeab>] sys_umount+0x7b/0x3a0
Jul 13 10:10:05 onyx-51vm3 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Jul 13 10:10:10 onyx-51vm3 kernel: Lustre: MGS is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck?
Jul 13 10:11:38 onyx-51vm3 kernel: Lustre: 3023:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1405271473/real 1405271473]  req@fff

Probably deserve a new ticket, but it only happened once in July, so probably not serious enough to be a critical ticket here?

Comment by Di Wang [ 28/Jul/14 ]

Hmm, this umount mgs failure only happens for http://review.whamcloud.com/#/c/10249/, so I am not sure this is related with that patch or not. I will create a new ticket if I saw more failures.

Comment by Peter Jones [ 29/Jul/14 ]

Bobijam

Could you please look into this issue?

Thanks

Peter

Comment by Zhenyu Xu [ 30/Jul/14 ]

failure with signature of "waiting for obd_unlinked_exports more than XX" is dup of LU-3665, and its fix patch is tracked at http://review.whamcloud.com/#/c/9350.

Comment by Nathaniel Clark [ 31/Jul/14 ]

I agree the symptom is the same, but fixing obdfilter-survey (LU-3665) doesn't seem like it will fix the failure in sanity/132. LU-3665 is caused by an extra objects created and not cleaned up via "lctl create"

Comment by Zhenyu Xu [ 07/Aug/14 ]

from the 1st report http://maloo.whamcloud.com/test_sets/e5783778-f887-11e3-b13a-52540035b04c.

02:17:13:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252206/real 1403252206]  req@ffff880045c20800 x1471410848609332/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1403252212 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
02:17:13:LustreError: 5280:0:(mgc_request.c:516:do_requeue()) failed processing log: -5
02:17:13:LustreError: 5280:0:(mgc_request.c:516:do_requeue()) Skipped 15 previous similar messages
02:17:13:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252206/real 1403252206]  req@ffff880065b40400 x1471410848609204/t0(0) o38->lustre-MDT0000-lwp-OST0007@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403252217 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
02:17:13:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252206/real 1403252206]  req@ffff88006a05d000 x1471410848609284/t0(0) o38->lustre-MDT0000-lwp-OST0004@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403252222 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
02:17:13:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252216/real 1403252216]  req@ffff880065ed6400 x1471410848609596/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1403252227 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252221/real 1403252221]  req@ffff880074a52400 x1471410848609744/t0(0) o38->lustre-MDT0000-lwp-OST0003@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403252237 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252241/real 1403252241]  req@ffff880042557800 x1471410848610120/t0(0) o38->lustre-MDT0000-lwp-OST0003@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403252262 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252276/real 1403252276]  req@ffff880073b55400 x1471410848610676/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1403252301 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252341/real 1403252341]  req@ffff880074e19400 x1471410848612092/t0(0) o38->lustre-MDT0000-lwp-OST0006@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403252366 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 18 previous similar messages
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252476/real 1403252476]  req@ffff8800672e3800 x1471410848614808/t0(0) o38->lustre-MDT0000-lwp-OST0003@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403252501 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 38 previous similar messages
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403252746/real 1403252746]  req@ffff88006633c800 x1471410848620304/t0(0) o38->lustre-MDT0000-lwp-OST0003@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403252771 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 80 previous similar messages
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403253266/real 1403253266]  req@ffff880027636c00 x1471410848630652/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1403253291 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 157 previous similar messages
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403253866/real 1403253866]  req@ffff880074dd0000 x1471410848642692/t0(0) o250->MGC10.1.6.21@tcp@10.1.6.21@tcp:26/25 lens 400/544 e 0 to 1 dl 1403253891 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 179 previous similar messages
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403254471/real 1403254471]  req@ffff880074dc6c00 x1471410848654508/t0(0) o38->lustre-MDT0000-lwp-OST0004@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403254496 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 181 previous similar messages
03:05:16:Lustre: 5258:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403255086/real 1403255086]  req@ffff880064c9e000 x1471410848666436/t0(0) o38->lustre-MDT0000-lwp-OST0003@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1403255111 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1

There are MGS_CONNECT and MDS_CONNECT RPCs being issued again and again (not resend), it seems that the reconnect engine does not honor that the obd device (mgc obd and lwp obd) has been cleaned and keeps trying to reconnect the device.

Comment by Oleg Drokin [ 14/Aug/14 ]

I think the root issue comes from this message:

02:17:17:Lustre: lustre-MDT0002 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 11. Is it stuck?
...

Which basically means the obd has some extra references and it's those references that don't let it go away and so it tries to reconnect still?

This sort of a message is what is seen in all of bugs of this class.

Comment by Jian Yu [ 14/Aug/14 ]

One more instance on master branch: https://testing.hpdd.intel.com/test_sets/a0ca0910-23fe-11e4-84ee-5254006e85c2

Comment by Zhenyu Xu [ 15/Aug/14 ]

Is it possible the result of the un-dying connect requests being kept re-generating and they hold the obd refcount?

Comment by Jian Yu [ 22/Aug/14 ]

Hmm, this umount mgs failure only happens for http://review.whamcloud.com/#/c/10249/, so I am not sure this is related with that patch or not. I will create a new ticket if I saw more failures.

While testing patch http://review.whamcloud.com/11539 on Lustre b2_5 branch, sanity-lfsck test 0 hit the unmounting mgs failure. I created a new ticket LU-5539.

Comment by Jian Yu [ 28/Aug/14 ]

While testing patch http://review.whamcloud.com/11574 on Lustre b2_5 branch with FSTYPE=zfs, sanity test 132 hit the same failure in this ticket.
Maloo report: https://testing.hpdd.intel.com/test_sets/271e44e8-2d5c-11e4-b550-5254006e85c2

Comment by Isaac Huang (Inactive) [ 05/Sep/14 ]

I checked the Maloo report above and found a deadlock on the OSS very similar to the latest ones in LU-4950:

  • ZFS had completed writing new blocks in the syncing txg, and updated the new uberblocks, but got stuck some where in the middle of doing a WRITE_FLUSH_FUA, to make sure the uberblocks do hit persistent storage I guess.
  • So txg_sync couldn't complete, the syncing thread got stuck there.
  • As the transaction groups couldn't move forward, the ll_ost00_008 thread also got stuck in txg_wait_open().

I haven't figured out why ZFS stuck at doing WRITE_FLUSH_FUA, but there's two things to move forward at this point:
1. Set zfs_nocacheflush:
options zfs zfs_nocacheflush=1
This makes DKIOCFLUSHWRITECACHE a noop. Then the deadlock may move elsewhere or simply disappear.
2. Whether zfs_nocacheflush is set or not, once the timeout happens, it'd be useful to gather "zpool events -v" outputs on the OSS, which'd give more details on the state of the stuck zio.

Comment by Alex Zhuravlev [ 15/Sep/14 ]

Isaac, can you explain what signs point to WRITE_FLUSH_FUA, please ?

Comment by Isaac Huang (Inactive) [ 16/Sep/14 ]

Alex, in the OSS stack dump:
z_ioctl_iss/0 S 0000000000000001 0 10013 2 0x00000080
Call Trace:
[<ffffffffa024aa46>] ? vdev_disk_io_start+0x156/0x1b0 [zfs]

The z_ioctl_iss thread handles ZIO_TYPE_IOCTL and the only command is DKIOCFLUSHWRITECACHE, which is handled by vdev_disk_io_flush(). I suspected that vdev_disk_io_flush() was blocking in: bio = bio_alloc(GFP_NOIO, 0) - which blocks until there's memory to return. Also, txg_sync thread was waiting spa_sync()->zio_wait(probably the IOCTL zio); and ll_ost00_008 thread was blocking in txg_wait_open(). That's why I suspected the DKIOCFLUSHWRITECACHE ioctl was blocking and preventing the txgs from moving forward - WRITE_FLUSH_FUA bio was probably not submitted yet.

Comment by Alex Zhuravlev [ 16/Sep/14 ]

Isaac, have a look here - https://testing.hpdd.intel.com/test_logs/c2a5b9e6-2d5c-11e4-b550-5254006e85c2/show_text
I didn't notice anything except these:

Sep 12 19:57:07 shadow-11vm3 kernel: ll_ost00_001 D 0000000000000000 0 9264 2 0x00000080
Sep 12 19:57:07 shadow-11vm3 kernel: ffff88007c5c7960 0000000000000046 000000007223d500 0000000000000001
Sep 12 19:57:07 shadow-11vm3 kernel: ffff88006e41a1f0 0000000000000082 ffff88006fd1cae8 ffff8800736e5d80
Sep 12 19:57:07 shadow-11vm3 kernel: ffff88007223dab8 ffff88007c5c7fd8 000000000000fbc8 ffff88007223dab8
Sep 12 19:57:07 shadow-11vm3 kernel: Call Trace:
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffff8109b1ee>] ? prepare_to_wait_exclusive+0x4e/0x80
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa014347d>] cv_wait_common+0xed/0x100 [spl]
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa01434e5>] __cv_wait+0x15/0x20 [spl]
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa0242f7b>] txg_wait_open+0x7b/0xa0 [zfs]
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa020a95e>] dmu_tx_wait+0x29e/0x2b0 [zfs]
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffff8152ad0e>] ? mutex_lock+0x1e/0x50
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa020aa01>] dmu_tx_assign+0x91/0x490 [zfs]
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa0e8154d>] osd_trans_start+0xed/0x430 [osd_zfs]
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa0f81fac>] ofd_trans_start+0x7c/0x100 [ofd]
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa0f835c0>] ofd_object_destroy+0x3b0/0x850 [ofd]

Sep 12 19:57:07 shadow-11vm3 kernel: txg_quiesce S 0000000000000000 0 307 2 0x00000080
Sep 12 19:57:07 shadow-11vm3 kernel: ffff8800736e5d70 0000000000000046 000007989b5d2dfd 0000000000000000
Sep 12 19:57:07 shadow-11vm3 kernel: 0000000000000000 0000000000000001 ffff8800736e5d50 0000000000000086
Sep 12 19:57:07 shadow-11vm3 kernel: ffff88006fd1d058 ffff8800736e5fd8 000000000000fbc8 ffff88006fd1d058
Sep 12 19:57:07 shadow-11vm3 kernel: Call Trace:
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffff8109b1ee>] ? prepare_to_wait_exclusive+0x4e/0x80
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa014347d>] cv_wait_common+0xed/0x100 [spl]
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffff81058bd3>] ? __wake_up+0x53/0x70
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa01434c5>] __cv_wait_interruptible+0x15/0x20 [spl]
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa02432fd>] txg_thread_wait+0x1d/0x40 [zfs]
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa0243629>] txg_quiesce_thread+0x299/0x2a0 [zfs]

Sep 12 19:57:07 shadow-11vm3 kernel: txg_sync D 0000000000000001 0 308 2 0x00000080
Sep 12 19:57:07 shadow-11vm3 kernel: ffff8800736e7ba0 0000000000000046 00000000ffffffff 000019aa6a2a17f0
Sep 12 19:57:07 shadow-11vm3 kernel: ffff8800736e7b10 ffff8800717ffc20 000000000038aee4 ffffffffabecd08f
Sep 12 19:57:07 shadow-11vm3 kernel: ffff88006fd1c5f8 ffff8800736e7fd8 000000000000fbc8 ffff88006fd1c5f8
Sep 12 19:57:07 shadow-11vm3 kernel: Call Trace:
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffff81529ea3>] io_schedule+0x73/0xc0
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa014341c>] cv_wait_common+0x8c/0x100 [spl]
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa01434a8>] __cv_wait_io+0x18/0x20 [spl]
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa02890ab>] zio_wait+0xfb/0x1b0 [zfs]
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa021ebe3>] dsl_pool_sync+0x2b3/0x3f0 [zfs]
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa0236e4b>] spa_sync+0x40b/0xa60 [zfs]
Sep 12 19:57:07 shadow-11vm3 kernel: [<ffffffffa0243916>] txg_sync_thread+0x2e6/0x510 [zfs]

Comment by Isaac Huang (Inactive) [ 17/Sep/14 ]

My previous analysis might be inaccurate as I neglected the '?' in the stack entry:

z_ioctl_iss/0 S 0000000000000001 0 10013 2 0x00000080
Call Trace:
[<ffffffffa024aa46>] ? vdev_disk_io_start+0x156/0x1b0 [zfs]

Which meant that the vdev_disk_io_start() function likely returned and the WRITE_FLUSH_FUA bio was already submitted - the z_ioctl_iss thread was waiting for more work to do. I checked the OSS stack dumps, and the z_* threads were all idle, so there was no more IO left to do. These led me to believe that:
1. IO for syncing the txg were all done, and WRITE_FLUSH_FUA was submitted as an IO barrier, i.e. to make sure all previous writes do hit persistent storage.
2. But the WRITE_FLUSH_FUA somehow didn't complete, causing txg_sync to wait and hang.

Suggestions to troubleshoot stay the same: on the OSS, set zfs_nocacheflush, and gather "zpool events -v" outputs.

Comment by Alex Zhuravlev [ 17/Sep/14 ]

why do you think vdev_disk_io_start() was already used in this txg? that could be a leftover from the previous txg?

Comment by Isaac Huang (Inactive) [ 17/Sep/14 ]

Alex, it was only a guess. If we had "zpool events -v" outputs, that'd provide a lot more information on the stuck zio.

Also I had a hunch that the stuck zio had something to do with the way the zpool was configured. There seemed to be too many levels of indirection - the use of /dev/lvm-Role_OSS/ indicated LVM which seemed not necessary as zfs could manage its own volumes.

Comment by Andreas Dilger [ 18/Sep/14 ]

Seems this is the major cause of review-zfs test failures.

Comment by Peter Jones [ 23/Sep/14 ]

Isaac

This issue continues to disrupt zfs testing. What do you advise?

Thanks

Peter

Comment by Isaac Huang (Inactive) [ 24/Sep/14 ]

Peter,

The auto-test results unfortunately didn't include sufficient ZFS debug information, see TEI-1729. I'm working with Jian to manually start a test session and reproduce the failure so I'd be able to collect the debug information needed.

Comment by Isaac Huang (Inactive) [ 26/Sep/14 ]

It seemed like LVM+deadline scheduler could be the cause of IO timeouts, see LU-4950 for latest results.

Comment by Jian Yu [ 02/Oct/14 ]

While verifying http://review.whamcloud.com/12001 on Lustre b2_5 branch with FSTYPE=zfs, sanity test 900 hung at unmounting MDS.

Console log on MDS:

06:27:21:Lustre: DEBUG MARKER: umount -d -f /mnt/mds1
06:27:21:Lustre: lustre-MDT0000: Not available for connect from 10.1.4.26@tcp (stopping)
06:27:21:LustreError: 3264:0:(client.c:1079:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880036038800 x1480755389930956/t0(0) o13->lustre-OST0000-osc-MDT0000@10.1.4.26@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
06:27:21:LustreError: 3264:0:(client.c:1079:ptlrpc_import_delay_req()) Skipped 3 previous similar messages
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) header@ffff88005bfda240[0x0, 1, [0x1:0x0:0x0] hash exist]{
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....local_storage@ffff88005bfda298
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....osd-zfs@ffff880059957ac0osd-zfs-object@ffff880059957ac0
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) } header@ffff88005bfda240
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) header@ffff88006891de08[0x0, 0, [0x100020000:0x1d03:0x0] hash lru]{
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....osp@ffff88006891de60osp-object@ffff88006891de08
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) } header@ffff88006891de08
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) header@ffff88005bfda9c0[0x0, 1, [0x200000003:0x0:0x0] hash exist]{
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....local_storage@ffff88005bfdaa18
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....osd-zfs@ffff880059957ee0osd-zfs-object@ffff880059957ee0
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) } header@ffff88005bfda9c0
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) header@ffff88005bfda3c0[0x1, 1, [0x200000003:0x2:0x0] hash exist]{
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....local_storage@ffff88005bfda418
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....osd-zfs@ffff880059957cd0osd-zfs-object@ffff880059957cd0
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) } header@ffff88005bfda3c0
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) header@ffff88005bfda300[0x0, 1, [0x200000003:0x3:0x0] hash exist]{
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....local_storage@ffff88005bfda358
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....osd-zfs@ffff880059957bc8osd-zfs-object@ffff880059957bc8
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) } header@ffff88005bfda300
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) header@ffff880027ebfc80[0x0, 1, [0xa:0x0:0x0] hash exist]{
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....local_storage@ffff880027ebfcd8
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....osd-zfs@ffff8800599579b8osd-zfs-object@ffff8800599579b8
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) } header@ffff880027ebfc80
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) header@ffff88006891ded0[0x0, 0, [0x100000000:0x1e0e:0x0] hash lru]{
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) ....osp@ffff88006891df28osp-object@ffff88006891ded0
06:27:21:LustreError: 3249:0:(osp_dev.c:777:osp_device_free()) } header@ffff88006891ded0
06:27:21:LustreError: 3249:0:(lu_object.c:1252:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
06:27:21:LustreError: 3249:0:(lu_object.c:1252:lu_device_fini()) LBUG
06:27:21:Pid: 3249, comm: obd_zombid
06:27:22:
06:27:22:Call Trace:
06:27:22: [<ffffffffa05ef895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
06:27:22: [<ffffffffa05efe97>] lbug_with_loc+0x47/0xb0 [libcfs]
06:27:22: [<ffffffffa075be88>] lu_device_fini+0xb8/0xc0 [obdclass]
06:27:22: [<ffffffffa076287e>] dt_device_fini+0xe/0x10 [obdclass]
06:27:22: [<ffffffffa104e93f>] osp_device_free+0xff/0x220 [osp]
06:27:22: [<ffffffffa0746e5d>] class_decref+0x46d/0x550 [obdclass]
06:27:22: [<ffffffffa07257df>] obd_zombie_impexp_cull+0x30f/0x5d0 [obdclass]
06:27:22: [<ffffffffa0725b05>] obd_zombie_impexp_thread+0x65/0x190 [obdclass]
06:27:22: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
06:27:22: [<ffffffffa0725aa0>] ? obd_zombie_impexp_thread+0x0/0x190 [obdclass]
06:27:22: [<ffffffff8109abf6>] kthread+0x96/0xa0
06:27:22: [<ffffffff8100c20a>] child_rip+0xa/0x20
06:27:22: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
06:27:22: [<ffffffff8100c200>] ? child_rip+0x0/0x20
06:27:22:
06:27:22:Kernel panic - not syncing: LBUG

Maloo report: https://testing.hpdd.intel.com/test_sets/6f56a29c-4a07-11e4-95b1-5254006e85c2

Comment by Isaac Huang (Inactive) [ 02/Oct/14 ]

The failure above was different from the zio timeout here in this ticket. The OSS zpools were idle, and on the MDS the zpool seemed to have been exported already (because the txg_sync thread already quit). It might have something to do with the patch being tested, but I want to focus on the zio timeout issue here.

Comment by Jian Yu [ 02/Oct/14 ]

Hi Isaac,

According to problem 2) in the earlier comments https://jira.hpdd.intel.com/browse/LU-5242?focusedCommentId=87897&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-87897 in this ticket, I reported the above failure here.

I also saw the failure was reported in LU-3623. So, let me track the failure over there then.

Comment by Jian Yu [ 07/Oct/14 ]

One more instance on master branch:
https://testing.hpdd.intel.com/test_sets/b2697762-4e12-11e4-9581-5254006e85c2

Comment by Doug Oucharek (Inactive) [ 15/Oct/14 ]

Same failure has occurred on master branch for test_23a: https://testing.hpdd.intel.com/test_sets/90cd3a7e-53d7-11e4-9c8a-5254006e85c2

Comment by Henri Doreau (Inactive) [ 23/Oct/14 ]

Looks like another instance, on master: https://testing.hpdd.intel.com/test_sets/a40e6a14-5a2f-11e4-8dbb-5254006e85c2

Comment by Jian Yu [ 25/Oct/14 ]

One more instance on master branch: https://testing.hpdd.intel.com/test_sets/b8f4813c-5c25-11e4-b9ce-5254006e85c2

Comment by Isaac Huang (Inactive) [ 29/Oct/14 ]

I was able to reproduce it today and kept the system live for debugging. The txg_sync thread stuck in zio_wait() but the weird thing was there was no IO pending/running at either the ZFS vdev queue or the Linux block dev queue. At this point I tend to think it's a ZFS issue rather than Lustre issue.

Comment by Jian Yu [ 29/Oct/14 ]

One more instance on Lustre b2_5 branch:
https://testing.hpdd.intel.com/test_sets/91627324-5fa7-11e4-895a-5254006e85c2

Comment by Isaac Huang (Inactive) [ 30/Oct/14 ]

Likely it can be fixed by this ZFS patch:
https://github.com/zfsonlinux/zfs/pull/2828

I'll test it soon.

Comment by Dmitry Eremin (Inactive) [ 10/Nov/14 ]

One more failure on master: https://testing.hpdd.intel.com/sub_tests/bee79924-6861-11e4-a1b6-5254006e85c2

Comment by Andreas Dilger [ 13/Nov/14 ]

Isaac, any update on testing this patch? While we prefer to stick with the upstream releases, we are also able to land a patch into our own ZFS Git repo so that our own testing passes. If this fixes the problem we are seeing, it is also worthwhile to update the issue on GitHub with this information so that it will speed up the patch landing into a release.

Comment by Isaac Huang (Inactive) [ 19/Nov/14 ]

I'm still troubleshooting some local ZFS test failures after merging the patch. I'll apply it to Jenkins ZFS build once all tests pass.

Comment by Jian Yu [ 03/Dec/14 ]

More instance on Lustre b2_5 branch:
https://testing.hpdd.intel.com/test_sets/e3f3287e-7b0c-11e4-8c6d-5254006e85c2

Comment by Isaac Huang (Inactive) [ 04/Dec/14 ]

I've been testing with a patched ZFS build and so far haven't been able to reproduce it. But strange thing is, I can't reproduce it either with a previous build where I could easily reproduce it with sanity test_132. So now I'm not sure whether the patch fixed it, or some other change made it much harder to reproduce. I'm now looking into why I can't reproduce it any more with the same build/test hw where I was able to easily reproduce it.

Comment by Andreas Dilger [ 08/Dec/14 ]

Is it possible to push the upstream GitHub patch into our local ZFS git repository, so that we are always testing with that patch? That would allow us to get better testing, to see if it fixes this problem, and to see if it introduces some other problems or not. This is one of the last issues that is preventing review-zfs to be enforced for all patches.

Comment by Alex Zhuravlev [ 09/Dec/14 ]

here is an example how to add a patch: http://git.whamcloud.com/ff/daos_lustre.git/commit/21f403380e365b083cf4e48003e863a1661a5a60

Comment by Gerrit Updater [ 12/Dec/14 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13050
Subject: LU-5242 build: Add zfs patch for Illumos 5244
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e5b4e4305c9cdf7375ad5542a7118f3deb4cf7e4

Comment by Gerrit Updater [ 16/Dec/14 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/13093
Subject: LU-5242 tests: except sanity test_132 for ZFS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5c867565cdc8d4b1da41ef17e9bc821b63809151

Comment by Andreas Dilger [ 16/Dec/14 ]

Based on the test results in https://testing.hpdd.intel.com/test_sessions/f8d48678-84d7-11e4-985f-5254006e85c2 it appears that the 13050 patch does not resolve the problem being seen here. Is there anything new that can be learned from the logs in these failed tests?

This problem and LU-4536 are really blocking the ability to enable review-zfs in enforcing mode. It is passing a large fraction of tests, but these ones are causing the most failures. Normally I'd say this test should be disabled, but I guess the hang would happen again at some later unmount, though I don't see that? Is there something special that this test does with the backing storage that is causing ZFS grief, or is it possible there is a bug in the SOM codepath that is causing a reference leak and hanging the unmount?

For now, I've pushed a patch to disable this test, because SOM is not a supported feature, and it will continue to be tested with ldiskfs anyway.

Comment by Gerrit Updater [ 17/Dec/14 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/13093/
Subject: LU-5242 tests: except sanity test_132 for ZFS
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 12bfc80b08d2b70c0052b271bfa2d34995b5892e

Comment by Andreas Dilger [ 18/Dec/14 ]

Sadly, it seems now that test_132() was added to the ALWAYS_EXCEPT list, test_133() has now started causing timeouts at OST unmount time:
https://testing.hpdd.intel.com/test_sets/7e564d44-85dd-11e4-b909-5254006e85c2

This makes me wonder if there is something happening in an earlier test that is causing a problem (e.g. refcount leak) and any later unmount will hit the problem? If we could hit this reliably in a shorter test (e.g. just sanity in a loop, even if it only hit 1/5 times), then we might be able to bisect which subtest is causing the problem to isolate the root cause.

Comment by Alex Zhuravlev [ 18/Dec/14 ]

we could change t-f to remount OST after every sub-test (or randomly) ?

Comment by Andreas Dilger [ 18/Dec/14 ]

There used to be the ability to do this in sanity.sh, by defining the "setup" and "cleanup" macros, but I suspect that is long gone.

Comment by Gerrit Updater [ 18/Dec/14 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13130
Subject: LU-5242 tests: DEBUG ONLY unmount/remount every test
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 48c4b3ea65ae7cf69c84df61c3b76f15bb336f5f

Comment by nasf (Inactive) [ 24/Dec/14 ]

Another failure instance:
https://testing.hpdd.intel.com/test_sets/c687c060-8bb5-11e4-8220-5254006e85c2

Comment by Oleg Drokin [ 31/Dec/14 ]

If diagnosing this problem requires some extra data that autotest framework does not collect - a patch needs to be created to collect this data if this is something not easily reproduceable on other systems.

Comment by Andreas Dilger [ 06/Jan/15 ]

Looking into the OST syslog from one of the recent LU-5242 failures, I see that sanity test_133g is hanging at OST unmount because of increased refcounts on the export structure, and again this can be traced back to OST threads blocked on starting a ZFS TXG.

Previous suggestions in this bug and LU-4716 included:

1. Set zfs_nocacheflush:
options zfs zfs_nocacheflush=1
This makes DKIOCFLUSHWRITECACHE a noop. Then the deadlock may move elsewhere or simply disappear.
2. Whether zfs_nocacheflush is set or not, once the timeout happens, it'd be useful to gather "zpool events -v" outputs on the OSS, which'd give more details on the state of the stuck zio.

I think it makes sense to make sure that our test system:
1. Use whole disks for zfs pools on guest VMs.
2. Use noop IO scheduler for corresponding disks on host OS.

Can any of these options be tested/implemented in some reasonable manner?
Minh, is there some way to check what the IO scheduler is for the VM host system?

I'd like to get some forward progress on these issues.

Comment by Andreas Dilger [ 07/Jan/15 ]

I've noticed in the test_133g hangs that there is a watchdog timeout in test_116a that indicates the source of the deadlock is at that point or earlier (this is easily seen in the OST dmesg log, but doesn't have any timestamps):

== sanity test 116a: stripe QOS: free space balance ===================== 18:41:05 (1420224065)
INFO: task ll_ost00_017:10306 blocked for more than 120 seconds.
ll_ost00_017  D 0000000000000001     0 10306      2 0x00000080
Call Trace:
 [<ffffffff8152bd65>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff8152bec3>] rwsem_down_write_failed+0x23/0x30
 [<ffffffff8128fbe3>] call_rwsem_down_write_failed+0x13/0x20
 [<ffffffff8152b3c2>] ? down_write+0x32/0x40
 [<ffffffffa0e79d81>] osd_object_write_lock+0x61/0x70 [osd_zfs]
 [<ffffffffa0fbf076>] ofd_object_destroy+0x66/0x8e0 [ofd]
 [<ffffffffa0fb964d>] ofd_destroy_by_fid+0x35d/0x620 [ofd]
 [<ffffffffa0fb2f6a>] ofd_destroy_hdl+0x2fa/0xb60 [ofd]
 [<ffffffffa09df97e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
 [<ffffffffa098f711>] ptlrpc_main+0xe41/0x1960 [ptlrpc]

Unfortunately the timestamps on the console logs are not very accurate (they appear to only update once a minute or so), and this is a long-running test (about 200s), so test_116a may in fact be the source of the problem itself.

I looked at all 20 of the test_133g hang OST dmesg logs from the past two weeks, and in every case the first test that is reporting this watchdog is test_116a, which itself takes about 200s to run. That makes it very likely that test_116a (or a test run just before it) is the source of the problem, since the previous tests can vary by a few seconds, so if it were a long time before the start of test_116a in some cases the watchdog message would have appeared during an earlier test.

Comment by Gerrit Updater [ 07/Jan/15 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/13264
Subject: LU-5242 tests: skip sanity test_116a for ZFS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 691009c5c0e38317724cdac764080fa45ad96667

Comment by Isaac Huang (Inactive) [ 09/Jan/15 ]

I just managed to reproduce it and the symptoms matched ZoL bug 2523, which was recently fixed by SPL patch:
https://github.com/zfsonlinux/spl/commit/a3c1eb77721a0d511b4fe7111bb2314686570c4b

Nathaniel is working on updating our ZFS/SPL version to 0.6.3-1.2 in LU-6105, which includes the fix above.

Comment by Andreas Dilger [ 14/Jan/15 ]

Unfortunately, the upgrade to ZFS 0.6.3-1.2 doesn't seem to have solved this problem:
https://testing.hpdd.intel.com/test_sets/19c8b610-9bc1-11e4-857a-5254006e85c2
https://testing.hpdd.intel.com/test_sets/deca9712-9bc1-11e4-857a-5254006e85c2

It might be that there are multiple issues at work here? I haven't checked the latest failures to see if they have different symptoms or not.

Comment by Isaac Huang (Inactive) [ 14/Jan/15 ]

The failures looked similar - txg_sync thread stuck in zio_wait(). Either there were multiple causes of this same symptom or the upstream fix didn't completely close the race. I'm looking into that. Meanwhile, I think we'd go ahead and land it and see if it'd reduce its occurrences, since quite some ZoL users reported that the fix did seem to work for them.

Comment by Gerrit Updater [ 15/Jan/15 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13416
Subject: LU-5242 tests: DEBUG ONLY obd exports after every test
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b19ed84cad9fe49f79bfc1096ec15141f70d16f0

Comment by Isaac Huang (Inactive) [ 15/Jan/15 ]

I just had a close look at:
https://testing.hpdd.intel.com/test_sets/deca9712-9bc1-11e4-857a-5254006e85c2

Although the txg_sync thread was stuck in D state in zio_wait(), it probably wan't the cause of the test hang. Here's what happened:
1. At the start of sanity test 116a, 20:16:13, both ll_ost00_013 and ll_ost00_015 were blocked in D state in osd_object_write_lock().
2. And they kept blocked there and never recovered. These two threads would probably hold some references that prevented the OST from being umounted.

It was unlikely that the two threads hanged because of txg_sync hang in zio_wait, as we've seen previously, because if so:

  • There'd have been similar messages like the txg_sync thread blocked for more than 120 seconds, but there was none.
  • With the txg_sync thread blocked, no write to the ost could complete, in fact no transaction would be open. The OST would simply get stuck there. However, the tests went on OK until sanity test 133g at 20:27.

So I'd tend to say that the test hang was a result of the hanged ll_ost threads, rather than the txg_sync zio_wait deadlock.

The txg_sync thread did show as D state in zio_wait at the stack dump after sanity test 133g wouldn't complete. But very likely it just happened to be there waiting for some IOs to complete on the slow VM disks. It's quite often to see the txg_sync thread blocked for IO completion for more than 20 seconds on the test VMs. On the same OSS I saw ll_ost00_011 blocked for 40 seconds in txg_wait_open() at 20:09:38 and then recovered - so the txg_sync thread must have been blocked for IO roughly 40 seconds and moved on. That's long enough a window for the txg_sync thread to just happen to be in D state in zio_wait() when the stacks are dumped.

So, I'd still suggest to merge the patch to upgrade to ZFS 0.6.3-1.2. So far there's been no new report of zio_wait hang on ZoL since the SPL fix was merged.

Comment by Andreas Dilger [ 16/Jan/15 ]

I tried commenting out test_116a in http://review.whamcloud.com/13264, but that also hit stuck threads with the same stack trace in test_120g (twice, and that test runs for about 55s), and in test_120b (once, and it only runs for 1s - https://testing.hpdd.intel.com/test_sets/6903d29a-983e-11e4-8f7b-5254006e85c2). Using the test_120b stack dump time, it puts the hang back in test_103c or earlier. Using the test_120g stack dump time (start/end window) it puts the failure in test_118c-test_118k or earlier, so there is no overlap. It seems like there is a specific test or behaviour that is triggering this, since the stacks are always dumped in test_116a (except when I had excluded it). If it was just a random race condition, it would appear in other tests as well.

Comment by Isaac Huang (Inactive) [ 16/Jan/15 ]

I just noticed that the ORI-616 fix was missing on master. I've asked for confirmation whether it'd be needed still. If yes, then that might explain some of the symptoms here as a refcount was missing.

Comment by Gerrit Updater [ 16/Jan/15 ]

Isaac Huang (he.huang@intel.com) uploaded a new patch: http://review.whamcloud.com/13431
Subject: LU-5242 osd-zfs: verify ref counting
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 22dc2ec61c815d532c507c68e871f4aa0471c423

Comment by Andreas Dilger [ 16/Jan/15 ]

Isaac, given how often this bug is being hit (about 50% of all review-zfs test runs), it is worthwhile to just try porting the patch to master and running a bunch of sanity tests via Test-Parameters: to see if it passes or fails.

Comment by Andreas Dilger [ 19/Jan/15 ]

Unfortunately, I don't think that the ORI-616 patch is relevant. It looks like the ORI-616 patch http://review.whamcloud.com/2560 that adds the lu_object_get() call in osd_bufs_get() is later dropped by ORI-645 patch http://review.whamcloud.com/2607 (presumably intentionally).

Comment by Nathaniel Clark [ 21/Jan/15 ]

I've just hit this on my local VM setup and txg_sync is what seems to be holding up the umount of the OST. Attached is a dump of `zpool events -v` from the OSS.

Below is a dump I created blocked processes using sysrq:

SysRq : Show Blocked State
  task                        PC stack   pid father
ll_ost01_004  D 0000000000000001     0  9350      2 0x00000080
 ffff8800052e5960 0000000000000046 0000000000016880 0000000000000001
 ffffc90008afb350 0000000000000082 ffff880024e73588 ffff880001d1fd80
 ffff880005337098 ffff8800052e5fd8 000000000000fbc8 ffff880005337098
Call Trace:
 [<ffffffff8109b1ee>] ? prepare_to_wait_exclusive+0x4e/0x80
 [<ffffffffa02d847d>] cv_wait_common+0xed/0x100 [spl]
 [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa02d84e5>] __cv_wait+0x15/0x20 [spl]
 [<ffffffffa03d7f7b>] txg_wait_open+0x7b/0xa0 [zfs]
 [<ffffffffa039f95e>] dmu_tx_wait+0x29e/0x2b0 [zfs]
 [<ffffffff8152a29e>] ? mutex_lock+0x1e/0x50
 [<ffffffffa039fa01>] dmu_tx_assign+0x91/0x490 [zfs]
 [<ffffffffa070f5ad>] osd_trans_start+0xed/0x430 [osd_zfs]
 [<ffffffffa0d9504c>] ofd_trans_start+0x7c/0x100 [ofd]
 [<ffffffffa0d97500>] ofd_object_destroy+0x3d0/0x8e0 [ofd]
 [<ffffffffa0d9176d>] ofd_destroy_by_fid+0x35d/0x620 [ofd]
 [<ffffffffa07a4ee0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
 [<ffffffffa07a6780>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
 [<ffffffffa0d8b0ba>] ofd_destroy_hdl+0x2fa/0xb60 [ofd]
 [<ffffffffa083396e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
 [<ffffffffa07e3701>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
 [<ffffffffa07e28c0>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
ll_ost01_014  D 0000000000000001     0 23470      2 0x00000080
 ffff88000e1e9a70 0000000000000046 ffff8800186e4090 ffff88000e1e9a50
 ffffffffa08617d3 0000000000000000 ffff88000e1e9fd8 ffff8800056d8040
 ffff8800056d85f8 ffff88000e1e9fd8 000000000000fbc8 ffff8800056d85f8
Call Trace:
 [<ffffffff8152b315>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff8152b473>] rwsem_down_write_failed+0x23/0x30
 [<ffffffff8128f683>] call_rwsem_down_write_failed+0x13/0x20
 [<ffffffff8152a972>] ? down_write+0x32/0x40
 [<ffffffffa07171a1>] osd_object_write_lock+0x61/0x70 [osd_zfs]
 [<ffffffffa0d97196>] ofd_object_destroy+0x66/0x8e0 [ofd]
 [<ffffffffa0d9176d>] ofd_destroy_by_fid+0x35d/0x620 [ofd]
 [<ffffffffa07a4ee0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
 [<ffffffffa07a6780>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
 [<ffffffffa0d8b0ba>] ofd_destroy_hdl+0x2fa/0xb60 [ofd]
 [<ffffffffa083396e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
 [<ffffffffa07e3701>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
 [<ffffffffa07e28c0>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
ll_ost01_016  D 0000000000000001     0 27741      2 0x00000080
 ffff88000d2c7a70 0000000000000046 ffff8800186e4090 ffff88000d2c7a50
 ffffffffa08617d3 0000000000000000 ffff88000d2c7fd8 ffff8800054f6080
 ffff8800054f6638 ffff88000d2c7fd8 000000000000fbc8 ffff8800054f6638
Call Trace:
 [<ffffffff8152b315>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff8152b473>] rwsem_down_write_failed+0x23/0x30
 [<ffffffff8128f683>] call_rwsem_down_write_failed+0x13/0x20
 [<ffffffff8152a972>] ? down_write+0x32/0x40
 [<ffffffffa07171a1>] osd_object_write_lock+0x61/0x70 [osd_zfs]
 [<ffffffffa0d97196>] ofd_object_destroy+0x66/0x8e0 [ofd]
 [<ffffffffa0d9176d>] ofd_destroy_by_fid+0x35d/0x620 [ofd]
 [<ffffffffa07a4ee0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
 [<ffffffffa07a6780>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
 [<ffffffffa0d8b0ba>] ofd_destroy_hdl+0x2fa/0xb60 [ofd]
 [<ffffffffa083396e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
 [<ffffffffa07e3701>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
 [<ffffffffa07e28c0>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
txg_sync      D 0000000000000000     0 48138      2 0x00000080
 ffff88000a53dba0 0000000000000046 0000000000000000 0000000000000000
 ffff88000a53db10 ffffffff81061d12 ffff88000a53db60 ffffffff810546b9
 ffff880024e73af8 ffff88000a53dfd8 000000000000fbc8 ffff880024e73af8
Call Trace:
 [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
 [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
 [<ffffffff810a6d21>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff81529433>] io_schedule+0x73/0xc0
 [<ffffffffa02d841c>] cv_wait_common+0x8c/0x100 [spl]
 [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa02d84a8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa041e0ab>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa03b3be3>] dsl_pool_sync+0x2b3/0x3f0 [zfs]
 [<ffffffffa03cbe4b>] spa_sync+0x40b/0xa60 [zfs]
 [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
 [<ffffffffa03d8916>] txg_sync_thread+0x2e6/0x510 [zfs]
 [<ffffffff810591a9>] ? set_user_nice+0xc9/0x130
 [<ffffffffa03d8630>] ? txg_sync_thread+0x0/0x510 [zfs]
 [<ffffffffa02d3c2f>] thread_generic_wrapper+0x5f/0x70 [spl]
 [<ffffffffa02d3bd0>] ? thread_generic_wrapper+0x0/0x70 [spl]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
umount        D 0000000000000001     0 60033  60032 0x00000080
 ffff88001d44da98 0000000000000082 ffff88001d44d9f8 ffff8800182ec000
 ffffffffa151f501 0000000000000000 ffff880001cdb14c ffffffffa151f501
 ffff8800136525f8 ffff88001d44dfd8 000000000000fbc8 ffff8800136525f8
Call Trace:
 [<ffffffff81529ac2>] schedule_timeout+0x192/0x2e0
 [<ffffffff81083f30>] ? process_timeout+0x0/0x10
 [<ffffffffa14aeba6>] obd_exports_barrier+0xb6/0x190 [obdclass]
 [<ffffffffa0d8281f>] ofd_device_fini+0x5f/0x260 [ofd]
 [<ffffffffa14d0b62>] class_cleanup+0x552/0xd10 [obdclass]
 [<ffffffffa14b0b26>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa14d330a>] class_process_config+0x1fea/0x27c0 [obdclass]
 [<ffffffffa0ef71b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa14cc315>] ? lustre_cfg_new+0x435/0x630 [obdclass]
 [<ffffffffa14d3c01>] class_manual_cleanup+0x121/0x870 [obdclass]
 [<ffffffffa14b0b26>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa150c457>] server_put_super+0xb37/0xe50 [obdclass]
 [<ffffffff8118b0cb>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff8118b1b6>] kill_anon_super+0x16/0x60
 [<ffffffffa14d5e56>] lustre_kill_super+0x36/0x60 [obdclass]
 [<ffffffff8118b957>] deactivate_super+0x57/0x80
 [<ffffffff811ab35f>] mntput_no_expire+0xbf/0x110
 [<ffffffff811abeab>] sys_umount+0x7b/0x3a0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

iotop shows that txg_sync running writes to disk and top shows this:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                         
48138 root       0 -20     0    0    0 R 64.4  0.0 102:13.51 txg_sync                                         
48060 root      39  19     0    0    0 S 41.2  0.0  64:16.60 z_null_iss/0                                     
48137 root       0 -20     0    0    0 S 23.9  0.0  37:43.30 txg_quiesce                                      
 9350 root      20   0     0    0    0 R  5.3  0.0   8:33.88 ll_ost01_004                                     
   22 root      20   0     0    0    0 S  0.3  0.0  36:24.79 kblockd/0                                        
Comment by Alex Zhuravlev [ 22/Jan/15 ]

Call Trace:
[<ffffffff81061d12>] ? default_wake_function+0x12/0x20
[<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
[<ffffffff810a6d21>] ? ktime_get_ts+0xb1/0xf0
[<ffffffff81529433>] io_schedule+0x73/0xc0
[<ffffffffa02d841c>] cv_wait_common+0x8c/0x100 [spl]
[<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
[<ffffffffa02d84a8>] __cv_wait_io+0x18/0x20 [spl]
[<ffffffffa041e0ab>] zio_wait+0xfb/0x1b0 [zfs]
[<ffffffffa03b3be3>] dsl_pool_sync+0x2b3/0x3f0 [zfs]
[<ffffffffa03cbe4b>] spa_sync+0x40b/0xa60 [zfs]
[<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
[<ffffffffa03d8916>] txg_sync_thread+0x2e6/0x510 [zfs]

so, it's still in DMU doing I/O ..

Comment by Isaac Huang (Inactive) [ 22/Jan/15 ]

Nathan, if iotop showed txg_sync issuing writes, it could be just wait for IO on the slow VM. It's quite often to see txg_sync to block for 10s of seconds waiting for IO to complete on VMs. The important thing is how long it keeps blocked there. If it blocks there for hours and there was no active IO going on, then it's an instance of the zio race we've seen previously. Otherwise it could be just a red herring.

Also, it's not clear whether ll_ost01_014 and ll_ost01_016 blocked before the txg_sync or after. In the case of https://testing.hpdd.intel.com/test_sets/deca9712-9bc1-11e4-857a-5254006e85c2 the two ost threads blocked in similar places but far before the txg_sync thread was blocked.

Comment by Isaac Huang (Inactive) [ 22/Jan/15 ]

Alex, yes it's still in DMU doing I/O. But it's quite often to see txg_sync to block for 10s of seconds waiting for IO to complete on VMs. In the case of https://testing.hpdd.intel.com/test_sets/deca9712-9bc1-11e4-857a-5254006e85c2 the two ost threads blocked in similar places but far before the txg_sync thread was blocked.

Comment by Alex Zhuravlev [ 22/Jan/15 ]

true, but then I'd expect only those to left at some point? while in all the cases we get that txg_sync() and another thread with zio_wait()?

Comment by Alex Zhuravlev [ 22/Jan/15 ]

would it be possible to get a crash dump?

Comment by Isaac Huang (Inactive) [ 25/Jan/15 ]

I just reproduced it and ll_ost00_017 was stuck in D state (umount ost1 also stuck):

[<ffffffffa011b415>] cv_wait_common+0x125/0x150 [spl]
[<ffffffffa011b495>] __cv_wait+0x15/0x20 [spl]
[<ffffffffa02215e3>] txg_wait_open+0xa3/0xf0 [zfs]
[<ffffffffa01e4939>] dmu_tx_wait+0x389/0x390 [zfs]
[<ffffffffa01e49e9>] dmu_tx_assign+0xa9/0x520 [zfs]
[<ffffffffa0e6d5ad>] osd_trans_start+0xed/0x430 [osd_zfs]
[<ffffffffa0fb907c>] ofd_trans_start+0x7c/0x100 [ofd]
[<ffffffffa0fbb530>] ofd_object_destroy+0x3d0/0x8e0 [ofd]
[<ffffffffa0fb579d>] ofd_destroy_by_fid+0x35d/0x620 [ofd]
[<ffffffffa0faf0ea>] ofd_destroy_hdl+0x2fa/0xb60 [ofd]
[<ffffffffa09dbd9e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
[<ffffffffa098b891>] ptlrpc_main+0xe41/0x1960 [ptlrpc]

However, it was actually running in D rather than sleeping, using lots of CPU time:
8058 root 20 0 0 0 0 D 8.3 0.0 68:01.97 ll_ost00_017

So it looked like cv_wait_common() was actually spinning on a damaged lock. Also the txgs of both ost pools were moving forward so it's not the txg_sync hang we've seen previously.

I've never seen any report like this on ZoL, so I'd tend to think it's related to the way we use ZoL. I wonder whether it could be some shutdown race where umount process had freed some resource that ll_ost00_017 was trying to use, then cv_wait_common() just got stuck on a damaged lock. The umount stack:

[<ffffffffa0724bd6>] obd_exports_barrier+0xb6/0x190 [obdclass]
[<ffffffffa0fa684f>] ofd_device_fini+0x5f/0x260 [ofd]
[<ffffffffa0746072>] class_cleanup+0x552/0xd10 [obdclass]
[<ffffffffa074881a>] class_process_config+0x1fea/0x27c0 [obdclass]
[<ffffffffa0749111>] class_manual_cleanup+0x121/0x870 [obdclass]
[<ffffffffa07819d7>] server_put_super+0xb37/0xe50 [obdclass]
[<ffffffff8118b61b>] generic_shutdown_super+0x5b/0xe0
[<ffffffff8118b706>] kill_anon_super+0x16/0x60
[<ffffffffa074b366>] lustre_kill_super+0x36/0x60 [obdclass]
[<ffffffff8118bea7>] deactivate_super+0x57/0x80
[<ffffffff811ab8af>] mntput_no_expire+0xbf/0x110
[<ffffffff811ac3fb>] sys_umount+0x7b/0x3a0

It appeared that class_cleanup()=>obd_precleanup() has been called, and if ldto_device_fini() was called as well, i.e. osd_device_fini() was called and dmu_objset_disown(o->od_os) called. At the same time, osd_trans_start() could still access od_os from dmu_tx_assign(oh->ot_tx, TXG_WAIT).

If there could be a race between osd_device_fini() and osd_trans_start(), that'd explain what I see here. Comments very welcomed.

Comment by Isaac Huang (Inactive) [ 25/Jan/15 ]

Also, the pool that umount was trying to umount seemed to be messed up:

cat lustre-ost1/txgs; sleep 1; cat lustre-ost1/txgs
469 0 0x01 3 336 31269437119932 94328219768999
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
504280372 94327859132340   S     0            0            0            0        0        52962        5155         34018        0           
504280373 94327859185302   W     0            0            0            0        0        42311        5308         0            0           
504280374 94327859227613   O     0            0            0            0        0        0            0            0            0           
469 0 0x01 3 336 31269437119932 94329224682995
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
504282558 94329223042758   S     0            0            0            0        0        819957       7673         68244        0           
504282559 94329223862715   W     0            0            0            0        0        119631       5697         0            0           
504282560 94329223982346   O     0            0            0            0        0        0            0            0            0

The TXG # increased by 2186 in just 1 second - at the same time, LNet sent/got no message and all OST threads were idle - very weird. Again, no similar report was found on ZoL. Likely we're doing something wrong here.

Comment by Alex Zhuravlev [ 25/Jan/15 ]

class_cleanup() calls class_disconnect_exports() before going into precleanup process. so all the exports should be disconnected, then obd_exports_barrier() is called to ensure all exports are gone. and only after that in ofd_stack_fini() we disconnect from OSD so OSD can start to umount. I think there should be no racing dmu_objset_disown() here. also notice that dsl_dataset_disown() doesn't really do anything like "umount", it just mark the objset as "not used" while all txg logic is driven a level higher - as a whole pool.

Comment by Andreas Dilger [ 26/Jan/15 ]

Brian and Olaf were looking at this bug on Friday and was going to post, but since he didn't get a chance to do so yet I'll recap their findings as best I can. When ZFS is compiled with ASSERT checks enabled there is a test failure around sanity subtest 100 that shows a bonus buffer being modified without a lock held on it. Judging by the stack traces in test_116, this shows a possible root cause for the stuck threads and later unmount failure.

Comment by Andreas Dilger [ 26/Jan/15 ]

Brian filed that as LU-6155 but it is worthwhile investigating as the root cause of this bug also, because this systematically hits only in sanity and not randomly in other tests.

Comment by Isaac Huang (Inactive) [ 26/Jan/15 ]

I had debugging on for SPL/ZFS, but was able to hit the hang without hitting any ZFS assertion.

I checked the kernel hung task watcher code and it looked like it'd not warn for tasks running in D state. The ll_ost00_017 thread likely hung on txg_wait_open() long before the umount started, judging from the CPU time it had used. That'd also explain why there were two other OST threads hanging in osd_object_write_lock() but there was no error message for any thread blocked holding the lock.

I'll try to narrow down to subtests earlier than 133. Would it be possible to force OST umount/remount after each test? That'd pinpoint exactly when SHTF.

Comment by Andreas Dilger [ 27/Jan/15 ]

Isaac, how recent was your testing? I wonder if we were hitting multiple different problems here, maybe one was fixed with 0.6.3-1.2 and there is a second problem causing the threads to be stuck and dump in test_116?

Unfortunately, Nathaniel already ran tests with an unmount after every subtest (see his debug patch earlier in this bug) but AFAIK this didn't find the problem that said, I didn't check the results of this patch myself. It may be that unmounting after every subtest didn't allow memory pressure to build up and his the problem.

I also had a patch on this bug to disable test_116 to see if it was the source of the problem, since it runs more than 120 seconds itself and the watchdogs could be triggered by hangs after the start of the test. However, the stack traces moved to test_118, but at least have me a bit better idea where the stuck threads first got stuck. It looks like it was around test_103 or so, but it is hard to know because the console messages do not have any timestamps.

It might be more productive to try a few patches to disable groups of subtests around 100 to 103, each with Test-Parameters that have 6-10 sanity runs to have a high confidence that the problem is caught.

Comment by Isaac Huang (Inactive) [ 27/Jan/15 ]

The one where txg_sync hanged in D state while there's no pending IO was fixed - I was able to hit it often before 0.6.3-1.2 but not once after. Now I'm seeing one OST thread blocked in D state in txg_wait_open() forever, while txg_sync was busy doing IO and TXGs of the pool were moving forward.

I'm now testing with a temporary workaround for LU-6155. I was able to reproduce it roughly 1 out of 4 sanity runs, so we'll soon know whether LU-6155 is to blame or not.

Comment by Isaac Huang (Inactive) [ 27/Jan/15 ]

The LU-6155 workaround didn't work, hit the same hang 1 out of 6 sanity runs.

Comment by Nathaniel Clark [ 27/Jan/15 ]

Attempting to bisect cause, umount/mount of osts works prior to test 50 and fails prior to test 100:
http://review.whamcloud.com/#/c/13130/7
https://testing.hpdd.intel.com/test_sets/e8dac624-9ce2-11e4-b50c-5254006e85c2

Comment by Isaac Huang (Inactive) [ 30/Jan/15 ]

I wrote a script to watch for stuck OSS threads and found in one failed test:
1. At test 56w about 02:52, the OSS began to get busy, with threads stuck waiting for IO in txg_wait_synced(). But threads would still recover and go on.
2. At test 79 about 03:25, one OSS thread (21277) got stuck in txg_wait_open() and it never recovered. It was in D state but actually running, which was why the kernel hung task watcher never warned about it.
3. At test 103b about 03:40, OST thread 21285 got stuck in ofd_object_destroy()=>osd_object_write_lock()=>call_rwsem_down_write_failed() and never recovered, at the same time OSS dmesg showed:

Lustre: DEBUG MARKER: == sanity test 103b: MDS mount option 'noacl' == 03:40:06 (1422618006)                                   
LustreError: 11-0: lustre-MDT0000-lwp-OST0001: operation obd_ping to node 10.100.4.147@tcp failed: rc = -107                   
Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.100.4.147@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 1 previous similar message
Lustre: 9624:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1422618009/real 1422618009]  req@ffff88006f83ccc0 x1491657489867716/t0(0) o400->MGC10.100.4.147@tcp@10.100.4.147@tcp:26/25 lens 224/224 e 0 to 1 dl 1422618016 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 9624:0:(client.c:1942:ptlrpc_expire_one_request()) Skipped 1 previous similar message                                  
LustreError: 166-1: MGC10.100.4.147@tcp: Connection to MGS (at 10.100.4.147@tcp) was lost; in progress operations using this service will fail
Lustre: 9623:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1422618016/real 1422618016]  req@ffff88003e42e680 x1491657489867740/t0(0) o250->MGC10.100.4.147@tcp@10.100.4.147@tcp:26/25 lens 400/544 e 0 to 1 dl 1422618022 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1                                                                             
Lustre: Evicted from MGS (at 10.100.4.147@tcp) after server handle changed from 0x9f6377cf69c91a9b to 0x9f6377cf6a97ffe3       
Lustre: MGC10.100.4.147@tcp: Connection restored to MGS (at 10.100.4.147@tcp)                                                  
Lustre: Skipped 1 previous similar message                                                                                     
Lustre: 9623:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1422618016/real 1422618016]  req@ffff88005febe9c0 x1491657489867744/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.100.4.147@tcp:12/10 lens 400/544 e 0 to 1 dl 1422618027 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1                                                                       
Lustre: lustre-OST0001: deleting orphan objects from 0x0:162984 to 0x0:163009                                                  
Lustre: lustre-OST0000: deleting orphan objects from 0x0:155843 to 0x0:155873                                                  
LustreError: 11-0: MGC10.100.4.147@tcp: operation obd_ping to node 10.100.4.147@tcp failed: rc = -107                          
LustreError: Skipped 1 previous similar message
LustreError: 166-1: MGC10.100.4.147@tcp: Connection to MGS (at 10.100.4.147@tcp) was lost; in progress operations using this service will fail
LustreError: 14122:0:(mgc_request.c:526:do_requeue()) failed processing log: -5                                                
LustreError: 14122:0:(mgc_request.c:526:do_requeue()) Skipped 7 previous similar messages                                      
Lustre: Evicted from MGS (at 10.100.4.147@tcp) after server handle changed from 0x9f6377cf6a97ffe3 to 0x9f6377cf6a980275

4. Later at 133g umount would block forever

I tend to think the hang thread in txg_wait_open() was the real cause, and the later hang threads in osd_object_write_lock() a consequence of it:

  • The txg_wait_open() hang always preceded the osd_object_write_lock() hang.
  • Stack trace showed that the txg_wait_open() thread might hold a lock which the later threads tried to get:
    PID: 21277  TASK: ffff88006130c040  CPU: 0   COMMAND: "ll_ost00_007"
     #0 [ffff88003ed5b880] schedule at ffffffff815296a0
     #1 [ffff88003ed5b948] cv_wait_common at ffffffffa011b415 [spl]
     #2 [ffff88003ed5b9c8] __cv_wait at ffffffffa011b495 [spl]
     #3 [ffff88003ed5b9d8] txg_wait_open at ffffffffa02215e3 [zfs]
     #4 [ffff88003ed5ba18] dmu_tx_wait at ffffffffa01e4939 [zfs]
     #5 [ffff88003ed5ba78] dmu_tx_assign at ffffffffa01e49e9 [zfs]
     #6 [ffff88003ed5bb28] osd_trans_start at ffffffffa0dd45ad [osd_zfs]
     #7 [ffff88003ed5bb58] ofd_trans_start at ffffffffa0f1f07c [ofd]
     #8 [ffff88003ed5bb88] ofd_object_destroy at ffffffffa0f21530 [ofd]
     #9 [ffff88003ed5bbd8] ofd_destroy_by_fid at ffffffffa0f1b79d [ofd]
    #10 [ffff88003ed5bcd8] ofd_destroy_hdl at ffffffffa0f150ea [ofd]
    
    PID: 21278  TASK: ffff88003e0e7500  CPU: 1   COMMAND: "ll_ost00_008"
     #0 [ffff880062b979b0] schedule at ffffffff815296a0
     #1 [ffff880062b97a78] rwsem_down_failed_common at ffffffff8152bd65
     #2 [ffff880062b97ad8] rwsem_down_write_failed at ffffffff8152bec3
     #3 [ffff880062b97b18] call_rwsem_down_write_failed at ffffffff8128fbe3
     #4 [ffff880062b97b78] osd_object_write_lock at ffffffffa0ddc2fb [osd_zfs]
     #5 [ffff880062b97b88] ofd_object_destroy at ffffffffa0f211c6 [ofd]
     #6 [ffff880062b97bd8] ofd_destroy_by_fid at ffffffffa0f1b79d [ofd]
     #7 [ffff880062b97cd8] ofd_destroy_hdl at ffffffffa0f150ea [ofd]
    
  • I ran "sanity --start-at 100 --stop-at 140" for more than 60 times without reproducing the hang, so the cause should precede test 100.
  • This is also in line with Nathaniel's observation that "umount/mount of osts works prior to test 50 and fails prior to test 100".
Comment by Isaac Huang (Inactive) [ 01/Feb/15 ]

Now I can reproduce the txg_wait_open() hang with "sanity --only 78,79,80". It seemed to be the smallest subset of tests to trigger it - I haven't been able to reproduce it if any one test is removed. Once a thread got stuck in txg_wait_open(), umount on OSS would block forever just like at the end of 133g.

Comment by Isaac Huang (Inactive) [ 02/Feb/15 ]

I found a possible bug LU-6195 in osd_declare_object_destroy() while looking at the code today, although I haven't fully understood its consequences to determine whether it could have something to do with this bug.

Comment by Brian Behlendorf [ 02/Feb/15 ]

If you're able to reproduce this bug it might be useful to check the following proc variables.

First I'd look at dmu_tx, it shows a list of counters which indicate why a TX can't be assigned to a TXG. From the stacks which have been posted it looks like dmu_tx_assign() may just be failing repeatedly (maybe an incorrectly constructed TX?) and thus blocking in dmu_tx_wait() forever. Knowing why would help narrow things down. On a healthy system you should see almost everything in dmu_tx_assigned (success), the other common cases are dmu_tx_dirty_throttle and dmu_tx_dirty_delay. These indicate that the txg_sync thread can't write data as fast as the processes are creating it and it's injecting a delay to slow them down to keep things in balance. If you see more than a handful of something else it should be investigated.

The other thing I'd look at is /proc/spl/kstat/zfs/<pool>/txgs. This contains of a history of the last N TXGs which were committed. How long they took, how much IO was performed, how many IOPs were issued, etc. To enable this you'll need to set the history size using the zfs_txg_history module parameter, by default it's disabled. But if you want to log the last say 10 TXGs set it to 10. In particular, I'm curious if TXGs continue to roll forward at the usual rate while this thread is stuck (it sounds like it). If so then the TXG engine itself would seem to be working fine and there's just something wrong with the specific TX which is blocked.

Comment by Alex Zhuravlev [ 02/Feb/15 ]

well, we're supposed to get a message in that case:

rc = dmu_tx_assign(oh>ot_tx, TXG_WAIT);
if (unlikely(rc != 0))

{ struct osd_device *osd = osd_dt_dev(d); /* dmu will call commit callback with error code during abort */ if (!lu_device_is_md(&d->dd_lu_dev) && rc == -ENOSPC) CERROR("%s: failed to start transaction due to ENOSPC. " "Metadata overhead is underestimated or " "grant_ratio is too low.\n", osd->od_svname); else CERROR("%s: can't assign tx: rc = %d\n", osd->od_svname, rc); }
Comment by Brian Behlendorf [ 02/Feb/15 ]

Well not exactly. When calling dmu_tx_assign() with TXG_WAIT it should only fail in the case where we're over quota. For all other errors it will retry and we should be assigned to the next TXG (assuming the TX is sane). So checking the dmu_tx proc file would show if we're hitting any of the ERESTART cases.

/*
 * Assign tx to a transaction group.  txg_how can be one of:
 *
 * (1)  TXG_WAIT.  If the current open txg is full, waits until there's
 *      a new one.  This should be used when you're not holding locks.
 *      It will only fail if we're truly out of space (or over quota).
 *
 * (2)  TXG_NOWAIT.  If we can't assign into the current open txg without
 *      blocking, returns immediately with ERESTART.  This should be used
 *      whenever you're holding locks.  On an ERESTART error, the caller
 *      should drop locks, do a dmu_tx_wait(tx), and try again.
Comment by Isaac Huang (Inactive) [ 02/Feb/15 ]

Brian, you nailed it! Thanks a lot!

Every time the hang happened, the TXGs would move forward, but unusually fast:

# cat lustre-ost1/txgs; sleep 1; cat lustre-ost1/txgs 
25 0 0x01 3 336 244675310220 123698699004892
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
1010319525 123698585210993  S     0            0            0            0        0        84246        3380         33414        0           
1010319526 123698585295239  W     0            0            0            0        0        36794        2496         0            0           
1010319527 123698585332033  O     0            0            0            0        0        0            0            0            0           
25 0 0x01 3 336 244675310220 123699705972680
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
1010324666 123699503872529  S     0            0            0            0        0        111328       5851         53035        0           
1010324667 123699503983857  W     0            0            0            0        0        59840        6406         0            0           
1010324668 123699504043697  O     0            0            0            0        0        0            0            0            0

In this case, 5141 in one second.

Here's dmu_tx:

# cat dmu_tx 
4 1 0x01 11 528 3293147069 124044054813482
name                            type data
dmu_tx_assigned                 4    32644
dmu_tx_delay                    4    0
dmu_tx_error                    4    0
dmu_tx_suspended                4    0
dmu_tx_group                    4    0
dmu_tx_memory_reserve           4    1012288401
dmu_tx_memory_reclaim           4    0
dmu_tx_dirty_throttle           4    0
dmu_tx_dirty_delay              4    0
dmu_tx_dirty_over_max           4    0
dmu_tx_quota                    4    0

And dmu_tx_memory_reserve is still rapidly growing even though there's now no IO from Lustre at all:

# grep dmu_tx_memory_reserve dmu_tx; sleep 1; grep dmu_tx_memory_reserve dmu_tx;
dmu_tx_memory_reserve           4    1012794812
dmu_tx_memory_reserve           4    1012800331

So the counter grew by 5519 in a second, roughly the same as the increase in TXG #. Looks like that's why the thread is still hanging in txg_wait_open().

The only place where that counter gets bumped is dmu_tx_try_assign()=>dsl_dir_tempreserve_space()=>arc_tempreserve_space(). So I increased zfs_arc_max by about 500M, waited for the arc_adapt thread to pick up the change, and then a few seconds later the thread recovered from the hang and everything looked normal! Even umount worked!

Before I increased zfs_arc_max, there was about 800M available from arc_sz to arc_c, and the hanged tx was only trying to remove one object (and update OI and accounting ZAPs) so clearly something was wrong with that TX (maybe LU-6195). I'll find out.

Comment by Brian Behlendorf [ 02/Feb/15 ]

Whoa, that's definitely going to be a problem! It sounds like Lustre constructed a massive TX relative to the target ARC size and therefore could never assign it to a TXG. Something like LU-6195 could cause this with an incorrect hold. You might want to build ZFS with --enable-debug-dmu-tx, this will enable some additional statistics about the TX and some more sanity checking.

Comment by Alex Zhuravlev [ 02/Feb/15 ]

iirc, we got -ENOSPC from dmu_tx_assign() if TX is too big. for example, when we tried to destroy a file with lots of stripes.

Comment by Brian Behlendorf [ 02/Feb/15 ]

Alex that can happen (along with EDQUOT) if the pool is low on space but it's not guaranteed to happen even for very large TXs. It's entirely possible that you'll fall in to this ERESTART retry loop first. The DMU expects that individual TXs will be relatively small (~10MB at most) but that a TXG may contain many TXs. If Lustre is creating TXs much larger than this I can see how this deadlock could happen. It would be best if we could put an upper bound of the size of a Lustre TX.

Comment by Andreas Dilger [ 02/Feb/15 ]

Kudos to everyone for the excellent work in tracking down this tricky problem. Nathaniel, while a solution is being working on, could you please submit a patch to sanity to re-enable subtest 132 and exclude subtests 78, 79, and 80 for ZFS OSTs with a Test-Parameters: line that runs sanity enough times to be confident of the result (maybe 6-10 times)? That would allow us to start enforcing the review-zfs test results while the fix for this new issue is prepared, and ensure we don't regress in other areas in the meantime. Excluding sanity test_133g failures we would have passed 27 of 30 recent review-zfs test sessions (one failued due to a problem in the patch, one failed due to timeouts in sanity-quota and ost-pools, and one had an additional failure in sanity-quota ).

There is a similar check in ext4 to limit the size of a single transaction handle to be < 1/4 of the total journal size. I don't think it unreasonable to put some upper bound on the size of a single ZFS tx, but what that size should be is unclear. I recall the 10MB tx limit also, but I think it was just recently increased? Related to this, llog updates (appends) they may be quite large in theory but in practice will only affect a small number of blocks, and since an unlink of a widely-striped file may update a lot of llog files (up to 2000 stripes) it could cause very large theoretical transaction sizes that aren't seen in practice, so we don't want to cause them to gratuitously fail due to this check. That is what LU-2160 "Implement ZFS dmu_tx_hold_append() declarations for llog" is about, but that hasn't been implemented yet.

Comment by Isaac Huang (Inactive) [ 02/Feb/15 ]

Working on the patch to disable 78, 79, and 80...

Comment by Gerrit Updater [ 02/Feb/15 ]

Isaac Huang (he.huang@intel.com) uploaded a new patch: http://review.whamcloud.com/13600
Subject: LU-5242 osd-zfs: umount hang in sanity 133g
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f45c320eeeeaff849ba2e97da0f6fe5185d1e149

Comment by Isaac Huang (Inactive) [ 03/Feb/15 ]

I just reproduced it again (with a LU-6195 workaround, so that's ruled out), and I found one fishy object on the OST:

# zdb -e -ddddd lustre-ost2/ost2 195
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
       195    5    16K   128K  1.01M  1.00T    0.00  ZFS plain file
Indirect blocks:
               0 L4     0:54a00:400 4000L/400P F=8 B=513/513
     10000000000  L3    0:54600:400 4000L/400P F=8 B=513/513
     10000000000   L2   0:54200:400 4000L/400P F=8 B=513/513
     10000000000    L1  0:4fe00:400 4000L/400P F=8 B=513/513
     10000000000     L0 0:76aa00:20000 20000L/20000P F=1 B=513/513
     10000020000     L0 0:78aa00:20000 20000L/20000P F=1 B=513/513
     10000040000     L0 0:7aaa00:20000 20000L/20000P F=1 B=513/513
     10000060000     L0 0:7caa00:20000 20000L/20000P F=1 B=513/513
     10000080000     L0 0:7eaa00:20000 20000L/20000P F=1 B=513/513
     100000a0000     L0 0:80aa00:20000 20000L/20000P F=1 B=513/513
     100000c0000     L0 0:82aa00:20000 20000L/20000P F=1 B=513/513
     100000e0000     L0 0:84aa00:20000 20000L/20000P F=1 B=513/513

                segment [0000010000000000, 0000010000100000) size    1M

An object with only 4 indirect blocks, despite its lsize of 1T. I guessed it's the object the hanged OST thread was trying to remove. The arc_c was 938M when it hanged, then I gradually increased it and the hang disappeared when arc_c was increased to 1.0G. Then that object also disappeared:

# zdb -e -ddddd lustre-ost2/ost2 195
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
zdb: dmu_bonus_hold(195) failed, errno 2

So my guess was right: the thread tried to remove the object with 1T lsize, and dmu_tx_hold_free() estimated the memory overhead of freeing the object to be about 1G, then dmu_tx_assign() hanged in dmu_tx_try_assign()=>dsl_dir_tempreserve_space()=>arc_tempreserve_space(). I also kept watching arcsz, and it grew by only a few MBs (it never grew over 60M).

So it looked like dmu_tx_count_free() severely over-estimated txh_memory_tohold for dmu_tx_hold_free().

Comment by Alex Zhuravlev [ 03/Feb/15 ]

cool catch! I guess it should be possible to reproduce this with a regular truncate to, say 2-4TB ?

Comment by Alex Zhuravlev [ 03/Feb/15 ]

also, would it make sense to have own copy of dmu_tx_assign() with some checks like if we're trying for too long, then print an error?

Comment by Andreas Dilger [ 03/Feb/15 ]

This looks like it is caused by sanity.sh test_80 "Page eviction is equally fast at high offsets too", which writes 1MB at a 1TB offset in a sparse file to test client page cache truncation performance. I guess it is also testing OSD-side block truncation...

I don't think making our own version of dmu_tx_assign() is the right solution, but rather fixing the upstream ZFS code. I'm not sure if this should fail outright (leaving a user with an unremovable object?) or only print an error that the arc_c limit should be increased (preferably with some advice on how large it needs to be).

As for the proper solution, I'd expect that the DMU code shouldn't be so pessimistic about TX size when deleting this sparse file. I'd imagine it only really needs a few hundred KB of dirty blocks, so maybe it needs a bit better accounting? Is it accounting for a dense file and not realizing the file is sparse? Is it possible to short-circuit all of the calculations in dmu_tx_count_free() by checking the allocated block count from the dnode (dn_used?), or is that a layering violation?

Comment by Alex Zhuravlev [ 03/Feb/15 ]

I agree that the proper fix can be done only in DMU. but there might be another similar overestimation leading to a livelock in some corner cases which will be hard to recognize given it's a silent loop within dmu_tx_assign().

Comment by Isaac Huang (Inactive) [ 03/Feb/15 ]

Last night I changed the seek=1M to seek=5K in test_80, and so far I've run 5 iterations of sanity and 64 iterations of "sanity --only 78,79,80" without reproducing it. So it looks like we can work around it by reducing seek in test_80 (probably test_34g as well) without skipping any sub-test, while working on a proper fix.

As to the fix:

  • In arc_tempreserve_space():
            if (reserve > arc_c/4 && !arc_no_grow)
                    arc_c = MIN(arc_c_max, reserve * 4);
    
            /*
             * Throttle when the calculated memory footprint for the TXG
             * exceeds the target ARC size.
             */
            if (reserve > arc_c) {                            
                    DMU_TX_STAT_BUMP(dmu_tx_memory_reserve);
                    return (SET_ERROR(ERESTART));
            }
    
    • An error/debug message here would help.
    • Also I think the code should be changed to:
              if (reserve > arc_c) {                                                              
                      return (SET_ERROR(ENOMEM));
              }
      

      The code just tried to grow ARC to the max possible in "arc_c = MIN(arc_c_max, reserve * 4);", and if that's still not enough ERESTART doesn't seem to be the right choice - unless the admin would show up and bump zfs_arc_max, which in my opinion isn't a case for ERESTART.

  • The dmu_tx_count_free() caused problems in the past with sparse file and was improved to be less over-estimating:
    https://github.com/zfsonlinux/zfs/commit/ff80d9b142826c15fa84e3c4b9ef1795fd6e9485
    Looks like there's more work there.
  • Investigate whether to limit the size of a single TX. The ZPL does exactly this in zfs_write():
    int             max_blksz = zsb->z_max_blksz;
    ......
                    dmu_tx_hold_write(tx, zp->z_id, woff, MIN(n, max_blksz));
    ......
                    /*        
                     * XXX - should we really limit each write to z_max_blksz?
                     * Perhaps we should use SPA_MAXBLOCKSIZE chunks?
                     */
                    nbytes = MIN(n, max_blksz - P2PHASE(woff, max_blksz));
    
Comment by Alex Zhuravlev [ 03/Feb/15 ]

take into account that with wide-striping we may need to modify upto ~2K objects (and each llog is modified twice: 8K header and record itself).

Comment by Jodi Levi (Inactive) [ 03/Feb/15 ]

Patch to disable test(s): http://review.whamcloud.com/#/c/13600

Comment by Andreas Dilger [ 03/Feb/15 ]

Isaac, we cannot change the seek value in test_80, because that is testing VM cache invalidation specific to the client, so changing the file size will break this test on the client.

Oleg's suggestion is to only exclude subtest 80 if the ARC size is too small, so that it still runs on other systems that have enough RAM to pass this subtest. It might be useful to check pass/fail rates against VM clusters to see if some VMs are failing more often because they have less RAM? It doesn't seem that we need to exclude subtest 78 or 79 at all, but we're landing this patch anyway to allow review-ZFS to begin passing.

Comment by Gerrit Updater [ 03/Feb/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13600/
Subject: LU-5242 osd-zfs: umount hang in sanity 133g
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 298916a319e1c5da4be0df3609e4773fe70a4026

Comment by Jodi Levi (Inactive) [ 03/Feb/15 ]

Failing tests disabled. Fix will be included in 2.8.

Comment by Isaac Huang (Inactive) [ 03/Feb/15 ]

Andreas, DMU wanted about 1G ARC size for the object delete. Most of our VMs are configured with 1.85G memory so ARC size defaults at 938M. So they'll all run into this problem. We can increase ARC size a bit to let the object delete pass, because only a couple of MBs were really needed. But that'd likely run into other issues, e.g. LU-5809. So I think our best workaround would be to only exclude 80. I'm now testing that on Eagle - will push a patch shortly.

Comment by Gerrit Updater [ 03/Feb/15 ]

Isaac Huang (he.huang@intel.com) uploaded a new patch: http://review.whamcloud.com/13620
Subject: LU-5242 osd-zfs: umount hang in sanity 133g
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 39acc43a443a23f3e92dab63155ad658fdeb8cf2

Comment by Gerrit Updater [ 04/Feb/15 ]

Isaac Huang (he.huang@intel.com) uploaded a new patch: http://review.whamcloud.com/13630
Subject: LU-5242 osd-zfs: umount hang in sanity 133g
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 13cc94af090e0b1279c2987840a0e6de685e03ee

Comment by Isaac Huang (Inactive) [ 04/Feb/15 ]

It looked like the preferred DMU API to remove large/sparse objects is dmu_free_long_range(). The ZPL was able to remove any large/sparse objects I threw at it where it'd just choke up osd-zfs:

zfs_rmnode(znode_t *zp)
......
        /*
         * Free up all the data in the file.
         */
        error = dmu_free_long_range(os, zp->z_id, 0, DMU_OBJECT_END);
......
        /*
         * Set up the final transaction.
         */
        tx = dmu_tx_create(os);
        dmu_tx_hold_free(tx, zp->z_id, 0, DMU_OBJECT_END);
......
        error = dmu_tx_assign(tx, TXG_WAIT);
......
        zfs_znode_delete(zp, tx); ==> VERIFY(0 == dmu_object_free(os, obj, tx));

I created a simple patch and my tests showed it worked:

@@ -444,6 +444,9 @@ static void __osd_declare_object_destroy(const struct lu_env *env,
        zap_cursor_t            *zc;
        int                      rc = 0;
 
+       rc = dmu_free_long_range(osd->od_os, oid, 0, DMU_OBJECT_END);
+       LASSERTF(rc == 0, "dmu_free_long_range "LPU64" failed: %d\n", oid, rc);
+
        dmu_tx_hold_free(tx, oid, 0, DMU_OBJECT_END);

But two things I'm not sure about:

  • I had to call it in do_declare_destroy() because it must be called before dmu_tx_hold_free() is called, that means when do_declare_destroy() returns the data in the object has already been freed. Is it OK for Lustre/OSD?
  • The dmu_free_long_range() frees data in the object in several TXs, so it might fail in the middle leaving the data in the object partially freed. If do_declare_destroy() fails for this reason, what would OSD layer do? The object itself and the OI and accounting ZAPs are still freed/updated in a single TX so that's consistent but the object data can be partially freed already.

We can work on dmu_tx_count_free() to correctly estimate memory overhead for the particular sparse file created by test_80(). But we'd have to deal with any large/sparse object anyway, so I think in the long run it'd be better to go with dmu_free_long_range().

Comment by Brian Behlendorf [ 04/Feb/15 ]

Nice find, that neatly explains everything. Thanks Issac for opening issue 3064 at zfsonlinux so we can decide of what should be done on the ZFS side. We should certainly investigate improving the quality of the worst case estimate. But it really does need to be a worst case estimate or we risk potentially stalling the txg engine in the worst (but very unlikely) case. Anyway, I'm open to ideas!

https://github.com/zfsonlinux/zfs/issues/3064

Comment by Alex Zhuravlev [ 04/Feb/15 ]

Isaac, this is usually done using an additional index/list containing object to be freed. so OSD's destroy put an object on the list (plus, say, OI removal), then a separate thread (or osd_trans_stop()) truncates the object, frees it and remove from the list/index using as many transactions as needed.

Comment by Brian Behlendorf [ 04/Feb/15 ]

Yes, what Alex said. This is exactly how the Posix layer also works, the object to be removed is placed on an unlinked listed for handling latter in as many transaction as needed. Only once the blocks are all freed in the obejct removed from the list. See zfs_unlinked_drain().

Comment by Andreas Dilger [ 04/Feb/15 ]

Note that by the time that the OST object is being destroyed it has already been unlinked from the MDS namespace, and that unlink is committed on the MDS. The OFD has also revoked client locks and discarded any cached dirty pages on the clients, so there is no danger to destroy the object in stages. That is what ext4 is doing internally (truncating blocks in small chunks from the end of the file).

Since the MDT is already logging the object destroy locally and will resend it if the OSS crashes. The OST object hasn't been deleted from the OST namespace yet, so I don't see a requirement for more infrastructure to handle this at the OSD level. It is fine to truncate the object before destroy FOR THE OST ONLY, and higher layers will handle it.

That said, since the OSD code is common, we shouldn't be exposing truncated-but-not-unlined objects on the MDT, so it may be easiest to truncate to zero explicitly from the OFD before the object destroy?

Comment by Alex Zhuravlev [ 05/Feb/15 ]

well, that would mean we can't atomically destroy huge objects using OUT which I'd like to do to batch OST object destroy's from MDT.

Comment by Isaac Huang (Inactive) [ 10/Feb/15 ]

How about:

  • An object (maybe a ZAP) is created per OST/MDT, to hold objects to be freed. Let's call it the deathrow object.
  • In declare_object_destroy() and object_destroy(), within a single TX, move the object to the deadthrow, and update OI and accounting ZAPs as well. So the object disappears from OST/MDT namespace atomically.
    • If an object is small enough, it can be destroyed in the old way.
  • A separate thread works on the deathrow object, truncating and freeing every object in it.
    • This may break some tests that wait for free space to increase after object removal. We may also do at least the truncating in osd_trans_stop().

All implemented at the osd-zfs layer.

Comment by Alex Zhuravlev [ 10/Feb/15 ]

from Lustre point of view, the object is destroyed when it can't be found with lu_object_find(). IOW, when corresponding FID is removed from OI - this should be a part of object destroy. space accounting shouldn't be an issue as it's not released immediately in any case: grants/quota can release reserved space upon commit, but a new reserve can be made only with new statfs. truncate/dnode destroy in osd_trans_stop() looks OK.

Comment by Andreas Dilger [ 11/Feb/15 ]

It makes sense to keep this compatible with ZFS ZPL if at all possible, so that if large files are unlinked under ZPL and then mounted as Lustre, or vice versa, we don't defer deleting them forever. I see this is handled in zfs_unlinked_add() and zfs_unlinked_drain() with a ZAP named "DELETE_QUEUE" (ZFS_UNLINKED_SET) in the MASTER_NODE_OBJ. Even if we need to implement our own routines to handle this, it makes sense to use the same ZAP and zap format (zap_add_int()).

Comment by Brian Behlendorf [ 11/Feb/15 ]

I like Andreas's idea of keeping the unlink behavior compatible with the ZPL. It would be ideal if you could reuse the existing ZPL functions but those functions are tied quite closely to ZPL specific data structures so that's probably not workable. But the ZFS_UNLINKED_SET object itself is just a ZAP containing a list of object ids. And since objects on disk are already constructed to be compatible with the ZPL we should be able to safely use it. Isaac's design is nice, but let me suggest a few minor tweaks:

  • Use the existing ZFS_UNLINKED_SET object linked the MASTER_NODE_OBJ as the deathrow object.
  • In declare_object_destroy() and object_destroy() just handle moving the object to the ZFS_UNLINKED_SET in a single TX.
  • In a dedicated thread, taskq, or generic linux worker thread regularly walk the ZFS_UNLINKED_SET and rely on dmu_free_long_range() to split the free over as many TXGs as required.
  • I don't think there's any advantage in handling small object destruction synchronously in object_destroy(). It's simpler and probably more efficient to always do this asynchronously.
  • Start draining the ZFS_UNLINKED_SET right away when remounting the OSD (this happening during mount for the ZPL).
Comment by Alex Zhuravlev [ 11/Feb/15 ]

I'd think that the optimization for small objects (notice most of MDT's objects are literally empty) make sense as we don't need to modify yet another ZAP twice. I guess there is no strong requirement to implement this right away, but still.

Comment by Brian Behlendorf [ 11/Feb/15 ]

I could see a case for zero length files as an possible optimization. But I suspect that even for the MDT it would be more efficient to handle the freeing asynchronously outside of any request processing. Even if the file is zero length you're still going to be freeing a spill block for the xattrs and updating a dbuf for the dnode object. Personally I'd keep it as simple and concise as possible until it's clear something more is required. But that's just my preference.

Keep in mind that none of this free space will be available for a couple TXGs anyway.

Comment by Alex Zhuravlev [ 11/Feb/15 ]

I have no objection to do this as simple as possible. my point was that MDT is known to be CPU-bound and ZAP (even micro ZAP) isn't free.

Comment by Isaac Huang (Inactive) [ 12/Feb/15 ]

Thanks all. I'll work on a patch first without the small object optimization to get this bug fixed; then will benchmark to figure out whether to optimize the small object path or not.

Comment by Andreas Dilger [ 12/Feb/15 ]

I would agree with Alex on this. By deferring unlink of small files it will probably double or triple the total IO that the MDT is doing because in addition to the actual dnode deletion it also needs to insert the dnode into the deathrow ZAP in one TXG and then delete it from the same ZAP in a different txg. If there are a large number of objects being deleted at once (easily possible on the MDT), then the deathrow ZAP may get quite large (and never shrink) and updates would become less efficient than if it is kept small.

Comment by Bob Glossman (Inactive) [ 19/Feb/15 ]

another seen on b2_5 with zfs:
https://testing.hpdd.intel.com/test_sessions/435c3152-b816-11e4-9ecb-5254006e85c2

Comment by Gerrit Updater [ 19/Feb/15 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13805
Subject: LU-5242 osd-zfs: umount hang in sanity 133g
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 817cf8a2e781d546508929a9f58b44561ae3361c

Comment by Gerrit Updater [ 17/May/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13630/
Subject: LU-5242 osd-zfs: umount hang in sanity 133g
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9b704e4088d867851cdb011f0a2560b1e622555c

Comment by Peter Jones [ 18/May/15 ]

Landed for 2.8

Generated at Sat Feb 10 01:49:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.