this test passed a couple days before: https://maloo.whamcloud.com/test_sets/506f3352-fa9b-11e1-887d-52540035b04c
start failing on 9/9 https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c
client show that mdsrate was hung
20:46:53:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
20:47:24:INFO: task mdsrate:21390 blocked for more than 120 seconds.
20:47:24:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
20:47:24:mdsrate D 0000000000000000 0 21390 21387 0x00000080
20:47:24: ffff88004c1d7be8 0000000000000082 ffff88000000012d ffffffff0000012d
20:47:24: ffff880000000065 ffffffffa0b404a0 000000000000012d ffffffffa0715cc5
20:47:24: ffff880048d5c638 ffff88004c1d7fd8 000000000000fb88 ffff880048d5c638
20:47:24:Call Trace:
20:47:24: [<ffffffffa06c042d>] ? lustre_msg_buf+0x5d/0x60 [ptlrpc]
20:47:24: [<ffffffffa06ed4a6>] ? __req_capsule_get+0x176/0x750 [ptlrpc]
20:47:24: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180
20:47:24: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50
20:47:24: [<ffffffffa06609b3>] mdc_close+0x193/0x9a0 [mdc]
20:47:24: [<ffffffffa061b8c6>] lmv_close+0x2d6/0x5a0 [lmv]
20:47:24: [<ffffffffa099c7af>] ll_close_inode_openhandle+0x30f/0x1050 [lustre]
20:47:24: [<ffffffffa099d69a>] ll_md_real_close+0x1aa/0x220 [lustre]
20:47:24: [<ffffffffa099d96b>] ll_md_close+0x25b/0x760 [lustre]
20:47:24: [<ffffffff814ff546>] ? down_read+0x16/0x30
20:47:24: [<ffffffffa099df8b>] ll_file_release+0x11b/0x3e0 [lustre]
20:47:24: [<ffffffff8117ca65>] __fput+0xf5/0x210
20:47:24: [<ffffffff8117cba5>] fput+0x25/0x30
20:47:24: [<ffffffff811785cd>] filp_close+0x5d/0x90
20:47:24: [<ffffffff811786a5>] sys_close+0xa5/0x100
20:47:24: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
20:48:15:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1347162469/real 1347162469] req@ffff88007a52dc00 x1412601028282197/t0(0) o250->MGC10.10.4.170@tcp@10.10.4.170@tcp:26/25 lens 400/544 e 0 to 1 dl 1347162494 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
20:48:15:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
20:49:26:INFO: task mdsrate:21390 blocked for more than 120 seconds.
20:49:26:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
perhaps it related to this Oops on the mds
https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c
https://maloo.whamcloud.com/test_logs/6a1db8f4-fa51-11e1-887d-52540035b04c/show_text
23:02:04:Lustre: DEBUG MARKER: ===== mdsrate-create-small.sh
23:02:04:BUG: unable to handle kernel paging request at 0000000781bbc060
23:02:04:IP: [<ffffffff8104f1c7>] resched_task+0x17/0x80
23:02:04:PGD 374e5067 PUD 0
23:02:04:Thread overran stack, or stack corrupted
23:02:04:Oops: 0000 1 SMP
23:02:04:last sysfs file: /sys/devices/system/cpu/possible
23:02:04:CPU 0
23:02:04:Modules linked in: osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) lustre(U) obdfilter(U) ost(U) cmm(U) mdt(U) mdd(U) mds(U) mgs(U) obdecho(U) mgc(U) lquota(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic jbd2 nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
23:02:04:
23:02:04:Pid: 21654, comm: mdt00_001 Not tainted 2.6.32-279.5.1.el6_lustre.x86_64 #1 Red Hat KVM
23:02:04:RIP: 0010:[<ffffffff8104f1c7>] [<ffffffff8104f1c7>] resched_task+0x17/0x80
23:02:04:RSP: 0018:ffff880002203de8 EFLAGS: 00010087
23:02:04:RAX: 00000000000166c0 RBX: ffff880068462b18 RCX: 00000000ffff8800
23:02:04:RDX: ffff880031c2a000 RSI: 0000000000000400 RDI: ffff880068462ae0
23:02:04:RBP: ffff880002203de8 R08: 0000000000989680 R09: 0000000000000000
23:02:04:R10: 0000000000000010 R11: 0000000000000000 R12: ffff880002216728
23:02:04:R13: 0000000000000000 R14: 0000000000000000 R15: ffff880068462ae0
23:02:04:FS: 00007f0480a26700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
23:02:04:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
23:02:04:CR2: 0000000781bbc060 CR3: 000000007089b000 CR4: 00000000000006f0
23:02:04:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
23:02:04:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
23:02:04:Process mdt00_001 (pid: 21654, threadinfo ffff880031c2a000, task ffff880068462ae0)
23:02:04:Stack:
23:02:04: ffff880002203e18 ffffffff8105484c ffff8800022166c0 0000000000000000
23:02:04:<d> 00000000000166c0 0000000000000000 ffff880002203e58 ffffffff81057fa1
23:02:04:<d> ffff880002203e58 ffff880068462ae0 0000000000000000 0000000000000000
23:02:04:Call Trace:
23:02:04: <IRQ>
This is a duplicate of
LU-1881, and this test can pass with the patch for 1881.