[LU-2015] Test failure on test suite obdfilter-survey, subtest test_3a Created: 23/Sep/12  Updated: 02/Aug/13  Resolved: 02/Aug/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Minh Diep
Resolution: Duplicate Votes: 0
Labels: zfs
Environment:

lustre-master build #1560 zfs


Issue Links:
Duplicate
is duplicated by LU-2939 Lustre: MGS is waiting for obd_unlink... Resolved
Related
is related to LU-3230 conf-sanity fails to start run: umoun... Resolved
Severity: 3
Rank (Obsolete): 4116

 Description   

This issue was created by maloo for Li Wei <liwei@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/6dfc3fae-049c-11e2-bfd4-52540035b04c.

The sub-test test_3a failed with the following error:

test failed to respond and timed out

From the OSS console:

06:39:39:Lustre: DEBUG MARKER: == obdfilter-survey test 3a: Network survey ========================================================== 06:39:38 (1348234778)
06:40:00:Lustre: DEBUG MARKER: grep -c /mnt/ost1' ' /proc/mounts
06:40:00:Lustre: DEBUG MARKER: umount -d -f /mnt/ost1
06:40:01:Lustre: 3533:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1348234793/real 1348234793]  req@ffff88003d984800 x1413717737622294/t0(0) o400->MGC10.10.4.186@tcp@10.10.4.186@tcp:26/25 lens 224/224 e 0 to 1 dl 1348234800 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
06:40:01:Lustre: 3533:0:(client.c:1905:ptlrpc_expire_one_request()) Skipped 1 previous similar message
06:40:01:LustreError: 166-1: MGC10.10.4.186@tcp: Connection to MGS (at 10.10.4.186@tcp) was lost; in progress operations using this service will fail
06:40:24:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 4. Is it stuck?
06:40:36:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 4. Is it stuck?
06:41:07:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 4. Is it stuck?
06:41:07:Lustre: 3532:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1348234845/real 1348234845]  req@ffff88003d984800 x1413717737622299/t0(0) o250->MGC10.10.4.186@tcp@10.10.4.186@tcp:26/25 lens 400/544 e 0 to 1 dl 1348234866 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
06:41:07:Lustre: 3532:0:(client.c:1905:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
06:42:09:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 4. Is it stuck?
06:43:42:Lustre: 3532:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1348234990/real 1348234990]  req@ffff88003d984800 x1413717737622304/t0(0) o250->MGC10.10.4.186@tcp@10.10.4.186@tcp:26/25 lens 400/544 e 0 to 1 dl 1348235015 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
06:43:42:Lustre: 3532:0:(client.c:1905:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
06:44:24:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 4. Is it stuck?
06:47:58:INFO: task umount:13712 blocked for more than 120 seconds.
06:47:58:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
06:47:58:umount        D 0000000000000000     0 13712  13711 0x00000080
06:47:58: ffff880038249ac8 0000000000000086 ffffffff00000010 ffff880038249a78
06:47:58: ffff880038249a38 0000000000000286 ffffffffa05f0ff0 ffff880052692d54
06:47:58: ffff8800383245f8 ffff880038249fd8 000000000000fb88 ffff8800383245f8
06:47:58:Call Trace:
06:47:58: [<ffffffff814fea92>] schedule_timeout+0x192/0x2e0
06:47:58: [<ffffffff8107e120>] ? process_timeout+0x0/0x10
06:47:58: [<ffffffffa03a373d>] cfs_schedule_timeout_and_set_state+0x1d/0x20 [libcfs]
06:47:58: [<ffffffffa0538828>] obd_exports_barrier+0x98/0x180 [obdclass]
06:47:59: [<ffffffffa0bc6fb2>] ofd_device_fini+0x42/0x230 [ofd]
06:47:59: [<ffffffffa055ddc7>] class_cleanup+0x577/0xdc0 [obdclass]
06:47:59: [<ffffffffa053aa36>] ? class_name2dev+0x56/0xe0 [obdclass]
06:47:59: [<ffffffffa055f6b5>] class_process_config+0x10a5/0x1ca0 [obdclass]
06:47:59: [<ffffffffa03a3be0>] ? cfs_alloc+0x30/0x60 [libcfs]
06:47:59: [<ffffffffa0559043>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
06:47:59: [<ffffffffa0560427>] class_manual_cleanup+0x177/0x6f0 [obdclass]
06:47:59: [<ffffffffa053aa36>] ? class_name2dev+0x56/0xe0 [obdclass]
06:47:59: [<ffffffffa0569997>] server_put_super+0x5a7/0xcb0 [obdclass]
06:47:59: [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0
06:48:00: [<ffffffff8117d436>] kill_anon_super+0x16/0x60
06:48:00: [<ffffffffa0562066>] lustre_kill_super+0x36/0x60 [obdclass]
06:48:00: [<ffffffff8117e4b0>] deactivate_super+0x70/0x90
06:48:00: [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110
06:48:00: [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0
06:48:00: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
06:48:00: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

This was master with OFD and LDiskFS OSTs.

Info required for matching: obdfilter-survey 3a



 Comments   
Comment by Andreas Dilger [ 15/Oct/12 ]

I hit this same problem with shutting down the MGS during sanity-scrub.sh test_10b() on 2.3.53-27-g1a660e4 (one local patch beyond 773c66 "LU-2142 scrub: reset completed scrub position if retrigger"), though I don't know if this has anything to do with the patch, even though it was touching the scrub code:

Lustre: MGS is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck?
INFO: task umount:23147 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount        D 0000000000000000     0 23147  23146 0x00000000
 ffff8800b22cbab8 0000000000000086 0000000000000000 ffff8800b22cba68
 ffff8800b22cba28 ffff88005ed8d000 ffffffffa13a73d1 0000000000000000
 ffff880029f685f8 ffff8800b22cbfd8 000000000000fb88 ffff880029f685f8
Call Trace:
 [<ffffffff814fea92>] schedule_timeout+0x192/0x2e0
 [<ffffffff8107e120>] ? process_timeout+0x0/0x10
 [<ffffffffa120673d>] cfs_schedule_timeout_and_set_state+0x1d/0x20 [libcfs]
 [<ffffffffa132a308>] obd_exports_barrier+0x98/0x180 [obdclass]
 [<ffffffffa0bde8fe>] mgs_device_fini+0xfe/0x5d0 [mgs]
 [<ffffffffa13583c7>] class_cleanup+0x577/0xdc0 [obdclass]
 [<ffffffffa132c516>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa1359cb5>] class_process_config+0x10a5/0x1ca0 [obdclass]
 [<ffffffffa1206be0>] ? cfs_alloc+0x30/0x60 [libcfs]
 [<ffffffffa1353473>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
 [<ffffffffa135aa29>] class_manual_cleanup+0x179/0x6f0 [obdclass]
 [<ffffffffa132c516>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa1367e5d>] server_put_super+0x4cd/0x12f0 [obdclass]
 [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff8117d436>] kill_anon_super+0x16/0x60
 [<ffffffffa135c8a6>] lustre_kill_super+0x36/0x60 [obdclass]
 [<ffffffff8117e4b0>] deactivate_super+0x70/0x90
 [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110
 [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0
Comment by Sarah Liu [ 30/Nov/12 ]

Hit the similar issue during interop testing with 2.3 server and luster-master client
https://maloo.whamcloud.com/test_sets/2b02c4e6-3987-11e2-9fda-52540035b04c

01:52:18:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
01:52:18:umount        D 0000000000000000     0 18490  18489 0x00000080
01:52:18: ffff88002f81fac8 0000000000000086 ffffffff00000010 ffff88002f81fa78
01:52:18: ffff88002f81fa38 0000000000000286 ffffffffa067ba50 ffff880076f95348
01:52:18: ffff88003b093058 ffff88002f81ffd8 000000000000fb88 ffff88003b093058
01:52:18:Call Trace:
01:52:18: [<ffffffff814fea92>] schedule_timeout+0x192/0x2e0
01:52:18: [<ffffffff8107e120>] ? process_timeout+0x0/0x10
01:52:18: [<ffffffffa043673d>] cfs_schedule_timeout_and_set_state+0x1d/0x20 [libcfs]
01:52:18: [<ffffffffa05ca168>] obd_exports_barrier+0x98/0x180 [obdclass]
01:52:18: [<ffffffffa0c49692>] ofd_device_fini+0x42/0x240 [ofd]
01:52:18: [<ffffffffa05eb4b7>] class_cleanup+0x577/0xdc0 [obdclass]
01:52:18: [<ffffffffa05cc376>] ? class_name2dev+0x56/0xe0 [obdclass]
01:52:18: [<ffffffffa05ed4ab>] class_process_config+0x102b/0x1c30 [obdclass]
01:52:18: [<ffffffffa0436be0>] ? cfs_alloc+0x30/0x60 [libcfs]
01:52:18: [<ffffffffa05e6eb3>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
01:52:18: [<ffffffffa05ee229>] class_manual_cleanup+0x179/0x6f0 [obdclass]
01:52:18: [<ffffffffa05cc376>] ? class_name2dev+0x56/0xe0 [obdclass]
01:52:18: [<ffffffffa05f8439>] server_put_super+0x6f9/0xcf0 [obdclass]
01:52:18: [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0
01:52:18: [<ffffffff8117d436>] kill_anon_super+0x16/0x60
01:52:18: [<ffffffffa05efe66>] lustre_kill_super+0x36/0x60 [obdclass]
01:52:18: [<ffffffff8117e4b0>] deactivate_super+0x70/0x90
01:52:18: [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110
01:52:18: [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0
01:52:18: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
01:52:18: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
01:53:40:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 4. Is it stuck?
01:55:42:Lustre: 3436:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1354096509/real 1354096509]  req@ffff880070cb7c00 x1419868288058186/t0(0) o250->MGC10.10.4.154@tcp@10.10.4.154@tcp:26/25 lens 400/544 e 0 to 1 dl 1354096534 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
01:55:42:Lustre: 3436:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
01:56:24:INFO: task umount:18490 blocked for more than 120 seconds.
Comment by Andreas Dilger [ 08/Mar/13 ]

Hit this again in recovery-small.sh test_107
https://maloo.whamcloud.com/test_sets/229fc768-82ed-11e2-8172-52540035b04c

15:16:28:Lustre: 18053:0:(client.c:1866:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1362179756/real 1362179756] req@ffff88007c931c00 x1428344221543872/t0(0) o250->MGC10.10.4.150@tcp@0@lo:26/25 lens 400/544 e 0 to 1 dl 1362179781 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
15:16:30:Lustre: 18053:0:(client.c:1866:ptlrpc_expire_one_request()) Skipped 20 previous similar messages
15:17:13:LustreError: 137-5: UUID 'lustre-MDT0000_UUID' is not available for connect (no target)
15:17:13:LustreError: Skipped 1081 previous similar messages
15:17:37:INFO: task umount:31658 blocked for more than 120 seconds.
15:17:37:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
15:17:37:umount D 0000000000000000 0 31658 31657 0x00000080
15:17:37: ffff8800545cfa98 0000000000000086 ffffffff00000010 ffff8800545cfa48
15:17:37: ffff8800545cfa08 ffff8800534ba000 ffffffffa106173e 0000000000000000
15:17:37: ffff88006483dab8 ffff8800545cffd8 000000000000fb88 ffff88006483dab8
15:17:37:Call Trace:
15:17:37: [<ffffffff814ead12>] schedule_timeout+0x192/0x2e0
15:17:37: [<ffffffff8107cb50>] ? process_timeout+0x0/0x10
15:17:37: [<ffffffffa0ef96bd>] cfs_schedule_timeout_and_set_state+0x1d/0x20 [libcfs]
15:17:38: [<ffffffffa0fe3a98>] obd_exports_barrier+0x98/0x170 [obdclass]
15:17:38: [<ffffffffa058d946>] mgs_device_fini+0xf6/0x5a0 [mgs]
15:17:38: [<ffffffffa10109d7>] class_cleanup+0x577/0xda0 [obdclass]
15:17:38: [<ffffffffa0fe5d36>] ? class_name2dev+0x56/0xe0 [obdclass]
15:17:38: [<ffffffffa10122bc>] class_process_config+0x10bc/0x1c80 [obdclass]
15:17:38: [<ffffffffa0f03df8>] ? libcfs_log_return+0x28/0x40 [libcfs]
15:17:39: [<ffffffffa100bb21>] ? lustre_cfg_new+0x391/0x7e0 [obdclass]
15:17:39: [<ffffffffa1012ff9>] class_manual_cleanup+0x179/0x6f0 [obdclass]
15:17:41: [<ffffffffa0fe5d36>] ? class_name2dev+0x56/0xe0 [obdclass]
15:17:42: [<ffffffffa101fa4d>] server_put_super+0x46d/0xf00 [obdclass]
15:17:42: [<ffffffff811785ab>] generic_shutdown_super+0x5b/0xe0
15:17:42: [<ffffffff81178696>] kill_anon_super+0x16/0x60
15:17:42: [<ffffffffa1014e56>] lustre_kill_super+0x36/0x60 [obdclass]
15:17:42: [<ffffffff81179670>] deactivate_super+0x70/0x90
15:17:43: [<ffffffff811955cf>] mntput_no_expire+0xbf/0x110
15:17:43: [<ffffffff81195f2b>] sys_umount+0x7b/0x3a0
15:17:43: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
15:22:13:Lustre: MGS is waiting for obd_unlinked_exports more than 1024 seconds. The obd refcount = 5. Is it stuck?

Comment by Oleg Drokin [ 10/Mar/13 ]

I am hitting this frequently on my test cluster (see LU-2939), I have crashdumps for anybody interested to dig into it.

Comment by Nathaniel Clark [ 13/Mar/13 ]

Hit this ldiskfs recovery-small/29a
https://maloo.whamcloud.com/test_sets/9a6769ca-8ac9-11e2-a80f-52540035b04c

17:22:19:Lustre: DEBUG MARKER: == recovery-small test 29a: error adding new clients doesn't cause LBUG (bug 22273) == 17:22:11 (1363047731)
17:22:19:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000711
17:22:19:Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts
17:22:19:Lustre: DEBUG MARKER: umount -d /mnt/mds1
17:22:19:Lustre: Failing over lustre-MDT0000
17:22:32:LustreError: 137-5: UUID 'lustre-MDT0000_UUID' is not available for connect (no target)
17:22:32:LustreError: Skipped 8 previous similar messages
17:22:34:Lustre: MGS is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck?
17:22:34:LustreError: 166-1: MGC10.10.16.89@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
17:22:45:Lustre: MGS is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck?
17:23:20:Lustre: MGS is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck?
17:23:41:LustreError: 137-5: UUID 'lustre-MDT0000_UUID' is not available for connect (no target)
17:23:42:LustreError: Skipped 116 previous similar messages
17:24:25:Lustre: MGS is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck?
17:25:50:LustreError: 137-5: UUID 'lustre-MDT0000_UUID' is not available for connect (no target)
17:25:50:LustreError: Skipped 233 previous similar messages
17:26:33:Lustre: MGS is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck?
17:28:43:Lustre: 11992:0:(client.c:1866:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1363048086/real 1363048086]  req@ffff880079f23c00 x1429257112860604/t0(0) o250->MGC10.10.16.89@tcp@0@lo:26/25 lens 400/544 e 0 to 1 dl 1363048111 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
17:28:43:Lustre: 11992:0:(client.c:1866:ptlrpc_expire_one_request()) Skipped 17 previous similar messages
17:29:37:INFO: task umount:17156 blocked for more than 120 seconds.
17:29:37:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
17:29:37:umount        D 0000000000000000     0 17156  17155 0x00000080
17:29:37: ffff880071d23a98 0000000000000086 ffffffff00000010 ffff880071d23a48
17:29:38: ffff880071d23a08 ffff88005c2a4400 ffffffffa0d2a73e 0000000000000000
17:29:38: ffff880056f2a5f8 ffff880071d23fd8 000000000000fb88 ffff880056f2a5f8
17:29:38:Call Trace:
17:29:38: [<ffffffff814ead12>] schedule_timeout+0x192/0x2e0
17:29:38: [<ffffffff8107cb50>] ? process_timeout+0x0/0x10
17:29:38: [<ffffffffa0bc26bd>] cfs_schedule_timeout_and_set_state+0x1d/0x20 [libcfs]
17:29:38: [<ffffffffa0caca98>] obd_exports_barrier+0x98/0x170 [obdclass]
17:29:38: [<ffffffffa1725996>] mgs_device_fini+0xf6/0x5a0 [mgs]
17:29:38: [<ffffffffa0cd99c7>] class_cleanup+0x577/0xda0 [obdclass]
17:29:38: [<ffffffffa0caed36>] ? class_name2dev+0x56/0xe0 [obdclass]
17:29:38: [<ffffffffa0cdb2ac>] class_process_config+0x10bc/0x1c80 [obdclass]
17:29:38: [<ffffffffa0bccdf8>] ? libcfs_log_return+0x28/0x40 [libcfs]
17:29:40: [<ffffffffa0cd4b11>] ? lustre_cfg_new+0x391/0x7e0 [obdclass]
17:29:40: [<ffffffffa0cdbfe9>] class_manual_cleanup+0x179/0x6f0 [obdclass]
17:29:40: [<ffffffffa0caed36>] ? class_name2dev+0x56/0xe0 [obdclass]
17:29:40: [<ffffffffa0ce8a3d>] server_put_super+0x46d/0xf00 [obdclass]
17:29:40: [<ffffffff811785ab>] generic_shutdown_super+0x5b/0xe0
17:29:40: [<ffffffff81178696>] kill_anon_super+0x16/0x60
17:29:40: [<ffffffffa0cdde46>] lustre_kill_super+0x36/0x60 [obdclass]
17:29:40: [<ffffffff81179670>] deactivate_super+0x70/0x90
17:29:40: [<ffffffff811955cf>] mntput_no_expire+0xbf/0x110
17:29:40: [<ffffffff81195f2b>] sys_umount+0x7b/0x3a0
17:29:40: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Comment by Sarah Liu [ 16/Apr/13 ]

Hit this error in the tag-2.3.64 testing:

https://maloo.whamcloud.com/test_sets/a2ff5494-a66c-11e2-90ad-52540035b04c

Comment by Nathaniel Clark [ 24/Apr/13 ]

Failure on replay-single/90 review-zfs
https://maloo.whamcloud.com/test_sets/6128f8ce-ad04-11e2-bd7c-52540035b04c

Comment by Jian Yu [ 02/May/13 ]

I found that since 2013-04-12, obdfilter-survey had failed with this issue consistently on master branch. The latest build number of master branch which passed the obdfilter-survey test in "full" test session is #1381.

Comment by Peter Jones [ 02/May/13 ]

Minh

Could you please see if you can isolate which landing introduced this problem?

Peter

Comment by Andreas Dilger [ 02/May/13 ]

What is the git commit hash for build #1381? It probably makes sense to just look through the patches that were landed to master between build #1381 and #1382, and if necessary build & run the obdfilter-survey test with git-bisect to isolate it to a specific patch.

Comment by Minh Diep [ 02/May/13 ]

34 UP lwp lustre-MDT0000-lwp-OST0005 lustre-MDT0000-lwp-OST0005_UUID 5
35 UP osd-ldiskfs lustre-OST0006-osd lustre-OST0006-osd_UUID 5
36 UP obdfilter lustre-OST0006 lustre-OST0006_UUID 10
37 UP lwp lustre-MDT0000-lwp-OST0006 lustre-MDT0000-lwp-OST0006_UUID 5
38 UP echo_client lustre-OST0000_ecc lustre-OST0000_ecc_UUID 3
39 UP echo_client lustre-OST0001_ecc lustre-OST0001_ecc_UUID 3
40 UP echo_client lustre-OST0002_ecc lustre-OST0002_ecc_UUID 3
41 UP echo_client lustre-OST0003_ecc lustre-OST0003_ecc_UUID 3
42 UP echo_client lustre-OST0004_ecc lustre-OST0004_ecc_UUID 3
43 UP echo_client lustre-OST0005_ecc lustre-OST0005_ecc_UUID 3
44 UP echo_client lustre-OST0006_ecc lustre-OST0006_ecc_UUID 3

One of the reasons is obdfilter-survey did not clean up the echo_client after finish. I am looking further

Comment by Minh Diep [ 03/May/13 ]

obdfilter-survey exited unexpectedly due to this command

+ echo '=============> Destroy 1 on localhost:lustre-OST0000_ecc'
+ destroy_objects localhost 38 2 1 /tmp/obdfilter_survey_2013-05-02@20:45_fat-amd-4.lab.whamcloud.com.detail_tmp
+ local host=localhost
+ local devno=38
+ local obj0=2
+ local nobj=1
+ local rfile=/tmp/obdfilter_survey_2013-05-02@20:45_fat-amd-4.lab.whamcloud.com.detail_tmp
+ remote_shell localhost lctl --device 38 destroy 2 1

[root@fat-amd-4 ~]# cat /tmp/obdfilter_survey_2013-05-02@20\:45_fat-amd-4.lab.whamcloud.com.detail_tmp
+ host=localhost
+ shift
+ cmds='lctl --device 38 destroy 2 1'
++ uname -n
+ '[' localhost = localhost -o localhost = fat-amd-4.lab.whamcloud.com ']'
+ eval 'lctl --device 38 destroy 2 1'
++ lctl --device 38 destroy 2 1
destroy: 1 objects
error: destroy: objid 0x2: No such file or directory
[root@fat-amd-4 ~]#

but we create objid 0x2!!!!

=============> Create 1 on localhost:lustre-OST0000_ecc
++ host=localhost
++ shift
++ cmds='lctl --device 38 create 1'
+++ uname -n
++ '[' localhost = localhost -o localhost = fat-amd-4.lab.whamcloud.com ']'
++ eval 'lctl --device 38 create 1'
+++ lctl --device 38 create 1
create: 1 objects
create: #1 is object id 0x2

Comment by Minh Diep [ 06/May/13 ]

The fix is dup of LU-3267

Comment by Sarah Liu [ 16/Jul/13 ]

Hit this issue again in zfs testing

https://maloo.whamcloud.com/test_sets/14e07cdc-ed44-11e2-99b4-52540035b04c

OST console:

00:47:41:LustreError: 19583:0:(qsd_reint.c:54:qsd_reint_completion()) Skipped 1 previous similar message
00:48:42:INFO: task umount:19559 blocked for more than 120 seconds.
00:48:42:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
00:48:42:umount        D 0000000000000001     0 19559  19558 0x00000080
00:48:42: ffff8802d22c7aa8 0000000000000082 ffffffff00000010 ffff8802d22c7a58
�
00:48:42: ffff8802d22c7a18 0000000000000286 ffffffffa07db450 ffff8802fcc06f2a
00:48:42: ffff8803003725f8 ffff8802d22c7fd8 000000000000fb88 ffff8803003725f8
00:48:42:Call Trace:
00:48:42: [<ffffffff8150ee42>] schedule_timeout+0x192/0x2e0
00:48:42: [<ffffffff810810e0>] ? process_timeout+0x0/0x10
00:48:42: [<ffffffffa05f462d>] cfs_schedule_timeout_and_set_state+0x1d/0x20 [libcfs]
00:48:42: [<ffffffffa07175f8>] obd_exports_barrier+0x98/0x170 [obdclass]
00:48:42: [<ffffffffa0e5a962>] ofd_device_fini+0x42/0x230 [ofd]
00:48:42: [<ffffffffa0742f17>] class_cleanup+0x577/0xda0 [obdclass]
00:48:42: [<ffffffffa07197a6>] ? class_name2dev+0x56/0xe0 [obdclass]
00:48:42: [<ffffffffa07447fc>] class_process_config+0x10bc/0x1c80 [obdclass]
00:48:42: [<ffffffffa073e1e3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
00:48:42: [<ffffffffa0745539>] class_manual_cleanup+0x179/0x6f0 [obdclass]
00:48:42: [<ffffffffa07197a6>] ? class_name2dev+0x56/0xe0 [obdclass]
00:48:42: [<ffffffffa07809ec>] server_put_super+0x5ec/0xf60 [obdclass]
00:48:42: [<ffffffff811833ab>] generic_shutdown_super+0x5b/0xe0
00:48:42: [<ffffffff81183496>] kill_anon_super+0x16/0x60
00:48:42: [<ffffffffa07473e6>] lustre_kill_super+0x36/0x60 [obdclass]
00:48:42: [<ffffffff81183c37>] deactivate_super+0x57/0x80
00:48:42: [<ffffffff811a1c8f>] mntput_no_expire+0xbf/0x110
00:48:42: [<ffffffff811a26fb>] sys_umount+0x7b/0x3a0
00:48:42: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Comment by Minh Diep [ 25/Jul/13 ]

Hi Sarah,

This seems to be a different issue. Could you file a new bug?

Comment by Sarah Liu [ 30/Jul/13 ]

Thanks Minh, LU-3665 is for the new issue.

Generated at Sat Feb 10 01:21:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.