[LU-2015] Test failure on test suite obdfilter-survey, subtest test_3a Created: 23/Sep/12 Updated: 02/Aug/13 Resolved: 02/Aug/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Minh Diep |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | zfs | ||
| Environment: |
lustre-master build #1560 zfs |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 4116 | ||||||||||||||||
| Description |
|
This issue was created by maloo for Li Wei <liwei@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/6dfc3fae-049c-11e2-bfd4-52540035b04c. The sub-test test_3a failed with the following error:
From the OSS console: 06:39:39:Lustre: DEBUG MARKER: == obdfilter-survey test 3a: Network survey ========================================================== 06:39:38 (1348234778) 06:40:00:Lustre: DEBUG MARKER: grep -c /mnt/ost1' ' /proc/mounts 06:40:00:Lustre: DEBUG MARKER: umount -d -f /mnt/ost1 06:40:01:Lustre: 3533:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1348234793/real 1348234793] req@ffff88003d984800 x1413717737622294/t0(0) o400->MGC10.10.4.186@tcp@10.10.4.186@tcp:26/25 lens 224/224 e 0 to 1 dl 1348234800 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 06:40:01:Lustre: 3533:0:(client.c:1905:ptlrpc_expire_one_request()) Skipped 1 previous similar message 06:40:01:LustreError: 166-1: MGC10.10.4.186@tcp: Connection to MGS (at 10.10.4.186@tcp) was lost; in progress operations using this service will fail 06:40:24:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 4. Is it stuck? 06:40:36:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 4. Is it stuck? 06:41:07:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 4. Is it stuck? 06:41:07:Lustre: 3532:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1348234845/real 1348234845] req@ffff88003d984800 x1413717737622299/t0(0) o250->MGC10.10.4.186@tcp@10.10.4.186@tcp:26/25 lens 400/544 e 0 to 1 dl 1348234866 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 06:41:07:Lustre: 3532:0:(client.c:1905:ptlrpc_expire_one_request()) Skipped 3 previous similar messages 06:42:09:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 4. Is it stuck? 06:43:42:Lustre: 3532:0:(client.c:1905:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1348234990/real 1348234990] req@ffff88003d984800 x1413717737622304/t0(0) o250->MGC10.10.4.186@tcp@10.10.4.186@tcp:26/25 lens 400/544 e 0 to 1 dl 1348235015 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 06:43:42:Lustre: 3532:0:(client.c:1905:ptlrpc_expire_one_request()) Skipped 4 previous similar messages 06:44:24:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 4. Is it stuck? 06:47:58:INFO: task umount:13712 blocked for more than 120 seconds. 06:47:58:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 06:47:58:umount D 0000000000000000 0 13712 13711 0x00000080 06:47:58: ffff880038249ac8 0000000000000086 ffffffff00000010 ffff880038249a78 06:47:58: ffff880038249a38 0000000000000286 ffffffffa05f0ff0 ffff880052692d54 06:47:58: ffff8800383245f8 ffff880038249fd8 000000000000fb88 ffff8800383245f8 06:47:58:Call Trace: 06:47:58: [<ffffffff814fea92>] schedule_timeout+0x192/0x2e0 06:47:58: [<ffffffff8107e120>] ? process_timeout+0x0/0x10 06:47:58: [<ffffffffa03a373d>] cfs_schedule_timeout_and_set_state+0x1d/0x20 [libcfs] 06:47:58: [<ffffffffa0538828>] obd_exports_barrier+0x98/0x180 [obdclass] 06:47:59: [<ffffffffa0bc6fb2>] ofd_device_fini+0x42/0x230 [ofd] 06:47:59: [<ffffffffa055ddc7>] class_cleanup+0x577/0xdc0 [obdclass] 06:47:59: [<ffffffffa053aa36>] ? class_name2dev+0x56/0xe0 [obdclass] 06:47:59: [<ffffffffa055f6b5>] class_process_config+0x10a5/0x1ca0 [obdclass] 06:47:59: [<ffffffffa03a3be0>] ? cfs_alloc+0x30/0x60 [libcfs] 06:47:59: [<ffffffffa0559043>] ? lustre_cfg_new+0x353/0x7e0 [obdclass] 06:47:59: [<ffffffffa0560427>] class_manual_cleanup+0x177/0x6f0 [obdclass] 06:47:59: [<ffffffffa053aa36>] ? class_name2dev+0x56/0xe0 [obdclass] 06:47:59: [<ffffffffa0569997>] server_put_super+0x5a7/0xcb0 [obdclass] 06:47:59: [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0 06:48:00: [<ffffffff8117d436>] kill_anon_super+0x16/0x60 06:48:00: [<ffffffffa0562066>] lustre_kill_super+0x36/0x60 [obdclass] 06:48:00: [<ffffffff8117e4b0>] deactivate_super+0x70/0x90 06:48:00: [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110 06:48:00: [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0 06:48:00: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0 06:48:00: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b This was master with OFD and LDiskFS OSTs. Info required for matching: obdfilter-survey 3a |
| Comments |
| Comment by Andreas Dilger [ 15/Oct/12 ] |
|
I hit this same problem with shutting down the MGS during sanity-scrub.sh test_10b() on 2.3.53-27-g1a660e4 (one local patch beyond 773c66 " Lustre: MGS is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck? INFO: task umount:23147 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D 0000000000000000 0 23147 23146 0x00000000 ffff8800b22cbab8 0000000000000086 0000000000000000 ffff8800b22cba68 ffff8800b22cba28 ffff88005ed8d000 ffffffffa13a73d1 0000000000000000 ffff880029f685f8 ffff8800b22cbfd8 000000000000fb88 ffff880029f685f8 Call Trace: [<ffffffff814fea92>] schedule_timeout+0x192/0x2e0 [<ffffffff8107e120>] ? process_timeout+0x0/0x10 [<ffffffffa120673d>] cfs_schedule_timeout_and_set_state+0x1d/0x20 [libcfs] [<ffffffffa132a308>] obd_exports_barrier+0x98/0x180 [obdclass] [<ffffffffa0bde8fe>] mgs_device_fini+0xfe/0x5d0 [mgs] [<ffffffffa13583c7>] class_cleanup+0x577/0xdc0 [obdclass] [<ffffffffa132c516>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa1359cb5>] class_process_config+0x10a5/0x1ca0 [obdclass] [<ffffffffa1206be0>] ? cfs_alloc+0x30/0x60 [libcfs] [<ffffffffa1353473>] ? lustre_cfg_new+0x353/0x7e0 [obdclass] [<ffffffffa135aa29>] class_manual_cleanup+0x179/0x6f0 [obdclass] [<ffffffffa132c516>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa1367e5d>] server_put_super+0x4cd/0x12f0 [obdclass] [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0 [<ffffffff8117d436>] kill_anon_super+0x16/0x60 [<ffffffffa135c8a6>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff8117e4b0>] deactivate_super+0x70/0x90 [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110 [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0 |
| Comment by Sarah Liu [ 30/Nov/12 ] |
|
Hit the similar issue during interop testing with 2.3 server and luster-master client 01:52:18:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 01:52:18:umount D 0000000000000000 0 18490 18489 0x00000080 01:52:18: ffff88002f81fac8 0000000000000086 ffffffff00000010 ffff88002f81fa78 01:52:18: ffff88002f81fa38 0000000000000286 ffffffffa067ba50 ffff880076f95348 01:52:18: ffff88003b093058 ffff88002f81ffd8 000000000000fb88 ffff88003b093058 01:52:18:Call Trace: 01:52:18: [<ffffffff814fea92>] schedule_timeout+0x192/0x2e0 01:52:18: [<ffffffff8107e120>] ? process_timeout+0x0/0x10 01:52:18: [<ffffffffa043673d>] cfs_schedule_timeout_and_set_state+0x1d/0x20 [libcfs] 01:52:18: [<ffffffffa05ca168>] obd_exports_barrier+0x98/0x180 [obdclass] 01:52:18: [<ffffffffa0c49692>] ofd_device_fini+0x42/0x240 [ofd] 01:52:18: [<ffffffffa05eb4b7>] class_cleanup+0x577/0xdc0 [obdclass] 01:52:18: [<ffffffffa05cc376>] ? class_name2dev+0x56/0xe0 [obdclass] 01:52:18: [<ffffffffa05ed4ab>] class_process_config+0x102b/0x1c30 [obdclass] 01:52:18: [<ffffffffa0436be0>] ? cfs_alloc+0x30/0x60 [libcfs] 01:52:18: [<ffffffffa05e6eb3>] ? lustre_cfg_new+0x353/0x7e0 [obdclass] 01:52:18: [<ffffffffa05ee229>] class_manual_cleanup+0x179/0x6f0 [obdclass] 01:52:18: [<ffffffffa05cc376>] ? class_name2dev+0x56/0xe0 [obdclass] 01:52:18: [<ffffffffa05f8439>] server_put_super+0x6f9/0xcf0 [obdclass] 01:52:18: [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0 01:52:18: [<ffffffff8117d436>] kill_anon_super+0x16/0x60 01:52:18: [<ffffffffa05efe66>] lustre_kill_super+0x36/0x60 [obdclass] 01:52:18: [<ffffffff8117e4b0>] deactivate_super+0x70/0x90 01:52:18: [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110 01:52:18: [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0 01:52:18: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0 01:52:18: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 01:53:40:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 4. Is it stuck? 01:55:42:Lustre: 3436:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1354096509/real 1354096509] req@ffff880070cb7c00 x1419868288058186/t0(0) o250->MGC10.10.4.154@tcp@10.10.4.154@tcp:26/25 lens 400/544 e 0 to 1 dl 1354096534 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 01:55:42:Lustre: 3436:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 8 previous similar messages 01:56:24:INFO: task umount:18490 blocked for more than 120 seconds. |
| Comment by Andreas Dilger [ 08/Mar/13 ] |
|
Hit this again in recovery-small.sh test_107
|
| Comment by Oleg Drokin [ 10/Mar/13 ] |
|
I am hitting this frequently on my test cluster (see |
| Comment by Nathaniel Clark [ 13/Mar/13 ] |
|
Hit this ldiskfs recovery-small/29a 17:22:19:Lustre: DEBUG MARKER: == recovery-small test 29a: error adding new clients doesn't cause LBUG (bug 22273) == 17:22:11 (1363047731) 17:22:19:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000711 17:22:19:Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts 17:22:19:Lustre: DEBUG MARKER: umount -d /mnt/mds1 17:22:19:Lustre: Failing over lustre-MDT0000 17:22:32:LustreError: 137-5: UUID 'lustre-MDT0000_UUID' is not available for connect (no target) 17:22:32:LustreError: Skipped 8 previous similar messages 17:22:34:Lustre: MGS is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? 17:22:34:LustreError: 166-1: MGC10.10.16.89@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail 17:22:45:Lustre: MGS is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck? 17:23:20:Lustre: MGS is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck? 17:23:41:LustreError: 137-5: UUID 'lustre-MDT0000_UUID' is not available for connect (no target) 17:23:42:LustreError: Skipped 116 previous similar messages 17:24:25:Lustre: MGS is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck? 17:25:50:LustreError: 137-5: UUID 'lustre-MDT0000_UUID' is not available for connect (no target) 17:25:50:LustreError: Skipped 233 previous similar messages 17:26:33:Lustre: MGS is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck? 17:28:43:Lustre: 11992:0:(client.c:1866:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1363048086/real 1363048086] req@ffff880079f23c00 x1429257112860604/t0(0) o250->MGC10.10.16.89@tcp@0@lo:26/25 lens 400/544 e 0 to 1 dl 1363048111 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 17:28:43:Lustre: 11992:0:(client.c:1866:ptlrpc_expire_one_request()) Skipped 17 previous similar messages 17:29:37:INFO: task umount:17156 blocked for more than 120 seconds. 17:29:37:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 17:29:37:umount D 0000000000000000 0 17156 17155 0x00000080 17:29:37: ffff880071d23a98 0000000000000086 ffffffff00000010 ffff880071d23a48 17:29:38: ffff880071d23a08 ffff88005c2a4400 ffffffffa0d2a73e 0000000000000000 17:29:38: ffff880056f2a5f8 ffff880071d23fd8 000000000000fb88 ffff880056f2a5f8 17:29:38:Call Trace: 17:29:38: [<ffffffff814ead12>] schedule_timeout+0x192/0x2e0 17:29:38: [<ffffffff8107cb50>] ? process_timeout+0x0/0x10 17:29:38: [<ffffffffa0bc26bd>] cfs_schedule_timeout_and_set_state+0x1d/0x20 [libcfs] 17:29:38: [<ffffffffa0caca98>] obd_exports_barrier+0x98/0x170 [obdclass] 17:29:38: [<ffffffffa1725996>] mgs_device_fini+0xf6/0x5a0 [mgs] 17:29:38: [<ffffffffa0cd99c7>] class_cleanup+0x577/0xda0 [obdclass] 17:29:38: [<ffffffffa0caed36>] ? class_name2dev+0x56/0xe0 [obdclass] 17:29:38: [<ffffffffa0cdb2ac>] class_process_config+0x10bc/0x1c80 [obdclass] 17:29:38: [<ffffffffa0bccdf8>] ? libcfs_log_return+0x28/0x40 [libcfs] 17:29:40: [<ffffffffa0cd4b11>] ? lustre_cfg_new+0x391/0x7e0 [obdclass] 17:29:40: [<ffffffffa0cdbfe9>] class_manual_cleanup+0x179/0x6f0 [obdclass] 17:29:40: [<ffffffffa0caed36>] ? class_name2dev+0x56/0xe0 [obdclass] 17:29:40: [<ffffffffa0ce8a3d>] server_put_super+0x46d/0xf00 [obdclass] 17:29:40: [<ffffffff811785ab>] generic_shutdown_super+0x5b/0xe0 17:29:40: [<ffffffff81178696>] kill_anon_super+0x16/0x60 17:29:40: [<ffffffffa0cdde46>] lustre_kill_super+0x36/0x60 [obdclass] 17:29:40: [<ffffffff81179670>] deactivate_super+0x70/0x90 17:29:40: [<ffffffff811955cf>] mntput_no_expire+0xbf/0x110 17:29:40: [<ffffffff81195f2b>] sys_umount+0x7b/0x3a0 17:29:40: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b |
| Comment by Sarah Liu [ 16/Apr/13 ] |
|
Hit this error in the tag-2.3.64 testing: https://maloo.whamcloud.com/test_sets/a2ff5494-a66c-11e2-90ad-52540035b04c |
| Comment by Nathaniel Clark [ 24/Apr/13 ] |
|
Failure on replay-single/90 review-zfs |
| Comment by Jian Yu [ 02/May/13 ] |
|
I found that since 2013-04-12, obdfilter-survey had failed with this issue consistently on master branch. The latest build number of master branch which passed the obdfilter-survey test in "full" test session is #1381. |
| Comment by Peter Jones [ 02/May/13 ] |
|
Minh Could you please see if you can isolate which landing introduced this problem? Peter |
| Comment by Andreas Dilger [ 02/May/13 ] |
|
What is the git commit hash for build #1381? It probably makes sense to just look through the patches that were landed to master between build #1381 and #1382, and if necessary build & run the obdfilter-survey test with git-bisect to isolate it to a specific patch. |
| Comment by Minh Diep [ 02/May/13 ] |
|
34 UP lwp lustre-MDT0000-lwp-OST0005 lustre-MDT0000-lwp-OST0005_UUID 5 One of the reasons is obdfilter-survey did not clean up the echo_client after finish. I am looking further |
| Comment by Minh Diep [ 03/May/13 ] |
|
obdfilter-survey exited unexpectedly due to this command + echo '=============> Destroy 1 on localhost:lustre-OST0000_ecc' [root@fat-amd-4 ~]# cat /tmp/obdfilter_survey_2013-05-02@20\:45_fat-amd-4.lab.whamcloud.com.detail_tmp but we create objid 0x2!!!! =============> Create 1 on localhost:lustre-OST0000_ecc |
| Comment by Minh Diep [ 06/May/13 ] |
|
The fix is dup of |
| Comment by Sarah Liu [ 16/Jul/13 ] |
|
Hit this issue again in zfs testing https://maloo.whamcloud.com/test_sets/14e07cdc-ed44-11e2-99b4-52540035b04c OST console: 00:47:41:LustreError: 19583:0:(qsd_reint.c:54:qsd_reint_completion()) Skipped 1 previous similar message 00:48:42:INFO: task umount:19559 blocked for more than 120 seconds. 00:48:42:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 00:48:42:umount D 0000000000000001 0 19559 19558 0x00000080 00:48:42: ffff8802d22c7aa8 0000000000000082 ffffffff00000010 ffff8802d22c7a58 � 00:48:42: ffff8802d22c7a18 0000000000000286 ffffffffa07db450 ffff8802fcc06f2a 00:48:42: ffff8803003725f8 ffff8802d22c7fd8 000000000000fb88 ffff8803003725f8 00:48:42:Call Trace: 00:48:42: [<ffffffff8150ee42>] schedule_timeout+0x192/0x2e0 00:48:42: [<ffffffff810810e0>] ? process_timeout+0x0/0x10 00:48:42: [<ffffffffa05f462d>] cfs_schedule_timeout_and_set_state+0x1d/0x20 [libcfs] 00:48:42: [<ffffffffa07175f8>] obd_exports_barrier+0x98/0x170 [obdclass] 00:48:42: [<ffffffffa0e5a962>] ofd_device_fini+0x42/0x230 [ofd] 00:48:42: [<ffffffffa0742f17>] class_cleanup+0x577/0xda0 [obdclass] 00:48:42: [<ffffffffa07197a6>] ? class_name2dev+0x56/0xe0 [obdclass] 00:48:42: [<ffffffffa07447fc>] class_process_config+0x10bc/0x1c80 [obdclass] 00:48:42: [<ffffffffa073e1e3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass] 00:48:42: [<ffffffffa0745539>] class_manual_cleanup+0x179/0x6f0 [obdclass] 00:48:42: [<ffffffffa07197a6>] ? class_name2dev+0x56/0xe0 [obdclass] 00:48:42: [<ffffffffa07809ec>] server_put_super+0x5ec/0xf60 [obdclass] 00:48:42: [<ffffffff811833ab>] generic_shutdown_super+0x5b/0xe0 00:48:42: [<ffffffff81183496>] kill_anon_super+0x16/0x60 00:48:42: [<ffffffffa07473e6>] lustre_kill_super+0x36/0x60 [obdclass] 00:48:42: [<ffffffff81183c37>] deactivate_super+0x57/0x80 00:48:42: [<ffffffff811a1c8f>] mntput_no_expire+0xbf/0x110 00:48:42: [<ffffffff811a26fb>] sys_umount+0x7b/0x3a0 00:48:42: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b |
| Comment by Minh Diep [ 25/Jul/13 ] |
|
Hi Sarah, This seems to be a different issue. Could you file a new bug? |
| Comment by Sarah Liu [ 30/Jul/13 ] |
|
Thanks Minh, |