Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.5.1, Lustre 2.7.0, Lustre 2.8.0
-
None
-
OpenSFS cluster with one MGS/MDS (MDT is a partition of /dev/sda), one OSS with two OSTs, a node running robinhood, one client and agent node and one client
-
3
-
13125
Description
Running acceptance-small for HSM testing on 2.5.1-RC4, replay-single hangs in test 53g. The test results are at https://maloo.whamcloud.com/test_sets/23724c76-abad-11e3-a696-52540035b04c , but there are no logs for test 53g, except at the end of the suite_log:
== replay-single test 53g: |X| drop open reply and close request while close and open are both in flight == 18:37:54 (1394761074) fail_loc=0x119 fail_loc=0x80000115 fail_loc=0 Replay barrier on lscratch-MDT0000 Failing mds1 on c16 Stopping /lustre/lscratch/mdt0 (opts:) on c16
Looking at dmesg on the MDT:
Lustre: DEBUG MARKER: == replay-single test 53g: |X| drop open reply and close request while close and open are both in flight == 18:37:54 (1394761074) Lustre: *** cfs_fail_loc=119, val=2147483648*** Lustre: Skipped 1 previous similar message LustreError: 18010:0:(ldlm_lib.c:2415:target_send_reply_msg()) @@@ dropping reply req@ffff88081e682000 x1462504360037684/t274877906958(0) o36->97aaf730-d78d-08d9-43ce-9e768c9c685f@192.168.2.120@o2ib:0/0 lens 488/448 e 0 to 0 dl 1394761114 ref 1 fl Interpret:/0/0 rc 0/0 LustreError: 18010:0:(ldlm_lib.c:2415:target_send_reply_msg()) Skipped 1 previous similar message Lustre: *** cfs_fail_loc=115, val=2147483648*** Turning device sda (0x800003) read-only Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lscratch-MDT0000 Lustre: MGS is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? Lustre: MGS is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck? LustreError: 166-1: MGC192.168.2.116@o2ib: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail Lustre: MGS is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck? Lustre: MGS is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck? Lustre: MGS is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck? Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394761310/real 1394761310] req@ffff8808227c9400 x1462503945016504/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394761346 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 14 previous similar messages LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 66 previous similar messages INFO: task umount:18492 blocked for more than 120 seconds. Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D 0000000000000000 0 18492 18491 0x00000080 ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000 ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985 ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8 Call Trace: [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0 [<ffffffff81084220>] ? process_timeout+0x0/0x10 [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass] [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs] [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass] [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass] [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass] [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass] [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass] [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0 [<ffffffff8118b326>] kill_anon_super+0x16/0x60 [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff8118bac7>] deactivate_super+0x57/0x80 [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110 [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Lustre: MGS is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck? INFO: task umount:18492 blocked for more than 120 seconds. Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D 0000000000000000 0 18492 18491 0x00000080 ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000 ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985 ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8 Call Trace: [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0 [<ffffffff81084220>] ? process_timeout+0x0/0x10 [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass] [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs] [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass] [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass] [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass] [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass] [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass] [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0 [<ffffffff8118b326>] kill_anon_super+0x16/0x60 [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff8118bac7>] deactivate_super+0x57/0x80 [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110 [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b INFO: task umount:18492 blocked for more than 120 seconds. Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D 0000000000000000 0 18492 18491 0x00000080 ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000 ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985 ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8 Call Trace: [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0 [<ffffffff81084220>] ? process_timeout+0x0/0x10 [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass] [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs] [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass] [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass] [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass] [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass] [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass] [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0 [<ffffffff8118b326>] kill_anon_super+0x16/0x60 [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff8118bac7>] deactivate_super+0x57/0x80 [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110 [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394761910/real 1394761910] req@ffff8808049d6000 x1462503945016540/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394761965 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 8 previous similar messages LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.118@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 120 previous similar messages INFO: task umount:18492 blocked for more than 120 seconds. Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D 0000000000000000 0 18492 18491 0x00000080 ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000 ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985 ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8 Call Trace: [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0 [<ffffffff81084220>] ? process_timeout+0x0/0x10 [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass] [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs] [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass] [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass] [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass] [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass] [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass] [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0 [<ffffffff8118b326>] kill_anon_super+0x16/0x60 [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff8118bac7>] deactivate_super+0x57/0x80 [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110 [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Lustre: MGS is waiting for obd_unlinked_exports more than 512 seconds. The obd refcount = 5. Is it stuck? INFO: task umount:18492 blocked for more than 120 seconds. Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D 0000000000000000 0 18492 18491 0x00000080 ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000 ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985 ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8 Call Trace: [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0 [<ffffffff81084220>] ? process_timeout+0x0/0x10 [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass] [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs] [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass] [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass] [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass] [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass] [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass] [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0 [<ffffffff8118b326>] kill_anon_super+0x16/0x60 [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff8118bac7>] deactivate_super+0x57/0x80 [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110 [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b INFO: task umount:18492 blocked for more than 120 seconds. Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D 0000000000000000 0 18492 18491 0x00000080 ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000 ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985 ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8 Call Trace: [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0 [<ffffffff81084220>] ? process_timeout+0x0/0x10 [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass] [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs] [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass] [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass] [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass] [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass] [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass] [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0 [<ffffffff8118b326>] kill_anon_super+0x16/0x60 [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff8118bac7>] deactivate_super+0x57/0x80 [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110 [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.120@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 113 previous similar messages Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394762585/real 1394762585] req@ffff880819252800 x1462503945016576/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394762640 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 8 previous similar messages Lustre: MGS is waiting for obd_unlinked_exports more than 1024 seconds. The obd refcount = 5. Is it stuck? LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.117@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 120 previous similar messages Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394763260/real 1394763260] req@ffff8808202e3800 x1462503945016612/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394763315 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 8 previous similar messages LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 116 previous similar messages Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394763935/real 1394763935] req@ffff8808040e9800 x1462503945016648/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394763990 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 8 previous similar messages LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 113 previous similar messages
On the client, we see
Lustre: DEBUG MARKER: == replay-single test 53g: |X| drop open reply and close request while close and open are both in flight == 18:37:54 (1394761074) Lustre: DEBUG MARKER: cancel_lru_locks mdc start Lustre: DEBUG MARKER: cancel_lru_locks mdc stop Lustre: DEBUG MARKER: cancel_lru_locks mdc start Lustre: DEBUG MARKER: cancel_lru_locks mdc stop Lustre: DEBUG MARKER: local REPLAY BARRIER on lscratch-MDT0000 Lustre: 7328:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394761618/real 1394761618] req@ffff88080f538800 x1462504360038020/t0(0) o250->MGC192.168.2.116@o2ib@192.168.2.116@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394761673 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 7328:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 15 previous similar messages Lustre: 7328:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394762293/real 1394762293] req@ffff880820549400 x1462504360038380/t0(0) o250->MGC192.168.2.116@o2ib@192.168.2.116@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394762348 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Attachments
Issue Links
- duplicates
-
LU-5539 MGS is waiting for obd_unlinked_exports more than 1024 seconds
- Resolved
-
LU-4062 sanity test_132: MGS is waiting for obd_unlinked_exports more than 512 seconds
- Closed
- is duplicated by
-
LU-5161 replay-single test_42: OBD refcount is 5
- Resolved
-
LU-6103 replay-single test_53f: test failed to respond and timed out
- Resolved
- is related to
-
LU-6616 recovery-small test_27: (mgs_handler.c:1379:mgs_fsc_debug()) ASSERTION( list_empty(&fsdb->fsdb_clients) ) failed
- Closed