[LU-4565] recovery-mds-scale test_failover_ost: failed mounting ost after reboot Created: 30/Jan/14  Updated: 24/Jun/14  Resolved: 24/Jun/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: zfs
Environment:

lustre-master build # 1837 RHEL6 zfs


Issue Links:
Related
is related to LU-2059 mgc to backup configuration on osd-ba... Resolved
Severity: 3
Rank (Obsolete): 12460

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/ed6584da-847f-11e3-9133-52540035b04c.

The sub-test test_failover_ost failed with the following error:

test failed to respond and timed out

OST dmesg:

Lustre: DEBUG MARKER: zpool list -H lustre-ost2 >/dev/null 2>&1 ||
			zpool import -f -o cachefile=none -d /dev/lvm-Role_OSS lustre-ost2
Lustre: DEBUG MARKER: mkdir -p /mnt/ost2; mount -t lustre   		                   lustre-ost2/ost2 /mnt/ost2
Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291785/real 1390291785]  req@ffff88006c21d800 x1457826572533816/t0(0) o38->lustre-MDT0000-lwp-OST0000@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291790 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 13a-8: Failed to get MGS log params and no local copy.
Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291791/real 1390291791]  req@ffff88006b122c00 x1457826572533912/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291796 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 3 clients reconnect
LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.10.4.200@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291801/real 1390291801]  req@ffff88006c21dc00 x1457826572533940/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291811 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: lustre-OST0000: Recovery over after 0:07, of 3 clients 3 recovered and 0 were evicted.
Lustre: lustre-OST0000: deleting orphan objects from 0x0:833 to 0x0:833
Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291816/real 1390291816]  req@ffff880037b0ec00 x1457826572533968/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291831 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.10.4.200@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 18 previous similar messages
Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291836/real 1390291836]  req@ffff8800680a7000 x1457826572534004/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291856 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1390291866/real 1390291866]  req@ffff880067d01000 x1457826572534060/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291891 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) Skipped 1 previous similar message
LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.10.4.200@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 14 previous similar messages
Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291871/real 1390291871]  req@ffff88006addb800 x1457826572534072/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291896 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.10.4.200@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 14 previous similar messages
Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1390291931/real 1390291931]  req@ffff88006addbc00 x1457826572534196/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291956 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.10.4.200@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 35 previous similar messages
INFO: task mount.lustre:3972 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.lustre  D 0000000000000001     0  3972   3971 0x00000080
 ffff88006af93718 0000000000000086 ffff88006af936a8 ffffffff81065c75
 000000106af936b8 ffff88007e4b2ad8 ffff8800022167a8 ffff880002216740
 ffff88006a935098 ffff88006af93fd8 000000000000fb88 ffff88006a935098
Call Trace:
 [<ffffffff81065c75>] ? enqueue_entity+0x125/0x410
 [<ffffffff8103c7d8>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8150f475>] schedule_timeout+0x215/0x2e0
 [<ffffffff81065c75>] ? enqueue_entity+0x125/0x410
 [<ffffffff810572f4>] ? check_preempt_wakeup+0x1a4/0x260
 [<ffffffff8106605b>] ? enqueue_task_fair+0xfb/0x100
 [<ffffffff8150f0f3>] wait_for_common+0x123/0x180
 [<ffffffff81063990>] ? default_wake_function+0x0/0x20
 [<ffffffff8150f20d>] wait_for_completion+0x1d/0x20
 [<ffffffffa0858c03>] llog_process_or_fork+0x353/0x5f0 [obdclass]
 [<ffffffffa0858eb4>] llog_process+0x14/0x20 [obdclass]
 [<ffffffffa088d444>] class_config_parse_llog+0x1e4/0x330 [obdclass]
 [<ffffffffa105c2d2>] mgc_process_log+0xd22/0x18e0 [mgc]
 [<ffffffffa1056360>] ? mgc_blocking_ast+0x0/0x810 [mgc]
 [<ffffffffa0aaf2a0>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
 [<ffffffffa105e4f5>] mgc_process_config+0x645/0x11d0 [mgc]
 [<ffffffffa089d476>] lustre_process_log+0x256/0xa70 [obdclass]
 [<ffffffffa086f832>] ? class_name2dev+0x42/0xe0 [obdclass]
 [<ffffffff81168043>] ? kmem_cache_alloc_trace+0x1a3/0x1b0
 [<ffffffffa086f8de>] ? class_name2obd+0xe/0x30 [obdclass]
 [<ffffffffa08ce71c>] server_start_targets+0x1c4c/0x1e00 [obdclass]
 [<ffffffffa08a0a7b>] ? lustre_start_mgc+0x48b/0x1e60 [obdclass]
 [<ffffffffa08989e0>] ? class_config_llog_handler+0x0/0x1880 [obdclass]
 [<ffffffffa08d3708>] server_fill_super+0xb98/0x1a64 [obdclass]
 [<ffffffffa08a2628>] lustre_fill_super+0x1d8/0x530 [obdclass]
 [<ffffffffa08a2450>] ? lustre_fill_super+0x0/0x530 [obdclass]
 [<ffffffff811845df>] get_sb_nodev+0x5f/0xa0
 [<ffffffffa089a365>] lustre_get_sb+0x25/0x30 [obdclass]
 [<ffffffff81183c1b>] vfs_kern_mount+0x7b/0x1b0
 [<ffffffff81183dc2>] do_kern_mount+0x52/0x130
 [<ffffffff81195382>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff811a3f82>] do_mount+0x2d2/0x8d0
 [<ffffffff811a4610>] sys_mount+0x90/0xe0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291996/real 1390291996]  req@ffff88006adf3400 x1457826572534332/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390292021 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
LustreError: 137-5: lustre-OST0003_UUID: not available for connect from 10.10.4.202@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 124 previous similar messages
INFO: task mount.lustre:3972 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.lustre  D 0000000000000001     0  3972   3971 0x00000080


 Comments   
Comment by Andreas Dilger [ 31/Jan/14 ]

It looks like the OST is stuck waiting to connect to the MGS but the MGS has not been started yet. I think the OST shouldn't block on the MGS, since that was never a requirement to mount the MGS first. This is doubly true for 2.4+ since the OST already has its index assigned.

Comment by Andreas Dilger [ 25/Apr/14 ]

It may be that fixing LU-2059 "mgc to backup configuration on osd-based llogs" would also fix this problem - then the OST can start with its local logs and avoid waiting for the MGS to start.

Comment by Jodi Levi (Inactive) [ 23/May/14 ]

Now that LU-2059 is fixed, who can confirm Andreas' thoughts that this would be fixed as well?
Thanks!

Comment by Jodi Levi (Inactive) [ 02/Jun/14 ]

Please reopen this ticket if the issue reoccurs.

Comment by Sarah Liu [ 19/Jun/14 ]

Hit this error again in lustre-master tag-2.5.60 zfs failover testing:

https://maloo.whamcloud.com/test_sets/1173f18a-f62d-11e3-8491-52540035b04c
OST dmesg

Lustre: 26711:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
INFO: task mount.lustre:27292 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.17.1.el6_lustre.g8d5344f.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.lustre  D 0000000000000000     0 27292  27291 0x00000080
 ffff88005e7d9718 0000000000000086 ffff88005e7d96a8 ffffffff81069f75
 000000105e7d96b8 ffff88007e4c2ad8 ffff8800023168e8 ffff880002316880
 ffff88005f2685f8 ffff88005e7d9fd8 000000000000fbc8 ffff88005f2685f8
Call Trace:
 [<ffffffff81069f75>] ? enqueue_entity+0x125/0x450
 [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff81528f05>] schedule_timeout+0x215/0x2e0
 [<ffffffff81069f75>] ? enqueue_entity+0x125/0x450
 [<ffffffff8105ad54>] ? check_preempt_wakeup+0x1a4/0x260
 [<ffffffff8106a39b>] ? enqueue_task_fair+0xfb/0x100
 [<ffffffff81528b83>] wait_for_common+0x123/0x180
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa08eaa80>] ? client_lwp_config_process+0x0/0x199a [obdclass]
 [<ffffffff81528c9d>] wait_for_completion+0x1d/0x20
 [<ffffffffa086ea34>] llog_process_or_fork+0x344/0x550 [obdclass]
 [<ffffffffa086ec54>] llog_process+0x14/0x30 [obdclass]
 [<ffffffffa08a0524>] class_config_parse_llog+0x1e4/0x330 [obdclass]
 [<ffffffffa103d252>] mgc_process_log+0xdc2/0x1970 [mgc]
 [<ffffffffa1037290>] ? mgc_blocking_ast+0x0/0x810 [mgc]
 [<ffffffffa0acb250>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
 [<ffffffffa103f485>] mgc_process_config+0x645/0x11d0 [mgc]
 [<ffffffffa08b07af>] lustre_process_log+0x20f/0xac0 [obdclass]
 [<ffffffffa08dfa3c>] ? server_find_mount+0xbc/0x160 [obdclass]
 [<ffffffff8116f303>] ? kmem_cache_alloc_trace+0x1a3/0x1b0
 [<ffffffffa08ad8af>] ? server_name2fsname+0x6f/0x90 [obdclass]
 [<ffffffffa08e5a16>] server_start_targets+0x12b6/0x1ae0 [obdclass]
 [<ffffffffa08b3eeb>] ? lustre_start_mgc+0x48b/0x1df0 [obdclass]
 [<ffffffffa08abdc0>] ? class_config_llog_handler+0x0/0x18b0 [obdclass]
 [<ffffffffa08e9ff8>] server_fill_super+0xb98/0x1620 [obdclass]
 [<ffffffffa08b5a28>] lustre_fill_super+0x1d8/0x530 [obdclass]
 [<ffffffffa08b5850>] ? lustre_fill_super+0x0/0x530 [obdclass]
 [<ffffffff8118be5f>] get_sb_nodev+0x5f/0xa0
 [<ffffffffa08ad775>] lustre_get_sb+0x25/0x30 [obdclass]
 [<ffffffff8118b4bb>] vfs_kern_mount+0x7b/0x1b0
 [<ffffffff8118b662>] do_kern_mount+0x52/0x130
 [<ffffffff8119d862>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff811ac63b>] do_mount+0x2fb/0x930
 [<ffffffff811acd00>] sys_mount+0x90/0xe0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: 26711:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1402948490/real 1402948490]  req@ffff88005d6a5c00 x1471097868387440/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1402948515 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 26711:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.1.6.23@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 135 previous similar messages
INFO: task mount.lustre:27292 blocked for more than 120 seconds.
Comment by Andreas Dilger [ 20/Jun/14 ]

Mike, could you please take a look at this again.

Comment by Peter Jones [ 21/Jun/14 ]

I note that this fix tracked under LU-2059 was not in 2.5.60 - http://git.whamcloud.com/fs/lustre-release.git/commit/8c21986e79f50131b0f381e5fe0311294328d660

Comment by Mikhail Pershin [ 23/Jun/14 ]

Peter, in that case we need just LU-2059 ported to b2_5, right?

Generated at Sat Feb 10 01:43:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.