[LU-4565] recovery-mds-scale test_failover_ost: failed mounting ost after reboot Created: 30/Jan/14 Updated: 24/Jun/14 Resolved: 24/Jun/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | zfs | ||
| Environment: |
lustre-master build # 1837 RHEL6 zfs |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 12460 | ||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/ed6584da-847f-11e3-9133-52540035b04c. The sub-test test_failover_ost failed with the following error:
OST dmesg: Lustre: DEBUG MARKER: zpool list -H lustre-ost2 >/dev/null 2>&1 || zpool import -f -o cachefile=none -d /dev/lvm-Role_OSS lustre-ost2 Lustre: DEBUG MARKER: mkdir -p /mnt/ost2; mount -t lustre lustre-ost2/ost2 /mnt/ost2 Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291785/real 1390291785] req@ffff88006c21d800 x1457826572533816/t0(0) o38->lustre-MDT0000-lwp-OST0000@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291790 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 LustreError: 13a-8: Failed to get MGS log params and no local copy. Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291791/real 1390291791] req@ffff88006b122c00 x1457826572533912/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291796 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 3 clients reconnect LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.10.4.200@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291801/real 1390291801] req@ffff88006c21dc00 x1457826572533940/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291811 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: lustre-OST0000: Recovery over after 0:07, of 3 clients 3 recovered and 0 were evicted. Lustre: lustre-OST0000: deleting orphan objects from 0x0:833 to 0x0:833 Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291816/real 1390291816] req@ffff880037b0ec00 x1457826572533968/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291831 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.10.4.200@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 18 previous similar messages Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291836/real 1390291836] req@ffff8800680a7000 x1457826572534004/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291856 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1390291866/real 1390291866] req@ffff880067d01000 x1457826572534060/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291891 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) Skipped 1 previous similar message LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.10.4.200@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 14 previous similar messages Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291871/real 1390291871] req@ffff88006addb800 x1457826572534072/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291896 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.10.4.200@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 14 previous similar messages Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1390291931/real 1390291931] req@ffff88006addbc00 x1457826572534196/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291956 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) Skipped 6 previous similar messages LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.10.4.200@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 35 previous similar messages INFO: task mount.lustre:3972 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mount.lustre D 0000000000000001 0 3972 3971 0x00000080 ffff88006af93718 0000000000000086 ffff88006af936a8 ffffffff81065c75 000000106af936b8 ffff88007e4b2ad8 ffff8800022167a8 ffff880002216740 ffff88006a935098 ffff88006af93fd8 000000000000fb88 ffff88006a935098 Call Trace: [<ffffffff81065c75>] ? enqueue_entity+0x125/0x410 [<ffffffff8103c7d8>] ? pvclock_clocksource_read+0x58/0xd0 [<ffffffff8150f475>] schedule_timeout+0x215/0x2e0 [<ffffffff81065c75>] ? enqueue_entity+0x125/0x410 [<ffffffff810572f4>] ? check_preempt_wakeup+0x1a4/0x260 [<ffffffff8106605b>] ? enqueue_task_fair+0xfb/0x100 [<ffffffff8150f0f3>] wait_for_common+0x123/0x180 [<ffffffff81063990>] ? default_wake_function+0x0/0x20 [<ffffffff8150f20d>] wait_for_completion+0x1d/0x20 [<ffffffffa0858c03>] llog_process_or_fork+0x353/0x5f0 [obdclass] [<ffffffffa0858eb4>] llog_process+0x14/0x20 [obdclass] [<ffffffffa088d444>] class_config_parse_llog+0x1e4/0x330 [obdclass] [<ffffffffa105c2d2>] mgc_process_log+0xd22/0x18e0 [mgc] [<ffffffffa1056360>] ? mgc_blocking_ast+0x0/0x810 [mgc] [<ffffffffa0aaf2a0>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc] [<ffffffffa105e4f5>] mgc_process_config+0x645/0x11d0 [mgc] [<ffffffffa089d476>] lustre_process_log+0x256/0xa70 [obdclass] [<ffffffffa086f832>] ? class_name2dev+0x42/0xe0 [obdclass] [<ffffffff81168043>] ? kmem_cache_alloc_trace+0x1a3/0x1b0 [<ffffffffa086f8de>] ? class_name2obd+0xe/0x30 [obdclass] [<ffffffffa08ce71c>] server_start_targets+0x1c4c/0x1e00 [obdclass] [<ffffffffa08a0a7b>] ? lustre_start_mgc+0x48b/0x1e60 [obdclass] [<ffffffffa08989e0>] ? class_config_llog_handler+0x0/0x1880 [obdclass] [<ffffffffa08d3708>] server_fill_super+0xb98/0x1a64 [obdclass] [<ffffffffa08a2628>] lustre_fill_super+0x1d8/0x530 [obdclass] [<ffffffffa08a2450>] ? lustre_fill_super+0x0/0x530 [obdclass] [<ffffffff811845df>] get_sb_nodev+0x5f/0xa0 [<ffffffffa089a365>] lustre_get_sb+0x25/0x30 [obdclass] [<ffffffff81183c1b>] vfs_kern_mount+0x7b/0x1b0 [<ffffffff81183dc2>] do_kern_mount+0x52/0x130 [<ffffffff81195382>] ? vfs_ioctl+0x22/0xa0 [<ffffffff811a3f82>] do_mount+0x2d2/0x8d0 [<ffffffff811a4610>] sys_mount+0x90/0xe0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291996/real 1390291996] req@ffff88006adf3400 x1457826572534332/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390292021 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) Skipped 7 previous similar messages LustreError: 137-5: lustre-OST0003_UUID: not available for connect from 10.10.4.202@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 124 previous similar messages INFO: task mount.lustre:3972 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mount.lustre D 0000000000000001 0 3972 3971 0x00000080 |
| Comments |
| Comment by Andreas Dilger [ 31/Jan/14 ] |
|
It looks like the OST is stuck waiting to connect to the MGS but the MGS has not been started yet. I think the OST shouldn't block on the MGS, since that was never a requirement to mount the MGS first. This is doubly true for 2.4+ since the OST already has its index assigned. |
| Comment by Andreas Dilger [ 25/Apr/14 ] |
|
It may be that fixing |
| Comment by Jodi Levi (Inactive) [ 23/May/14 ] |
|
Now that |
| Comment by Jodi Levi (Inactive) [ 02/Jun/14 ] |
|
Please reopen this ticket if the issue reoccurs. |
| Comment by Sarah Liu [ 19/Jun/14 ] |
|
Hit this error again in lustre-master tag-2.5.60 zfs failover testing: https://maloo.whamcloud.com/test_sets/1173f18a-f62d-11e3-8491-52540035b04c Lustre: 26711:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
INFO: task mount.lustre:27292 blocked for more than 120 seconds.
Tainted: P --------------- 2.6.32-431.17.1.el6_lustre.g8d5344f.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.lustre D 0000000000000000 0 27292 27291 0x00000080
ffff88005e7d9718 0000000000000086 ffff88005e7d96a8 ffffffff81069f75
000000105e7d96b8 ffff88007e4c2ad8 ffff8800023168e8 ffff880002316880
ffff88005f2685f8 ffff88005e7d9fd8 000000000000fbc8 ffff88005f2685f8
Call Trace:
[<ffffffff81069f75>] ? enqueue_entity+0x125/0x450
[<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
[<ffffffff81528f05>] schedule_timeout+0x215/0x2e0
[<ffffffff81069f75>] ? enqueue_entity+0x125/0x450
[<ffffffff8105ad54>] ? check_preempt_wakeup+0x1a4/0x260
[<ffffffff8106a39b>] ? enqueue_task_fair+0xfb/0x100
[<ffffffff81528b83>] wait_for_common+0x123/0x180
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa08eaa80>] ? client_lwp_config_process+0x0/0x199a [obdclass]
[<ffffffff81528c9d>] wait_for_completion+0x1d/0x20
[<ffffffffa086ea34>] llog_process_or_fork+0x344/0x550 [obdclass]
[<ffffffffa086ec54>] llog_process+0x14/0x30 [obdclass]
[<ffffffffa08a0524>] class_config_parse_llog+0x1e4/0x330 [obdclass]
[<ffffffffa103d252>] mgc_process_log+0xdc2/0x1970 [mgc]
[<ffffffffa1037290>] ? mgc_blocking_ast+0x0/0x810 [mgc]
[<ffffffffa0acb250>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
[<ffffffffa103f485>] mgc_process_config+0x645/0x11d0 [mgc]
[<ffffffffa08b07af>] lustre_process_log+0x20f/0xac0 [obdclass]
[<ffffffffa08dfa3c>] ? server_find_mount+0xbc/0x160 [obdclass]
[<ffffffff8116f303>] ? kmem_cache_alloc_trace+0x1a3/0x1b0
[<ffffffffa08ad8af>] ? server_name2fsname+0x6f/0x90 [obdclass]
[<ffffffffa08e5a16>] server_start_targets+0x12b6/0x1ae0 [obdclass]
[<ffffffffa08b3eeb>] ? lustre_start_mgc+0x48b/0x1df0 [obdclass]
[<ffffffffa08abdc0>] ? class_config_llog_handler+0x0/0x18b0 [obdclass]
[<ffffffffa08e9ff8>] server_fill_super+0xb98/0x1620 [obdclass]
[<ffffffffa08b5a28>] lustre_fill_super+0x1d8/0x530 [obdclass]
[<ffffffffa08b5850>] ? lustre_fill_super+0x0/0x530 [obdclass]
[<ffffffff8118be5f>] get_sb_nodev+0x5f/0xa0
[<ffffffffa08ad775>] lustre_get_sb+0x25/0x30 [obdclass]
[<ffffffff8118b4bb>] vfs_kern_mount+0x7b/0x1b0
[<ffffffff8118b662>] do_kern_mount+0x52/0x130
[<ffffffff8119d862>] ? vfs_ioctl+0x22/0xa0
[<ffffffff811ac63b>] do_mount+0x2fb/0x930
[<ffffffff811acd00>] sys_mount+0x90/0xe0
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: 26711:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1402948490/real 1402948490] req@ffff88005d6a5c00 x1471097868387440/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.1.6.21@tcp:12/10 lens 400/544 e 0 to 1 dl 1402948515 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 26711:0:(client.c:1924:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.1.6.23@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 135 previous similar messages
INFO: task mount.lustre:27292 blocked for more than 120 seconds.
|
| Comment by Andreas Dilger [ 20/Jun/14 ] |
|
Mike, could you please take a look at this again. |
| Comment by Peter Jones [ 21/Jun/14 ] |
|
I note that this fix tracked under |
| Comment by Mikhail Pershin [ 23/Jun/14 ] |
|
Peter, in that case we need just |