[LU-6273] Hard Failover replay-dual test_17: Failover OST mount hang Created: 24/Feb/15  Updated: 23/Sep/15  Resolved: 22/Sep/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: zfs
Environment:

client and server: lustre-master build # 2856
zfs


Issue Links:
Related
is related to LU-4214 Hyperion - OST never recovers on fail... Resolved
Severity: 3
Rank (Obsolete): 17590

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/0429703c-ba58-11e4-8053-5254006e85c2.

The sub-test test_17 failed with the following error:

test failed to respond and timed out

ost dmesg

Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.2.4.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 120 previous similar messages
INFO: task mount.lustre:3630 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.29.2.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.lustre  D 0000000000000000     0  3630   3629 0x00000080
 ffff88006edf9718 0000000000000082 0000000000000000 ffff88006ef82040
 ffff88006edf9698 ffffffff81055783 ffff88007e4c2ad8 ffff880002216880
 ffff88006ef825f8 ffff88006edf9fd8 000000000000fbc8 ffff88006ef825f8
Call Trace:
 [<ffffffff81055783>] ? set_next_buddy+0x43/0x50
 [<ffffffff8152a595>] schedule_timeout+0x215/0x2e0
 [<ffffffff81069f15>] ? enqueue_entity+0x125/0x450
 [<ffffffff8152a213>] wait_for_common+0x123/0x180
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa090cd00>] ? client_lwp_config_process+0x0/0x1948 [obdclass]
 [<ffffffff8152a32d>] wait_for_completion+0x1d/0x20
 [<ffffffffa0898e14>] llog_process_or_fork+0x354/0x540 [obdclass]
 [<ffffffffa0899014>] llog_process+0x14/0x30 [obdclass]
 [<ffffffffa08c81d4>] class_config_parse_llog+0x1e4/0x330 [obdclass]
 [<ffffffffa10314f2>] mgc_process_log+0xeb2/0x1970 [mgc]
 [<ffffffffa102b1f0>] ? mgc_blocking_ast+0x0/0x810 [mgc]
 [<ffffffffa0ad0860>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
 [<ffffffffa1032ef8>] mgc_process_config+0x658/0x1210 [mgc]
 [<ffffffffa08d9383>] lustre_process_log+0x7e3/0x1130 [obdclass]
 [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa08d514f>] ? server_name2fsname+0x6f/0x90 [obdclass]
 [<ffffffffa0907496>] server_start_targets+0x12b6/0x1af0 [obdclass]
 [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa08dbfe6>] ? lustre_start_mgc+0x4b6/0x1e00 [obdclass]
 [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa08d3390>] ? class_config_llog_handler+0x0/0x1a70 [obdclass]
 [<ffffffffa090c255>] server_fill_super+0xbe5/0x1690 [obdclass]
 [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa08dde90>] lustre_fill_super+0x560/0xa80 [obdclass]
 [<ffffffffa08dd930>] ? lustre_fill_super+0x0/0xa80 [obdclass]
 [<ffffffff8118c56f>] get_sb_nodev+0x5f/0xa0
 [<ffffffffa08d4ee5>] lustre_get_sb+0x25/0x30 [obdclass]
 [<ffffffff8118bbcb>] vfs_kern_mount+0x7b/0x1b0
 [<ffffffff8118bd72>] do_kern_mount+0x52/0x130
 [<ffffffff8119e972>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff811ad74b>] do_mount+0x2fb/0x930
 [<ffffffff811ade10>] sys_mount+0x90/0xe0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424565647/real 1424565647]  req@ffff880070449080 x1493765169611180/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.2.4.158@tcp:12/10 lens 400/544 e 0 to 1 dl 1424565672 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424565712/real 1424565712]  req@ffff880070449680 x1493765169611316/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.2.4.158@tcp:12/10 lens 400/544 e 0 to 1 dl 1424565737 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
INFO: task mount.lustre:3630 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.29.2.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.lustre  D 0000000000000000     0  3630   3629 0x00000080
 ffff88006edf9718 0000000000000082 0000000000000000 ffff88006ef82040
 ffff88006edf9698 ffffffff81055783 ffff88007e4c2ad8 ffff880002216880
 ffff88006ef825f8 ffff88006edf9fd8 000000000000fbc8 ffff88006ef825f8
Call Trace:
 [<ffffffff81055783>] ? set_next_buddy+0x43/0x50
 [<ffffffff8152a595>] schedule_timeout+0x215/0x2e0
 [<ffffffff81069f15>] ? enqueue_entity+0x125/0x450
 [<ffffffff8152a213>] wait_for_common+0x123/0x180
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa090cd00>] ? client_lwp_config_process+0x0/0x1948 [obdclass]
 [<ffffffff8152a32d>] wait_for_completion+0x1d/0x20
 [<ffffffffa0898e14>] llog_process_or_fork+0x354/0x540 [obdclass]
 [<ffffffffa0899014>] llog_process+0x14/0x30 [obdclass]
 [<ffffffffa08c81d4>] class_config_parse_llog+0x1e4/0x330 [obdclass]
 [<ffffffffa10314f2>] mgc_process_log+0xeb2/0x1970 [mgc]
 [<ffffffffa102b1f0>] ? mgc_blocking_ast+0x0/0x810 [mgc]
 [<ffffffffa0ad0860>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
 [<ffffffffa1032ef8>] mgc_process_config+0x658/0x1210 [mgc]
 [<ffffffffa08d9383>] lustre_process_log+0x7e3/0x1130 [obdclass]
 [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa08d514f>] ? server_name2fsname+0x6f/0x90 [obdclass]
 [<ffffffffa0907496>] server_start_targets+0x12b6/0x1af0 [obdclass]
 [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa08dbfe6>] ? lustre_start_mgc+0x4b6/0x1e00 [obdclass]
 [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa08d3390>] ? class_config_llog_handler+0x0/0x1a70 [obdclass]
 [<ffffffffa090c255>] server_fill_super+0xbe5/0x1690 [obdclass]
 [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa08dde90>] lustre_fill_super+0x560/0xa80 [obdclass]
 [<ffffffffa08dd930>] ? lustre_fill_super+0x0/0xa80 [obdclass]
 [<ffffffff8118c56f>] get_sb_nodev+0x5f/0xa0
 [<ffffffffa08d4ee5>] lustre_get_sb+0x25/0x30 [obdclass]
 [<ffffffff8118bbcb>] vfs_kern_mount+0x7b/0x1b0
 [<ffffffff8118bd72>] do_kern_mount+0x52/0x130
 [<ffffffff8119e972>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff811ad74b>] do_mount+0x2fb/0x930
 [<ffffffff811ade10>] sys_mount+0x90/0xe0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.2.4.156@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 304 previous similar messages
Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424565842/real 1424565842]  req@ffff880070449c80 x1493765169611592/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.2.4.158@tcp:12/10 lens 400/544 e 0 to 1 dl 1424565867 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 16 previous similar messages
INFO: task mount.lustre:3630 blocked for more than 120 seconds.


 Comments   
Comment by Oleg Drokin [ 24/Feb/15 ]

So the real problem here is such that MDS has crashed and now OSTs cannot reconnect to it anymore.
We need to find MDS crashdump and get a log our of there to see why did it crash.

Comment by Jodi Levi (Inactive) [ 24/Feb/15 ]

Mike,
Could you please have a look and comment on this one?
Thank you!

Comment by Oleg Drokin [ 24/Feb/15 ]

Additionally - even with MDS/MGS node still down OSTs absolutely must be able to start as long as they have a cached copy of config.
And yet in the OST logs we see OST0000 started and OST0002 hung - so we also need to see why did that one stack - I imagine Mike would want to look at this one.

Comment by Sarah Liu [ 04/Mar/15 ]

another instance seen in recovery-double-scale

https://testing.hpdd.intel.com/test_sets/85116226-c170-11e4-bef2-5254006e85c2

OST

Lustre: 27409:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
INFO: task mount.lustre:27983 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-504.8.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.lustre  D 0000000000000000     0 27983  27982 0x00000080
 ffff88005ca1f718 0000000000000082 000000000000ebc8 ffff88005ed70640
 ffff88007e4c2aa0 ffff88005ed70080 ffff88005ca1f698 ffffffff810588b3
 ffff88007e4c2ad8 ffff880002215900 ffff88005ed70638 ffff88005ca1ffd8
Call Trace:
 [<ffffffff810588b3>] ? set_next_buddy+0x43/0x50
 [<ffffffff81041e98>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8152b185>] schedule_timeout+0x215/0x2e0
 [<ffffffff8106d175>] ? enqueue_entity+0x125/0x450
 [<ffffffff8105dea4>] ? check_preempt_wakeup+0x1a4/0x260
 [<ffffffff8106d59b>] ? enqueue_task_fair+0xfb/0x100
 [<ffffffff8152ae03>] wait_for_common+0x123/0x180
 [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
 [<ffffffffa09127c0>] ? client_lwp_config_process+0x0/0x1912 [obdclass]
 [<ffffffff8152af1d>] wait_for_completion+0x1d/0x20
 [<ffffffffa089ecd4>] llog_process_or_fork+0x354/0x540 [obdclass]
 [<ffffffffa089eed4>] llog_process+0x14/0x30 [obdclass]
 [<ffffffffa08cddf4>] class_config_parse_llog+0x1e4/0x330 [obdclass]
 [<ffffffffa10344f2>] mgc_process_log+0xeb2/0x1970 [mgc]
 [<ffffffffa102e1f0>] ? mgc_blocking_ast+0x0/0x810 [mgc]
 [<ffffffffa0ad5800>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
 [<ffffffffa1035ef8>] mgc_process_config+0x658/0x1210 [mgc]
 [<ffffffff81174d4c>] ? __kmalloc+0x21c/0x230
 [<ffffffffa08defa3>] lustre_process_log+0x7e3/0x1130 [obdclass]
 [<ffffffffa0906e9c>] ? server_find_mount+0xbc/0x160 [obdclass]
 [<ffffffff811749f3>] ? kmem_cache_alloc_trace+0x1b3/0x1c0
 [<ffffffffa08dad6f>] ? server_name2fsname+0x6f/0x90 [obdclass]
 [<ffffffffa090cf56>] server_start_targets+0x12b6/0x1af0 [obdclass]
 [<ffffffffa08e1bdb>] ? lustre_start_mgc+0x48b/0x1e00 [obdclass]
 [<ffffffffa08d8fb0>] ? class_config_llog_handler+0x0/0x1a70 [obdclass]
 [<ffffffffa0911d15>] server_fill_super+0xbe5/0x1690 [obdclass]
 [<ffffffffa08e3ab0>] lustre_fill_super+0x560/0xa80 [obdclass]
 [<ffffffffa08e3550>] ? lustre_fill_super+0x0/0xa80 [obdclass]
 [<ffffffff811917af>] get_sb_nodev+0x5f/0xa0
 [<ffffffffa08dab05>] lustre_get_sb+0x25/0x30 [obdclass]
 [<ffffffff81190deb>] vfs_kern_mount+0x7b/0x1b0
 [<ffffffff81190f92>] do_kern_mount+0x52/0x130
 [<ffffffff811a3c12>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff811b2b9b>] do_mount+0x2fb/0x930
 [<ffffffff811b3260>] sys_mount+0x90/0xe0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: 27409:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1425299784/real 1425299784]  req@ffff88005ee57380 x1494534964904508/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.1.4.205@tcp:12/10 lens 400/544 e 0 to 1 dl 1425299809 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 27409:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.1.4.198@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 40 previous similar messages
Lustre: 27409:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1425299849/real 1425299849]  req@ffff88005c0a69c0 x1494534964904636/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.1.4.205@tcp:12/10 lens 400/544 e 0 to 1 dl 1425299874 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 27409:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
INFO: task mount.lustre:27983 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-504.8.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.lustre  D 0000000000000000     0 27983  27982 0x00000080
Comment by Jian Yu [ 11/Mar/15 ]

This is blocking ZFS hard failover testing on master branch:
https://testing.hpdd.intel.com/test_sessions/dbcec19e-c81e-11e4-be50-5254006e85c2

Comment by Joseph Gmitter (Inactive) [ 18/Aug/15 ]

Hi Mike,
Can you take a look at this issue? It is a blocker for 2.8.
Thanks.
Joe

Comment by Mikhail Pershin [ 18/Aug/15 ]

yes, I am working on this

Comment by Mikhail Pershin [ 19/Aug/15 ]

this looks like bug with lw_client detection in target_handle_connect(). I think this patch should do the job:
http://review.whamcloud.com/#/c/13726/3
Meanwhile it is fixed in master by one of DNE2 patches, I will combine both patches to one for 2.7

Comment by Joseph Gmitter (Inactive) [ 20/Aug/15 ]

Looks like that patch needs to be rebased to move forward.

Comment by Mikhail Pershin [ 20/Aug/15 ]

in fact I think this is fixed already by commit f1d81db9376965e302ecc05e10be220d72b2f04a from LU-6629, I've just added fix for initial LWP connection logic from my old patch for LU-4214

Comment by Jian Yu [ 24/Aug/15 ]

On master branch:

RHEL 7.1 with ZFS:
https://testing.hpdd.intel.com/test_sessions/3d62dc24-48bc-11e5-a657-5254006e85c2

RHEL 6.6 with ZFS:
https://testing.hpdd.intel.com/test_sessions/6a2783ea-4853-11e5-a4ad-5254006e85c2

RHEL 7.1 with ldiskfs:
https://testing.hpdd.intel.com/test_sessions/e8b22732-498a-11e5-8ada-5254006e85c2

RHEL 6.6 with ldiskfs:
https://testing.hpdd.intel.com/test_sessions/1220e9f8-470d-11e5-bfb6-5254006e85c2

The failure also occurred on ZFS test sessions on master branch. Not on ldiskfs test sessions.

Comment by Mikhail Pershin [ 25/Aug/15 ]

It looks like OST is unable to switch to the failover MDS after all and keep trying to reach failed MDS. That means problem in its import connection logic or local config doesn't provide information about failover MDS

Comment by Jian Yu [ 04/Sep/15 ]

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_6/2 (2.6.0)
Distro/Arch: RHEL6.5/x86_64
FSTYPE=zfs

The failure also occurred on Lustre 2.6.0 release:
https://testing.hpdd.intel.com/test_sessions/05c8dc02-135f-11e4-92ae-5254006e85c2

Comment by Mikhail Pershin [ 06/Sep/15 ]

The problem looks still related to the LWP device, it is not able to switch to the failover server for some reason. Meanwhile the local config llog exists on OST and contains failover nid, so problem might be inside LWP code. Also it is not clear why it exists only on ZFS setups.

Comment by Niu Yawei (Inactive) [ 06/Sep/15 ]
Call Trace:
 [<ffffffff81055783>] ? set_next_buddy+0x43/0x50
 [<ffffffff8152a595>] schedule_timeout+0x215/0x2e0
 [<ffffffff81069f15>] ? enqueue_entity+0x125/0x450
 [<ffffffff8152a213>] wait_for_common+0x123/0x180
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa090cd00>] ? client_lwp_config_process+0x0/0x1948 [obdclass]
 [<ffffffff8152a32d>] wait_for_completion+0x1d/0x20
 [<ffffffffa0898e14>] llog_process_or_fork+0x354/0x540 [obdclass]
 [<ffffffffa0899014>] llog_process+0x14/0x30 [obdclass]
 [<ffffffffa08c81d4>] class_config_parse_llog+0x1e4/0x330 [obdclass]
 [<ffffffffa10314f2>] mgc_process_log+0xeb2/0x1970 [mgc]
 [<ffffffffa102b1f0>] ? mgc_blocking_ast+0x0/0x810 [mgc]
 [<ffffffffa0ad0860>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
 [<ffffffffa1032ef8>] mgc_process_config+0x658/0x1210 [mgc]
 [<ffffffffa08d9383>] lustre_process_log+0x7e3/0x1130 [obdclass]
 [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa08d514f>] ? server_name2fsname+0x6f/0x90 [obdclass]
 [<ffffffffa0907496>] server_start_targets+0x12b6/0x1af0 [obdclass]
 [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa08dbfe6>] ? lustre_start_mgc+0x4b6/0x1e00 [obdclass]
 [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa08d3390>] ? class_config_llog_handler+0x0/0x1a70 [obdclass]
 [<ffffffffa090c255>] server_fill_super+0xbe5/0x1690 [obdclass]
 [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa08dde90>] lustre_fill_super+0x560/0xa80 [obdclass]
 [<ffffffffa08dd930>] ? lustre_fill_super+0x0/0xa80 [obdclass]
 [<ffffffff8118c56f>] get_sb_nodev+0x5f/0xa0
 [<ffffffffa08d4ee5>] lustre_get_sb+0x25/0x30 [obdclass]
 [<ffffffff8118bbcb>] vfs_kern_mount+0x7b/0x1b0
 [<ffffffff8118bd72>] do_kern_mount+0x52/0x130
 [<ffffffff8119e972>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff811ad74b>] do_mount+0x2fb/0x930
 [<ffffffff811ade10>] sys_mount+0x90/0xe0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

The mount thread is waiting on processing log records, let's see what the log process thread is doing:

llog_process_ S 0000000000000001     0 28768      2 0x00000080
 ffff880058535970 0000000000000046 ffff880058535960 ffffffffa0c04f4c
 ffff8800585359d0 000000007f823240 0000000000000000 ffff8800590f4511
 0000004958535910 ffffffffa0cccece ffff880058f285f8 ffff880058535fd8
Call Trace:
 [<ffffffffa0c04f4c>] ? ptlrpc_unregister_reply+0x6c/0x810 [ptlrpc]
 [<ffffffff8152b222>] schedule_timeout+0x192/0x2e0
 [<ffffffff81087540>] ? process_timeout+0x0/0x10
 [<ffffffffa0c07ba9>] ptlrpc_set_wait+0x319/0xa20 [ptlrpc]
 [<ffffffffa0bfd2c0>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc]
 [<ffffffff81064c00>] ? default_wake_function+0x0/0x20
 [<ffffffffa0c13da5>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc]
 [<ffffffffa0c08331>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
 [<ffffffffa0e7455b>] fld_client_rpc+0x15b/0x510 [fld]
 [<ffffffff81297869>] ? simple_strtoul+0x9/0x10
 [<ffffffffa0e7b024>] fld_update_from_controller+0x1c4/0x570 [fld]
 [<ffffffffa124f8e8>] ofd_register_lwp_callback+0xa8/0x5a0 [ofd]
 [<ffffffffa0a0c84f>] lustre_lwp_connect+0xacf/0xd10 [obdclass]
 [<ffffffffa0a0db25>] lustre_lwp_setup+0x8c5/0xc60 [obdclass]
 [<ffffffffa09dbc08>] ? target_name2index+0x78/0xc0 [obdclass]
 [<ffffffffa0a0fa07>] client_lwp_config_process+0x1357/0x1da0 [obdclass]
 [<ffffffffa099c35a>] llog_process_thread+0x94a/0xfc0 [obdclass]
 [<ffffffffa099d515>] llog_process_thread_daemonize+0x45/0x70 [obdclass]
 [<ffffffffa099d4d0>] ? llog_process_thread_daemonize+0x0/0x70 [obdclass]
 [<ffffffff8109e78e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

It's trying to query FLDB from MDT0, but unfortunately the FLD RPC timedout somehow. Though I don't quite sure the timeout reason, I think we shouldn't do that in ofd_register_lwp_callback() from the beginning, because that'll block the OST mount process when the MDT0 isn't started.

Di, could you take a look? Is there anyway to avert that problem?

Comment by Niu Yawei (Inactive) [ 07/Sep/15 ]

I see why the LWP on OST can't connect to the failover MDT now:

On OST mount, we process the client log to setup LWP as following:
1> Process LCFG_ADD_UUID record to setup LWP device, then connect to the MDT. (see lustre_lwp_setup());
2> Process LCFG_ADD_CONN record to add failover connection;

We can see if the mount process is blocked on step 1, then it will never have a chance to add failover connection, and LWP won't be able to switch to failover node forever.

Unfortunately, the process could be blocked on step 1, see lustre_lwp_setup() -> lustre_lwp_connect():

         rc = obd_connect(&env, &exp, lwp, uuid, data, NULL);
        if (rc != 0) {
                CERROR("%s: connect failed: rc = %d\n", lwp->obd_name, rc);
        } else {
                if (unlikely(lwp->obd_lwp_export != NULL))
                        class_export_put(lwp->obd_lwp_export);
                lwp->obd_lwp_export = class_export_get(exp);
                lustre_notify_lwp_list(exp);
        }

lustre_notify_lwp_list() is called after obd_connect() is called successfully (which means CONNECT can be sent successfully, but not connection established), lustre_notify_lwp_list calls ofd_register_lwp_callback() to send FLD RPC, please note that it's now try to send RPC to the primary MDT0, (connection of failover MDT0 hasn't been added yet).

Let look at fld_client_rpc():

        if (rc != 0) {
                if (imp->imp_state != LUSTRE_IMP_CLOSED && !imp->imp_deactive) {
                        /* Since LWP is not replayable, so it will keep
                         * trying unless umount happens, otherwise it would
                         * cause unecessary failure of the application. */
                        ptlrpc_req_finished(req);
                        rc = 0;
                        goto again;
                }
                GOTO(out_req, rc);
        }

It's trying to send the RPC to primary MDT0 again and again in a loop... At the end, the mount process is blocked here forever (before the failover connection being added).

I'd think further on how to resolve this. Di, any thoughts? Thanks.

Comment by Di Wang [ 07/Sep/15 ]

Hmm, I actually met this problem in one of my patch test
http://review.whamcloud.com/#/c/15275/ (patch set 7)

See this failure https://testing.hpdd.intel.com/test_sets/e2a60dca-3ff5-11e5-afd4-5254006e85c2

we probably should set no_delay and allow_replay flag for update FLD RPC. So I think this patch should fix the problem

diff --git a/lustre/fld/fld_request.c b/lustre/fld/fld_request.c
index a59ab95..c82b098 100644
--- a/lustre/fld/fld_request.c
+++ b/lustre/fld/fld_request.c
@@ -399,6 +399,14 @@ again:
 
                req_capsule_set_size(&req->rq_pill, &RMF_GENERIC_DATA,
                                     RCL_SERVER, PAGE_CACHE_SIZE);
+
+               /* This might happen before the import state becomes to FULL,
+                * let's set allow_replay for this request to avoid deadlock
+                * see LU-6273 */
+               req->rq_allow_replay = 1;
+               /* This will always use LWP connection, let's send the req
+                * with no_delay flags, see above */
+               req->rq_no_delay = 1;
                break;
        default:
                rc = -EINVAL;

I will push the fix to http://review.whamcloud.com/#/c/15275/

Comment by Mikhail Pershin [ 07/Sep/15 ]

yes, that looks like the reason, but I still wonder why ldiskfs setups are not affected. Di, IIRC, you did local FLDB in past, shouldn't OST have a local copy of FLDB?

Comment by Di Wang [ 07/Sep/15 ]

I discussed with niu a bit, and this patch is probably not enough according to niu's comment above. We probably should move this fled_update_FLDB to a separate thread, otherwise the LWP config log analyze will not continue, and LWP will never be able to connect. Yes, there is FLDB locally on each target, this FLDB will only happens during first initialization (checking lsf_new flags) or upgrade from 2.4 (DNE env) to the current DNE version (2.8). And I do not know why it only happen on ZFS. It maybe because fld_index_init is slow on ZFS, so OST is restart before fld_index_init finish.

Comment by Mikhail Pershin [ 08/Sep/15 ]

I think we can just move the lustre_notify_lwp_list() to the lustre_start_lwp(), right after lsi_lwp_started is set to 1. At this point the llog is processed and lwp should be fully setup.

Comment by Gerrit Updater [ 08/Sep/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16303
Subject: LU-6273 fld: update local FLDB in a separate thread
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 92feb9161586f00cc03c481c81598f3b77847fb3

Comment by Di Wang [ 08/Sep/15 ]
I think we can just move the lustre_notify_lwp_list() to the lustre_start_lwp(), right after lsi_lwp_started is set to 1. At this point the llog is processed and lwp should be fully setup.

No, lustre_start_lwp does not promise LWP will be setup accordingly, because mdt0 might not be added to the client config log yet, for example, OST might be started before MDT0.

Comment by Gerrit Updater [ 08/Sep/15 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/16304
Subject: LU-6273 lwp: notify LWP is ready after llog processing
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1be148d882a209137e92e525bc69b601e114646c

Comment by Gerrit Updater [ 22/Sep/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16303/
Subject: LU-6273 lwp: notify LWP users in dedicated thread
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b1848aa5b23fd332362e9ae3d5aab31d8dd9d920

Comment by Peter Jones [ 22/Sep/15 ]

Landed for 2.8

Generated at Sat Feb 10 01:58:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.