Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.4.1, Lustre 2.4.2, Lustre 2.5.1
-
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/63/
Distro/Arch: RHEL6.4/x86_64
TEST_GROUP=failover
-
3
-
11929
Description
recovery-double-scale test_pairwise_fail hung as follows:
==== START === test 5: failover OST, then 2 clients ==== <~snip~> Starting client: superfat-intel-1vm5: -o user_xattr,acl,flock superfat-intel-1vm7:superfat-intel-1vm3:/lustre /mnt/lustre CMD: superfat-intel-1vm5 mkdir -p /mnt/lustre CMD: superfat-intel-1vm5 mount -t lustre -o user_xattr,acl,flock superfat-intel-1vm7:superfat-intel-1vm3:/lustre /mnt/lustre CMD: superfat-intel-1vm5 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/sbin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"0x33f0404\" \" 0xffb7e3ff\" 32
Console log on client superfat-intel-1vm5 showed that:
19:25:31:Lustre: lustre-OST0004-osc-ffff8800370c7c00: Connection restored to lustre-OST0004 (at 10.10.4.95@tcp) 19:25:31:Lustre: Skipped 3 previous similar messages 19:25:31:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Failing type2=clients item2=superfat-intel-1vm5,superfat-intel-1vm6 ... 19:25:31:Lustre: DEBUG MARKER: Failing type2=clients item2=superfat-intel-1vm5,superfat-intel-1vm6 ... 19:25:31:Lustre: DEBUG MARKER: test -f /tmp/client-load.pid && 19:25:31: { kill -s TERM $(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; } 19:25:31: 19:25:31:<ConMan> Console [superfat-intel-1vm5] disconnected from <superfat-intel-1:6004> at 12-04 19:24. 19:25:31: 19:25:31:<ConMan> Console [superfat-intel-1vm5] connected to <superfat-intel-1:6004> at 12-04 19:25. 19:25:31: Press any key to continue. <~snip~> 19:30:33:LNet: Added LNI 10.10.4.84@tcp [8/256/0/180] 19:30:33:LNet: Accept all, port 7988 19:30:33:LustreError: 152-6: Ignoring deprecated mount option 'acl'. 19:30:33:Lustre: Layout lock feature supported. 19:30:33:LustreError: 11-0: lustre-OST0006-osc-ffff880037b0a400: Communicating with 10.10.4.95@tcp, operation ost_connect failed with -16. 19:30:33:LustreError: 11-0: lustre-OST0005-osc-ffff880037b0a400: Communicating with 10.10.4.95@tcp, operation ost_connect failed with -16. 19:30:33:Lustre: Mounted lustre-client 19:30:33:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u 19:30:33:LNet: 2176:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release. 19:30:33:LNet: 2177:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release. 19:30:33:LustreError: 11-0: lustre-OST0006-osc-ffff880037b0a400: Communicating with 10.10.4.95@tcp, operation ost_connect failed with -16. 19:30:33:LustreError: 11-0: lustre-OST0005-osc-ffff880037b0a400: Communicating with 10.10.4.95@tcp, operation ost_connect failed with -16.
Console log on OSS superfat-intel-1vm8 showed that:
19:25:47:Lustre: DEBUG MARKER: Failing type2=clients item2=superfat-intel-1vm5,superfat-intel-1vm6 ... 19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) reconnecting, waiting for 3 clients in recovery for 0:15 19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs 19:25:47:Lustre: lustre-OST0006: Will be in recovery for at least 1:00, or until 3 clients reconnect 19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) reconnecting, waiting for 3 clients in recovery for 0:05 19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs 19:25:47:Lustre: lustre-OST0005: recovery is timed out, evict stale exports 19:25:47:LustreError: 3880:0:(ldlm_resource.c:1165:ldlm_resource_get()) lustre-OST0005: lvbo_init failed for resource 0x402:0x0: rc = -2 19:25:47:LustreError: 3880:0:(ldlm_resource.c:1165:ldlm_resource_get()) Skipped 597 previous similar messages 19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs 19:25:47:LustreError: 3880:0:(ldlm_resource.c:1165:ldlm_resource_get()) lustre-OST0005: lvbo_init failed for resource 0x405:0x0: rc = -2 19:25:47:LustreError: 3880:0:(ldlm_resource.c:1165:ldlm_resource_get()) Skipped 3452 previous similar messages 19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) reconnecting, waiting for 3 clients in recovery for 0:36 19:25:47:Lustre: Skipped 1044 previous similar messages 19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs 19:25:47:Lustre: Skipped 2 previous similar messages 19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) reconnecting, waiting for 3 clients in recovery for 0:16 19:25:47:Lustre: Skipped 3 previous similar messages 19:27:28:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs 19:27:28:Lustre: Skipped 3 previous similar messages 19:27:28:Lustre: lustre-OST0002: haven't heard from client f23f578c-ed7b-9fa4-df6c-fa1ce4040fbf (at 10.10.4.84@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff88007af59c00, cur 1386213947 expire 1386213917 last 1386213897 19:27:28:Lustre: lustre-OST0004: haven't heard from client f23f578c-ed7b-9fa4-df6c-fa1ce4040fbf (at 10.10.4.84@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff88007ef97400, cur 1386213947 expire 1386213917 last 1386213897 19:27:28:Lustre: lustre-OST0005: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (2 recovered, 1 in progress, and 0 evicted) to recover in 0:13 19:27:28:Lustre: lustre-OST0006: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (1 recovered, 1 in progress, and 0 evicted) to recover in 0:11 19:27:28:Lustre: lustre-OST0005: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (2 recovered, 1 in progress, and 0 evicted) to recover in 0:08 19:27:28:Lustre: Skipped 1 previous similar message 19:27:28:Lustre: lustre-OST0005: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (2 recovered, 1 in progress, and 0 evicted) to recover in 0:03 19:27:28:Lustre: Skipped 1 previous similar message 19:27:28:Lustre: lustre-OST0006: recovery is timed out, evict stale exports 19:27:28:Lustre: lustre-OST0006: disconnecting 1 stale clients 19:27:28:Lustre: lustre-OST0006: Recovery over after 1:00, of 3 clients 2 recovered and 1 was evicted. 19:27:28:Lustre: lustre-OST0006: deleting orphan objects from 0x0:1078 to 0x0:1107
Maloo report: https://maloo.whamcloud.com/test_sets/b748b488-5ddf-11e3-aed2-52540035b04c
IP addresses:
superfat-intel-1vm1: 10.10.4.80
superfat-intel-1vm2: 10.10.4.81
superfat-intel-1vm3: 10.10.4.82
superfat-intel-1vm4: 10.10.4.83
superfat-intel-1vm5: 10.10.4.84
superfat-intel-1vm6: 10.10.4.85
superfat-intel-1vm7: 10.10.4.94
superfat-intel-1vm8: 10.10.4.95
From the above logs, we can see:
After oss superfat-intel-1vm4 failed over to oss superfat-intel-1vm8, before client osc connections were restored to lustre-OST0005 and lustre-OST0006, client superfat-intel-1vm5 was powered off and on. After the client node was up and tried to connect to lustre-OST0005, the ost kept denying the connection:
19:27:28:Lustre: lustre-OST0005: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (2 recovered, 1 in progress, and 0 evicted) to recover in 0:13