[LU-4355] recovery-double-scale test_pairwise_fail hung: operation ost_connect failed with -16 Created: 06/Dec/13  Updated: 14/Dec/21  Resolved: 14/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1, Lustre 2.4.2, Lustre 2.5.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Jian Yu Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: yuc2
Environment:

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/63/
Distro/Arch: RHEL6.4/x86_64
TEST_GROUP=failover


Severity: 3
Rank (Obsolete): 11929

 Description   

recovery-double-scale test_pairwise_fail hung as follows:

==== START === test 5: failover OST, then 2 clients ====
<~snip~>
Starting client: superfat-intel-1vm5: -o user_xattr,acl,flock superfat-intel-1vm7:superfat-intel-1vm3:/lustre /mnt/lustre
CMD: superfat-intel-1vm5 mkdir -p /mnt/lustre
CMD: superfat-intel-1vm5 mount -t lustre -o user_xattr,acl,flock superfat-intel-1vm7:superfat-intel-1vm3:/lustre /mnt/lustre
CMD: superfat-intel-1vm5 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/sbin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"0x33f0404\" \" 0xffb7e3ff\" 32 

Console log on client superfat-intel-1vm5 showed that:

19:25:31:Lustre: lustre-OST0004-osc-ffff8800370c7c00: Connection restored to lustre-OST0004 (at 10.10.4.95@tcp)
19:25:31:Lustre: Skipped 3 previous similar messages
19:25:31:Lustre: DEBUG MARKER: /usr/sbin/lctl mark                             Failing type2=clients item2=superfat-intel-1vm5,superfat-intel-1vm6 ... 
19:25:31:Lustre: DEBUG MARKER: Failing type2=clients item2=superfat-intel-1vm5,superfat-intel-1vm6 ...
19:25:31:Lustre: DEBUG MARKER: test -f /tmp/client-load.pid &&
19:25:31:        { kill -s TERM $(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
19:25:31:
19:25:31:<ConMan> Console [superfat-intel-1vm5] disconnected from <superfat-intel-1:6004> at 12-04 19:24.
19:25:31:
19:25:31:<ConMan> Console [superfat-intel-1vm5] connected to <superfat-intel-1:6004> at 12-04 19:25.
19:25:31:
Press any key to continue.

<~snip~>

19:30:33:LNet: Added LNI 10.10.4.84@tcp [8/256/0/180]
19:30:33:LNet: Accept all, port 7988
19:30:33:LustreError: 152-6: Ignoring deprecated mount option 'acl'.
19:30:33:Lustre: Layout lock feature supported.
19:30:33:LustreError: 11-0: lustre-OST0006-osc-ffff880037b0a400: Communicating with 10.10.4.95@tcp, operation ost_connect failed with -16.
19:30:33:LustreError: 11-0: lustre-OST0005-osc-ffff880037b0a400: Communicating with 10.10.4.95@tcp, operation ost_connect failed with -16.
19:30:33:Lustre: Mounted lustre-client
19:30:33:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
19:30:33:LNet: 2176:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
19:30:33:LNet: 2177:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
19:30:33:LustreError: 11-0: lustre-OST0006-osc-ffff880037b0a400: Communicating with 10.10.4.95@tcp, operation ost_connect failed with -16.
19:30:33:LustreError: 11-0: lustre-OST0005-osc-ffff880037b0a400: Communicating with 10.10.4.95@tcp, operation ost_connect failed with -16.

Console log on OSS superfat-intel-1vm8 showed that:

19:25:47:Lustre: DEBUG MARKER: Failing type2=clients item2=superfat-intel-1vm5,superfat-intel-1vm6 ...
19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) reconnecting, waiting for 3 clients in recovery for 0:15
19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs
19:25:47:Lustre: lustre-OST0006: Will be in recovery for at least 1:00, or until 3 clients reconnect
19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) reconnecting, waiting for 3 clients in recovery for 0:05
19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs
19:25:47:Lustre: lustre-OST0005: recovery is timed out, evict stale exports
19:25:47:LustreError: 3880:0:(ldlm_resource.c:1165:ldlm_resource_get()) lustre-OST0005: lvbo_init failed for resource 0x402:0x0: rc = -2
19:25:47:LustreError: 3880:0:(ldlm_resource.c:1165:ldlm_resource_get()) Skipped 597 previous similar messages
19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs
19:25:47:LustreError: 3880:0:(ldlm_resource.c:1165:ldlm_resource_get()) lustre-OST0005: lvbo_init failed for resource 0x405:0x0: rc = -2
19:25:47:LustreError: 3880:0:(ldlm_resource.c:1165:ldlm_resource_get()) Skipped 3452 previous similar messages
19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) reconnecting, waiting for 3 clients in recovery for 0:36
19:25:47:Lustre: Skipped 1044 previous similar messages
19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs
19:25:47:Lustre: Skipped 2 previous similar messages
19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) reconnecting, waiting for 3 clients in recovery for 0:16
19:25:47:Lustre: Skipped 3 previous similar messages
19:27:28:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs
19:27:28:Lustre: Skipped 3 previous similar messages
19:27:28:Lustre: lustre-OST0002: haven't heard from client f23f578c-ed7b-9fa4-df6c-fa1ce4040fbf (at 10.10.4.84@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff88007af59c00, cur 1386213947 expire 1386213917 last 1386213897
19:27:28:Lustre: lustre-OST0004: haven't heard from client f23f578c-ed7b-9fa4-df6c-fa1ce4040fbf (at 10.10.4.84@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff88007ef97400, cur 1386213947 expire 1386213917 last 1386213897
19:27:28:Lustre: lustre-OST0005: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (2 recovered, 1 in progress, and 0 evicted) to recover in 0:13
19:27:28:Lustre: lustre-OST0006: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (1 recovered, 1 in progress, and 0 evicted) to recover in 0:11
19:27:28:Lustre: lustre-OST0005: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (2 recovered, 1 in progress, and 0 evicted) to recover in 0:08
19:27:28:Lustre: Skipped 1 previous similar message
19:27:28:Lustre: lustre-OST0005: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (2 recovered, 1 in progress, and 0 evicted) to recover in 0:03
19:27:28:Lustre: Skipped 1 previous similar message
19:27:28:Lustre: lustre-OST0006: recovery is timed out, evict stale exports
19:27:28:Lustre: lustre-OST0006: disconnecting 1 stale clients
19:27:28:Lustre: lustre-OST0006: Recovery over after 1:00, of 3 clients 2 recovered and 1 was evicted.
19:27:28:Lustre: lustre-OST0006: deleting orphan objects from 0x0:1078 to 0x0:1107

Maloo report: https://maloo.whamcloud.com/test_sets/b748b488-5ddf-11e3-aed2-52540035b04c

IP addresses:
superfat-intel-1vm1: 10.10.4.80
superfat-intel-1vm2: 10.10.4.81
superfat-intel-1vm3: 10.10.4.82
superfat-intel-1vm4: 10.10.4.83
superfat-intel-1vm5: 10.10.4.84
superfat-intel-1vm6: 10.10.4.85
superfat-intel-1vm7: 10.10.4.94
superfat-intel-1vm8: 10.10.4.95

From the above logs, we can see:
After oss superfat-intel-1vm4 failed over to oss superfat-intel-1vm8, before client osc connections were restored to lustre-OST0005 and lustre-OST0006, client superfat-intel-1vm5 was powered off and on. After the client node was up and tried to connect to lustre-OST0005, the ost kept denying the connection:

19:27:28:Lustre: lustre-OST0005: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (2 recovered, 1 in progress, and 0 evicted) to recover in 0:13


 Comments   
Comment by Jian Yu [ 06/Dec/13 ]

The same test passed on Lustre 2.4.1 RC2 (build #45):
https://maloo.whamcloud.com/test_sets/f0cc9f6c-194c-11e3-bb73-52540035b04c

After searching on Maloo, I found the same failure occurred before on Lustre b2_4 build #40:
https://maloo.whamcloud.com/test_sets/fd792344-0f4b-11e3-af0c-52540035b04c

So, this does not look like a regression.

Comment by Jian Yu [ 21/Dec/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1)
Distro/Arch: RHEL6.4/x86_64
TEST_GROUP=failover
FSTYPE=zfs

The same failure occurred:
https://maloo.whamcloud.com/test_sets/ed36352a-6a41-11e3-8e21-52540035b04c

Comment by Jian Yu [ 04/Jan/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5/
TEST_GROUP=failover

The same failure occurred:
https://maloo.whamcloud.com/test_sets/e2af0f04-7505-11e3-95ae-52540035b04c

Comment by Jian Yu [ 07/Feb/14 ]

More instance on Lustre b2_5 branch:
https://maloo.whamcloud.com/test_sets/e70577b8-8f32-11e3-b8e1-52540035b04c

Comment by Jian Yu [ 09/Mar/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/39/ (2.5.1 RC1)
Distro/Arch: RHEL6.5/x86_64
FSTYPE=zfs
Test Group: failover

The same failure occurred:
https://maloo.whamcloud.com/test_sets/50529cf6-a657-11e3-a191-52540035b04c

Generated at Sat Feb 10 01:41:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.