[LU-4355] recovery-double-scale test_pairwise_fail hung: operation ost_connect failed with -16 Created: 06/Dec/13 Updated: 14/Dec/21 Resolved: 14/Dec/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1, Lustre 2.4.2, Lustre 2.5.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Jian Yu | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | yuc2 | ||
| Environment: |
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/63/ |
||
| Severity: | 3 |
| Rank (Obsolete): | 11929 |
| Description |
|
recovery-double-scale test_pairwise_fail hung as follows: ==== START === test 5: failover OST, then 2 clients ==== <~snip~> Starting client: superfat-intel-1vm5: -o user_xattr,acl,flock superfat-intel-1vm7:superfat-intel-1vm3:/lustre /mnt/lustre CMD: superfat-intel-1vm5 mkdir -p /mnt/lustre CMD: superfat-intel-1vm5 mount -t lustre -o user_xattr,acl,flock superfat-intel-1vm7:superfat-intel-1vm3:/lustre /mnt/lustre CMD: superfat-intel-1vm5 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/sbin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"0x33f0404\" \" 0xffb7e3ff\" 32 Console log on client superfat-intel-1vm5 showed that: 19:25:31:Lustre: lustre-OST0004-osc-ffff8800370c7c00: Connection restored to lustre-OST0004 (at 10.10.4.95@tcp)
19:25:31:Lustre: Skipped 3 previous similar messages
19:25:31:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Failing type2=clients item2=superfat-intel-1vm5,superfat-intel-1vm6 ...
19:25:31:Lustre: DEBUG MARKER: Failing type2=clients item2=superfat-intel-1vm5,superfat-intel-1vm6 ...
19:25:31:Lustre: DEBUG MARKER: test -f /tmp/client-load.pid &&
19:25:31: { kill -s TERM $(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
19:25:31:
19:25:31:<ConMan> Console [superfat-intel-1vm5] disconnected from <superfat-intel-1:6004> at 12-04 19:24.
19:25:31:
19:25:31:<ConMan> Console [superfat-intel-1vm5] connected to <superfat-intel-1:6004> at 12-04 19:25.
19:25:31:
Press any key to continue.
<~snip~>
19:30:33:LNet: Added LNI 10.10.4.84@tcp [8/256/0/180]
19:30:33:LNet: Accept all, port 7988
19:30:33:LustreError: 152-6: Ignoring deprecated mount option 'acl'.
19:30:33:Lustre: Layout lock feature supported.
19:30:33:LustreError: 11-0: lustre-OST0006-osc-ffff880037b0a400: Communicating with 10.10.4.95@tcp, operation ost_connect failed with -16.
19:30:33:LustreError: 11-0: lustre-OST0005-osc-ffff880037b0a400: Communicating with 10.10.4.95@tcp, operation ost_connect failed with -16.
19:30:33:Lustre: Mounted lustre-client
19:30:33:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
19:30:33:LNet: 2176:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
19:30:33:LNet: 2177:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
19:30:33:LustreError: 11-0: lustre-OST0006-osc-ffff880037b0a400: Communicating with 10.10.4.95@tcp, operation ost_connect failed with -16.
19:30:33:LustreError: 11-0: lustre-OST0005-osc-ffff880037b0a400: Communicating with 10.10.4.95@tcp, operation ost_connect failed with -16.
Console log on OSS superfat-intel-1vm8 showed that: 19:25:47:Lustre: DEBUG MARKER: Failing type2=clients item2=superfat-intel-1vm5,superfat-intel-1vm6 ... 19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) reconnecting, waiting for 3 clients in recovery for 0:15 19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs 19:25:47:Lustre: lustre-OST0006: Will be in recovery for at least 1:00, or until 3 clients reconnect 19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) reconnecting, waiting for 3 clients in recovery for 0:05 19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs 19:25:47:Lustre: lustre-OST0005: recovery is timed out, evict stale exports 19:25:47:LustreError: 3880:0:(ldlm_resource.c:1165:ldlm_resource_get()) lustre-OST0005: lvbo_init failed for resource 0x402:0x0: rc = -2 19:25:47:LustreError: 3880:0:(ldlm_resource.c:1165:ldlm_resource_get()) Skipped 597 previous similar messages 19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs 19:25:47:LustreError: 3880:0:(ldlm_resource.c:1165:ldlm_resource_get()) lustre-OST0005: lvbo_init failed for resource 0x405:0x0: rc = -2 19:25:47:LustreError: 3880:0:(ldlm_resource.c:1165:ldlm_resource_get()) Skipped 3452 previous similar messages 19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) reconnecting, waiting for 3 clients in recovery for 0:36 19:25:47:Lustre: Skipped 1044 previous similar messages 19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs 19:25:47:Lustre: Skipped 2 previous similar messages 19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) reconnecting, waiting for 3 clients in recovery for 0:16 19:25:47:Lustre: Skipped 3 previous similar messages 19:27:28:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs 19:27:28:Lustre: Skipped 3 previous similar messages 19:27:28:Lustre: lustre-OST0002: haven't heard from client f23f578c-ed7b-9fa4-df6c-fa1ce4040fbf (at 10.10.4.84@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff88007af59c00, cur 1386213947 expire 1386213917 last 1386213897 19:27:28:Lustre: lustre-OST0004: haven't heard from client f23f578c-ed7b-9fa4-df6c-fa1ce4040fbf (at 10.10.4.84@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff88007ef97400, cur 1386213947 expire 1386213917 last 1386213897 19:27:28:Lustre: lustre-OST0005: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (2 recovered, 1 in progress, and 0 evicted) to recover in 0:13 19:27:28:Lustre: lustre-OST0006: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (1 recovered, 1 in progress, and 0 evicted) to recover in 0:11 19:27:28:Lustre: lustre-OST0005: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (2 recovered, 1 in progress, and 0 evicted) to recover in 0:08 19:27:28:Lustre: Skipped 1 previous similar message 19:27:28:Lustre: lustre-OST0005: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (2 recovered, 1 in progress, and 0 evicted) to recover in 0:03 19:27:28:Lustre: Skipped 1 previous similar message 19:27:28:Lustre: lustre-OST0006: recovery is timed out, evict stale exports 19:27:28:Lustre: lustre-OST0006: disconnecting 1 stale clients 19:27:28:Lustre: lustre-OST0006: Recovery over after 1:00, of 3 clients 2 recovered and 1 was evicted. 19:27:28:Lustre: lustre-OST0006: deleting orphan objects from 0x0:1078 to 0x0:1107 Maloo report: https://maloo.whamcloud.com/test_sets/b748b488-5ddf-11e3-aed2-52540035b04c IP addresses: From the above logs, we can see: 19:27:28:Lustre: lustre-OST0005: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (2 recovered, 1 in progress, and 0 evicted) to recover in 0:13 |
| Comments |
| Comment by Jian Yu [ 06/Dec/13 ] |
|
The same test passed on Lustre 2.4.1 RC2 (build #45): After searching on Maloo, I found the same failure occurred before on Lustre b2_4 build #40: So, this does not look like a regression. |
| Comment by Jian Yu [ 21/Dec/13 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1) The same failure occurred: |
| Comment by Jian Yu [ 04/Jan/14 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5/ The same failure occurred: |
| Comment by Jian Yu [ 07/Feb/14 ] |
|
More instance on Lustre b2_5 branch: |
| Comment by Jian Yu [ 09/Mar/14 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/39/ (2.5.1 RC1) The same failure occurred: |