Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4355

recovery-double-scale test_pairwise_fail hung: operation ost_connect failed with -16

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.4.1, Lustre 2.4.2, Lustre 2.5.1

    • Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/63/
      Distro/Arch: RHEL6.4/x86_64
      TEST_GROUP=failover
    • 3
    • 11929

    Description

      recovery-double-scale test_pairwise_fail hung as follows:

      ==== START === test 5: failover OST, then 2 clients ====
      <~snip~>
      Starting client: superfat-intel-1vm5: -o user_xattr,acl,flock superfat-intel-1vm7:superfat-intel-1vm3:/lustre /mnt/lustre
      CMD: superfat-intel-1vm5 mkdir -p /mnt/lustre
      CMD: superfat-intel-1vm5 mount -t lustre -o user_xattr,acl,flock superfat-intel-1vm7:superfat-intel-1vm3:/lustre /mnt/lustre
      CMD: superfat-intel-1vm5 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/sbin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"0x33f0404\" \" 0xffb7e3ff\" 32 
      

      Console log on client superfat-intel-1vm5 showed that:

      19:25:31:Lustre: lustre-OST0004-osc-ffff8800370c7c00: Connection restored to lustre-OST0004 (at 10.10.4.95@tcp)
      19:25:31:Lustre: Skipped 3 previous similar messages
      19:25:31:Lustre: DEBUG MARKER: /usr/sbin/lctl mark                             Failing type2=clients item2=superfat-intel-1vm5,superfat-intel-1vm6 ... 
      19:25:31:Lustre: DEBUG MARKER: Failing type2=clients item2=superfat-intel-1vm5,superfat-intel-1vm6 ...
      19:25:31:Lustre: DEBUG MARKER: test -f /tmp/client-load.pid &&
      19:25:31:        { kill -s TERM $(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
      19:25:31:
      19:25:31:<ConMan> Console [superfat-intel-1vm5] disconnected from <superfat-intel-1:6004> at 12-04 19:24.
      19:25:31:
      19:25:31:<ConMan> Console [superfat-intel-1vm5] connected to <superfat-intel-1:6004> at 12-04 19:25.
      19:25:31:
      Press any key to continue.
      
      <~snip~>
      
      19:30:33:LNet: Added LNI 10.10.4.84@tcp [8/256/0/180]
      19:30:33:LNet: Accept all, port 7988
      19:30:33:LustreError: 152-6: Ignoring deprecated mount option 'acl'.
      19:30:33:Lustre: Layout lock feature supported.
      19:30:33:LustreError: 11-0: lustre-OST0006-osc-ffff880037b0a400: Communicating with 10.10.4.95@tcp, operation ost_connect failed with -16.
      19:30:33:LustreError: 11-0: lustre-OST0005-osc-ffff880037b0a400: Communicating with 10.10.4.95@tcp, operation ost_connect failed with -16.
      19:30:33:Lustre: Mounted lustre-client
      19:30:33:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
      19:30:33:LNet: 2176:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
      19:30:33:LNet: 2177:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
      19:30:33:LustreError: 11-0: lustre-OST0006-osc-ffff880037b0a400: Communicating with 10.10.4.95@tcp, operation ost_connect failed with -16.
      19:30:33:LustreError: 11-0: lustre-OST0005-osc-ffff880037b0a400: Communicating with 10.10.4.95@tcp, operation ost_connect failed with -16.
      

      Console log on OSS superfat-intel-1vm8 showed that:

      19:25:47:Lustre: DEBUG MARKER: Failing type2=clients item2=superfat-intel-1vm5,superfat-intel-1vm6 ...
      19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) reconnecting, waiting for 3 clients in recovery for 0:15
      19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs
      19:25:47:Lustre: lustre-OST0006: Will be in recovery for at least 1:00, or until 3 clients reconnect
      19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) reconnecting, waiting for 3 clients in recovery for 0:05
      19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs
      19:25:47:Lustre: lustre-OST0005: recovery is timed out, evict stale exports
      19:25:47:LustreError: 3880:0:(ldlm_resource.c:1165:ldlm_resource_get()) lustre-OST0005: lvbo_init failed for resource 0x402:0x0: rc = -2
      19:25:47:LustreError: 3880:0:(ldlm_resource.c:1165:ldlm_resource_get()) Skipped 597 previous similar messages
      19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs
      19:25:47:LustreError: 3880:0:(ldlm_resource.c:1165:ldlm_resource_get()) lustre-OST0005: lvbo_init failed for resource 0x405:0x0: rc = -2
      19:25:47:LustreError: 3880:0:(ldlm_resource.c:1165:ldlm_resource_get()) Skipped 3452 previous similar messages
      19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) reconnecting, waiting for 3 clients in recovery for 0:36
      19:25:47:Lustre: Skipped 1044 previous similar messages
      19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs
      19:25:47:Lustre: Skipped 2 previous similar messages
      19:25:47:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) reconnecting, waiting for 3 clients in recovery for 0:16
      19:25:47:Lustre: Skipped 3 previous similar messages
      19:27:28:Lustre: lustre-OST0005: Client 44f4f2de-f2b9-e381-1912-31a482522e3d (at 10.10.4.85@tcp) refused reconnection, still busy with 1 active RPCs
      19:27:28:Lustre: Skipped 3 previous similar messages
      19:27:28:Lustre: lustre-OST0002: haven't heard from client f23f578c-ed7b-9fa4-df6c-fa1ce4040fbf (at 10.10.4.84@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff88007af59c00, cur 1386213947 expire 1386213917 last 1386213897
      19:27:28:Lustre: lustre-OST0004: haven't heard from client f23f578c-ed7b-9fa4-df6c-fa1ce4040fbf (at 10.10.4.84@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff88007ef97400, cur 1386213947 expire 1386213917 last 1386213897
      19:27:28:Lustre: lustre-OST0005: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (2 recovered, 1 in progress, and 0 evicted) to recover in 0:13
      19:27:28:Lustre: lustre-OST0006: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (1 recovered, 1 in progress, and 0 evicted) to recover in 0:11
      19:27:28:Lustre: lustre-OST0005: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (2 recovered, 1 in progress, and 0 evicted) to recover in 0:08
      19:27:28:Lustre: Skipped 1 previous similar message
      19:27:28:Lustre: lustre-OST0005: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (2 recovered, 1 in progress, and 0 evicted) to recover in 0:03
      19:27:28:Lustre: Skipped 1 previous similar message
      19:27:28:Lustre: lustre-OST0006: recovery is timed out, evict stale exports
      19:27:28:Lustre: lustre-OST0006: disconnecting 1 stale clients
      19:27:28:Lustre: lustre-OST0006: Recovery over after 1:00, of 3 clients 2 recovered and 1 was evicted.
      19:27:28:Lustre: lustre-OST0006: deleting orphan objects from 0x0:1078 to 0x0:1107
      

      Maloo report: https://maloo.whamcloud.com/test_sets/b748b488-5ddf-11e3-aed2-52540035b04c

      IP addresses:
      superfat-intel-1vm1: 10.10.4.80
      superfat-intel-1vm2: 10.10.4.81
      superfat-intel-1vm3: 10.10.4.82
      superfat-intel-1vm4: 10.10.4.83
      superfat-intel-1vm5: 10.10.4.84
      superfat-intel-1vm6: 10.10.4.85
      superfat-intel-1vm7: 10.10.4.94
      superfat-intel-1vm8: 10.10.4.95

      From the above logs, we can see:
      After oss superfat-intel-1vm4 failed over to oss superfat-intel-1vm8, before client osc connections were restored to lustre-OST0005 and lustre-OST0006, client superfat-intel-1vm5 was powered off and on. After the client node was up and tried to connect to lustre-OST0005, the ost kept denying the connection:

      19:27:28:Lustre: lustre-OST0005: Denying connection for new client c95a9301-62de-7453-fcdc-472b6eddb32c (at 10.10.4.84@tcp), waiting for all 3 known clients (2 recovered, 1 in progress, and 0 evicted) to recover in 0:13
      

      Attachments

        Activity

          People

            wc-triage WC Triage
            yujian Jian Yu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: