Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4359

1.8.9 clients has endless bulk IO timeouts with 2.4.1 servers

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.1
    • None
    • RHEL6.4/distro OFED/1.8.9 clients/2.4.1 servers
    • 3
    • 11939

    Description

      We have 2 separate 1.8.9 clients that have processes hung in the D state with the clients endlessly looping from FULL to DISCONN and then reestablishing connectivity. One appears to be looping on a Bulk IO write to a stale file handle (10.36.202.142@o2ib) and the other appears to be a BULK IO read from a kiblnd failure (10.36.202.138@o2ib). The timeouts affect filesystem availability, but other activities proceed in between these disconnections.

      Just yesterday we identified 2 bad IB cables with high symbol error rates in our fabric that have since been disconnected. They were likely the cause for at least one of the issues.

      server logs relevant to 10.36.202.138@o2ib issue:
      Dec 8 12:11:52 atlas-oss3b4 kernel: [1032387.863173] Lustre: atlas2-OST009b: Client 3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2 (at 10.36.202.138@o2ib) refused reconnection, still busy with 1 active RPCs
      Dec 8 12:11:52 atlas-oss3b4 kernel: [1032387.919690] LustreError: 11590:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk PUT req@ffff880761cd2800 x1453425689933409/t0(0) o3->3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2@10.36.202.138@o2ib:0/0 lens 448/432 e 0 to 0 dl 1386523292 ref 1 fl Interpret:/2/0 rc 0/0
      Dec 8 12:11:52 atlas-oss3b4 kernel: [1032388.030434] Lustre: atlas2-OST009b: Bulk IO read error with 3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2 (at 10.36.202.138@o2ib), client will retry: rc -110

      client log on 10.36.202.138@o2ib:
      Dec 8 12:11:51 dtn04.ccs.ornl.gov kernel: LustreError: 24615:0:(events.c:199:client_bulk_callback()) event type 1, status -103, desc ffff880499288000
      Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: 24622:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1453425689933409 sent from atlas2-OST009b-osc-ffff880c392f9400 to NID 10.36.225.185@o2ib 19s ago has failed due to network error (567s prior to deadline).
      Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: 24622:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
      Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: atlas2-OST009b-osc-ffff880c392f9400: Connection to service atlas2-OST009b via nid 10.36.225.185@o2ib was lost; in progress operations using this service will wait for recovery to complete.
      Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: Skipped 7 previous similar messages
      Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: LustreError: 11-0: an error occurred while communicating with 10.36.225.185@o2ib. The ost_connect operation failed with -16

      lctl dk output:
      00000100:00080000:9:1386522391.428708:0:24622:0:(client.c:1392:ptlrpc_check_set()) resend bulk old x1453425689698153 new x1453425689814678
      00000100:02000400:4:1386522391.428708:0:24623:0:(import.c:1016:ptlrpc_connect_interpret()) Server atlas2-OST009b_UUID version (2.4.1.0) is much newer than client version (1.8.9)
      00000800:00000100:8:1386522411.140346:0:1644:0:(o2iblnd_cb.c:1813:kiblnd_close_conn_locked()) Closing conn to 10.36.225.185@o2ib: error 0(waiting)
      00000100:00020000:6:1386522411.140744:0:24615:0:(events.c:199:client_bulk_callback()) event type 1, status -103, desc ffff880499288000
      00000100:00000400:3:1386522411.151303:0:24622:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1453425689814678 sent from atlas2-OST009b-osc-ffff880c392f9400 to NID 10.36.225.185@o2ib 20s ago has failed due to network error (567s prior to deadline).

      server logs relevant to 10.36.202.142@o2ib issue:
      Dec 8 09:27:15 atlas-oss4h1 kernel: [1022525.123761] LustreError: 113676:0:(ldlm_lib.c:2722:target_bulk_io()) @@@ network error on bulk GET 0(1048576) req@ffff880ff253d000 x145343054
      1439222/t0(0) o4->10c57078-ce10-72c6-b97d-e7c9a32a240c@10.36.202.142@o2ib:0/0 lens 448/448 e 0 to 0 dl 1386513410 ref 1 fl Interpret:/2/0 rc 0/0
      Dec 8 09:27:15 atlas-oss4h1 kernel: [1022525.123900] Lustre: atlas2-OST03e0: Bulk IO write error with 10c57078-ce10-72c6-b97d-e7c9a32a240c (at 10.36.202.142@o2ib), client will retry:
      rc -110

      client logs from 10.36.202.142@o2ib:
      Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: atlas2-OST03e0-osc-ffff881039813c00: Connection to service atlas2-OST03e0 via nid 10.36.226.48@o2ib was lost; in progress operati
      ons using this service will wait for recovery to complete.
      Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: Skipped 39 previous similar messages
      Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: atlas2-OST03e0-osc-ffff881039813c00: Connection restored to service atlas2-OST03e0 using nid 10.36.226.48@o2ib.
      Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: Skipped 41 previous similar messages

      The lctl dk output is less revealing for this case, but I will attach it. What flags are desirable for getting more relevant information?

      From the OS, if I try to kill the process or stat the inode, I see:
      00000080:00020000:3:1386519720.291103:0:16284:0:(file.c:3348:ll_inode_revalidate_fini()) failure -116 inode 144117425485470173

      [root@dtn-sch01 ~]# fuser -k -m /lustre/atlas2
      Cannot stat file /proc/975/fd/25: Stale file handle
      [root@dtn-sch01 ~]# ls -l /proc/975/fd/25
      l-wx------ 1 cfuson ccsstaff 64 Dec 7 22:46 /proc/975/fd/25 -> /lustre/atlas1/stf007/scratch/cfuson/TestDir/SubDir2/13G-3.tar
      [root@dtn-sch01 ~]#

      Attachments

        1. kiblnd_timeout_client_dtn04.log
          45 kB
        2. lst-wrapper.sh
          1 kB
        3. lustre_dtn04_0.log.gz
          1.46 MB
        4. lustre_dtn04_20131209_fail.log.gz
          544 kB
        5. lustre_dtn-sch01_0.log.gz
          2.99 MB
        6. lustre_logs_atlas-oss3b4_flush.log.gz
          0.3 kB
        7. stale_fh_loop_client_dtn-sch01.log
          25 kB
        8. stale_fh_loop_server_logs
          153 kB

        Issue Links

          Activity

            [LU-4359] 1.8.9 clients has endless bulk IO timeouts with 2.4.1 servers

            Could you run lnet-selftest to check the network between problematic OST and 1.8 clients?

            I just posted the instructions of how to use lnet-selftest here and the wrapper script is attached:

            = Preparation =

            The LNET Selftest kernel module must be installed and loaded on all targets in the test before the application is started. Identify the set of all systems that will participate in a session and ensure that the kernel module has been loaded. To load the kernel module:

            modprobe lnet_selftest

            Dependencies are automatically resolved and loaded by modprobe. This will make sure all the necessary modules are loaded: libcfs, lnet, lnet_selftest and any one of the klnds (kernel lustre network devices, i.e. ksocklnd, ko2iblnd, etc.).

            Identify a "console" node from which to conduct the tests. This is the single system from which all LNET selftest commands will be executed. The console node owns the LNET selftest session and there should be only one active session on the network at any given time (strictly speaking one can run several LNET selftest sessions in parallel across a network but this is generally discouraged unless the sessions are carefully isolated).

            It is strongly recommended that a survey and analysis of raw network performance between the target systems is carried out prior to running the LNET selftest benchmark. This will help to identify and measure any performance overhead introduced by LNET. The HPDD SE team has recently been evaluating Netperf for this purpose on TCP/IP-based networks with good results. Refer to the HPDD SE Netperf page for details on how to manage this exercise.

            = Using the Wrapper Script =

            Use the LNET Selftest wrapper to execute the test cases referenced in this document. The header of the script has some variables that need to be set in accordance with the target environment. Without changes, the script is very unlikely to operate correctly, if at all. Here is a listing of the header:

            Single Client Throughput – LNET Selftest Read (2 Nodes, 1:1)
            Used to establish point to point unidirectional read performance between two nodes.
            Set the wrapper up as follows:

            #Output file
            ST=lst-output-$(date +%Y-%m-%d-%H:%M:%S)
             
            # Concurrency
            CN=32
            #Size
            SZ=1M
            # Length of time to run test (secs)
            TM=60
            # Which BRW test to run (read or write)
            BRW=read
            # Checksum calculation (simple or full)
            CKSUM=simple
            # The LST "from" list -- e.g. Lustre clients. Space separated list of NIDs.
            LFROM="10.73.2.21@tcp"
            # The LST "to" list -- e.g. Lustre servers. Space separated list of NIDs.
            LTO="10.73.2.22@tcp"
            

            CN: the concurrency setting simulates the number of threads performing communication. The LNET Selftest default is 1, which is not enough to properly exercise the connection. Set to at least 16, but experiment with higher values (32 or 64 being reasonable choices).
            SZ: the size setting determines the size of the IO transaction. For bandwidth (throughput) measurements, use 1M.
            TM: test time in seconds– how long to run the benchmark for. Set to a reasonable number in order to ensure collection of sufficient data to extrapolate a meaningful average (at least 60 seconds).
            BRW: The Bulk Read/Write test to use. There are only two choices "read" or "write".
            CKSUM: The checksum checking method. Choose either "simple" or "full".
            LFROM: a space-separated list of NIDs that represent the "from" list (or source) in LNET Selftest. This is often a set of clients.
            LTO: a space-separated list of NIDs that represent the "to" list (or destination) in LNET Selftest. This is often a set of servers.

            Change the LFROM and LTO lists as required.

            Run the script several times, changing the concurrency setting with at the start of every new run. Use the sequence 1, 2, 4, 8, 16, 32, 64, 128. Modify the output filename for each run so that it is clear what results have been captured into each file.

            niu Niu Yawei (Inactive) added a comment - Could you run lnet-selftest to check the network between problematic OST and 1.8 clients? I just posted the instructions of how to use lnet-selftest here and the wrapper script is attached: = Preparation = The LNET Selftest kernel module must be installed and loaded on all targets in the test before the application is started. Identify the set of all systems that will participate in a session and ensure that the kernel module has been loaded. To load the kernel module: modprobe lnet_selftest Dependencies are automatically resolved and loaded by modprobe. This will make sure all the necessary modules are loaded: libcfs, lnet, lnet_selftest and any one of the klnds (kernel lustre network devices, i.e. ksocklnd, ko2iblnd, etc.). Identify a "console" node from which to conduct the tests. This is the single system from which all LNET selftest commands will be executed. The console node owns the LNET selftest session and there should be only one active session on the network at any given time (strictly speaking one can run several LNET selftest sessions in parallel across a network but this is generally discouraged unless the sessions are carefully isolated). It is strongly recommended that a survey and analysis of raw network performance between the target systems is carried out prior to running the LNET selftest benchmark. This will help to identify and measure any performance overhead introduced by LNET. The HPDD SE team has recently been evaluating Netperf for this purpose on TCP/IP-based networks with good results. Refer to the HPDD SE Netperf page for details on how to manage this exercise. = Using the Wrapper Script = Use the LNET Selftest wrapper to execute the test cases referenced in this document. The header of the script has some variables that need to be set in accordance with the target environment. Without changes, the script is very unlikely to operate correctly, if at all. Here is a listing of the header: Single Client Throughput – LNET Selftest Read (2 Nodes, 1:1) Used to establish point to point unidirectional read performance between two nodes. Set the wrapper up as follows: #Output file ST=lst-output-$(date +%Y-%m-%d-%H:%M:%S) # Concurrency CN=32 #Size SZ=1M # Length of time to run test (secs) TM=60 # Which BRW test to run (read or write) BRW=read # Checksum calculation (simple or full) CKSUM=simple # The LST "from" list -- e.g. Lustre clients. Space separated list of NIDs. LFROM= "10.73.2.21@tcp" # The LST "to" list -- e.g. Lustre servers. Space separated list of NIDs. LTO= "10.73.2.22@tcp" CN: the concurrency setting simulates the number of threads performing communication. The LNET Selftest default is 1, which is not enough to properly exercise the connection. Set to at least 16, but experiment with higher values (32 or 64 being reasonable choices). SZ: the size setting determines the size of the IO transaction. For bandwidth (throughput) measurements, use 1M. TM: test time in seconds– how long to run the benchmark for. Set to a reasonable number in order to ensure collection of sufficient data to extrapolate a meaningful average (at least 60 seconds). BRW: The Bulk Read/Write test to use. There are only two choices "read" or "write". CKSUM: The checksum checking method. Choose either "simple" or "full". LFROM: a space-separated list of NIDs that represent the "from" list (or source) in LNET Selftest. This is often a set of clients. LTO: a space-separated list of NIDs that represent the "to" list (or destination) in LNET Selftest. This is often a set of servers. Change the LFROM and LTO lists as required. Run the script several times, changing the concurrency setting with at the start of every new run. Use the sequence 1, 2, 4, 8, 16, 32, 64, 128. Modify the output filename for each run so that it is clear what results have been captured into each file.

            The test cases in this LU are all with clients on the same NI as the servers – no routers.

            blakecaldwell Blake Caldwell added a comment - The test cases in this LU are all with clients on the same NI as the servers – no routers.

            Strangely this is only on our 1.8.9 systems. 2.4 clients can access these files without issue and a test writing to files striped on each OST in the filesystem run to completion. The same test on a 1.8.9 client will hang on OST0024 as noted above.

            Do you use different router for 1.8 & 2.4 clients?

            Liang, is there any simple way to check if the network between client and server is sane? Could you advise?

            niu Niu Yawei (Inactive) added a comment - Strangely this is only on our 1.8.9 systems. 2.4 clients can access these files without issue and a test writing to files striped on each OST in the filesystem run to completion. The same test on a 1.8.9 client will hang on OST0024 as noted above. Do you use different router for 1.8 & 2.4 clients? Liang, is there any simple way to check if the network between client and server is sane? Could you advise?

            We tried many different 1.8.9/RHEL6 clients after the bad links were removed. They all fall into the same loop, when writing to various OSTs. It does not appear to always be OST0009b, but it happens frequently enough, that the hangs are very common. When sweeping though all of the OSTs, another client hung on OST0024 for example.

            The example shown here was repeated after the cables were replaced. After the case had been opened, the system was rebooted and a particular user tried again to run the same rsync command that triggered the OST0009b issues. The most recent logs from the lustre server are from the recurrence of the issue.

            Strangely this is only on our 1.8.9 systems. 2.4 clients can access these files without issue and a test writing to files striped on each OST in the filesystem run to completion. The same test on a 1.8.9 client will hang on OST0024 as noted above.

            blakecaldwell Blake Caldwell added a comment - We tried many different 1.8.9/RHEL6 clients after the bad links were removed. They all fall into the same loop, when writing to various OSTs. It does not appear to always be OST0009b, but it happens frequently enough, that the hangs are very common. When sweeping though all of the OSTs, another client hung on OST0024 for example. The example shown here was repeated after the cables were replaced. After the case had been opened, the system was rebooted and a particular user tried again to run the same rsync command that triggered the OST0009b issues. The most recent logs from the lustre server are from the recurrence of the issue. Strangely this is only on our 1.8.9 systems. 2.4 clients can access these files without issue and a test writing to files striped on each OST in the filesystem run to completion. The same test on a 1.8.9 client will hang on OST0024 as noted above.

            Just yesterday we identified 2 bad IB cables with high symbol error rates in our fabric that have since been disconnected. They were likely the cause for at least one of the issues.

            Looks these two bad cables caused the following series of problems? The problems were not resovled even after you replaced the cables?

            Dec 9 06:12:43 atlas-oss4h4 kernel: [510636.815385] LustreError: 61456:0:(ldlm_lib.c:2722:target_bulk_io()) @@@ network error on bulk GET 0(1048576) req@ffff880fe2a0e400 x1453430544373389/t0(0) o4->10c57078-ce10-72c6-b97d-e7c9a32a240c@10.36.202.142@o2ib:0/0 lens 448/448 e 0 to 0 dl 1386588141 ref 1 fl Interpret:/2/0 rc 0/0
            

            Bulk read always fail for some reason, I think that's the reason why client fell into the loop of : bulk read timeout -> reconnect -> bulk read timeout -> reconnect ... Could you check if the connection between the client to OST009b is healthy?

            niu Niu Yawei (Inactive) added a comment - Just yesterday we identified 2 bad IB cables with high symbol error rates in our fabric that have since been disconnected. They were likely the cause for at least one of the issues. Looks these two bad cables caused the following series of problems? The problems were not resovled even after you replaced the cables? Dec 9 06:12:43 atlas-oss4h4 kernel: [510636.815385] LustreError: 61456:0:(ldlm_lib.c:2722:target_bulk_io()) @@@ network error on bulk GET 0(1048576) req@ffff880fe2a0e400 x1453430544373389/t0(0) o4->10c57078-ce10-72c6-b97d-e7c9a32a240c@10.36.202.142@o2ib:0/0 lens 448/448 e 0 to 0 dl 1386588141 ref 1 fl Interpret:/2/0 rc 0/0 Bulk read always fail for some reason, I think that's the reason why client fell into the loop of : bulk read timeout -> reconnect -> bulk read timeout -> reconnect ... Could you check if the connection between the client to OST009b is healthy?

            Still servicing client BULK reads:
            Dec 13 15:44:50 atlas-oss3b4 kernel: [1476205.395677] LustreError: 11576:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 1 previous similar message
            Dec 13 15:44:50 atlas-oss3b4 kernel: [1476205.425212] Lustre: atlas2-OST009b: Bulk IO read error with 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib), client will retry: rc -110
            Dec 13 15:44:50 atlas-oss3b4 kernel: [1476205.466052] Lustre: Skipped 1 previous similar message
            Dec 13 15:49:51 atlas-oss3b4 kernel: [1476505.278112] LustreError: 25256:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff88070f632000
            Dec 13 15:54:32 atlas-oss3b4 kernel: [1476785.683451] Lustre: atlas2-OST009b: Client 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib) reconnecting
            Dec 13 15:54:32 atlas-oss3b4 kernel: [1476785.709484] Lustre: Skipped 9 previous similar messages
            Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.464059] LustreError: 25256:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff880717a48000
            Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.585964] Lustre: atlas2-OST009b: Client 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib) refused reconnection, still busy with 1 active RPCs
            Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.626704] Lustre: Skipped 1 previous similar message
            Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.704087] LustreError: 11607:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk PUT req@ffff8807613a7c00 x1453958693735860/t0(0) o3->3064e3d3-408f-7c76-94af-ce52cde93078@10.36.202.138@o2ib:0/0 lens 448/432 e 0 to 0 dl 1386968389 ref 1 fl Interpret:/2/0 rc 0/0
            Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.777371] LustreError: 11607:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 1 previous similar message
            Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.815244] Lustre: atlas2-OST009b: Bulk IO read error with 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib), client will retry: rc -110

            blakecaldwell Blake Caldwell added a comment - Still servicing client BULK reads: Dec 13 15:44:50 atlas-oss3b4 kernel: [1476205.395677] LustreError: 11576:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 1 previous similar message Dec 13 15:44:50 atlas-oss3b4 kernel: [1476205.425212] Lustre: atlas2-OST009b: Bulk IO read error with 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib), client will retry: rc -110 Dec 13 15:44:50 atlas-oss3b4 kernel: [1476205.466052] Lustre: Skipped 1 previous similar message Dec 13 15:49:51 atlas-oss3b4 kernel: [1476505.278112] LustreError: 25256:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff88070f632000 Dec 13 15:54:32 atlas-oss3b4 kernel: [1476785.683451] Lustre: atlas2-OST009b: Client 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib) reconnecting Dec 13 15:54:32 atlas-oss3b4 kernel: [1476785.709484] Lustre: Skipped 9 previous similar messages Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.464059] LustreError: 25256:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff880717a48000 Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.585964] Lustre: atlas2-OST009b: Client 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib) refused reconnection, still busy with 1 active RPCs Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.626704] Lustre: Skipped 1 previous similar message Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.704087] LustreError: 11607:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk PUT req@ffff8807613a7c00 x1453958693735860/t0(0) o3->3064e3d3-408f-7c76-94af-ce52cde93078@10.36.202.138@o2ib:0/0 lens 448/432 e 0 to 0 dl 1386968389 ref 1 fl Interpret:/2/0 rc 0/0 Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.777371] LustreError: 11607:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 1 previous similar message Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.815244] Lustre: atlas2-OST009b: Bulk IO read error with 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib), client will retry: rc -110
            bobijam Zhenyu Xu added a comment -

            can we have logs from atlas-oss3b4 on which atlas2-OST009b locates, and it is still busy servicing a client BULK read but fails and rejecting the same client's reconnection.

            bobijam Zhenyu Xu added a comment - can we have logs from atlas-oss3b4 on which atlas2-OST009b locates, and it is still busy servicing a client BULK read but fails and rejecting the same client's reconnection.

            Increased priority to Major due to customer request/comments:

            "There are files on the filesystem that are inaccessible and I would like to rule out lingering issues from LU-4309 as a cause. At this point, our first priority is to recover the files that seem to be causing the hang."

            jamesanunez James Nunez (Inactive) added a comment - Increased priority to Major due to customer request/comments: "There are files on the filesystem that are inaccessible and I would like to rule out lingering issues from LU-4309 as a cause. At this point, our first priority is to recover the files that seem to be causing the hang."

            This seem very repeatable. If I try reading the file that the user process (rsync) is stuck on, I can see the same errors. It appears to be on OSS 10.36.225.185@o2ib, specifically OST009b or obdindex 155. I found this from the client dmesg. I'm attaching a debug log from command 3 that hung below with flags
            ioctl neterror warning dlmtrace error emerg ha rpctrace config console

            [root@dtn04 eq]# dmesg|tail
            LustreError: 2398:0:(events.c:199:client_bulk_callback()) event type 1, status -103, desc ffff880838928000
            Lustre: 2405:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1453958660133786 sent from atlas2-OST009b-osc-ffff880c37da6800 to NID 10.36.225.185@o2ib 20s ago has failed due to network error (567s prior to deadline).
            req@ffff880821b5a000 x1453958660133786/t0 o3->atlas2-OST009b_UUID@10.36.225.185@o2ib:6/4 lens 448/592 e 0 to 1 dl 1386624465 ref 2 fl Rpc:/2/0 rc 0/0
            Lustre: 2405:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
            Lustre: atlas2-OST009b-osc-ffff880c37da6800: Connection to service atlas2-OST009b via nid 10.36.225.185@o2ib was lost; in progress operations using this service will wait for recovery to complete.
            Lustre: Skipped 7 previous similar messages
            LustreError: 11-0: an error occurred while communicating with 10.36.225.185@o2ib. The ost_connect operation failed with -16
            LustreError: Skipped 1 previous similar message

            1)
            [root@dtn04 ~]# lfs getstripe /lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor
            /lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor
            lmm_stripe_count: 4
            lmm_stripe_size: 1048576
            lmm_stripe_offset: 155
            obdidx objid objid group
            155 16095 0x3edf 0
            156 16094 0x3ede 0
            157 16165 0x3f25 0
            158 16102 0x3ee6 0

            [root@dtn04 ~]# stat /lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor
            File: `/lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor'
            Size: 20807164 Blocks: 40640 IO Block: 2097152 regular file
            Device: 2586730ah/629568266d Inode: 144117763932180829 Links: 1
            Access: (0600/rw------) Uid: ( 9348/ dermaid) Gid: (18645/ dermaid)
            Access: 2013-12-09 13:12:46.000000000 -0500
            Modify: 2013-11-30 13:45:37.000000000 -0500
            Change: 2013-11-30 13:45:37.000000000 -0500
            [root@dtn04 ~]# ls -l /lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor
            rw------ 1 dermaid dermaid 20807164 Nov 30 13:45 /lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor
            [root@dtn04 ~]# file /lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor
            ^C^C

            2)
            Now with a different file that works (no objects on obdindex 155):
            [root@dtn04 eq]# stat cer1_cer2_bena_chol_4x.eq370.0000.coor
            File: `cer1_cer2_bena_chol_4x.eq370.0000.coor'
            Size: 20807164 Blocks: 40640 IO Block: 2097152 regular file
            Device: 2586730ah/629568266d Inode: 144117777421110925 Links: 1
            Access: (0640/rw-r----) Uid: ( 9348/ dermaid) Gid: (18645/ dermaid)
            Access: 2013-12-09 15:49:01.000000000 -0500
            Modify: 2013-11-28 14:15:42.000000000 -0500
            Change: 2013-11-28 15:26:26.000000000 -0500
            [root@dtn04 eq]# file cer1_cer2_bena_chol_4x.eq370.0000.coor
            cer1_cer2_bena_chol_4x.eq370.0000.coor: data

            3)
            Now with another file that fails (object on obdindex 155)
            cer1_cer2_bena_chol_4x.eq370.0001.xst
            lmm_stripe_count: 4
            lmm_stripe_size: 1048576
            lmm_stripe_offset: 154
            obdidx objid objid group
            154 16163 0x3f23 0
            155 16094 0x3ede 0
            156 16093 0x3edd 0
            157 16164 0x3f24 0

            [root@dtn04 eq]# stat cer1_cer2_bena_chol_4x.eq370.0001.xst
            File: `cer1_cer2_bena_chol_4x.eq370.0001.xst'
            Size: 15610 Blocks: 32 IO Block: 2097152 regular file
            Device: 2586730ah/629568266d Inode: 144117763932180577 Links: 1
            Access: (0600/rw------) Uid: ( 9348/ dermaid) Gid: (18645/ dermaid)
            Access: 2013-12-09 15:52:45.000000000 -0500
            Modify: 2013-11-30 13:45:37.000000000 -0500
            Change: 2013-11-30 13:45:37.000000000 -0500
            [root@dtn04 eq]# file cer1_cer2_bena_chol_4x.eq370.0001.xst

            blakecaldwell Blake Caldwell added a comment - This seem very repeatable. If I try reading the file that the user process (rsync) is stuck on, I can see the same errors. It appears to be on OSS 10.36.225.185@o2ib, specifically OST009b or obdindex 155. I found this from the client dmesg. I'm attaching a debug log from command 3 that hung below with flags ioctl neterror warning dlmtrace error emerg ha rpctrace config console [root@dtn04 eq] # dmesg|tail LustreError: 2398:0:(events.c:199:client_bulk_callback()) event type 1, status -103, desc ffff880838928000 Lustre: 2405:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1453958660133786 sent from atlas2-OST009b-osc-ffff880c37da6800 to NID 10.36.225.185@o2ib 20s ago has failed due to network error (567s prior to deadline). req@ffff880821b5a000 x1453958660133786/t0 o3->atlas2-OST009b_UUID@10.36.225.185@o2ib:6/4 lens 448/592 e 0 to 1 dl 1386624465 ref 2 fl Rpc:/2/0 rc 0/0 Lustre: 2405:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Lustre: atlas2-OST009b-osc-ffff880c37da6800: Connection to service atlas2-OST009b via nid 10.36.225.185@o2ib was lost; in progress operations using this service will wait for recovery to complete. Lustre: Skipped 7 previous similar messages LustreError: 11-0: an error occurred while communicating with 10.36.225.185@o2ib. The ost_connect operation failed with -16 LustreError: Skipped 1 previous similar message 1) [root@dtn04 ~] # lfs getstripe /lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor /lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_stripe_offset: 155 obdidx objid objid group 155 16095 0x3edf 0 156 16094 0x3ede 0 157 16165 0x3f25 0 158 16102 0x3ee6 0 [root@dtn04 ~] # stat /lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor File: `/lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor' Size: 20807164 Blocks: 40640 IO Block: 2097152 regular file Device: 2586730ah/629568266d Inode: 144117763932180829 Links: 1 Access: (0600/ rw ------) Uid: ( 9348/ dermaid) Gid: (18645/ dermaid) Access: 2013-12-09 13:12:46.000000000 -0500 Modify: 2013-11-30 13:45:37.000000000 -0500 Change: 2013-11-30 13:45:37.000000000 -0500 [root@dtn04 ~] # ls -l /lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor rw ------ 1 dermaid dermaid 20807164 Nov 30 13:45 /lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor [root@dtn04 ~] # file /lustre/atlas2/chm045/scratch/dermaid/eq370_50pct/eq/cer1_cer2_bena_chol_4x.eq370.0001.coor ^C^C 2) Now with a different file that works (no objects on obdindex 155): [root@dtn04 eq] # stat cer1_cer2_bena_chol_4x.eq370.0000.coor File: `cer1_cer2_bena_chol_4x.eq370.0000.coor' Size: 20807164 Blocks: 40640 IO Block: 2097152 regular file Device: 2586730ah/629568266d Inode: 144117777421110925 Links: 1 Access: (0640/ rw-r ----) Uid: ( 9348/ dermaid) Gid: (18645/ dermaid) Access: 2013-12-09 15:49:01.000000000 -0500 Modify: 2013-11-28 14:15:42.000000000 -0500 Change: 2013-11-28 15:26:26.000000000 -0500 [root@dtn04 eq] # file cer1_cer2_bena_chol_4x.eq370.0000.coor cer1_cer2_bena_chol_4x.eq370.0000.coor: data 3) Now with another file that fails (object on obdindex 155) cer1_cer2_bena_chol_4x.eq370.0001.xst lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_stripe_offset: 154 obdidx objid objid group 154 16163 0x3f23 0 155 16094 0x3ede 0 156 16093 0x3edd 0 157 16164 0x3f24 0 [root@dtn04 eq] # stat cer1_cer2_bena_chol_4x.eq370.0001.xst File: `cer1_cer2_bena_chol_4x.eq370.0001.xst' Size: 15610 Blocks: 32 IO Block: 2097152 regular file Device: 2586730ah/629568266d Inode: 144117763932180577 Links: 1 Access: (0600/ rw ------) Uid: ( 9348/ dermaid) Gid: (18645/ dermaid) Access: 2013-12-09 15:52:45.000000000 -0500 Modify: 2013-11-30 13:45:37.000000000 -0500 Change: 2013-11-30 13:45:37.000000000 -0500 [root@dtn04 eq] # file cer1_cer2_bena_chol_4x.eq370.0001.xst

            We found messages that occur during recovery when one of the misbehaving clients reconnects that refers to LU-617. Is there any relation here? What is the cause of the smaller transaction getting written to disk, or why is the client trying to replace a larger transaction?

            (mdt_recovery.c:418:mdt_last_rcvd_update()) Trying to overwrite bigger transno:on-disk: 47302528652, new: 47302528651 replay: 0. see LU-617.

            In context below. Note that there is not one of the "Trying to overwrite bigger transno" messages for every reconnection.
            Dec 9 06:09:57 atlas-oss3b4 kernel: [1096933.917564] Lustre: atlas2-OST009b: Client 3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2 (at 10.36.202.138@o2ib) reconnecting
            Dec 9 06:09:57 atlas-oss3b4 kernel: [1096933.942120] Lustre: Skipped 9 previous similar messages
            Dec 9 06:10:17 atlas-oss3b4 kernel: [1096953.629963] LustreError: 25257:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff88072b512000
            Dec 9 06:10:38 atlas-oss4g2 kernel: [1096977.515250] LustreError: 137-5: atlas2-OST007d_UUID: not available for connect from 10.36.202.142@o2ib (no target)
            Dec 9 06:10:38 atlas-oss4g4 kernel: [1096978.820263] LustreError: 137-5: atlas2-OST007f_UUID: not available for connect from 10.36.202.142@o2ib (no target)
            Dec 9 06:12:35 atlas-mds3 kernel: [1096356.518178] LustreError: 32369:0:(mdt_recovery.c:418:mdt_last_rcvd_update()) Trying to overwrite bigger transno:on-disk: 47302528652, new: 47302528651 replay: 0. see LU-617.
            Dec 9 06:12:42 atlas-oss4h4 kernel: [510636.716395] LustreError: 16128:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff880fa8d40000
            Dec 9 06:12:42 atlas-oss4h4 kernel: [510636.745560] LustreError: 16127:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff880fc2d3a000
            Dec 9 06:12:43 atlas-oss4h4 kernel: [510636.784482] LustreError: 16128:0:(events.c:450:server_bulk_callback()) event type 3, status -5, desc ffff880fa8d40000
            Dec 9 06:12:43 atlas-oss4h4 kernel: [510636.815353] LustreError: 16127:0:(events.c:450:server_bulk_callback()) event type 3, status -5, desc ffff880fc2d3a000
            Dec 9 06:12:43 atlas-oss4h4 kernel: [510636.815385] LustreError: 61456:0:(ldlm_lib.c:2722:target_bulk_io()) @@@ network error on bulk GET 0(1048576) req@ffff880fe2a0e400 x1453430544373389/t0(0) o4->10c57078-ce10-72c6-b97d-e7c9a32a240c@10.36.202.142@o2ib:0/0 lens 448/448 e 0 to 0 dl 1386588141 ref 1 fl Interpret:/2/0 rc 0/0

            blakecaldwell Blake Caldwell added a comment - We found messages that occur during recovery when one of the misbehaving clients reconnects that refers to LU-617 . Is there any relation here? What is the cause of the smaller transaction getting written to disk, or why is the client trying to replace a larger transaction? (mdt_recovery.c:418:mdt_last_rcvd_update()) Trying to overwrite bigger transno:on-disk: 47302528652, new: 47302528651 replay: 0. see LU-617 . In context below. Note that there is not one of the "Trying to overwrite bigger transno" messages for every reconnection. Dec 9 06:09:57 atlas-oss3b4 kernel: [1096933.917564] Lustre: atlas2-OST009b: Client 3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2 (at 10.36.202.138@o2ib) reconnecting Dec 9 06:09:57 atlas-oss3b4 kernel: [1096933.942120] Lustre: Skipped 9 previous similar messages Dec 9 06:10:17 atlas-oss3b4 kernel: [1096953.629963] LustreError: 25257:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff88072b512000 Dec 9 06:10:38 atlas-oss4g2 kernel: [1096977.515250] LustreError: 137-5: atlas2-OST007d_UUID: not available for connect from 10.36.202.142@o2ib (no target) Dec 9 06:10:38 atlas-oss4g4 kernel: [1096978.820263] LustreError: 137-5: atlas2-OST007f_UUID: not available for connect from 10.36.202.142@o2ib (no target) Dec 9 06:12:35 atlas-mds3 kernel: [1096356.518178] LustreError: 32369:0:(mdt_recovery.c:418:mdt_last_rcvd_update()) Trying to overwrite bigger transno:on-disk: 47302528652, new: 47302528651 replay: 0. see LU-617 . Dec 9 06:12:42 atlas-oss4h4 kernel: [510636.716395] LustreError: 16128:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff880fa8d40000 Dec 9 06:12:42 atlas-oss4h4 kernel: [510636.745560] LustreError: 16127:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff880fc2d3a000 Dec 9 06:12:43 atlas-oss4h4 kernel: [510636.784482] LustreError: 16128:0:(events.c:450:server_bulk_callback()) event type 3, status -5, desc ffff880fa8d40000 Dec 9 06:12:43 atlas-oss4h4 kernel: [510636.815353] LustreError: 16127:0:(events.c:450:server_bulk_callback()) event type 3, status -5, desc ffff880fc2d3a000 Dec 9 06:12:43 atlas-oss4h4 kernel: [510636.815385] LustreError: 61456:0:(ldlm_lib.c:2722:target_bulk_io()) @@@ network error on bulk GET 0(1048576) req@ffff880fe2a0e400 x1453430544373389/t0(0) o4->10c57078-ce10-72c6-b97d-e7c9a32a240c@10.36.202.142@o2ib:0/0 lens 448/448 e 0 to 0 dl 1386588141 ref 1 fl Interpret:/2/0 rc 0/0
            pjones Peter Jones added a comment -

            Niu

            Could you please comment on this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Niu Could you please comment on this one? Thanks Peter

            People

              niu Niu Yawei (Inactive)
              blakecaldwell Blake Caldwell
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: