Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4359

1.8.9 clients has endless bulk IO timeouts with 2.4.1 servers

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.1
    • None
    • RHEL6.4/distro OFED/1.8.9 clients/2.4.1 servers
    • 3
    • 11939

    Description

      We have 2 separate 1.8.9 clients that have processes hung in the D state with the clients endlessly looping from FULL to DISCONN and then reestablishing connectivity. One appears to be looping on a Bulk IO write to a stale file handle (10.36.202.142@o2ib) and the other appears to be a BULK IO read from a kiblnd failure (10.36.202.138@o2ib). The timeouts affect filesystem availability, but other activities proceed in between these disconnections.

      Just yesterday we identified 2 bad IB cables with high symbol error rates in our fabric that have since been disconnected. They were likely the cause for at least one of the issues.

      server logs relevant to 10.36.202.138@o2ib issue:
      Dec 8 12:11:52 atlas-oss3b4 kernel: [1032387.863173] Lustre: atlas2-OST009b: Client 3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2 (at 10.36.202.138@o2ib) refused reconnection, still busy with 1 active RPCs
      Dec 8 12:11:52 atlas-oss3b4 kernel: [1032387.919690] LustreError: 11590:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk PUT req@ffff880761cd2800 x1453425689933409/t0(0) o3->3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2@10.36.202.138@o2ib:0/0 lens 448/432 e 0 to 0 dl 1386523292 ref 1 fl Interpret:/2/0 rc 0/0
      Dec 8 12:11:52 atlas-oss3b4 kernel: [1032388.030434] Lustre: atlas2-OST009b: Bulk IO read error with 3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2 (at 10.36.202.138@o2ib), client will retry: rc -110

      client log on 10.36.202.138@o2ib:
      Dec 8 12:11:51 dtn04.ccs.ornl.gov kernel: LustreError: 24615:0:(events.c:199:client_bulk_callback()) event type 1, status -103, desc ffff880499288000
      Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: 24622:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1453425689933409 sent from atlas2-OST009b-osc-ffff880c392f9400 to NID 10.36.225.185@o2ib 19s ago has failed due to network error (567s prior to deadline).
      Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: 24622:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
      Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: atlas2-OST009b-osc-ffff880c392f9400: Connection to service atlas2-OST009b via nid 10.36.225.185@o2ib was lost; in progress operations using this service will wait for recovery to complete.
      Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: Skipped 7 previous similar messages
      Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: LustreError: 11-0: an error occurred while communicating with 10.36.225.185@o2ib. The ost_connect operation failed with -16

      lctl dk output:
      00000100:00080000:9:1386522391.428708:0:24622:0:(client.c:1392:ptlrpc_check_set()) resend bulk old x1453425689698153 new x1453425689814678
      00000100:02000400:4:1386522391.428708:0:24623:0:(import.c:1016:ptlrpc_connect_interpret()) Server atlas2-OST009b_UUID version (2.4.1.0) is much newer than client version (1.8.9)
      00000800:00000100:8:1386522411.140346:0:1644:0:(o2iblnd_cb.c:1813:kiblnd_close_conn_locked()) Closing conn to 10.36.225.185@o2ib: error 0(waiting)
      00000100:00020000:6:1386522411.140744:0:24615:0:(events.c:199:client_bulk_callback()) event type 1, status -103, desc ffff880499288000
      00000100:00000400:3:1386522411.151303:0:24622:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1453425689814678 sent from atlas2-OST009b-osc-ffff880c392f9400 to NID 10.36.225.185@o2ib 20s ago has failed due to network error (567s prior to deadline).

      server logs relevant to 10.36.202.142@o2ib issue:
      Dec 8 09:27:15 atlas-oss4h1 kernel: [1022525.123761] LustreError: 113676:0:(ldlm_lib.c:2722:target_bulk_io()) @@@ network error on bulk GET 0(1048576) req@ffff880ff253d000 x145343054
      1439222/t0(0) o4->10c57078-ce10-72c6-b97d-e7c9a32a240c@10.36.202.142@o2ib:0/0 lens 448/448 e 0 to 0 dl 1386513410 ref 1 fl Interpret:/2/0 rc 0/0
      Dec 8 09:27:15 atlas-oss4h1 kernel: [1022525.123900] Lustre: atlas2-OST03e0: Bulk IO write error with 10c57078-ce10-72c6-b97d-e7c9a32a240c (at 10.36.202.142@o2ib), client will retry:
      rc -110

      client logs from 10.36.202.142@o2ib:
      Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: atlas2-OST03e0-osc-ffff881039813c00: Connection to service atlas2-OST03e0 via nid 10.36.226.48@o2ib was lost; in progress operati
      ons using this service will wait for recovery to complete.
      Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: Skipped 39 previous similar messages
      Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: atlas2-OST03e0-osc-ffff881039813c00: Connection restored to service atlas2-OST03e0 using nid 10.36.226.48@o2ib.
      Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: Skipped 41 previous similar messages

      The lctl dk output is less revealing for this case, but I will attach it. What flags are desirable for getting more relevant information?

      From the OS, if I try to kill the process or stat the inode, I see:
      00000080:00020000:3:1386519720.291103:0:16284:0:(file.c:3348:ll_inode_revalidate_fini()) failure -116 inode 144117425485470173

      [root@dtn-sch01 ~]# fuser -k -m /lustre/atlas2
      Cannot stat file /proc/975/fd/25: Stale file handle
      [root@dtn-sch01 ~]# ls -l /proc/975/fd/25
      l-wx------ 1 cfuson ccsstaff 64 Dec 7 22:46 /proc/975/fd/25 -> /lustre/atlas1/stf007/scratch/cfuson/TestDir/SubDir2/13G-3.tar
      [root@dtn-sch01 ~]#

      Attachments

        1. kiblnd_timeout_client_dtn04.log
          45 kB
        2. lst-wrapper.sh
          1 kB
        3. lustre_dtn04_0.log.gz
          1.46 MB
        4. lustre_dtn04_20131209_fail.log.gz
          544 kB
        5. lustre_dtn-sch01_0.log.gz
          2.99 MB
        6. lustre_logs_atlas-oss3b4_flush.log.gz
          0.3 kB
        7. stale_fh_loop_client_dtn-sch01.log
          25 kB
        8. stale_fh_loop_server_logs
          153 kB

        Issue Links

          Activity

            [LU-4359] 1.8.9 clients has endless bulk IO timeouts with 2.4.1 servers

            I'm afraid it might not be very helpful to try selftest if there is other activity on the client, it's better to have a "neat" run to diagnose network issue.
            Niu, could it relate to LU-793? the patch is already in master but not in 2.4 releases.

            liang Liang Zhen (Inactive) added a comment - I'm afraid it might not be very helpful to try selftest if there is other activity on the client, it's better to have a "neat" run to diagnose network issue. Niu, could it relate to LU-793 ? the patch is already in master but not in 2.4 releases.

            Thanks for the suggestion. We are open to trying lnet selftest, but while the system is in production, we would expect other client activity to servers mentioned here to interfere with the lnet selftest and vice versa. Do you expect that we'd see an issue at a low concurrency setting such that we wouldn't be subject to contention from other clients. If so, then the results will be meaningful. If not, we will need to wait 2 weeks before we have an interval where we can perform isolated testing. Please advise.

            Was there heavy load on servers when the bulk io timeout is seen? If the the problem always happen no matter of server load, I would think a low concurrency testing is meaningful.

            Another thought to address whether their is an an issue with the network:
            verify issue is still present with selected client running 1.8.9 client
            run rsync application across a large directory such that objects from every OST are read/written.
            upgrade selected client to 2.4.1
            repeat rsync test and look for any signs of network loss

            What to look for:
            When the issue with the 1.8.9 clients occurs, it is very obvious based on the number of Lustre error messages in our server logs. I'd consider the lack of these messages to be a sign of successful test.

            Yes, I think this way can help us to see if it's a network or Lustre version problem.

            niu Niu Yawei (Inactive) added a comment - Thanks for the suggestion. We are open to trying lnet selftest, but while the system is in production, we would expect other client activity to servers mentioned here to interfere with the lnet selftest and vice versa. Do you expect that we'd see an issue at a low concurrency setting such that we wouldn't be subject to contention from other clients. If so, then the results will be meaningful. If not, we will need to wait 2 weeks before we have an interval where we can perform isolated testing. Please advise. Was there heavy load on servers when the bulk io timeout is seen? If the the problem always happen no matter of server load, I would think a low concurrency testing is meaningful. Another thought to address whether their is an an issue with the network: verify issue is still present with selected client running 1.8.9 client run rsync application across a large directory such that objects from every OST are read/written. upgrade selected client to 2.4.1 repeat rsync test and look for any signs of network loss What to look for: When the issue with the 1.8.9 clients occurs, it is very obvious based on the number of Lustre error messages in our server logs. I'd consider the lack of these messages to be a sign of successful test. Yes, I think this way can help us to see if it's a network or Lustre version problem.

            Hi Niu,

            Thanks for the suggestion. We are open to trying lnet selftest, but while the system is in production, we would expect other client activity to servers mentioned here to interfere with the lnet selftest and vice versa. Do you expect that we'd see an issue at a low concurrency setting such that we wouldn't be subject to contention from other clients. If so, then the results will be meaningful. If not, we will need to wait 2 weeks before we have an interval where we can perform isolated testing. Please advise.

            Another thought to address whether their is an an issue with the network:

            • verify issue is still present with selected client running 1.8.9 client
            • run rsync application across a large directory such that objects from every OST are read/written.
            • upgrade selected client to 2.4.1
            • repeat rsync test and look for any signs of network loss

            What to look for:
            When the issue with the 1.8.9 clients occurs, it is very obvious based on the number of Lustre error messages in our server logs. I'd consider the lack of these messages to be a sign of successful test.

            Would this be useful or is it too high level?

            blakecaldwell Blake Caldwell added a comment - Hi Niu, Thanks for the suggestion. We are open to trying lnet selftest, but while the system is in production, we would expect other client activity to servers mentioned here to interfere with the lnet selftest and vice versa. Do you expect that we'd see an issue at a low concurrency setting such that we wouldn't be subject to contention from other clients. If so, then the results will be meaningful. If not, we will need to wait 2 weeks before we have an interval where we can perform isolated testing. Please advise. Another thought to address whether their is an an issue with the network: verify issue is still present with selected client running 1.8.9 client run rsync application across a large directory such that objects from every OST are read/written. upgrade selected client to 2.4.1 repeat rsync test and look for any signs of network loss What to look for: When the issue with the 1.8.9 clients occurs, it is very obvious based on the number of Lustre error messages in our server logs. I'd consider the lack of these messages to be a sign of successful test. Would this be useful or is it too high level?

            Could you run lnet-selftest to check the network between problematic OST and 1.8 clients?

            I just posted the instructions of how to use lnet-selftest here and the wrapper script is attached:

            = Preparation =

            The LNET Selftest kernel module must be installed and loaded on all targets in the test before the application is started. Identify the set of all systems that will participate in a session and ensure that the kernel module has been loaded. To load the kernel module:

            modprobe lnet_selftest

            Dependencies are automatically resolved and loaded by modprobe. This will make sure all the necessary modules are loaded: libcfs, lnet, lnet_selftest and any one of the klnds (kernel lustre network devices, i.e. ksocklnd, ko2iblnd, etc.).

            Identify a "console" node from which to conduct the tests. This is the single system from which all LNET selftest commands will be executed. The console node owns the LNET selftest session and there should be only one active session on the network at any given time (strictly speaking one can run several LNET selftest sessions in parallel across a network but this is generally discouraged unless the sessions are carefully isolated).

            It is strongly recommended that a survey and analysis of raw network performance between the target systems is carried out prior to running the LNET selftest benchmark. This will help to identify and measure any performance overhead introduced by LNET. The HPDD SE team has recently been evaluating Netperf for this purpose on TCP/IP-based networks with good results. Refer to the HPDD SE Netperf page for details on how to manage this exercise.

            = Using the Wrapper Script =

            Use the LNET Selftest wrapper to execute the test cases referenced in this document. The header of the script has some variables that need to be set in accordance with the target environment. Without changes, the script is very unlikely to operate correctly, if at all. Here is a listing of the header:

            Single Client Throughput – LNET Selftest Read (2 Nodes, 1:1)
            Used to establish point to point unidirectional read performance between two nodes.
            Set the wrapper up as follows:

            #Output file
            ST=lst-output-$(date +%Y-%m-%d-%H:%M:%S)
             
            # Concurrency
            CN=32
            #Size
            SZ=1M
            # Length of time to run test (secs)
            TM=60
            # Which BRW test to run (read or write)
            BRW=read
            # Checksum calculation (simple or full)
            CKSUM=simple
            # The LST "from" list -- e.g. Lustre clients. Space separated list of NIDs.
            LFROM="10.73.2.21@tcp"
            # The LST "to" list -- e.g. Lustre servers. Space separated list of NIDs.
            LTO="10.73.2.22@tcp"
            

            CN: the concurrency setting simulates the number of threads performing communication. The LNET Selftest default is 1, which is not enough to properly exercise the connection. Set to at least 16, but experiment with higher values (32 or 64 being reasonable choices).
            SZ: the size setting determines the size of the IO transaction. For bandwidth (throughput) measurements, use 1M.
            TM: test time in seconds– how long to run the benchmark for. Set to a reasonable number in order to ensure collection of sufficient data to extrapolate a meaningful average (at least 60 seconds).
            BRW: The Bulk Read/Write test to use. There are only two choices "read" or "write".
            CKSUM: The checksum checking method. Choose either "simple" or "full".
            LFROM: a space-separated list of NIDs that represent the "from" list (or source) in LNET Selftest. This is often a set of clients.
            LTO: a space-separated list of NIDs that represent the "to" list (or destination) in LNET Selftest. This is often a set of servers.

            Change the LFROM and LTO lists as required.

            Run the script several times, changing the concurrency setting with at the start of every new run. Use the sequence 1, 2, 4, 8, 16, 32, 64, 128. Modify the output filename for each run so that it is clear what results have been captured into each file.

            niu Niu Yawei (Inactive) added a comment - Could you run lnet-selftest to check the network between problematic OST and 1.8 clients? I just posted the instructions of how to use lnet-selftest here and the wrapper script is attached: = Preparation = The LNET Selftest kernel module must be installed and loaded on all targets in the test before the application is started. Identify the set of all systems that will participate in a session and ensure that the kernel module has been loaded. To load the kernel module: modprobe lnet_selftest Dependencies are automatically resolved and loaded by modprobe. This will make sure all the necessary modules are loaded: libcfs, lnet, lnet_selftest and any one of the klnds (kernel lustre network devices, i.e. ksocklnd, ko2iblnd, etc.). Identify a "console" node from which to conduct the tests. This is the single system from which all LNET selftest commands will be executed. The console node owns the LNET selftest session and there should be only one active session on the network at any given time (strictly speaking one can run several LNET selftest sessions in parallel across a network but this is generally discouraged unless the sessions are carefully isolated). It is strongly recommended that a survey and analysis of raw network performance between the target systems is carried out prior to running the LNET selftest benchmark. This will help to identify and measure any performance overhead introduced by LNET. The HPDD SE team has recently been evaluating Netperf for this purpose on TCP/IP-based networks with good results. Refer to the HPDD SE Netperf page for details on how to manage this exercise. = Using the Wrapper Script = Use the LNET Selftest wrapper to execute the test cases referenced in this document. The header of the script has some variables that need to be set in accordance with the target environment. Without changes, the script is very unlikely to operate correctly, if at all. Here is a listing of the header: Single Client Throughput – LNET Selftest Read (2 Nodes, 1:1) Used to establish point to point unidirectional read performance between two nodes. Set the wrapper up as follows: #Output file ST=lst-output-$(date +%Y-%m-%d-%H:%M:%S) # Concurrency CN=32 #Size SZ=1M # Length of time to run test (secs) TM=60 # Which BRW test to run (read or write) BRW=read # Checksum calculation (simple or full) CKSUM=simple # The LST "from" list -- e.g. Lustre clients. Space separated list of NIDs. LFROM= "10.73.2.21@tcp" # The LST "to" list -- e.g. Lustre servers. Space separated list of NIDs. LTO= "10.73.2.22@tcp" CN: the concurrency setting simulates the number of threads performing communication. The LNET Selftest default is 1, which is not enough to properly exercise the connection. Set to at least 16, but experiment with higher values (32 or 64 being reasonable choices). SZ: the size setting determines the size of the IO transaction. For bandwidth (throughput) measurements, use 1M. TM: test time in seconds– how long to run the benchmark for. Set to a reasonable number in order to ensure collection of sufficient data to extrapolate a meaningful average (at least 60 seconds). BRW: The Bulk Read/Write test to use. There are only two choices "read" or "write". CKSUM: The checksum checking method. Choose either "simple" or "full". LFROM: a space-separated list of NIDs that represent the "from" list (or source) in LNET Selftest. This is often a set of clients. LTO: a space-separated list of NIDs that represent the "to" list (or destination) in LNET Selftest. This is often a set of servers. Change the LFROM and LTO lists as required. Run the script several times, changing the concurrency setting with at the start of every new run. Use the sequence 1, 2, 4, 8, 16, 32, 64, 128. Modify the output filename for each run so that it is clear what results have been captured into each file.

            The test cases in this LU are all with clients on the same NI as the servers – no routers.

            blakecaldwell Blake Caldwell added a comment - The test cases in this LU are all with clients on the same NI as the servers – no routers.

            Strangely this is only on our 1.8.9 systems. 2.4 clients can access these files without issue and a test writing to files striped on each OST in the filesystem run to completion. The same test on a 1.8.9 client will hang on OST0024 as noted above.

            Do you use different router for 1.8 & 2.4 clients?

            Liang, is there any simple way to check if the network between client and server is sane? Could you advise?

            niu Niu Yawei (Inactive) added a comment - Strangely this is only on our 1.8.9 systems. 2.4 clients can access these files without issue and a test writing to files striped on each OST in the filesystem run to completion. The same test on a 1.8.9 client will hang on OST0024 as noted above. Do you use different router for 1.8 & 2.4 clients? Liang, is there any simple way to check if the network between client and server is sane? Could you advise?

            We tried many different 1.8.9/RHEL6 clients after the bad links were removed. They all fall into the same loop, when writing to various OSTs. It does not appear to always be OST0009b, but it happens frequently enough, that the hangs are very common. When sweeping though all of the OSTs, another client hung on OST0024 for example.

            The example shown here was repeated after the cables were replaced. After the case had been opened, the system was rebooted and a particular user tried again to run the same rsync command that triggered the OST0009b issues. The most recent logs from the lustre server are from the recurrence of the issue.

            Strangely this is only on our 1.8.9 systems. 2.4 clients can access these files without issue and a test writing to files striped on each OST in the filesystem run to completion. The same test on a 1.8.9 client will hang on OST0024 as noted above.

            blakecaldwell Blake Caldwell added a comment - We tried many different 1.8.9/RHEL6 clients after the bad links were removed. They all fall into the same loop, when writing to various OSTs. It does not appear to always be OST0009b, but it happens frequently enough, that the hangs are very common. When sweeping though all of the OSTs, another client hung on OST0024 for example. The example shown here was repeated after the cables were replaced. After the case had been opened, the system was rebooted and a particular user tried again to run the same rsync command that triggered the OST0009b issues. The most recent logs from the lustre server are from the recurrence of the issue. Strangely this is only on our 1.8.9 systems. 2.4 clients can access these files without issue and a test writing to files striped on each OST in the filesystem run to completion. The same test on a 1.8.9 client will hang on OST0024 as noted above.

            Just yesterday we identified 2 bad IB cables with high symbol error rates in our fabric that have since been disconnected. They were likely the cause for at least one of the issues.

            Looks these two bad cables caused the following series of problems? The problems were not resovled even after you replaced the cables?

            Dec 9 06:12:43 atlas-oss4h4 kernel: [510636.815385] LustreError: 61456:0:(ldlm_lib.c:2722:target_bulk_io()) @@@ network error on bulk GET 0(1048576) req@ffff880fe2a0e400 x1453430544373389/t0(0) o4->10c57078-ce10-72c6-b97d-e7c9a32a240c@10.36.202.142@o2ib:0/0 lens 448/448 e 0 to 0 dl 1386588141 ref 1 fl Interpret:/2/0 rc 0/0
            

            Bulk read always fail for some reason, I think that's the reason why client fell into the loop of : bulk read timeout -> reconnect -> bulk read timeout -> reconnect ... Could you check if the connection between the client to OST009b is healthy?

            niu Niu Yawei (Inactive) added a comment - Just yesterday we identified 2 bad IB cables with high symbol error rates in our fabric that have since been disconnected. They were likely the cause for at least one of the issues. Looks these two bad cables caused the following series of problems? The problems were not resovled even after you replaced the cables? Dec 9 06:12:43 atlas-oss4h4 kernel: [510636.815385] LustreError: 61456:0:(ldlm_lib.c:2722:target_bulk_io()) @@@ network error on bulk GET 0(1048576) req@ffff880fe2a0e400 x1453430544373389/t0(0) o4->10c57078-ce10-72c6-b97d-e7c9a32a240c@10.36.202.142@o2ib:0/0 lens 448/448 e 0 to 0 dl 1386588141 ref 1 fl Interpret:/2/0 rc 0/0 Bulk read always fail for some reason, I think that's the reason why client fell into the loop of : bulk read timeout -> reconnect -> bulk read timeout -> reconnect ... Could you check if the connection between the client to OST009b is healthy?

            Still servicing client BULK reads:
            Dec 13 15:44:50 atlas-oss3b4 kernel: [1476205.395677] LustreError: 11576:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 1 previous similar message
            Dec 13 15:44:50 atlas-oss3b4 kernel: [1476205.425212] Lustre: atlas2-OST009b: Bulk IO read error with 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib), client will retry: rc -110
            Dec 13 15:44:50 atlas-oss3b4 kernel: [1476205.466052] Lustre: Skipped 1 previous similar message
            Dec 13 15:49:51 atlas-oss3b4 kernel: [1476505.278112] LustreError: 25256:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff88070f632000
            Dec 13 15:54:32 atlas-oss3b4 kernel: [1476785.683451] Lustre: atlas2-OST009b: Client 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib) reconnecting
            Dec 13 15:54:32 atlas-oss3b4 kernel: [1476785.709484] Lustre: Skipped 9 previous similar messages
            Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.464059] LustreError: 25256:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff880717a48000
            Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.585964] Lustre: atlas2-OST009b: Client 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib) refused reconnection, still busy with 1 active RPCs
            Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.626704] Lustre: Skipped 1 previous similar message
            Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.704087] LustreError: 11607:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk PUT req@ffff8807613a7c00 x1453958693735860/t0(0) o3->3064e3d3-408f-7c76-94af-ce52cde93078@10.36.202.138@o2ib:0/0 lens 448/432 e 0 to 0 dl 1386968389 ref 1 fl Interpret:/2/0 rc 0/0
            Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.777371] LustreError: 11607:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 1 previous similar message
            Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.815244] Lustre: atlas2-OST009b: Bulk IO read error with 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib), client will retry: rc -110

            blakecaldwell Blake Caldwell added a comment - Still servicing client BULK reads: Dec 13 15:44:50 atlas-oss3b4 kernel: [1476205.395677] LustreError: 11576:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 1 previous similar message Dec 13 15:44:50 atlas-oss3b4 kernel: [1476205.425212] Lustre: atlas2-OST009b: Bulk IO read error with 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib), client will retry: rc -110 Dec 13 15:44:50 atlas-oss3b4 kernel: [1476205.466052] Lustre: Skipped 1 previous similar message Dec 13 15:49:51 atlas-oss3b4 kernel: [1476505.278112] LustreError: 25256:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff88070f632000 Dec 13 15:54:32 atlas-oss3b4 kernel: [1476785.683451] Lustre: atlas2-OST009b: Client 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib) reconnecting Dec 13 15:54:32 atlas-oss3b4 kernel: [1476785.709484] Lustre: Skipped 9 previous similar messages Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.464059] LustreError: 25256:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff880717a48000 Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.585964] Lustre: atlas2-OST009b: Client 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib) refused reconnection, still busy with 1 active RPCs Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.626704] Lustre: Skipped 1 previous similar message Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.704087] LustreError: 11607:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk PUT req@ffff8807613a7c00 x1453958693735860/t0(0) o3->3064e3d3-408f-7c76-94af-ce52cde93078@10.36.202.138@o2ib:0/0 lens 448/432 e 0 to 0 dl 1386968389 ref 1 fl Interpret:/2/0 rc 0/0 Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.777371] LustreError: 11607:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 1 previous similar message Dec 13 15:54:52 atlas-oss3b4 kernel: [1476805.815244] Lustre: atlas2-OST009b: Bulk IO read error with 3064e3d3-408f-7c76-94af-ce52cde93078 (at 10.36.202.138@o2ib), client will retry: rc -110
            bobijam Zhenyu Xu added a comment -

            can we have logs from atlas-oss3b4 on which atlas2-OST009b locates, and it is still busy servicing a client BULK read but fails and rejecting the same client's reconnection.

            bobijam Zhenyu Xu added a comment - can we have logs from atlas-oss3b4 on which atlas2-OST009b locates, and it is still busy servicing a client BULK read but fails and rejecting the same client's reconnection.

            Increased priority to Major due to customer request/comments:

            "There are files on the filesystem that are inaccessible and I would like to rule out lingering issues from LU-4309 as a cause. At this point, our first priority is to recover the files that seem to be causing the hang."

            jamesanunez James Nunez (Inactive) added a comment - Increased priority to Major due to customer request/comments: "There are files on the filesystem that are inaccessible and I would like to rule out lingering issues from LU-4309 as a cause. At this point, our first priority is to recover the files that seem to be causing the hang."

            People

              niu Niu Yawei (Inactive)
              blakecaldwell Blake Caldwell
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: