Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6748

excessive client reconnect to OSS servers under heavy IO work load.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      While testing the last pre-2.8 code I noticed heavy client reconnects to OSS servers. The error on the client side was:

      Lustre: sultan-OST0008-osc-ffff8803ea302800: Connection to sultan-OST0008 (at 10.37.248.69@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      Lustre: Skipped 55 previous similar messages
      Lustre: 5355:0:(client.c:2009:ptlrpc_expire_one_request()) Skipped 61 previous similar messages
      Lustre: 5350:0:(client.c:2009:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434742560/real 1434742560] req@ffff8803c23fb6c0 x1504421695570504/t0(0) o8->sultan-OST0023-osc-ffff8803ea302800@10.37.248.72@o2ib1:28/4 lens 520/544 e 0 to 1 dl 1434742568 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 5350:0:(client.c:2009:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
      Lustre: sultan-OST0000-osc-ffff8803ea302800: Connection restored to sultan-OST0000 (at 10.37.248.69@o2ib1)
      Lustre: Skipped 27 previous similar messages
      Lustre: 5356:0:(client.c:2009:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434742782/real 1434742782] req@ffff8803c1b639c0 x1504421695572244/t0(0) o400->sultan-OST0034-osc-ffff8803ea302800@10.37.248.69@o2ib1:28/4 lens 224/224 e 0 to 1 dl 1434742789 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: sultan-OST0000-osc-ffff8803ea302800: Connection to sultan-OST0000 (at 10.37.248.69@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      Lustre: Skipped 41 previous similar messages
      Lustre: 5356:0:(client.c:2009:ptlrpc_expire_one_request()) Skipped 73 previous similar messages
      Lustre: sultan-OST0000-osc-ffff8803ea302800: Connection restored to sultan-OST0000 (at 10.37.248.69@o2ib1)
      Lustre: Skipped 41 previous similar messages
      Lustre: sultan-OST0003-osc-ffff8803ea302800: Connection restored to sultan-OST0003 (at 10.37.248.72@o2ib1)
      Lustre: Skipped 27 previous similar messages

      and the messages seen on the OSS side are:

      20639.820176] Lustre: sultan-OST0008: Client 57c62113-31f1-f463-ffeb-9d0c7541279d (at 26@gni1) reconnecting
      [20639.829910] Lustre: Skipped 20 previous similar messages
      [20676.881745] Lustre: sultan-OST000c: Client 57c62113-31f1-f463-ffeb-9d0c7541279d (at 26@gni1) reconnecting
      [20676.891462] Lustre: Skipped 29 previous similar messages
      [20868.910972] Lustre: sultan-OST0004: Client 57c62113-31f1-f463-ffeb-9d0c7541279d (at 26@gni1) reconnecting
      [20868.920682] Lustre: Skipped 23 previous similar messages
      [20906.993360] Lustre: sultan-OST0000: Client 57c62113-31f1-f463-ffeb-9d0c7541279d (at 26@gni1) reconnecting
      [20906.993364] Lustre: sultan-OST0004: Client 57c62113-31f1-f463-ffeb-9d0c7541279d (at 26@gni1) reconnecting
      [20906.993368] Lustre: Skipped 17 previous similar messages
      [20907.018191] Lustre: Skipped 11 previous similar messages

      This occured when I ran a file per process IOR job across 20 nodes with 32 threads per client.

      Attachments

        Issue Links

          Activity

            People

              jay Jinshan Xiong (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: