Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11976

req wrong generation leading to I/O errors on 2.12 clients

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.12.0
    • None
    • Clients:2.12.0 Servers (Oak): 2.10.5
    • 3
    • 9223372036854775807

    Description

      Since we upgraded our clients to 2.12.0, our users are reporting more I/O errors on Oak (2.10 servers) that seem to be related to the following Lustre Error messages:

      Example:

      Feb 18 19:25:59 sh-106-64.int kernel: LustreError: 397481:0:(client.c:1193:ptlrpc_import_delay_req()) @@@ req wrong generation:  req@ffff8c67bf646000 x1624748797937520/t0(0) o101->oak-OST005f-osc-ffff8c809690880
      

      NAMD job failing with I/O error:

      Info: Working in the current directory /oak/....../aqueous2opls/run2
      ...
      FATAL ERROR: Error on write to binary file step6.6_equilibration.restart.vel: Input/output error
      

      Timestamp of NAMD file: Feb 18 19:25 cluster6.out

      The Lustre client shows a lot of these error messages, on different OSTs. This is all Oak related logs on a client (sh-106-64) that has generated I/O errors:

      Feb 06 11:23:22 sh-106-64.int kernel: Lustre: Mounted oak-client
      Feb 07 12:49:28 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1549572562/real 1549572562]  req@ffff8c6591f3e300 x16247483-
      Feb 10 08:43:24 sh-106-64.int kernel: LustreError: 1287:0:(client.c:1193:ptlrpc_import_delay_req()) @@@ req wrong generation:  req@ffff8c66aaa35400 x1624748322417600/t0(0) o101->oak-OST006d-osc-ffff8c8096908800@
      Feb 13 10:55:06 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550084100/real 1550084100]  req@ffff8c7204ebe900 x16247484-
      Feb 15 07:46:03 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550245557/real 1550245557]  req@ffff8c7fc423a400 x16247484-
      Feb 15 09:17:12 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550251026/real 1550251026]  req@ffff8c7fc423b600 x16247484-
      Feb 15 09:21:23 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550251277/real 1550251277]  req@ffff8c7fc4238f00 x16247484-
      Feb 15 10:09:28 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550254162/real 1550254162]  req@ffff8c7fc423b000 x16247484-
      Feb 15 10:22:01 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550254915/real 1550254915]  req@ffff8c7fc423bc00 x16247484-
      Feb 15 13:28:05 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550266079/real 1550266079]  req@ffff8c7fc4238c00 x16247484-
      Feb 15 13:31:26 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550266280/real 1550266280]  req@ffff8c7fc423ce00 x16247484-
      Feb 15 13:38:07 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550266681/real 1550266681]  req@ffff8c7fc423e300 x16247484-
      Feb 15 13:44:24 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550267058/real 1550267058]  req@ffff8c7fc423ad00 x16247484-
      Feb 16 18:12:47 sh-106-64.int kernel: LustreError: 11-0: oak-OST0071-osc-ffff8c8096908800: operation ost_connect to node 10.0.2.109@o2ib5 failed: rc = -19
      Feb 17 00:40:54 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550392848/real 1550392848]  req@ffff8c8020528300 x16247484-
      Feb 18 19:25:59 sh-106-64.int kernel: LustreError: 397481:0:(client.c:1193:ptlrpc_import_delay_req()) @@@ req wrong generation:  req@ffff8c67bf646000 x1624748797937520/t0(0) o101->oak-OST005f-osc-ffff8c809690880
      Feb 19 10:55:50 sh-106-64.int kernel: LustreError: 39073:0:(client.c:1193:ptlrpc_import_delay_req()) @@@ req wrong generation:  req@ffff8c77a92f2400 x1624748825028640/t0(0) o101->oak-OST004f-osc-ffff8c8096908800
      Feb 19 11:46:27 sh-106-64.int kernel: LustreError: 39926:0:(client.c:1193:ptlrpc_import_delay_req()) @@@ req wrong generation:  req@ffff8c8041fada00 x1624748825812992/t0(0) o101->oak-OST0033-osc-ffff8c8096908800
      Feb 19 14:14:03 sh-106-64.int kernel: LustreError: 48889:0:(client.c:1193:ptlrpc_import_delay_req()) @@@ req wrong generation:  req@ffff8c69c2e1bc00 x1624748827483728/t0(0) o101->oak-OST0066-osc-ffff8c8096908800
      Feb 19 14:36:38 sh-106-64.int kernel: LustreError: 56018:0:(client.c:1193:ptlrpc_import_delay_req()) @@@ req wrong generation:  req@ffff8c67d265e900 x1624748827749392/t0(0) o101->oak-OST0071-osc-ffff8c8096908800
      

      Any idea of how to troubleshoot this issue? Perhaps this is a 2.10/2.12 compat issue?
      Thanks,
      Stephane

      Attachments

        Issue Links

          Activity

            [LU-11976] req wrong generation leading to I/O errors on 2.12 clients

            Close as a duplicate of LU-11951, which has a patch. I've cherry-picked it to b2_12 as https://review.whamcloud.com/34435

            adilger Andreas Dilger added a comment - Close as a duplicate of LU-11951 , which has a patch. I've cherry-picked it to b2_12 as https://review.whamcloud.com/34435

            Hi Alex,

            I confirm that after we disabled the idling connection feature (was 2 days ago), we are not seeing any more occurrence of these "req wrong generation errors" on Sherlock. Thanks!

            sthiell Stephane Thiell added a comment - Hi Alex, I confirm that after we disabled the idling connection feature (was 2 days ago), we are not seeing any more occurrence of these "req wrong generation errors" on Sherlock. Thanks!

            Actually we also found many occurrences of "req wrong generation" errors with fir, so idle_timeout is now set to 0 on all filesystems.

            sthiell Stephane Thiell added a comment - Actually we also found many occurrences of "req wrong generation" errors with fir, so idle_timeout is now set to 0 on all filesystems.

            Thanks Alex! We have disabled the idling connection timeout feature on all clients on our pre-2.12 filesystems (Oak based on lustre 2.10 servers and Regal based on lustre 2.8 servers). We decided to keep it enabled with our new 2.12 filesystem (fir) until we see the same "req wrong generation" issues.

            $ lctl get_param osc.*.idle_timeout
            osc.fir-OST0000-osc-ffff92c50baf0800.idle_timeout=20
            ...
            osc.fir-OST002f-osc-ffff92c50baf0800.idle_timeout=20
            osc.oak-OST0000-osc-ffff92c50b2d3000.idle_timeout=0
            ...
            osc.oak-OST0071-osc-ffff92c50b2d3000.idle_timeout=0
            osc.regal-OST0000-osc-ffff92c50b2d1800.idle_timeout=0
            ...
            osc.regal-OST006b-osc-ffff92c50b2d1800.idle_timeout=0
            
            sthiell Stephane Thiell added a comment - Thanks Alex! We have disabled the idling connection timeout feature on all clients on our pre-2.12 filesystems (Oak based on lustre 2.10 servers and Regal based on lustre 2.8 servers). We decided to keep it enabled with our new 2.12 filesystem (fir) until we see the same "req wrong generation" issues. $ lctl get_param osc.*.idle_timeout osc.fir-OST0000-osc-ffff92c50baf0800.idle_timeout=20 ... osc.fir-OST002f-osc-ffff92c50baf0800.idle_timeout=20 osc.oak-OST0000-osc-ffff92c50b2d3000.idle_timeout=0 ... osc.oak-OST0071-osc-ffff92c50b2d3000.idle_timeout=0 osc.regal-OST0000-osc-ffff92c50b2d1800.idle_timeout=0 ... osc.regal-OST006b-osc-ffff92c50b2d1800.idle_timeout=0

            please try to disable idling connection feature on the clients:

            lctl set_param  osc.*.idle_timeout=0
            
            bzzz Alex Zhuravlev added a comment - please try to disable idling connection feature on the clients: lctl set_param osc.*.idle_timeout=0

            looks like a duplicate of LU-11951

            bzzz Alex Zhuravlev added a comment - looks like a duplicate of LU-11951

            People

              bzzz Alex Zhuravlev
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: