[LU-11976] req wrong generation leading to I/O errors on 2.12 clients Created: 19/Feb/19 Updated: 19/Dec/23 Resolved: 16/Mar/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Stephane Thiell | Assignee: | Alex Zhuravlev |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Clients:2.12.0 Servers (Oak): 2.10.5 |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Since we upgraded our clients to 2.12.0, our users are reporting more I/O errors on Oak (2.10 servers) that seem to be related to the following Lustre Error messages: Example: Feb 18 19:25:59 sh-106-64.int kernel: LustreError: 397481:0:(client.c:1193:ptlrpc_import_delay_req()) @@@ req wrong generation: req@ffff8c67bf646000 x1624748797937520/t0(0) o101->oak-OST005f-osc-ffff8c809690880 NAMD job failing with I/O error: Info: Working in the current directory /oak/....../aqueous2opls/run2 ... FATAL ERROR: Error on write to binary file step6.6_equilibration.restart.vel: Input/output error Timestamp of NAMD file: Feb 18 19:25 cluster6.out The Lustre client shows a lot of these error messages, on different OSTs. This is all Oak related logs on a client (sh-106-64) that has generated I/O errors: Feb 06 11:23:22 sh-106-64.int kernel: Lustre: Mounted oak-client Feb 07 12:49:28 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1549572562/real 1549572562] req@ffff8c6591f3e300 x16247483- Feb 10 08:43:24 sh-106-64.int kernel: LustreError: 1287:0:(client.c:1193:ptlrpc_import_delay_req()) @@@ req wrong generation: req@ffff8c66aaa35400 x1624748322417600/t0(0) o101->oak-OST006d-osc-ffff8c8096908800@ Feb 13 10:55:06 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550084100/real 1550084100] req@ffff8c7204ebe900 x16247484- Feb 15 07:46:03 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550245557/real 1550245557] req@ffff8c7fc423a400 x16247484- Feb 15 09:17:12 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550251026/real 1550251026] req@ffff8c7fc423b600 x16247484- Feb 15 09:21:23 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550251277/real 1550251277] req@ffff8c7fc4238f00 x16247484- Feb 15 10:09:28 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550254162/real 1550254162] req@ffff8c7fc423b000 x16247484- Feb 15 10:22:01 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550254915/real 1550254915] req@ffff8c7fc423bc00 x16247484- Feb 15 13:28:05 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550266079/real 1550266079] req@ffff8c7fc4238c00 x16247484- Feb 15 13:31:26 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550266280/real 1550266280] req@ffff8c7fc423ce00 x16247484- Feb 15 13:38:07 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550266681/real 1550266681] req@ffff8c7fc423e300 x16247484- Feb 15 13:44:24 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550267058/real 1550267058] req@ffff8c7fc423ad00 x16247484- Feb 16 18:12:47 sh-106-64.int kernel: LustreError: 11-0: oak-OST0071-osc-ffff8c8096908800: operation ost_connect to node 10.0.2.109@o2ib5 failed: rc = -19 Feb 17 00:40:54 sh-106-64.int kernel: Lustre: 98431:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550392848/real 1550392848] req@ffff8c8020528300 x16247484- Feb 18 19:25:59 sh-106-64.int kernel: LustreError: 397481:0:(client.c:1193:ptlrpc_import_delay_req()) @@@ req wrong generation: req@ffff8c67bf646000 x1624748797937520/t0(0) o101->oak-OST005f-osc-ffff8c809690880 Feb 19 10:55:50 sh-106-64.int kernel: LustreError: 39073:0:(client.c:1193:ptlrpc_import_delay_req()) @@@ req wrong generation: req@ffff8c77a92f2400 x1624748825028640/t0(0) o101->oak-OST004f-osc-ffff8c8096908800 Feb 19 11:46:27 sh-106-64.int kernel: LustreError: 39926:0:(client.c:1193:ptlrpc_import_delay_req()) @@@ req wrong generation: req@ffff8c8041fada00 x1624748825812992/t0(0) o101->oak-OST0033-osc-ffff8c8096908800 Feb 19 14:14:03 sh-106-64.int kernel: LustreError: 48889:0:(client.c:1193:ptlrpc_import_delay_req()) @@@ req wrong generation: req@ffff8c69c2e1bc00 x1624748827483728/t0(0) o101->oak-OST0066-osc-ffff8c8096908800 Feb 19 14:36:38 sh-106-64.int kernel: LustreError: 56018:0:(client.c:1193:ptlrpc_import_delay_req()) @@@ req wrong generation: req@ffff8c67d265e900 x1624748827749392/t0(0) o101->oak-OST0071-osc-ffff8c8096908800 Any idea of how to troubleshoot this issue? Perhaps this is a 2.10/2.12 compat issue? |
| Comments |
| Comment by Alex Zhuravlev [ 20/Feb/19 ] |
|
looks like a duplicate of |
| Comment by Alex Zhuravlev [ 20/Feb/19 ] |
|
please try to disable idling connection feature on the clients: lctl set_param osc.*.idle_timeout=0 |
| Comment by Stephane Thiell [ 20/Feb/19 ] |
|
Thanks Alex! We have disabled the idling connection timeout feature on all clients on our pre-2.12 filesystems (Oak based on lustre 2.10 servers and Regal based on lustre 2.8 servers). We decided to keep it enabled with our new 2.12 filesystem (fir) until we see the same "req wrong generation" issues. $ lctl get_param osc.*.idle_timeout osc.fir-OST0000-osc-ffff92c50baf0800.idle_timeout=20 ... osc.fir-OST002f-osc-ffff92c50baf0800.idle_timeout=20 osc.oak-OST0000-osc-ffff92c50b2d3000.idle_timeout=0 ... osc.oak-OST0071-osc-ffff92c50b2d3000.idle_timeout=0 osc.regal-OST0000-osc-ffff92c50b2d1800.idle_timeout=0 ... osc.regal-OST006b-osc-ffff92c50b2d1800.idle_timeout=0 |
| Comment by Stephane Thiell [ 20/Feb/19 ] |
|
Actually we also found many occurrences of "req wrong generation" errors with fir, so idle_timeout is now set to 0 on all filesystems. |
| Comment by Stephane Thiell [ 22/Feb/19 ] |
|
Hi Alex, I confirm that after we disabled the idling connection feature (was 2 days ago), we are not seeing any more occurrence of these "req wrong generation errors" on Sherlock. Thanks! |
| Comment by Andreas Dilger [ 16/Mar/19 ] |
|
Close as a duplicate of |