[LU-9242] Applications are failing to complete due to connection loss with OSS servers Created: 22/Mar/17 Updated: 26/Apr/17 Resolved: 25/Apr/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | James A Simmons | Assignee: | Jian Yu |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre 2.9.54 running on servers using RHEL7 and using ldiskfs. Client side is Cray SLES11SP4 also running lustre 2.9.54 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
With my testing of the latest master branch I see jobs failing due to the lose of communication with the OSS servers. I see a reconnect storm but in the end the application error out. |
| Comments |
| Comment by Peter Jones [ 22/Mar/17 ] |
|
Jian Could you please advise on this one? Thanks Peter |
| Comment by Jian Yu [ 22/Mar/17 ] |
|
Hi James, From the attached debug log, I saw a lot of the following error messages: 00000100:00000040:20.0:1490199768.961594:0:20430:0:(client.c:1271:ptlrpc_check_status()) @@@ status is -22 req@ffff880775b17800 x1562585786929952/t0(0) o104->sultan-OST000c@3@gni1:15/16 lens 296/192 e 0 to 0 dl 1490199781 ref 1 fl Rpc:R/2/0 rc 0/-22 00000100:00000040:20.0:1490199768.961736:0:20430:0:(client.c:1271:ptlrpc_check_status()) @@@ status is -22 req@ffff880775b11800 x1562585786930480/t0(0) o104->sultan-OST000c@3@gni1:15/16 lens 296/192 e 0 to 0 dl 1490199781 ref 1 fl Rpc:R/2/0 rc 0/-22 00000100:00000040:20.0:1490199768.961887:0:20430:0:(client.c:1271:ptlrpc_check_status()) @@@ status is -22 req@ffff880775b14200 x1562585786930880/t0(0) o104->sultan-OST000c@3@gni1:15/16 lens 296/192 e 0 to 0 dl 1490199781 ref 1 fl Rpc:R/2/0 rc 0/-22 00000100:00000040:19.0:1490199768.961965:0:20412:0:(client.c:1271:ptlrpc_check_status()) @@@ status is -22 req@ffff88077b6f8f00 x1562585786934704/t0(0) o104->sultan-OST0018@3@gni1:15/16 lens 296/192 e 0 to 0 dl 1490199781 ref 1 fl Rpc:R/2/0 rc 0/-22 00000100:00000040:3.0:1490199768.962021:0:20501:0:(client.c:1271:ptlrpc_check_status()) @@@ status is -22 req@ffff88006cc7d700 x1562585786931872/t0(0) o104->sultan-OST0028@3@gni1:15/16 lens 296/192 e 0 to 0 dl 1490199781 ref 1 fl Rpc:R/2/0 rc 0/-22 00000100:00000040:6.0:1490199768.962058:0:20511:0:(client.c:1271:ptlrpc_check_status()) @@@ status is -22 req@ffff880068527800 x1562585786934880/t0(0) o104->sultan-OST0014@3@gni1:15/16 lens 296/192 e 0 to 0 dl 1490199781 ref 1 fl Rpc:R/2/0 rc 0/-22 00000100:00000040:19.0:1490199768.962109:0:20412:0:(client.c:1271:ptlrpc_check_status()) @@@ status is -22 req@ffff88077b6f9e00 x1562585786936128/t0(0) o104->sultan-OST0018@3@gni1:15/16 lens 296/192 e 0 to 0 dl 1490199781 ref 1 fl Rpc:R/2/0 rc 0/-22 ...... And the context of one of the above error messages is sth like: 00000100:00000001:20.0:1490199768.961593:0:20430:0:(client.c:1251:ptlrpc_check_status()) Process entered Are there any error messages in the syslog/console logs on the Client and OSS nodes? Could you please also gather the debug log on OSS node? Thank you. |
| Comment by James A Simmons [ 24/Mar/17 ] |
|
The attached debug logs are from the OSS nodes. I'm seeing the console messages on the OSS server [16278.899599] Lustre: sultan-OST0000: Client ecf7eb35-3b79-6131-416a-61beb23f4a96 (at 30@gni1) reconnecting On the client node I see in the console logs: stre: Skipped 55 previous similar messages I will get the client logs as well tomorrow.
|
| Comment by Jian Yu [ 25/Mar/17 ] |
|
Hi James, |
| Comment by Peter Jones [ 17/Apr/17 ] |
|
James Do you still have this test environment setup? Is this simply due to Multi-Rail LNET not being able to work in this configuration? Peter |
| Comment by Peter Jones [ 25/Apr/17 ] |
|
James It does not seem that this is still a live issue for you so I will close for the time being. If you do see this issue on more current builds and can provide enough data to debug then we'll of course reopen Peter |
| Comment by James A Simmons [ 26/Apr/17 ] |
|
I would say its safe to keep this closed. I have updated my test bed to the latest master and I'm not seeing the application failures. |
| Comment by Peter Jones [ 26/Apr/17 ] |
|
thanks for confirming James |