[LU-10961] Clients hang after failovers. LustreError: 223668:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID [0x200000007:0x1:0x0] error: rc = -4 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.12.0
Affects Version/s: Lustre 2.12.0
Labels:
- soak
Environment:
soak cluster

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We are seeing repeated hard hang on clients after server failover.
'df' on a client will hang, user tasks do no complete. So far no hard faults, the node just grinds to a halt. Yesterday this occurred on soak-17 and soak-23. I have dumped stacks on both nodes, and crash dumps are available on soak.
We see:

connections to one or more osts drop, and the client does not re-connect:

Apr 27 03:28:42 soak-23 kernel: Lustre: 2084:0:(client.c:2099:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1524799714/real 0]  req@ffff8808fee67500 x1598738343197024/t0(0) o400->soaked-OST000b-osc-ffff8807f6ba0800@192.168.1.107@o2ib:28/4 lens 224/224 e 0 to 1 dl 1524799721 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Apr 27 03:28:42 soak-23 kernel: Lustre: soaked-OST0011-osc-ffff8807f6ba0800: Connection to soaked-OST0011 (at 192.168.1.107@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Apr 27 03:28:42 soak-23 kernel: Lustre: Skipped 3 previous similar messages
Apr 27 03:28:42 soak-23 kernel: Lustre: 2084:0:(client.c:2099:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Apr 27 03:28:42 soak-23 kernel: Lustre: soaked-OST000b-osc-ffff8807f6ba0800: Connection to soaked-OST000b (at 192.168.1.107@o2ib) was lost; in progress operations using this service will wait for recovery to complete

As of 1700 hours (14 hours after failover) the node still has not reconnected to this OST.

We also see repeated errors referencing the MDT:

Apr 27 17:25:42 soak-23 kernel: LustreError: 223668:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID [0x200000007:0x1:0x0] error: rc = -4

The error appears very repeatable. Logs and stack traces are attached.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mds.lustre.log.txt.gz
17.85 MB
14/Jun/18 12:26 AM
s-17.client.hang.txt.gz
7.80 MB
27/Apr/18 5:53 PM
soak-17.log.gz
281 kB
14/Jun/18 12:45 AM
soak-17.lustre.log.txt.gz
0.9 kB
14/Jun/18 12:09 AM
soak-17.stacktrace.txt
553 kB
27/Apr/18 5:53 PM
soak-18.lustre.log.txt.gz
1.68 MB
14/Jun/18 12:26 AM
soak-19.lustre.log.txt.gz
1.52 MB
14/Jun/18 12:26 AM
soak-21.06-05-2018.gz
17.40 MB
05/Jun/18 2:08 PM
soak-23.client.hang.txt.gz
7.92 MB
27/Apr/18 5:53 PM
soak-23.stacks.txt
574 kB
27/Apr/18 5:53 PM
soak-24.0430.txt.gz
19.33 MB
30/Apr/18 3:35 PM
soak-24.stack.txt
567 kB
30/Apr/18 3:34 PM
soak-42.log.gz
355 kB
14/Jun/18 12:45 AM
soak-42.lustre.log.txt.gz
1.14 MB
14/Jun/18 12:09 AM
soak-44.fini.txt
136.00 MB
29/May/18 8:38 PM
soak-8.console.log.gz
2.43 MB
05/Jun/18 5:52 PM
soak-8.log.gz
152 kB
14/Jun/18 12:45 AM
soak-8.lustre.log.2018-06-05.gz
26.76 MB
05/Jun/18 2:08 PM
soak-8.syslog.log.gz
3.42 MB
05/Jun/18 5:52 PM

Issue Links

is blocked by

LU-11158 PFL component instantiation is not replayed properly

Resolved

mentioned in: Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.

Activity

[LU-10961] Clients hang after failovers. LustreError: 223668:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID [0x200000007:0x1:0x0] error: rc = -4

Cliff White (Inactive) added a comment - 14/Jun/18 12:09 AM

Hit the bug right away, lustre logs for two clients attached. Also dumped stacks and crash dumped those nodes, bits are on Spirit. Restarting with full debug.

Cliff White (Inactive) added a comment - 14/Jun/18 12:09 AM Hit the bug right away, lustre logs for two clients attached. Also dumped stacks and crash dumped those nodes, bits are on Spirit. Restarting with full debug.

Gerrit Updater added a comment - 13/Jun/18 3:38 PM

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: https://review.whamcloud.com/32710
Subject: ~~LU-10961~~ osc: add debug code
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 523ac2ce306d182b2dc5db7c9a0f401b39124963

Gerrit Updater added a comment - 13/Jun/18 3:38 PM Mike Pershin (mike.pershin@intel.com) uploaded a new patch: https://review.whamcloud.com/32710 Subject: LU-10961 osc: add debug code Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 523ac2ce306d182b2dc5db7c9a0f401b39124963

James Nunez (Inactive) added a comment - 12/Jun/18 5:45 PM

Mike - Would you please upload a patch with any necessary debug information that will cause an error so we can get fail instead of waiting for a timeout?

Thanks.

James Nunez (Inactive) added a comment - 12/Jun/18 5:45 PM Mike - Would you please upload a patch with any necessary debug information that will cause an error so we can get fail instead of waiting for a timeout? Thanks.

Mikhail Pershin added a comment - 07/Jun/18 9:41 AM

yes, my question was about just particular time when error happened, so it is not possible to select any specific load causing that. In that case only logs at the moment of failure could make that clear. For this we can try to inject code which will cause error immediately instead of timeout.

Dmitry showed already place where timeout happens, probably here we can output more debug info and return error immediately instead if waiting, so lustre logs will contain useful data.

Mikhail Pershin added a comment - 07/Jun/18 9:41 AM yes, my question was about just particular time when error happened, so it is not possible to select any specific load causing that. In that case only logs at the moment of failure could make that clear. For this we can try to inject code which will cause error immediately instead of timeout. Dmitry showed already place where timeout happens, probably here we can output more debug info and return error immediately instead if waiting, so lustre logs will contain useful data.

Cliff White (Inactive) added a comment - 06/Jun/18 2:09 PM - edited

I though we had explained soak to you. The test running at the time of failure were:
blogbench, iorssf, iorfpp, kcompile, mdtestfpp, mdtestssf, simul,
fio( random, sequential, SAS simulation)
The random mix is distributed across the clients, with the intent of seriously loading each client. It is difficult to tell exactly what is running on a specific node at the time of failure, but generally a node will have 2/3 different jobs running at any given time.

Cliff White (Inactive) added a comment - 06/Jun/18 2:09 PM - edited I though we had explained soak to you. The test running at the time of failure were: blogbench, iorssf, iorfpp, kcompile, mdtestfpp, mdtestssf, simul, fio( random, sequential, SAS simulation) The random mix is distributed across the clients, with the intent of seriously loading each client. It is difficult to tell exactly what is running on a specific node at the time of failure, but generally a node will have 2/3 different jobs running at any given time.

Mikhail Pershin added a comment - 06/Jun/18 1:26 PM

Cliff, what type of load is used in this testing? Is it something particular, e.g. 'dd' or 'tar' or mix? I am thinking about possibility to reproduce that with simpler test.

Mikhail Pershin added a comment - 06/Jun/18 1:26 PM Cliff, what type of load is used in this testing? Is it something particular, e.g. 'dd' or 'tar' or mix? I am thinking about possibility to reproduce that with simpler test.

Cliff White (Inactive) added a comment - 05/Jun/18 7:05 PM

Thanks, I forced the core dump several hours after the fault, it's difficult to catch as the fault normally occurs in the middle of my night

Cliff White (Inactive) added a comment - 05/Jun/18 7:05 PM Thanks, I forced the core dump several hours after the fault, it's difficult to catch as the fault normally occurs in the middle of my night

Dmitry Eremin (Inactive) added a comment - 05/Jun/18 6:58 PM

Thanks. I submit the DCO ticket and now I'm coping core from onyx...

Dmitry Eremin (Inactive) added a comment - 05/Jun/18 6:58 PM Thanks. I submit the DCO ticket and now I'm coping core from onyx...

Cliff White (Inactive) added a comment - 05/Jun/18 6:47 PM

In the meantime, the crash dump is on onyx - /home/cliffwhi/lu-10961 - if you can't reach it, point me to a better directory and I'll put it there.

Cliff White (Inactive) added a comment - 05/Jun/18 6:47 PM In the meantime, the crash dump is on onyx - /home/cliffwhi/lu-10961 - if you can't reach it, point me to a better directory and I'll put it there.

Cliff White (Inactive) added a comment - 05/Jun/18 6:37 PM - edited

You should be able to get a Spirit account quickly, file a DCO ticket and label it account-mgmnt. Usually happens in minutes - we've advised DCO, and they are ready to do it now. All they need is your public ssh key

Cliff White (Inactive) added a comment - 05/Jun/18 6:37 PM - edited You should be able to get a Spirit account quickly, file a DCO ticket and label it account-mgmnt. Usually happens in minutes - we've advised DCO, and they are ready to do it now. All they need is your public ssh key

Dmitry Eremin (Inactive) added a comment - 05/Jun/18 6:31 PM

Sorry, I don't have login on spirit. Could you copy it to somewhere on onyx?

Dmitry Eremin (Inactive) added a comment - 05/Jun/18 6:31 PM Sorry, I don't have login on spirit. Could you copy it to somewhere on onyx?

People

Assignee:: Mikhail Pershin

Reporter:: Cliff White (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 27/Apr/18 5:54 PM

Updated:: 10/Sep/18 5:43 PM

Resolved:: 10/Sep/18 5:43 PM