[LU-4112] Random eviction of clients on lock callback timeout - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Won't Fix
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.1.6
Labels:
None

Severity:
3
Rank (Obsolete):
11068

Description

Hi,

At Meteo France, Lustre clients are randomly evicted with the following message:

1381739209 2013 Oct 14 08:26:49 beaufixoss9 kern err kernel LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.23.67.157@o2ib ns: filter-scratch-OST0024_UUID lock: ffff880215b3c6c0/0xe3c417cfbe584025 lrc: 3/0,0 mode: PW/PW res: 87670641/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 remote: 0x964c642755ae9fd expref: 4 pid: 20133 timeout 4581190301

Some attempts were made to increase the ldlm_timeout, but even with a timeout of 300 seconds, the timer finally expires and evicts the client.

After collecting some debug logs, it seems that there is a communication problem: the client sends the lock callback RPC, but it is never received by the OSS, leading it to timeout.

Here is an example:

[root@beaufixmngt0 beaufix1046]# grep evict lustre-logs-20131011-082535
00000100:00080000:17.0:1381479935.274633:0:4383:0:(import.c:915:ptlrpc_connect_interpret()) @@@ scratch-OST0017-osc-ffff880454e9b400: evicting (reconnect/recover flags not set: 4) req@ffff88087945e000 x1448401857359985/t0(0) o8->scratch-OST0017-osc-ffff880454e9b400@172.23.80.69@o2ib1:28/4 lens 368/264 e 0 to 0 dl 1381479946 ref 1 fl Interpret:RN/0/0 rc 0/0
00000100:02020000:17.0:1381479935.274641:0:4383:0:(import.c:1299:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by scratch-OST0017; in progress operations using this service will fail.
00000100:00080000:17.0:1381479935.285351:0:4383:0:(import.c:1303:ptlrpc_import_recovery_state_machine()) evicted from scratch-OST0017_UUID@172.23.80.69@o2ib1; invalidating

172.23.80.69 = beaufixoss15-ib1

[root@beaufixmngt0 beaufix1046]# head -n1 ../beaufixoss15/lustre-logs-20131011-082458
00010000:00000010:14.1F:1381479186.063701:0:0:0:(ldlm_lock.c:198:ldlm_lock_free()) slab-freed 'lock': 560 at ffff880067e6c480.
[root@beaufixmngt0 beaufix1046]# tail -n 2 ../beaufixoss15/lustre-logs-20131011-082458 | head -n1
00002000:00000001:2.0:1381479898.458969:0:20716:0:(filter_io_26.c:325:filter_do_bio()) Process entered

OSS debug log covers from 1381479186 to 1381479898

Here is the list of RPCs sent during this time (total of 5 RPCs):
[root@beaufixmngt0 beaufix1046]# grep -f requests lustre-logs-20131011-082535 | grep ptl_send_buf
00000100:00000200:22.0:1381479685.229939:0:4404:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857358474, offset 0
00000100:00000200:46.0:1381479697.155610:0:4424:0:(niobuf.c:87:ptl_send_buf()) Sending 296 bytes to portal 28, xid 1448401857358807, offset 0
00000100:00000200:6.0:1381479797.753246:0:4384:0:(niobuf.c:87:ptl_send_buf()) Sending 456 bytes to portal 6, xid 1448401857359681, offset 0
00000100:00000200:27.0:1381479835.230381:0:4411:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857359731, offset 0
00000100:00000200:6.0:1381479885.229892:0:4409:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857359816, offset 0

[root@beaufixmngt0 beaufix1046]# grep -f requests ../beaufixoss15/lustre-logs-20131011-082458 | grep incoming
00000100:00000040:0.0:1381479885.230810:0:10038:0:(events.c:284:request_in_callback()) incoming req@ffff88005b69f400 x1448401857359816 msgsize 192

-> Only 1 RPC received out of 5 sent by the clients

This example only shows what happens on top of LNet, but we can see no error at the Infiniband level yet.

This is a blocking problem as it is preventing the cluster from being used for production.

I will upload traces to help analysis.

Thanks for your help,
Sebastien.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

2014-02-11-run6-client.out.gz
970 kB
21/Feb/14 7:59 PM
LU-4112-reproducer.tar.gz
2 kB
25/Oct/13 11:33 AM
LU-4112-reproducer-v2.tar.gz
7 kB
12/Nov/13 11:12 PM
lustre-logs.tar.bz2.aa
0.2 kB
16/Oct/13 9:25 AM
lustre-logs.tar.bz2.ab
0.2 kB
16/Oct/13 9:25 AM
server_log_2_10_2014.log
11 kB
21/Feb/14 7:59 PM
xfer_client_2_10_2014.log
13 kB
21/Feb/14 7:59 PM

Activity

[LU-4112] Random eviction of clients on lock callback timeout

James A Simmons added a comment - 14/Aug/16 5:24 PM

Old blocker for unsupported version

James A Simmons added a comment - 14/Aug/16 5:24 PM Old blocker for unsupported version

Sebastien Piechurski added a comment - 30/May/14 9:36 AM

I was able today to perform new tests.

Running the reproducer on a regular kernel was taking times between 1 and 5 minutes (of course leading to multiple lock callback timeouts after 160 seconds).
Using the kernel.sched_features parameters requires to have a kernel compiled with CONFIG_SCHED_DEBUG, so I installed one of our debug kernel which has CONFIG_SCHED_DEBUG among a lot of other DEBUG features.
Running the same reproducer with this kernel with the default sched_features lead to run the complete reproducer in 20/21 seconds quite constantly.
Running with your kernel (2.6.32-358.11.1.el6_lustre) which has much less DEBUG features returned the following times:

between 40 seconds and 2.5 minutes with default sched_features (=3183)
constant time of 15 seconds with ached_features=1135

So this is very encouraging and I guess we now need to understand the phenomenon more deeply to see if there is actually a problem with our platform or if there is simply something bad when running this kind of load on systems with lots of cores (3 nodes x 24 cores in this case).

Sebastien Piechurski added a comment - 30/May/14 9:36 AM I was able today to perform new tests. Running the reproducer on a regular kernel was taking times between 1 and 5 minutes (of course leading to multiple lock callback timeouts after 160 seconds). Using the kernel.sched_features parameters requires to have a kernel compiled with CONFIG_SCHED_DEBUG, so I installed one of our debug kernel which has CONFIG_SCHED_DEBUG among a lot of other DEBUG features. Running the same reproducer with this kernel with the default sched_features lead to run the complete reproducer in 20/21 seconds quite constantly. Running with your kernel (2.6.32-358.11.1.el6_lustre) which has much less DEBUG features returned the following times: between 40 seconds and 2.5 minutes with default sched_features (=3183) constant time of 15 seconds with ached_features=1135 So this is very encouraging and I guess we now need to understand the phenomenon more deeply to see if there is actually a problem with our platform or if there is simply something bad when running this kind of load on systems with lots of cores (3 nodes x 24 cores in this case).

Bruno Faccini (Inactive) added a comment - 22/May/14 3:36 PM

Humm I am late again, but this is a really tricky problem to debug from crash-dumps and lustre debug-logs ...
Seb, I wonder if you can re-run the same last test session with OWNER_SPIN sched_feature disabled ??
This can be done like following :

# sysctl -n kernel.sched_features
3183
# sysctl -w kernel.sched_features=1135
kernel.sched_features = 1135
# sysctl -n kernel.sched_features
1135
#

BTW, Bull's Kernel version (2.6.32-279.5.2.bl6.Bull.36.x86_64) that seems to be installed on-site does not allow to do it, but the Kernel versions (>= 2.6.32-358.11.1.el6_lustre.x86_64) we ship with Lustre 2.1.6/2.4.3 from our downloads area is ok.

Bruno Faccini (Inactive) added a comment - 22/May/14 3:36 PM Humm I am late again, but this is a really tricky problem to debug from crash-dumps and lustre debug-logs ... Seb, I wonder if you can re-run the same last test session with OWNER_SPIN sched_feature disabled ?? This can be done like following : # sysctl -n kernel.sched_features 3183 # sysctl -w kernel.sched_features=1135 kernel.sched_features = 1135 # sysctl -n kernel.sched_features 1135 # BTW, Bull's Kernel version (2.6.32-279.5.2.bl6.Bull.36.x86_64) that seems to be installed on-site does not allow to do it, but the Kernel versions (>= 2.6.32-358.11.1.el6_lustre.x86_64) we ship with Lustre 2.1.6/2.4.3 from our downloads area is ok.

Bruno Faccini (Inactive) added a comment - 15/May/14 1:27 PM

Seb, just to let you know that your latest infos confirmed my earlier thoughts. I Will try to give an update on were I am in the "livelock" analysis by the end of today.

Bruno Faccini (Inactive) added a comment - 15/May/14 1:27 PM Seb, just to let you know that your latest infos confirmed my earlier thoughts. I Will try to give an update on were I am in the "livelock" analysis by the end of today.

Bruno Faccini (Inactive) added a comment - 30/Apr/14 11:58 AM

Hello Seb,
Thanks again for this extra work this caused you ... I am currently working on these new materials and will let you know soon if they confirm my last idea.

Bruno Faccini (Inactive) added a comment - 30/Apr/14 11:58 AM Hello Seb, Thanks again for this extra work this caused you ... I am currently working on these new materials and will let you know soon if they confirm my last idea.

Sebastien Piechurski added a comment - 25/Apr/14 12:28 PM

Hi Bruno,

I am currently uploading an archive on the ftp site under uploads/~~LU-4112~~/~~LU-4112~~-complete-dumps-20140425.tar.bz2 which contains:

crash dumps of all involved nodes
lustre traces (all enabled) of each node as well. The one for sid683 which is the client which was crashed at eviction time was extracted from the dump, but for an unknown reason does not contain the last traces, so I included the last lines from the console in the file final_message.sid683
nids: a file to match nids to node names
nodes: the role of each node
the vmlinux and System.map files
the lustre modules

This was reproduced with the kernel (2.6.32-358.11.1.el6_lustre) and lustre version (vanilla 2.1.6) from the downloads.whamcloud.com site except for the LBUG() call introduced in ptlrpc_invalidate_import_thread().
I was also able to reproduce the problem with version 2.4.3.

Note that the nodes which were used are based on a 10-cores (40 threads nodes) version of the processor we used for the first occurrences (instead of a 12-cores) and that it took me a lot more attempts before being able to reproduce when it was systematically reproduced with the 12-cores version (48 threads).
Leads me to think that we are close to a limit showing the scheduling problem.

I hope you will be able to confirm your thoughts with these traces.
Tell me if anything is missing.

Sebastien Piechurski added a comment - 25/Apr/14 12:28 PM Hi Bruno, I am currently uploading an archive on the ftp site under uploads/ LU-4112 / LU-4112 -complete-dumps-20140425.tar.bz2 which contains: crash dumps of all involved nodes lustre traces (all enabled) of each node as well. The one for sid683 which is the client which was crashed at eviction time was extracted from the dump, but for an unknown reason does not contain the last traces, so I included the last lines from the console in the file final_message.sid683 nids: a file to match nids to node names nodes: the role of each node the vmlinux and System.map files the lustre modules This was reproduced with the kernel (2.6.32-358.11.1.el6_lustre) and lustre version (vanilla 2.1.6) from the downloads.whamcloud.com site except for the LBUG() call introduced in ptlrpc_invalidate_import_thread(). I was also able to reproduce the problem with version 2.4.3. Note that the nodes which were used are based on a 10-cores (40 threads nodes) version of the processor we used for the first occurrences (instead of a 12-cores) and that it took me a lot more attempts before being able to reproduce when it was systematically reproduced with the 12-cores version (48 threads). Leads me to think that we are close to a limit showing the scheduling problem. I hope you will be able to confirm your thoughts with these traces. Tell me if anything is missing.

Bruno Faccini (Inactive) added a comment - 21/Apr/14 9:20 PM

Yes, crash-dump is required to confirm behavior found in debug-traces, and also what I think I have found in the crash-dump when running with the instrumented Lustre-modules where you inserted an LBUG() in ptlrpc_invalidate_import_thread() when obd_dump_on_eviction != 0. Can you do the same ?

Bruno Faccini (Inactive) added a comment - 21/Apr/14 9:20 PM Yes, crash-dump is required to confirm behavior found in debug-traces, and also what I think I have found in the crash-dump when running with the instrumented Lustre-modules where you inserted an LBUG() in ptlrpc_invalidate_import_thread() when obd_dump_on_eviction != 0. Can you do the same ?

Sebastien Piechurski added a comment - 15/Apr/14 5:57 PM

Hi Bruno,

Sorry I was not able to give an update earlier.
I had the people onsite take debug traces on a reproduced occurence of the problem.
I have uploaded them to the ftp site under uploads/~~LU-4112~~/~~LU-4112~~-occurence-20140401.tar.xz
Unfortunately, I forgot to tell them to get a crashdump of the clients as well.
Tell me if the debug traces is not enough and I will manage to get new traces with the crashdump again.

Sebastien Piechurski added a comment - 15/Apr/14 5:57 PM Hi Bruno, Sorry I was not able to give an update earlier. I had the people onsite take debug traces on a reproduced occurence of the problem. I have uploaded them to the ftp site under uploads/ LU-4112 / LU-4112 -occurence-20140401.tar.xz Unfortunately, I forgot to tell them to get a crashdump of the clients as well. Tell me if the debug traces is not enough and I will manage to get new traces with the crashdump again.

Bruno Faccini (Inactive) added a comment - 31/Mar/14 9:44 PM

Seb, can you give some update ?? Thanks in advance.

Bruno Faccini (Inactive) added a comment - 31/Mar/14 9:44 PM Seb, can you give some update ?? Thanks in advance.

Bruno Faccini (Inactive) added a comment - 05/Mar/14 4:10 PM

Seb: To ensure I did not miss anything during your/our earlier testing/reproducing/debugging sessions I again went thru all the infos/logs/crash-dumps/... and any materials that we generated before.

And going deeper into the crash-dump that you generated on Client upon eviction (with your Lustre distro including the patch introducing an LBUG() in ptlrpc_invalidate_import_thread() when obd_dump_on_eviction != 0), I found something really interesting that may be a new direction to investigate (a possible live-lock/scheduling issue during cl_lock mutex operations that may cause the timeouts/evicts ...).

I know it is more work for you, but since you are only able to really reproduce on your platform ..., so could it be possible that you re-run the reproducer on a set of Clients running with the same patch in order to provide me with a new crash-dump to analyze and help me to confirm this ??

Bruno Faccini (Inactive) added a comment - 05/Mar/14 4:10 PM Seb: To ensure I did not miss anything during your/our earlier testing/reproducing/debugging sessions I again went thru all the infos/logs/crash-dumps/... and any materials that we generated before. And going deeper into the crash-dump that you generated on Client upon eviction (with your Lustre distro including the patch introducing an LBUG() in ptlrpc_invalidate_import_thread() when obd_dump_on_eviction != 0), I found something really interesting that may be a new direction to investigate (a possible live-lock/scheduling issue during cl_lock mutex operations that may cause the timeouts/evicts ...). I know it is more work for you, but since you are only able to really reproduce on your platform ..., so could it be possible that you re-run the reproducer on a set of Clients running with the same patch in order to provide me with a new crash-dump to analyze and help me to confirm this ??

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Sebastien Buisson (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 16/Oct/13 9:13 AM

Updated:: 14/Aug/16 5:24 PM

Resolved:: 14/Aug/16 5:24 PM