Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
None
-
3
-
23,398
-
8547
Description
Hi,
At CEA they see quite often a problem on Lustre clients where processes are stuck consuming a lot of CPU time in Lustre layers. Unfortunately the only way to really fix this for now is to reboot the impacted nodes (after waiting for them for several hours), since involved processes are not killable.
Crash dump analysis shows processes stuck with the following stack traces (crash dumps can be analyzed only on customer site):
=========================================================
_spin_lock()
cl_page_gang_lookup()
cl_lock_page_out()
osc_lock_flush()
osc_lock_cancel()
cl_lock_cancel0()
.....
=========================================================
and/or
=========================================================
__cond_resched()
_cond_resched()
cfs_cond_resched()
cl_lock_page_out()
osc_lock_flush()
osc_lock_cancel()
cl_lock_cancel0()
.....
=========================================================
In attachment you will find 3 files:
- node1330_dmesg is the dmesg of the faulty client;
- node1330_lctl_dk is the 'lctl dk' output from the faulty client;
- cmds.txt is the sequence of commands played to get the 'lctl dk' output.
There are also "ll_imp_inval" threads stuck due to this problem, leaving OSCs in "IN"active state during a too long time finally causing time-outs and EIOs for client processes.
Data structures involved are "cl_object_header.[coh_page_guard,coh_tree]", respectively for the lock/radix-tree used to manage the page-cache associated to a Lustre-Client object.
It seems to be a race around the OSC object pages lock/radix-tree when concurrent access occur (OOM, flush, invalidation, concurrent I/O). This problem seems to occur when, on the same
Lustre-Client, there are concurrent accesses on the same Lustre objects, inducing a competition on the associated lock and radix-tree from multiple CPUs.
To reproduce this issue, CEA is using one of their proprietary benchmark. But basically, on a single node there are as many processes as cores on this machine, each process mapping a lot of memory. The processes write this memory to Lustre, preferably on the same OST to reproduce the problem. CEA noticed that OSC inactivation process in client eviction can be involved during
issue reproduction. So a part of the reproducer can be to manually force client eviction on OSS side by using either:
lct set_param obdfilter.<fs_name>-<OST_name>.evict_client=nid:<ipoib_clnt_addr>@<portal_name>
or:
echo 'nid:<ipoib_clnt_addr>@<portal_name>' > /proc/fs/lustre/obdfilter/<fs_name>/<OST_name>/evict_client
In order to cope with production imperatives, CEA has setup a work-around that consists in freeing pagecache with "echo 1 > /proc/sys/vm/drop_caches". Doing so, clients will be able to reconnect. On the contrary, and it is interesting to note, clearing the LRU with "lctl set_param ldlm.namespaces.*.lru_size=clear" will hang the node!
Does this issue sound familiar?
Of course CEA really need a fix for this, as soon as possible.
Sebastien.
Bull\CEA confirm that this issue was resolved by the LU394 patch