[LU-3471] "client_obd_lock_t cl_loi_list_lock" in struct client_obd should not be a spin lock (b1_8) Created: 14/Jun/13  Updated: 25/Apr/14  Resolved: 25/Apr/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Vladimir Saveliev Assignee: Keith Mannthey (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 8699

 Description   

client_obd_lock_t cl_loi_list_lock of struct client_obd is used to protect async page operations which are not guaranteed to not block even on linux, therefore
spinlock (used for linux implementaion of cl_loi_list_lock) is not appropriate.

For example, in the call chain:
osc_check_rpcs() -> osc_send_oap_rpc() -> ptlrpcd_add_req():

osc_check_rpc() is called with cli->cl_loi_list_lock spinlock held and
ptlrpcd_add_req() may wait with timeout.

In http://jira-nss.xy01.xyratex.com:8080/browse/MRP-1053 there was discovered a hang cause by scheduling from a process holding the cl_loi_list_lock.
Corresponding core dump is removed already, and I do not remember exactly it hung that time.

In new kernels, 3.0.42, the following call chain may block:
osc_check_rpcs() -> osc_send_oap_rpc() -> ll_ap_make_ready() -> clear_page_dirty_for_io() -> page_mkclean() -> page_mkclean_file()

page_mkclean_file() locks mutex:
mutex_lock(&mapping->i_mmap_mutex);

See http://jira-nss.xy01.xyratex.com:8080/browse/LELUS-116 for more details



 Comments   
Comment by Vladimir Saveliev [ 14/Jun/13 ]

please take a look at the patch:
http://review.whamcloud.com/6646

Comment by Keith Mannthey (Inactive) [ 14/Jun/13 ]

When I click on http://jira-nss.xy01.xyratex.com:8080/browse/LELUS-116 I get a Server not found error. I get the same thing for the MRP link.

Comment by Keith Mannthey (Inactive) [ 14/Jun/13 ]

This seems to be a 1.8 functional improvement. I don't know if many improvements like this have been taken into the tree in a while.

What version of the 1.8 tree did the initial problem hit with?

Is this issue still relevant for Master?

Comment by Vladimir Saveliev [ 15/Jun/13 ]

> When I click on http://jira-nss.xy01.xyratex.com:8080/browse/LELUS-116 I get a Server not found error. I get the same thing for the MRP link

Ok

> This seems to be a 1.8 functional improvement.

This is a fix for reproducible lockups.

> What version of the 1.8 tree did the initial problem hit with?

LELUS-116 reports the failure on 2.2.
MRP-1053 is about this bug hit on Oracle's 1.8.

> Is this issue still relevant for Master?

2.4 does not have this problem after new IO engine (https://jira.hpdd.intel.com/browse/LU-1030) was introduced.

Comment by Keith Mannthey (Inactive) [ 17/Jun/13 ]

Can you provide some more details about the tickets you reference?

Can you confirm the code version in MRP-1053?

How can the lockup be reproduced? Is there a test for this issue?

Comment by Vladimir Saveliev [ 04/Jul/13 ]

> Can you provide some more details about the tickets you reference?

These tickets are about discovered with help of crash(8) and core dumps lockups where a process gets blocked and rescheduled having spinlock held.

> Can you confirm the code version in MRP-1053?

MRP-1053 is about Oracle's b1_8.

> How can the lockup be reproduced? Is there a test for this issue?

We reproduced the lockup with Oracle's b1_8 in order to call to mind the exact traces which deadlocked.

It appeared that patch from https://projectlava.xyratex.com/show_bug.cgi?id=21812 is responsible for that particular lockup.
As long as Intel's 1.8 does not include that patch, you will not able to reproduce it.

You can probably close the bug.

But, please review example in description. It describes a call chain, where a process may reschedule holding the spinlock.

Also, scheduling may become possible due to changes coming to linux.
For example, in linux-3.0.42 - page_mkclean_file() (which is called with spinlock held) may block.

Generated at Sat Feb 10 01:34:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.