[LU-3471] "client_obd_lock_t cl_loi_list_lock" in struct client_obd should not be a spin lock (b1_8) Created: 14/Jun/13 Updated: 25/Apr/14 Resolved: 25/Apr/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Vladimir Saveliev | Assignee: | Keith Mannthey (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | patch | ||
| Severity: | 3 |
| Rank (Obsolete): | 8699 |
| Description |
|
client_obd_lock_t cl_loi_list_lock of struct client_obd is used to protect async page operations which are not guaranteed to not block even on linux, therefore For example, in the call chain: osc_check_rpc() is called with cli->cl_loi_list_lock spinlock held and In http://jira-nss.xy01.xyratex.com:8080/browse/MRP-1053 there was discovered a hang cause by scheduling from a process holding the cl_loi_list_lock. In new kernels, 3.0.42, the following call chain may block: page_mkclean_file() locks mutex: See http://jira-nss.xy01.xyratex.com:8080/browse/LELUS-116 for more details |
| Comments |
| Comment by Vladimir Saveliev [ 14/Jun/13 ] |
|
please take a look at the patch: |
| Comment by Keith Mannthey (Inactive) [ 14/Jun/13 ] |
|
When I click on http://jira-nss.xy01.xyratex.com:8080/browse/LELUS-116 I get a Server not found error. I get the same thing for the MRP link. |
| Comment by Keith Mannthey (Inactive) [ 14/Jun/13 ] |
|
This seems to be a 1.8 functional improvement. I don't know if many improvements like this have been taken into the tree in a while. What version of the 1.8 tree did the initial problem hit with? Is this issue still relevant for Master? |
| Comment by Vladimir Saveliev [ 15/Jun/13 ] |
|
> When I click on http://jira-nss.xy01.xyratex.com:8080/browse/LELUS-116 I get a Server not found error. I get the same thing for the MRP link Ok > This seems to be a 1.8 functional improvement. This is a fix for reproducible lockups. > What version of the 1.8 tree did the initial problem hit with? LELUS-116 reports the failure on 2.2. > Is this issue still relevant for Master? 2.4 does not have this problem after new IO engine (https://jira.hpdd.intel.com/browse/LU-1030) was introduced. |
| Comment by Keith Mannthey (Inactive) [ 17/Jun/13 ] |
|
Can you provide some more details about the tickets you reference? Can you confirm the code version in MRP-1053? How can the lockup be reproduced? Is there a test for this issue? |
| Comment by Vladimir Saveliev [ 04/Jul/13 ] |
|
> Can you provide some more details about the tickets you reference? These tickets are about discovered with help of crash(8) and core dumps lockups where a process gets blocked and rescheduled having spinlock held. > Can you confirm the code version in MRP-1053? MRP-1053 is about Oracle's b1_8. > How can the lockup be reproduced? Is there a test for this issue? We reproduced the lockup with Oracle's b1_8 in order to call to mind the exact traces which deadlocked. It appeared that patch from https://projectlava.xyratex.com/show_bug.cgi?id=21812 is responsible for that particular lockup. You can probably close the bug. But, please review example in description. It describes a call chain, where a process may reschedule holding the spinlock. Also, scheduling may become possible due to changes coming to linux. |