[LU-3581] Recurrence of LU-3020: Lustre returns EINTR during writes when SA_RESTART is set Created: 12/Jul/13 Updated: 23/Dec/13 Resolved: 03/Sep/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.5.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | Peter Jones |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mn4, yuc2 | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9074 | ||||||||
| Description |
|
This is the same issue as described in As I did not hit this issue while testing the fix for This issue is easy to hit without debugging enabled, and very hard to hit with debugging enabled. Here is the relevant portion of the trace logs: This is hit during writes, specifically during ll_commit_write. I will be attaching the full log. This is happening due to a signal arriving during the following l_wait_event call, in osc_enter_cache: rc = l_wait_event(ocw.ocw_waitq, ocw_granted(cli, &ocw), &lwi); client_obd_list_lock(&cli->cl_loi_list_lock); /* l_wait_event is interrupted by signal */ — I will attach full trace logs. Search for -4 in the log to find the EINTR. The question is: Is it safe to return ERESTARTSYS here, instead of EINTR? More generally, Lustre's default behavior in l_wait_event is to return EINTR. Should we consider changing this to ERESTARTSYS and making EINTR the exceptional case? (This may be a terrible idea - I'm just floating it out of curiositiy.) |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 15/Jul/13 ] |
|
Attached is the reproducer for this. This is a slightly improved version of the reproducer for The shell script just helps to attempt to gather debug information. With debug set to 0, the reproducer hits the problem most of the time. With debugging set to +trace, it is very difficult to hit and may need to be looped for an hour or more. |
| Comment by Peter Jones [ 15/Jul/13 ] |
|
Niu Could you please look into this one? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 16/Jul/13 ] |
|
Yes, seems we'd return -ERESTARTSYS in such case too (interrupted by signal in osc_enter_cache()). I think we should return -ERESTARTSYS only in read/write path but not in l_wait_event(). Patrick, would you post a patch for this? Thanks. |
| Comment by Patrick Farrell (Inactive) [ 16/Jul/13 ] |
|
Niu, I'll put the patch up shortly. You said in the read/write path, so would you like it changed in osc_enter_cache? Or somewhere else in the path? This is what I was thinking: /* l_wait_event is interrupted by signal */ if (rc < 0) { /* Ensures restartability - LU-3581 */ if(rc == -EINTR) rc = -ERESTARTSYS; cfs_list_del_init(&ocw.ocw_entry); GOTO(out, rc); } |
| Comment by Niu Yawei (Inactive) [ 16/Jul/13 ] |
|
Patrick, yes, that's exactly what I meant. |
| Comment by Patrick Farrell (Inactive) [ 16/Jul/13 ] |
|
I've put a patch with that in it up here: I'll be testing whether it fixes my reproducer soon and will report back. |
| Comment by Patrick Farrell (Inactive) [ 16/Jul/13 ] |
|
Testing with the reproducer indicates this change resolves the issue. Thank you, niu. |
| Comment by Patrick Farrell (Inactive) [ 24/Jul/13 ] |
|
Niu, I've added you as a reviewer; could you also re-start the testing process? The test failure doesn't appear to be related to the patch. Thanks, |
| Comment by Peter Jones [ 03/Sep/13 ] |
|
Landed for 2.5 |