[LU-9952] soft lockup in osd_inode_iteration() for lustre 2.8.1 Created: 06/Sep/17 Updated: 05/Jun/18 Resolved: 05/Jun/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | James A Simmons | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre ldiskfs server back end running version 2.8.1 with a few additional patches. The OS is RHEL6.9 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 2 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
One of production file systems running lustre 2.8.1 experienced a soft lock up very similar to |
| Comments |
| Comment by Peter Jones [ 07/Sep/17 ] |
|
Fan Yong Can you please advise on this one? Thanks Peter |
| Comment by nasf (Inactive) [ 08/Sep/17 ] |
|
The known patches on master that are related with the OI scrub soft lockup are back ported as following: https://review.whamcloud.com/28903 |
| Comment by James A Simmons [ 18/Sep/17 ] |
|
We are in the process of testing these patches. I attempted to recreate the problem with "lctl set_param fail_loc=0x1504" but that didn't work. What would you recommend to recreate this problem on a 2.8 system? Note we removed the offending files to make our production file system usable again. |
| Comment by nasf (Inactive) [ 19/Sep/17 ] |
|
I think that we need some new fail_loc to simulate osd_inode_iteration() trouble. For example, inject the new failure stub in the osd_iit_next() to simulate kinds of bitmap layout cases. |
| Comment by James A Simmons [ 19/Sep/17 ] |
|
Could you create a test condition before the 30th of September? |
| Comment by nasf (Inactive) [ 21/Sep/17 ] |
|
Let check whether this one https://review.whamcloud.com/#/c/29133/ works or not. |
| Comment by James A Simmons [ 26/Sep/17 ] |
|
We wouldn't be running the test framework on our production system. It looks like I just need to create a bunch of files on the file system. lctl set_param -n osd*.MDT.force_sync=1 While you check status: Does this look right? What values do I use to reset it back to normal working conditions. |
| Comment by James A Simmons [ 27/Sep/17 ] |
|
We have been trying your reproducer by itself and we see it get stuck but no soft lock ups. Why is no soft lockups being reported? |
| Comment by nasf (Inactive) [ 29/Sep/17 ] |
|
"fail_loc=0x190" will slow down the OI scrub scanning, then we can have time to inject other failures before the OI scrub complete. |
| Comment by James A Simmons [ 29/Sep/17 ] |
|
We have been trying patch 28903 by itself and have not seen the soft lockup |
| Comment by nasf (Inactive) [ 09/Oct/17 ] |
You mean you can reproduce the soft lockup every time if without any of above patch, right? And the soft lockup will be disappear if only 28903 applied, right? |
| Comment by nasf (Inactive) [ 24/Nov/17 ] |
|
Any further feedback? Thanks! |
| Comment by nasf (Inactive) [ 05/Jun/18 ] |
|
simmonsja, |
| Comment by James A Simmons [ 05/Jun/18 ] |
|
You can close this. With various patches applied we haven't seen problems in some time. Thanks. |
| Comment by nasf (Inactive) [ 05/Jun/18 ] |
|
The issue has bee resolved via back porting the following patches: |