[LU-9952] soft lockup in osd_inode_iteration() for lustre 2.8.1 Created: 06/Sep/17  Updated: 05/Jun/18  Resolved: 05/Jun/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James A Simmons Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre ldiskfs server back end running version 2.8.1 with a few additional patches. The OS is RHEL6.9


Attachments: Text File vmcore-dmesg-f1.txt    
Issue Links:
Related
is related to LU-9040 Soft lockup on CPU during lfsck Resolved
is related to LU-9488 soft lockup in osd_inode_iteration() Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

One of production file systems running lustre 2.8.1 experienced a soft lock up very similar to LU-9488. I attempted to back port the patch but way to many changes have happened between 2.8.1 and lustre 2.10.0. Unsure if I would get the port right. I have attached the back trace.



 Comments   
Comment by Peter Jones [ 07/Sep/17 ]

Fan Yong

Can you please advise on this one?

Thanks

Peter

Comment by nasf (Inactive) [ 08/Sep/17 ]

The known patches on master that are related with the OI scrub soft lockup are back ported as following:

https://review.whamcloud.com/28903
https://review.whamcloud.com/28904
https://review.whamcloud.com/28905
https://review.whamcloud.com/28906

Comment by James A Simmons [ 18/Sep/17 ]

We are in the process of testing these patches. I attempted to recreate the problem with "lctl set_param fail_loc=0x1504" but that didn't work. What would you recommend to recreate this problem on a 2.8 system? Note we removed the offending files to make our production file system usable again.

Comment by nasf (Inactive) [ 19/Sep/17 ]

I think that we need some new fail_loc to simulate osd_inode_iteration() trouble. For example, inject the new failure stub in the osd_iit_next() to simulate kinds of bitmap layout cases.

Comment by James A Simmons [ 19/Sep/17 ]

Could you create a test condition before the 30th of September?

Comment by nasf (Inactive) [ 21/Sep/17 ]

Let check whether this one https://review.whamcloud.com/#/c/29133/ works or not.

Comment by James A Simmons [ 26/Sep/17 ]

We wouldn't be running the test framework on our production system. It looks like I just need to create a bunch of files on the file system.

lctl set_param -n osd*.MDT.force_sync=1
lctl set_param fail_val=1 fail_loc=0x190
lctl lfsck_start -M lustre-MDT0000
lctl set_param fail_val=0 fail_loc=0x198

While you check status:
lctl get_param -n osd-ldiskfs.lustre-MDT000.oi_scrub | grep status

Does this look right? What values do I use to reset it back to normal working conditions.

Comment by James A Simmons [ 27/Sep/17 ]

We have been trying your reproducer by itself and we see it get stuck but no soft lock ups. Why is no soft lockups being reported?

Comment by nasf (Inactive) [ 29/Sep/17 ]

"fail_loc=0x190" will slow down the OI scrub scanning, then we can have time to inject other failures before the OI scrub complete.
"fail_loc=0x198" will make the OI scrub iteration repeatedly scan the same bits for inode table. If without our former patches (28903/4/5/6), then the OI scrub will fall into soft lockup. But because we have such patches, then OI scrub can detect such dead repeat then move forward. So no soft lockup is the expected behavior.

Comment by James A Simmons [ 29/Sep/17 ]

We have been trying patch 28903 by itself and have not seen the soft lockup BTW do we need to inject another failure?

Comment by nasf (Inactive) [ 09/Oct/17 ]

We have been trying patch 28903 by itself and have not seen the soft lockup BTW do we need to inject another failure?

You mean you can reproduce the soft lockup every time if without any of above patch, right? And the soft lockup will be disappear if only 28903 applied, right?

Comment by nasf (Inactive) [ 24/Nov/17 ]

Any further feedback?

Thanks!

Comment by nasf (Inactive) [ 05/Jun/18 ]

simmonsja,
Any further feedback for this ticket?

Comment by James A Simmons [ 05/Jun/18 ]

You can close this. With various patches applied we haven't seen problems in some time. Thanks.

Comment by nasf (Inactive) [ 05/Jun/18 ]

The issue has bee resolved via back porting the following patches:
https://review.whamcloud.com/28903
https://review.whamcloud.com/28904
https://review.whamcloud.com/28905
https://review.whamcloud.com/28906

Generated at Sat Feb 10 02:30:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.