[LU-9952] soft lockup in osd_inode_iteration() for lustre 2.8.1 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.8.0
Labels:
None
Environment:
Lustre ldiskfs server back end running version 2.8.1 with a few additional patches. The OS is RHEL6.9

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

One of production file systems running lustre 2.8.1 experienced a soft lock up very similar to ~~LU-9488~~. I attempted to back port the patch but way to many changes have happened between 2.8.1 and lustre 2.10.0. Unsure if I would get the port right. I have attached the back trace.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

vmcore-dmesg-f1.txt
512 kB
06/Sep/17 6:56 PM

Issue Links

is related to

LU-9040 Soft lockup on CPU during lfsck

Resolved

LU-9488 soft lockup in osd_inode_iteration()

Resolved

Activity

[LU-9952] soft lockup in osd_inode_iteration() for lustre 2.8.1

nasf (Inactive) added a comment - 29/Sep/17 3:13 AM

"fail_loc=0x190" will slow down the OI scrub scanning, then we can have time to inject other failures before the OI scrub complete.
"fail_loc=0x198" will make the OI scrub iteration repeatedly scan the same bits for inode table. If without our former patches (28903/4/5/6), then the OI scrub will fall into soft lockup. But because we have such patches, then OI scrub can detect such dead repeat then move forward. So no soft lockup is the expected behavior.

nasf (Inactive) added a comment - 29/Sep/17 3:13 AM "fail_loc=0x190" will slow down the OI scrub scanning, then we can have time to inject other failures before the OI scrub complete. "fail_loc=0x198" will make the OI scrub iteration repeatedly scan the same bits for inode table. If without our former patches (28903/4/5/6), then the OI scrub will fall into soft lockup. But because we have such patches, then OI scrub can detect such dead repeat then move forward. So no soft lockup is the expected behavior.

James A Simmons added a comment - 27/Sep/17 4:10 PM

We have been trying your reproducer by itself and we see it get stuck but no soft lock ups. Why is no soft lockups being reported?

James A Simmons added a comment - 27/Sep/17 4:10 PM We have been trying your reproducer by itself and we see it get stuck but no soft lock ups. Why is no soft lockups being reported?

James A Simmons added a comment - 26/Sep/17 7:00 PM

We wouldn't be running the test framework on our production system. It looks like I just need to create a bunch of files on the file system.

lctl set_param -n osd*.MDT.force_sync=1
lctl set_param fail_val=1 fail_loc=0x190
lctl lfsck_start -M lustre-MDT0000
lctl set_param fail_val=0 fail_loc=0x198

While you check status:
lctl get_param -n osd-ldiskfs.lustre-MDT000.oi_scrub | grep status

Does this look right? What values do I use to reset it back to normal working conditions.

James A Simmons added a comment - 26/Sep/17 7:00 PM We wouldn't be running the test framework on our production system. It looks like I just need to create a bunch of files on the file system. lctl set_param -n osd*. MDT .force_sync=1 lctl set_param fail_val=1 fail_loc=0x190 lctl lfsck_start -M lustre-MDT0000 lctl set_param fail_val=0 fail_loc=0x198 While you check status: lctl get_param -n osd-ldiskfs.lustre-MDT000.oi_scrub | grep status Does this look right? What values do I use to reset it back to normal working conditions.

nasf (Inactive) added a comment - 21/Sep/17 10:11 AM

Let check whether this one https://review.whamcloud.com/#/c/29133/ works or not.

nasf (Inactive) added a comment - 21/Sep/17 10:11 AM Let check whether this one https://review.whamcloud.com/#/c/29133/ works or not.

James A Simmons added a comment - 19/Sep/17 4:47 PM

Could you create a test condition before the 30th of September?

James A Simmons added a comment - 19/Sep/17 4:47 PM Could you create a test condition before the 30th of September?

nasf (Inactive) added a comment - 19/Sep/17 1:17 AM

I think that we need some new fail_loc to simulate osd_inode_iteration() trouble. For example, inject the new failure stub in the osd_iit_next() to simulate kinds of bitmap layout cases.

nasf (Inactive) added a comment - 19/Sep/17 1:17 AM I think that we need some new fail_loc to simulate osd_inode_iteration() trouble. For example, inject the new failure stub in the osd_iit_next() to simulate kinds of bitmap layout cases.

James A Simmons added a comment - 18/Sep/17 3:49 PM - edited

We are in the process of testing these patches. I attempted to recreate the problem with "lctl set_param fail_loc=0x1504" but that didn't work. What would you recommend to recreate this problem on a 2.8 system? Note we removed the offending files to make our production file system usable again.

James A Simmons added a comment - 18/Sep/17 3:49 PM - edited We are in the process of testing these patches. I attempted to recreate the problem with "lctl set_param fail_loc=0x1504" but that didn't work. What would you recommend to recreate this problem on a 2.8 system? Note we removed the offending files to make our production file system usable again.

nasf (Inactive) added a comment - 08/Sep/17 4:31 AM

The known patches on master that are related with the OI scrub soft lockup are back ported as following:

https://review.whamcloud.com/28903
https://review.whamcloud.com/28904
https://review.whamcloud.com/28905
https://review.whamcloud.com/28906

nasf (Inactive) added a comment - 08/Sep/17 4:31 AM The known patches on master that are related with the OI scrub soft lockup are back ported as following: https://review.whamcloud.com/28903 https://review.whamcloud.com/28904 https://review.whamcloud.com/28905 https://review.whamcloud.com/28906

Peter Jones added a comment - 07/Sep/17 5:21 PM

Fan Yong

Can you please advise on this one?

Thanks

Peter

Peter Jones added a comment - 07/Sep/17 5:21 PM Fan Yong Can you please advise on this one? Thanks Peter

People

Assignee:: nasf (Inactive)

Reporter:: James A Simmons

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 06/Sep/17 6:56 PM

Updated:: 05/Jun/18 4:42 PM

Resolved:: 05/Jun/18 4:42 PM