Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9952

soft lockup in osd_inode_iteration() for lustre 2.8.1

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.8.0
    • None
    • Lustre ldiskfs server back end running version 2.8.1 with a few additional patches. The OS is RHEL6.9
    • 2
    • 9223372036854775807

    Description

      One of production file systems running lustre 2.8.1 experienced a soft lock up very similar to LU-9488. I attempted to back port the patch but way to many changes have happened between 2.8.1 and lustre 2.10.0. Unsure if I would get the port right. I have attached the back trace.

      Attachments

        Issue Links

          Activity

            [LU-9952] soft lockup in osd_inode_iteration() for lustre 2.8.1
            yong.fan nasf (Inactive) added a comment - The issue has bee resolved via back porting the following patches: https://review.whamcloud.com/28903 https://review.whamcloud.com/28904 https://review.whamcloud.com/28905 https://review.whamcloud.com/28906

            You can close this. With various patches applied we haven't seen problems in some time. Thanks.

            simmonsja James A Simmons added a comment - You can close this. With various patches applied we haven't seen problems in some time. Thanks.

            simmonsja,
            Any further feedback for this ticket?

            yong.fan nasf (Inactive) added a comment - simmonsja , Any further feedback for this ticket?

            Any further feedback?

            Thanks!

            yong.fan nasf (Inactive) added a comment - Any further feedback? Thanks!

            We have been trying patch 28903 by itself and have not seen the soft lockup BTW do we need to inject another failure?

            You mean you can reproduce the soft lockup every time if without any of above patch, right? And the soft lockup will be disappear if only 28903 applied, right?

            yong.fan nasf (Inactive) added a comment - We have been trying patch 28903 by itself and have not seen the soft lockup BTW do we need to inject another failure? You mean you can reproduce the soft lockup every time if without any of above patch, right? And the soft lockup will be disappear if only 28903 applied, right?
            simmonsja James A Simmons added a comment - - edited

            We have been trying patch 28903 by itself and have not seen the soft lockup BTW do we need to inject another failure?

            simmonsja James A Simmons added a comment - - edited We have been trying patch 28903 by itself and have not seen the soft lockup BTW do we need to inject another failure?

            "fail_loc=0x190" will slow down the OI scrub scanning, then we can have time to inject other failures before the OI scrub complete.
            "fail_loc=0x198" will make the OI scrub iteration repeatedly scan the same bits for inode table. If without our former patches (28903/4/5/6), then the OI scrub will fall into soft lockup. But because we have such patches, then OI scrub can detect such dead repeat then move forward. So no soft lockup is the expected behavior.

            yong.fan nasf (Inactive) added a comment - "fail_loc=0x190" will slow down the OI scrub scanning, then we can have time to inject other failures before the OI scrub complete. "fail_loc=0x198" will make the OI scrub iteration repeatedly scan the same bits for inode table. If without our former patches (28903/4/5/6), then the OI scrub will fall into soft lockup. But because we have such patches, then OI scrub can detect such dead repeat then move forward. So no soft lockup is the expected behavior.

            We have been trying your reproducer by itself and we see it get stuck but no soft lock ups. Why is no soft lockups being reported?

            simmonsja James A Simmons added a comment - We have been trying your reproducer by itself and we see it get stuck but no soft lock ups. Why is no soft lockups being reported?

            We wouldn't be running the test framework on our production system. It looks like I just need to create a bunch of files on the file system.

            lctl set_param -n osd*.MDT.force_sync=1
            lctl set_param fail_val=1 fail_loc=0x190
            lctl lfsck_start -M lustre-MDT0000
            lctl set_param fail_val=0 fail_loc=0x198

            While you check status:
            lctl get_param -n osd-ldiskfs.lustre-MDT000.oi_scrub | grep status

            Does this look right? What values do I use to reset it back to normal working conditions.

            simmonsja James A Simmons added a comment - We wouldn't be running the test framework on our production system. It looks like I just need to create a bunch of files on the file system. lctl set_param -n osd*. MDT .force_sync=1 lctl set_param fail_val=1 fail_loc=0x190 lctl lfsck_start -M lustre-MDT0000 lctl set_param fail_val=0 fail_loc=0x198 While you check status: lctl get_param -n osd-ldiskfs.lustre-MDT000.oi_scrub | grep status Does this look right? What values do I use to reset it back to normal working conditions.

            Let check whether this one https://review.whamcloud.com/#/c/29133/ works or not.

            yong.fan nasf (Inactive) added a comment - Let check whether this one https://review.whamcloud.com/#/c/29133/ works or not.

            People

              yong.fan nasf (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: