Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9952

soft lockup in osd_inode_iteration() for lustre 2.8.1

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.8.0
    • None
    • Lustre ldiskfs server back end running version 2.8.1 with a few additional patches. The OS is RHEL6.9
    • 2
    • 9223372036854775807

    Description

      One of production file systems running lustre 2.8.1 experienced a soft lock up very similar to LU-9488. I attempted to back port the patch but way to many changes have happened between 2.8.1 and lustre 2.10.0. Unsure if I would get the port right. I have attached the back trace.

      Attachments

        Issue Links

          Activity

            [LU-9952] soft lockup in osd_inode_iteration() for lustre 2.8.1
            simmonsja James A Simmons added a comment - - edited

            We have been trying patch 28903 by itself and have not seen the soft lockup BTW do we need to inject another failure?

            simmonsja James A Simmons added a comment - - edited We have been trying patch 28903 by itself and have not seen the soft lockup BTW do we need to inject another failure?

            "fail_loc=0x190" will slow down the OI scrub scanning, then we can have time to inject other failures before the OI scrub complete.
            "fail_loc=0x198" will make the OI scrub iteration repeatedly scan the same bits for inode table. If without our former patches (28903/4/5/6), then the OI scrub will fall into soft lockup. But because we have such patches, then OI scrub can detect such dead repeat then move forward. So no soft lockup is the expected behavior.

            yong.fan nasf (Inactive) added a comment - "fail_loc=0x190" will slow down the OI scrub scanning, then we can have time to inject other failures before the OI scrub complete. "fail_loc=0x198" will make the OI scrub iteration repeatedly scan the same bits for inode table. If without our former patches (28903/4/5/6), then the OI scrub will fall into soft lockup. But because we have such patches, then OI scrub can detect such dead repeat then move forward. So no soft lockup is the expected behavior.

            We have been trying your reproducer by itself and we see it get stuck but no soft lock ups. Why is no soft lockups being reported?

            simmonsja James A Simmons added a comment - We have been trying your reproducer by itself and we see it get stuck but no soft lock ups. Why is no soft lockups being reported?

            We wouldn't be running the test framework on our production system. It looks like I just need to create a bunch of files on the file system.

            lctl set_param -n osd*.MDT.force_sync=1
            lctl set_param fail_val=1 fail_loc=0x190
            lctl lfsck_start -M lustre-MDT0000
            lctl set_param fail_val=0 fail_loc=0x198

            While you check status:
            lctl get_param -n osd-ldiskfs.lustre-MDT000.oi_scrub | grep status

            Does this look right? What values do I use to reset it back to normal working conditions.

            simmonsja James A Simmons added a comment - We wouldn't be running the test framework on our production system. It looks like I just need to create a bunch of files on the file system. lctl set_param -n osd*. MDT .force_sync=1 lctl set_param fail_val=1 fail_loc=0x190 lctl lfsck_start -M lustre-MDT0000 lctl set_param fail_val=0 fail_loc=0x198 While you check status: lctl get_param -n osd-ldiskfs.lustre-MDT000.oi_scrub | grep status Does this look right? What values do I use to reset it back to normal working conditions.

            Let check whether this one https://review.whamcloud.com/#/c/29133/ works or not.

            yong.fan nasf (Inactive) added a comment - Let check whether this one https://review.whamcloud.com/#/c/29133/ works or not.

            Could you create a test condition before the 30th of September?

            simmonsja James A Simmons added a comment - Could you create a test condition before the 30th of September?

            I think that we need some new fail_loc to simulate osd_inode_iteration() trouble. For example, inject the new failure stub in the osd_iit_next() to simulate kinds of bitmap layout cases.

            yong.fan nasf (Inactive) added a comment - I think that we need some new fail_loc to simulate osd_inode_iteration() trouble. For example, inject the new failure stub in the osd_iit_next() to simulate kinds of bitmap layout cases.
            simmonsja James A Simmons added a comment - - edited

            We are in the process of testing these patches. I attempted to recreate the problem with "lctl set_param fail_loc=0x1504" but that didn't work. What would you recommend to recreate this problem on a 2.8 system? Note we removed the offending files to make our production file system usable again.

            simmonsja James A Simmons added a comment - - edited We are in the process of testing these patches. I attempted to recreate the problem with "lctl set_param fail_loc=0x1504" but that didn't work. What would you recommend to recreate this problem on a 2.8 system? Note we removed the offending files to make our production file system usable again.

            The known patches on master that are related with the OI scrub soft lockup are back ported as following:

            https://review.whamcloud.com/28903
            https://review.whamcloud.com/28904
            https://review.whamcloud.com/28905
            https://review.whamcloud.com/28906

            yong.fan nasf (Inactive) added a comment - The known patches on master that are related with the OI scrub soft lockup are back ported as following: https://review.whamcloud.com/28903 https://review.whamcloud.com/28904 https://review.whamcloud.com/28905 https://review.whamcloud.com/28906
            pjones Peter Jones added a comment -

            Fan Yong

            Can you please advise on this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Fan Yong Can you please advise on this one? Thanks Peter

            People

              yong.fan nasf (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: