Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9068

Hardware problem resulting in bad blocks

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • None
    • Lustre 2.5.5
    • None
    • 1
    • 9223372036854775807

    Description

      We encountered a hardware problem on the MDT storage device (DDN 7700) that resulted in bad blocks. The file system continued to operate but yesterday went read-only when it stumbled over a bad sector.

      We ran fsck against the file system with the most current e2fsprogs which repaired the file system but dumped 90 objects/files into lost+found. All but 2 belonged to one user. But one of the files/objects belongs to root and has a low inode number #5749 that appears to be a data file.

      We are very concerned that this particular file may be lustre relevant and would like your guidance on what we should do. (Obviously we are able to mount the file system ldiskfs.)

      Attachments

        Issue Links

          Activity

            [LU-9068] Hardware problem resulting in bad blocks
            pjones Peter Jones added a comment -

            ok - thanks Ruth

            pjones Peter Jones added a comment - ok - thanks Ruth

            As Joe mentioned, we had already replaced the file and moved on to errors on some other log files.

            We finally removed the CATALOGS file and rebooted. This got us past the LBUG and the file system is back, minus a few items lost due to hardware.

            Ok to close.

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - As Joe mentioned, we had already replaced the file and moved on to errors on some other log files. We finally removed the CATALOGS file and rebooted. This got us past the LBUG and the file system is back, minus a few items lost due to hardware. Ok to close.
            adilger Andreas Dilger added a comment - - edited

            Sorry, I didn't see your reply until now. Applying the patch to return the error from osp_sync_thread() is the proper fix. You may be able to work around this by creating an empty O/1/105729 file on the MDT (using decimal object ID based on error messages), but it may be that this will also return an error message if the content is bad, instead of just a missing file.

            adilger Andreas Dilger added a comment - - edited Sorry, I didn't see your reply until now. Applying the patch to return the error from osp_sync_thread() is the proper fix. You may be able to work around this by creating an empty O/1/105729 file on the MDT (using decimal object ID based on error messages), but it may be that this will also return an error message if the content is bad, instead of just a missing file.
            jamervi Joe Mervini added a comment -

            We're looking for a work around to get the file system back up again.

            jamervi Joe Mervini added a comment - We're looking for a work around to get the file system back up again.
            jamervi Joe Mervini added a comment -

            We're running the toss version of 2.5.5.

            jamervi Joe Mervini added a comment - We're running the toss version of 2.5.5.
            adilger Andreas Dilger added a comment - - edited

            The error being reported on the MDS is for a different llog file 0x19cee than the one in lost+found, which is 0x19c6b.

            What version of Lustre are you running? It looks like this LBUG (ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK )) is a duplicate with LU-6696, which was fixed in Lustre 2.9.0 (patch http://review.whamcloud.com/19856).

            adilger Andreas Dilger added a comment - - edited The error being reported on the MDS is for a different llog file 0x19cee than the one in lost+found , which is 0x19c6b . What version of Lustre are you running? It looks like this LBUG (ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK )) is a duplicate with LU-6696 , which was fixed in Lustre 2.9.0 (patch http://review.whamcloud.com/19856 ).
            jamervi Joe Mervini added a comment - - edited

            We brought down all the OSSs and OSTs and rebooted them. Brought up the MDS and MDT which looked happy. When I started bring the OSTs back online it LBUGed against another OST.

            [ 1549.368141] Lustre: gscratch-MDT0000: trigger OI scrub by RPC for [0x1:0x19d01:0x0], rc = 0 [1]
            [ 1549.463828] LustreError: 5476:0:(llog_cat.c:195:llog_cat_id2handle()) gscratch-OST0036-osc-MDT0000: error opening log id 0x19d01:1:0: rc = -115
            [ 1549.614586] LustreError: 5476:0:(llog_cat.c:586:llog_cat_process_cb()) gscratch-OST0036-osc-MDT0000: cannot find handle for llog 0x19d01:1: -115
            [ 1549.765765] LustreError: 5476:0:(osp_sync.c:872:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 19 changes, 7 in progress, 0 in flight: -115
            Jan 31 12:45:50 [ 1549.919365] LustreError: 5476:0:(osp_sync.c:872:osp_sync_thread()) LBUG
            gmds1 kernel: [ 1549.765765] LustreError: 5476:0:(osp_sync.c:872:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 19 changes, 7 in progress, 0 in flight: -115
            Jan 31 12:45:50 gmds1 kernel: [ 1549.919365] LustreError: 5476:0:(osp_sync.c:872:osp_sync_thread()) LBUG
            [ 1549.996451] Pid: 5476, comm: osp-syn-54-0
            

            By the way - this is lustre 2.5.5 not 2.7.

            jamervi Joe Mervini added a comment - - edited We brought down all the OSSs and OSTs and rebooted them. Brought up the MDS and MDT which looked happy. When I started bring the OSTs back online it LBUGed against another OST. [ 1549.368141] Lustre: gscratch-MDT0000: trigger OI scrub by RPC for [0x1:0x19d01:0x0], rc = 0 [1] [ 1549.463828] LustreError: 5476:0:(llog_cat.c:195:llog_cat_id2handle()) gscratch-OST0036-osc-MDT0000: error opening log id 0x19d01:1:0: rc = -115 [ 1549.614586] LustreError: 5476:0:(llog_cat.c:586:llog_cat_process_cb()) gscratch-OST0036-osc-MDT0000: cannot find handle for llog 0x19d01:1: -115 [ 1549.765765] LustreError: 5476:0:(osp_sync.c:872:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 19 changes, 7 in progress, 0 in flight: -115 Jan 31 12:45:50 [ 1549.919365] LustreError: 5476:0:(osp_sync.c:872:osp_sync_thread()) LBUG gmds1 kernel: [ 1549.765765] LustreError: 5476:0:(osp_sync.c:872:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 19 changes, 7 in progress, 0 in flight: -115 Jan 31 12:45:50 gmds1 kernel: [ 1549.919365] LustreError: 5476:0:(osp_sync.c:872:osp_sync_thread()) LBUG [ 1549.996451] Pid: 5476, comm: osp-syn-54-0 By the way - this is lustre 2.5.5 not 2.7.
            jamervi Joe Mervini added a comment -

            Note: All the OSS have remained up with all OSTs mounted.

            jamervi Joe Mervini added a comment - Note: All the OSS have remained up with all OSTs mounted.

            People

              pjones Peter Jones
              jamervi Joe Mervini
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: