Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16689

upgrade to 2.15.2 lost sever top level directories

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • None
    • None
    • 2
    • 9223372036854775807

    Description

      After upgrading filesystem from 2.12 to 2.15.2 Several top level directories got corrupted. 

      [root@nbp11-srv1 ~]# ls -l /nobackupp11/
      ls: cannot access '/nobackupp11/ylin4': No such file or directory
      ls: cannot access '/nobackupp11/mbarad': No such file or directory
      ls: cannot access '/nobackupp11/ldgrant': No such file or directory
      ls: cannot access '/nobackupp11/kknizhni': No such file or directory
      ls: cannot access '/nobackupp11/mzhao4': No such file or directory
      ls: cannot access '/nobackupp11/afahad': No such file or directory
      ls: cannot access '/nobackupp11/jliu7': No such file or directory
      ls: cannot access '/nobackupp11/jswest': No such file or directory
      ls: cannot access '/nobackupp11/hsp': No such file or directory
      ls: cannot access '/nobackupp11/vjespos1': No such file or directory
      ls: cannot access '/nobackupp11/ssepka': No such file or directory
      ls: cannot access '/nobackupp11/cjang1': No such file or directory

       

      debugfs:  stat ylin4
      Inode: 43051102   Type: directory    Mode:  0000   Flags: 0x80000
      Generation: 503057142    Version: 0x00000000:00000000
      User:     0   Group:     0   Project:     0   Size: 4096
      File ACL: 0
      Links: 2   Blockcount: 8
      Fragment:  Address: 0    Number: 0    Size: 0
       ctime: 0x63dd8e2f:22c83c08 – Fri Feb  3 14:43:59 2023
       atime: 0x63dd8e2f:22c83c08 – Fri Feb  3 14:43:59 2023
       mtime: 0x63dd8e2f:22c83c08 – Fri Feb  3 14:43:59 2023
      crtime: 0x63dd8e2f:22c83c08 – Fri Feb  3 14:43:59 2023
      Size of extra inode fields: 32
      Extended attributes:
        lma: fid=[0x280015902:0x2:0x0] compat=0 incompat=2
      EXTENTS:
      (0):671099035

       

      Not thinking I delete these via ldiskfs. The data is still there how can we recover the director data.

      1. lfs quota -u ylin4  /nobackupp11
        Disk quotas for usr ylin4 (uid 11560):
             Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
           /nobackupp11 11337707848* 1073741824 2147483648       -  208359  500000  600000       -

       

       

      Attachments

        Issue Links

          Activity

            [LU-16689] upgrade to 2.15.2 lost sever top level directories

            I used debugfs to dump all fid in /REMOTE_DIR on each MDT. Then I did a lookup of the fid2path to match the directories that were missing. I then cd into the /fs/.lustre/fid/fidnum and moved all contents to its new location.

            Dry-run lfsck still running and finding lots of these
            Mar 30 12:53:39 nbp11-srv5 kernel: Lustre: nbp11-MDT0002-osd: layout LFSCK master found bad lmm_oi for [0x2400ecb98:0x8467:0x0]: rc = 56
            Mar 30 12:53:39 nbp11-srv5 kernel: Lustre: nbp11-MDT0002-osd: layout LFSCK master found bad lmm_oi for [0x2400ecb98:0x8468:0x0]: rc = 56

            These are files under the directories that gotten corrupted. 

            mhanafi Mahmoud Hanafi added a comment - I used debugfs to dump all fid in /REMOTE_DIR on each MDT. Then I did a lookup of the fid2path to match the directories that were missing. I then cd into the /fs/.lustre/fid/fidnum and moved all contents to its new location. Dry-run lfsck still running and finding lots of these Mar 30 12:53:39 nbp11-srv5 kernel: Lustre: nbp11-MDT0002-osd: layout LFSCK master found bad lmm_oi for [0x2400ecb98:0x8467:0x0] : rc = 56 Mar 30 12:53:39 nbp11-srv5 kernel: Lustre: nbp11-MDT0002-osd: layout LFSCK master found bad lmm_oi for [0x2400ecb98:0x8468:0x0] : rc = 56 These are files under the directories that gotten corrupted. 

            It may be that mounting the MDT with "-o resetoi" would have rebuilt the OI files without having to move them from lost+found, in case someone finds this ticket in the future.

            adilger Andreas Dilger added a comment - It may be that mounting the MDT with " -o resetoi " would have rebuilt the OI files without having to move them from lost+found , in case someone finds this ticket in the future.

            This looks like LU-16655, which was caused by a bad code change breaking the on-disk file format for OI Scrub. If Scrub has been run on a filesystem prior to upgrade then it will incorrectly read the fields from this file. The patch https://review.whamcloud.com/50455 "LU-16655 scrub: upgrade scrub_file from 2.12 format" fixes this issue and LU-16655 describes the details (though it is too late to avoid this bug for your system).

            adilger Andreas Dilger added a comment - This looks like LU-16655 , which was caused by a bad code change breaking the on-disk file format for OI Scrub. If Scrub has been run on a filesystem prior to upgrade then it will incorrectly read the fields from this file. The patch https://review.whamcloud.com/50455 " LU-16655 scrub: upgrade scrub_file from 2.12 format " fixes this issue and LU-16655 describes the details (though it is too late to avoid this bug for your system).
            dongyang Dongyang Li added a comment -

            Hi Mahmoud, 2 questions:
            What does stat look like on nobackupp11 via debugfs?
            How did you find ylin4 in debugfs?

            dongyang Dongyang Li added a comment - Hi Mahmoud, 2 questions: What does stat look like on nobackupp11 via debugfs? How did you find ylin4 in debugfs?

            I recovered the files.

            I found the parent fid and cd into /fs/.lustre/fid/fidnum then just move all contents to a newly created directory

            I still like to understand what caused the corruption. 

            mhanafi Mahmoud Hanafi added a comment - I recovered the files. I found the parent fid and cd into /fs/.lustre/fid/fidnum then just move all contents to a newly created directory I still like to understand what caused the corruption. 

            I started a lfsck dry-run

            on MDT0 getting a lot of these errors that are for files with hard links

            Mar 30 12:32:28 nbp11-srv1 kernel: ret_from_fork+0x1f/0x40
            Mar 30 12:32:28 nbp11-srv1 kernel: Lustre: nbp11-MDT0000-osd: namespace LFSCK add flags for [0x20004ca8c:0x8986:0x0] in the trace file, flags 1, old 0, new 1: rc = -22
            Mar 30 12:32:28 nbp11-srv1 kernel: CPU: 34 PID: 1520983 Comm: lfsck Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-425.3.1.el8_lustre.x86_64 #1
            Mar 30 12:32:28 nbp11-srv1 kernel: Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 04/21/2022
            Mar 30 12:32:28 nbp11-srv1 kernel: Call Trace:
            Mar 30 12:32:28 nbp11-srv1 kernel: dump_stack+0x41/0x60
            Mar 30 12:32:28 nbp11-srv1 kernel: lfsck_trans_create.part.58+0x63/0x70 [lfsck]
            Mar 30 12:32:28 nbp11-srv1 kernel: lfsck_namespace_trace_update+0x972/0x980 [lfsck]
            Mar 30 12:32:28 nbp11-srv1 kernel: lfsck_namespace_exec_oit+0x87d/0x970 [lfsck]
            Mar 30 12:32:28 nbp11-srv1 kernel: lfsck_master_oit_engine+0xc56/0x1360 [lfsck]
            Mar 30 12:32:28 nbp11-srv1 kernel: lfsck_master_engine+0x512/0xcd0 [lfsck]
            Mar 30 12:32:28 nbp11-srv1 kernel: ? __schedule+0x2d9/0x860
            Mar 30 12:32:28 nbp11-srv1 kernel: ? finish_wait+0x80/0x80
            Mar 30 12:32:28 nbp11-srv1 kernel: ? lfsck_master_oit_engine+0x1360/0x1360 [lfsck]
            Mar 30 12:32:28 nbp11-srv1 kernel: kthread+0x10a/0x120
            Mar 30 12:32:28 nbp11-srv1 kernel: ? set_kthread_struct+0x50/0x50
            Mar 30 12:32:28 nbp11-srv1 kernel: ret_from_fork+0x1f/0x40

            on MDT2 getting these errors

            Mar 30 12:33:43 nbp11-srv5 kernel: Lustre: nbp11-MDT0002-osd: layout LFSCK master found bad lmm_oi for [0x2400ecb78:0x1e8bb:0x0]: rc = 56
            Mar 30 12:33:43 nbp11-srv5 kernel: Lustre: nbp11-MDT0002-osd: layout LFSCK master found bad lmm_oi for [0x2400ecb78:0x1e8bc:0x0]: rc = 56

            These are the files for the bad directories.

            mhanafi Mahmoud Hanafi added a comment - I started a lfsck dry-run on MDT0 getting a lot of these errors that are for files with hard links Mar 30 12:32:28 nbp11-srv1 kernel: ret_from_fork+0x1f/0x40 Mar 30 12:32:28 nbp11-srv1 kernel: Lustre: nbp11-MDT0000-osd: namespace LFSCK add flags for [0x20004ca8c:0x8986:0x0] in the trace file, flags 1, old 0, new 1: rc = -22 Mar 30 12:32:28 nbp11-srv1 kernel: CPU: 34 PID: 1520983 Comm: lfsck Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-425.3.1.el8_lustre.x86_64 #1 Mar 30 12:32:28 nbp11-srv1 kernel: Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 04/21/2022 Mar 30 12:32:28 nbp11-srv1 kernel: Call Trace: Mar 30 12:32:28 nbp11-srv1 kernel: dump_stack+0x41/0x60 Mar 30 12:32:28 nbp11-srv1 kernel: lfsck_trans_create.part.58+0x63/0x70 [lfsck] Mar 30 12:32:28 nbp11-srv1 kernel: lfsck_namespace_trace_update+0x972/0x980 [lfsck] Mar 30 12:32:28 nbp11-srv1 kernel: lfsck_namespace_exec_oit+0x87d/0x970 [lfsck] Mar 30 12:32:28 nbp11-srv1 kernel: lfsck_master_oit_engine+0xc56/0x1360 [lfsck] Mar 30 12:32:28 nbp11-srv1 kernel: lfsck_master_engine+0x512/0xcd0 [lfsck] Mar 30 12:32:28 nbp11-srv1 kernel: ? __schedule+0x2d9/0x860 Mar 30 12:32:28 nbp11-srv1 kernel: ? finish_wait+0x80/0x80 Mar 30 12:32:28 nbp11-srv1 kernel: ? lfsck_master_oit_engine+0x1360/0x1360 [lfsck] Mar 30 12:32:28 nbp11-srv1 kernel: kthread+0x10a/0x120 Mar 30 12:32:28 nbp11-srv1 kernel: ? set_kthread_struct+0x50/0x50 Mar 30 12:32:28 nbp11-srv1 kernel: ret_from_fork+0x1f/0x40 on MDT2 getting these errors Mar 30 12:33:43 nbp11-srv5 kernel: Lustre: nbp11-MDT0002-osd: layout LFSCK master found bad lmm_oi for [0x2400ecb78:0x1e8bb:0x0] : rc = 56 Mar 30 12:33:43 nbp11-srv5 kernel: Lustre: nbp11-MDT0002-osd: layout LFSCK master found bad lmm_oi for [0x2400ecb78:0x1e8bc:0x0] : rc = 56 These are the files for the bad directories.

            People

              adilger Andreas Dilger
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: