Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18620

LFSCK does not fix broken agent entries

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Here my attempt to re-create a missing agent entry for a remote stripe of a striped dir :

      [root@rocky tests]# ../utils/lfs mkdir -H fnv_1a_64 -c 2 /mnt/lustre/dir-c2
      [root@rocky tests]# ../utils/lfs path2fid /mnt/lustre/dir-c2
      [0x200000402:0x1:0x0]
      [root@rocky tests]# ../utils/lfs getdirstripe /mnt/lustre/dir-c2
      lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
      mdtidx		 FID[seq:oid:ver]
           0		 [0x200000400:0x2:0x0]		
           1		 [0x240000401:0x2:0x0]		
      [root@rocky tests]# debugfs /dev/mapper/mds2_flakey  -R "ls -lD REMOTE_PARENT_DIR"
      debugfs 1.46.2.wc5 (26-Mar-2022)
        25001   40755 (2)      0      0    4096  9-Jan-2025 16:04 .
            2   40755 (2)      0      0    4096  9-Jan-2025 15:31 ..
        25047   40755 (2)      0      0    4096  9-Jan-2025 16:04 0x240000401:0x2:0x0
      
      [root@rocky tests]# debugfs /dev/mapper/mds1_flakey  -R "ls ROOT/dir-c2"
      debugfs 1.46.2.wc5 (26-Mar-2022)
       25049  (12) .    25043  (28) ..    25050  (52) [0x200000400:0x2:0x0]:0   
       25051  (4004) [0x240000401:0x2:0x0]:1   
      [root@rocky tests]# debugfs /dev/mapper/mds1_flakey  -R "ls -lD ROOT/dir-c2"
      debugfs 1.46.2.wc5 (26-Mar-2022)
        25049   40755 (2)      0      0    4096  9-Jan-2025 16:04 .
        25043   40755 (18)      0      0    4096  9-Jan-2025 16:04 fid:[0x200000007:0x1:0x0] ..
        25050   40755 (18)      0      0    4096  9-Jan-2025 16:04 fid:[0x200000400:0x2:0x0] [0x200000400:0x2:0x0]:0
        25051   40000 (18)      0      0    4096  1-Jan-1970 03:00 fid:[0x240000401:0x2:0x0] [0x240000401:0x2:0x0]:1
      

      removing the agent entry:

      [root@rocky tests]# umount /mnt/lustre-mds2
      [root@rocky tests]# debugfs -w /dev/mapper/mds2_flakey  -R "unlink REMOTE_PARENT_DIR/0x240000401:0x2:0x0"
      debugfs 1.46.2.wc5 (26-Mar-2022)
      [root@rocky tests]# debugfs /dev/mapper/mds2_flakey  -R "ls -lD REMOTE_PARENT_DIR"
      debugfs 1.46.2.wc5 (26-Mar-2022)
        25001   40755 (2)      0      0    4096  9-Jan-2025 16:04 .
            2   40755 (2)      0      0    4096  9-Jan-2025 15:31 ..
      

      starting LFSCK namespace

      [root@rocky tests]# mount -t lustre /dev/mapper/mds2_flakey /mnt/lustre-mds2
      [root@rocky tests]# ../utils/lctl lfsck_start -M lustre-MDT0000  -t namespace
      Started LFSCK on the device lustre-MDT0000: scrub namespace
      [root@rocky tests]# ../utils/lctl lfsck_start -M lustre-MDT0001  -t namespace
      Started LFSCK on the device lustre-MDT0001: scrub namespace
      
      

      checking the results of the LFSCK runs, seeing that
      an object with some new FID created and new agent entry inserted into /REMOTE_PARENT_DIR :

      [root@rocky tests]# debugfs /dev/mapper/mds2_flakey  -R "ls -lD REMOTE_PARENT_DIR"
      debugfs 1.46.2.wc5 (26-Mar-2022)
        25001   40755 (2)      0      0    4096  9-Jan-2025 16:07 .
            2   40755 (2)      0      0    4096  9-Jan-2025 15:31 ..
        25048   40700 (2)      0      0    4096  9-Jan-2025 16:07 0x240000bd0:0x1:0x0
      

      it is a different FID from the FID of the remote stripe:

      [root@rocky tests]# debugfs /dev/mapper/mds1_flakey  -R "ls -lD ROOT/dir-c2"
      debugfs 1.46.2.wc5 (26-Mar-2022)
        25049   40755 (2)      0      0    4096  9-Jan-2025 16:04 .
        25043   40755 (18)      0      0    4096  9-Jan-2025 16:04 fid:[0x200000007:0x1:0x0] ..
        25050   40755 (18)      0      0    4096  9-Jan-2025 16:04 fid:[0x200000400:0x2:0x0] [0x200000400:0x2:0x0]:0
        25051   40000 (18)      0      0    4096  1-Jan-1970 03:00 fid:[0x240000401:0x2:0x0] [0x240000401:0x2:0x0]:1
      

      the dir inode with the remote stripe still in use and still an orphan, not connected to any dir:

      [root@rocky tests]# debugfs /dev/mapper/mds2_flakey  
      debugfs 1.46.2.wc5 (26-Mar-2022)
      debugfs:  testi <25047>
      Inode 25047 is marked in use
      debugfs:  ncheck 25047
      Inode	Pathname
      debugfs:   
      

      Attachments

        Activity

          [LU-18620] LFSCK does not fix broken agent entries

          sometimes a modified version of sanity-lfsck.sh:test_35 (fault injection is replaced by debugfs unlink cmd) fails this way:

          == sanity-lfsck test 35: LFSCK can rebuild the lost agent entry ========================================================== 21:36:47 (1736534207)
          preparing... 1 * 1 files will be created Fri Jan 10 21:36:47 MSK 2025.
          total: 1 mkdir in 0.00 seconds: 3506.94 ops/second
          total: 1 create in 0.00 seconds: 4462.03 ops/second
          total: 1 mkdir in 0.00 seconds: 4148.67 ops/second
          prepared Fri Jan 10 21:36:48 MSK 2025.
          fail_loc=0
          Stopping /mnt/lustre-mds2 (opts:) on rocky.localnet
          debugfs 1.46.2.wc5 (26-Mar-2022)
          debugfs:  unlink REMOTE_PARENT_DIR/0x240000402:0x1:0x0
          debugfs:  debugfs 1.46.2.wc5 (26-Mar-2022)
          debugfs:  unlink REMOTE_PARENT_DIR/0x240000401:0x2:0x0
          debugfs:  Starting mds2: -o localrecov  /dev/mapper/mds2_flakey /mnt/lustre-mds2
          Started lustre-MDT0001
          Started LFSCK on the device lustre-MDT0000: scrub namespace
          stopall to cleanup object cache
          setupall
          Using TIMEOUT=20
          Started LFSCK on the device lustre-MDT0000: scrub namespace
          debugfs 1.46.2.wc5 (26-Mar-2022)
          /dev/mapper/mds2_flakey: catastrophic mode - not reading inode or group bitmaps
          debugfs:  ls -l REMOTE_PARENT_DIR
            20001   40755 (2)      0      0    4096 10-Jan-2025 21:37 .
                2   40755 (2)      0      0    4096 10-Jan-2025 21:36 ..
            20099   40700 (2)      0      0    4096 10-Jan-2025 21:37 0x240000bd0:0x1:0x0
          
          debugfs:  debugfs 1.46.2.wc5 (26-Mar-2022)
          /dev/mapper/mds2_flakey: catastrophic mode - not reading inode or group bitmaps
          REMOTE_PARENT_DIR/0x240000401:0x2:0x0: File not found by ext2_lookup 
          debugfs:  ls -l REMOTE_PARENT_DIR/0x240000401:0x2:0x0
          debugfs:  debugfs 1.46.2.wc5 (26-Mar-2022)
          /dev/mapper/mds2_flakey: catastrophic mode - not reading inode or group bitmaps
          REMOTE_PARENT_DIR/0x240000402:0x1:0x0: File not found by ext2_lookup 
           sanity-lfsck test_35: @@@@@@ FAIL: (8) remote dir agent entry is missing or incorrect 
            Trace dump:
            = ./../tests/test-framework.sh:7229:error()
            = sanity-lfsck.sh:5599:test_35()
            = ./../tests/test-framework.sh:7602:run_one()
            = ./../tests/test-framework.sh:7665:run_one_logged()
            = ./../tests/test-framework.sh:7468:run_test()
            = sanity-lfsck.sh:5607:main()
          Dumping lctl log to /tmp/test_logs/1736534154/sanity-lfsck.test_35.*.1736534280.log
          Dumping logs only on local client.
          FAIL 35 (73s)
          [root@rocky tests]# 
          
          zam Alexander Zarochentsev added a comment - sometimes a modified version of sanity-lfsck.sh:test_35 (fault injection is replaced by debugfs unlink cmd) fails this way: == sanity-lfsck test 35: LFSCK can rebuild the lost agent entry ========================================================== 21:36:47 (1736534207) preparing... 1 * 1 files will be created Fri Jan 10 21:36:47 MSK 2025. total: 1 mkdir in 0.00 seconds: 3506.94 ops/second total: 1 create in 0.00 seconds: 4462.03 ops/second total: 1 mkdir in 0.00 seconds: 4148.67 ops/second prepared Fri Jan 10 21:36:48 MSK 2025. fail_loc=0 Stopping /mnt/lustre-mds2 (opts:) on rocky.localnet debugfs 1.46.2.wc5 (26-Mar-2022) debugfs: unlink REMOTE_PARENT_DIR/0x240000402:0x1:0x0 debugfs: debugfs 1.46.2.wc5 (26-Mar-2022) debugfs: unlink REMOTE_PARENT_DIR/0x240000401:0x2:0x0 debugfs: Starting mds2: -o localrecov /dev/mapper/mds2_flakey /mnt/lustre-mds2 Started lustre-MDT0001 Started LFSCK on the device lustre-MDT0000: scrub namespace stopall to cleanup object cache setupall Using TIMEOUT=20 Started LFSCK on the device lustre-MDT0000: scrub namespace debugfs 1.46.2.wc5 (26-Mar-2022) /dev/mapper/mds2_flakey: catastrophic mode - not reading inode or group bitmaps debugfs: ls -l REMOTE_PARENT_DIR 20001 40755 (2) 0 0 4096 10-Jan-2025 21:37 . 2 40755 (2) 0 0 4096 10-Jan-2025 21:36 .. 20099 40700 (2) 0 0 4096 10-Jan-2025 21:37 0x240000bd0:0x1:0x0 debugfs: debugfs 1.46.2.wc5 (26-Mar-2022) /dev/mapper/mds2_flakey: catastrophic mode - not reading inode or group bitmaps REMOTE_PARENT_DIR/0x240000401:0x2:0x0: File not found by ext2_lookup debugfs: ls -l REMOTE_PARENT_DIR/0x240000401:0x2:0x0 debugfs: debugfs 1.46.2.wc5 (26-Mar-2022) /dev/mapper/mds2_flakey: catastrophic mode - not reading inode or group bitmaps REMOTE_PARENT_DIR/0x240000402:0x1:0x0: File not found by ext2_lookup sanity-lfsck test_35: @@@@@@ FAIL: (8) remote dir agent entry is missing or incorrect Trace dump: = ./../tests/test-framework.sh:7229:error() = sanity-lfsck.sh:5599:test_35() = ./../tests/test-framework.sh:7602:run_one() = ./../tests/test-framework.sh:7665:run_one_logged() = ./../tests/test-framework.sh:7468:run_test() = sanity-lfsck.sh:5607:main() Dumping lctl log to /tmp/test_logs/1736534154/sanity-lfsck.test_35.*.1736534280.log Dumping logs only on local client. FAIL 35 (73s) [root@rocky tests]#

          "Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57714
          Subject: LU-18620 tests: sanity_lfsck test 35 improvements
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 4bd3d4f496d85cdec313a2b31bb1d132437b3ce4

          gerrit Gerrit Updater added a comment - "Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57714 Subject: LU-18620 tests: sanity_lfsck test 35 improvements Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4bd3d4f496d85cdec313a2b31bb1d132437b3ce4

          People

            zam Alexander Zarochentsev
            zam Alexander Zarochentsev
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: