Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18225

LDISKFS-fs: initial error at time 1724092382: ldiskfs_generic_delete_entry

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.16.0
    • 3
    • 9223372036854775807

    Description

      loading build from https://build.whamcloud.com/job/lustre-reviews/107525/ on both server and clients of WR-SOAK, with no HA enabled, soak hit following issue

      [63239.265986] Lustre: Lustre: Build Version: 2.15.90_24_ge863f95f

      both VMs from the 2nd controller rebooted, see following errors in console log

      vcontroller vm0

      [63664.348413] Lustre: sfa18k03-OST0025: new connection from sfa18k03-MDT0003-mdtlov (cleaning up unused objects from 0x340000401:20963886 to 0x340000401:20968833)
      [63664.813575] Lustre: sfa18k03-OST0026: new connection from sfa18k03-MDT0003-mdtlov (cleaning up unused objects from 0xc40000400:20942873 to 0xc40000400:20948833)
      [63664.817519] Lustre: sfa18k03-OST0024: new connection from sfa18k03-MDT0003-mdtlov (cleaning up unused objects from 0x5c0000400:20671906 to 0x5c0000400:20675073)
      [63670.858601] Lustre: sfa18k03-OST0024: new connection from sfa18k03-MDT0001-mdtlov (cleaning up unused objects from 0x5c0000401:19996851 to 0x5c0000401:19999185)
      [63670.878393] Lustre: sfa18k03-OST001c: new connection from sfa18k03-MDT0001-mdtlov (cleaning up unused objects from 0x780000400:19109574 to 0x780000400:19111665)
      [63705.015859] Lustre: sfa18k03-OST0029-osc-MDT0001: Connection restored to 172.25.80.53@tcp (at 172.25.80.53@tcp)
      [63705.019483] Lustre: Skipped 14 previous similar messages
      [63711.321154] LustreError: sfa18k03-OST0031-osc-MDT0001: operation ost_connect to node 172.25.80.53@tcp failed: rc = -19
      [63711.323272] LustreError: Skipped 10 previous similar messages
      [63936.604947] LDISKFS-fs (sdao): error count since last fsck: 38
      [63936.604946] LDISKFS-fs (sdap): error count since last fsck: 2
      [63936.604966] LDISKFS-fs (sdap): initial error at time 1723874458: ldiskfs_find_dest_de:2297
      [63936.609560] LDISKFS-fs (sdao): initial error at time 1723660739: ldiskfs_find_dest_de:2297
      [63936.610905] : inode 182190082
      [63936.612597] : inode 73138214
      [63936.614252] : block 5830242584
      [63936.615147] : block 60217102
      [63936.615989] 
      [63936.616849] 
      [63936.617671] LDISKFS-fs (sdap): last error at time 1723874458: ldiskfs_evict_inode:257
      [63936.618286] LDISKFS-fs (sdao): last error at time 1723878231: ldiskfs_evict_inode:257
      [63936.618893] 
      [63936.620394] 
      [75516.204047] LustreError: sfa18k03-OST002c: not available for connect from 172.25.80.50@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [75516.206751] LustreError: sfa18k03-OST002b-osc-MDT0001: operation ost_statfs to node 172.25.80.53@tcp failed: rc = -107
      [75516.206773] Lustre: sfa18k03-MDT0003-osp-MDT0001: Connection to sfa18k03-MDT0003 (at 172.25.80.53@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [75516.206775] Lustre: Skipped 1 previous similar message
      [75516.209189] LustreError: Skipped 238 previous similar messages
      [75516.211099] LustreError: Skipped 10 previous similar messages
      [ESC[0;32m  OK  ESC[0m] Stopped target resource-agents dependencies.
               Stopping Restore /run/initramfs on shutdown...
               Stopping LVM event activation on device 8:0...
               Stopping LVM event activation on device 8:48...
               Stopping LVM event activation on device 252:3...
      [ESC[0;32m  OK  ESC[0m] Stopped target rpc_pipefs.target.
               Unmounting RPC Pipe File System...
               Stopping LVM event activation on device 8:32...
               Stopping LVM event activation on device 252:2...
               Stopping LVM event activation on device 8:64...
               Stopping Hostname Service...
      [ESC[0;32m  OK  ESC[0m] Stopped t[ESC[0;32m  OK  ESC[0m] Stopped irqbalance daemon.
      [ESC[0;32m  OK  ESC[0m] Stopped Self Monitoring and Reporting Technology (SMART) Daemon.
      [ESC[0;32m  OK  ESC[0m] Stopped Prometheus exporter for Lustre filesystem.
      [ESC[0;32m  OK  ESC[0m] Stopped Hostname Service.
      
      

      Attachments

        1. debugfs-bd
          16 kB
        2. debugfs-c-R-1014
          15 kB
        3. debugfs-Rbd
          16 kB
        4. debugfs-stat
          15 kB
        5. e2fsck-fn-1015
          3 kB
        6. e2fsck-output
          12 kB
        7. messages-20240923.gz
          1.11 MB
        8. sdao-dump
          2 kB

        Issue Links

          Activity

            [LU-18225] LDISKFS-fs: initial error at time 1724092382: ldiskfs_generic_delete_entry
            gerrit Gerrit Updater added a comment - - edited

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57661
            Subject: LU-18225 build: remove virtio patch from el9.3
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a234d1f243dd1b9561bfb3b6358154a8bff3043a

            gerrit Gerrit Updater added a comment - - edited "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57661 Subject: LU-18225 build: remove virtio patch from el9.3 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a234d1f243dd1b9561bfb3b6358154a8bff3043a
            gerrit Gerrit Updater added a comment - - edited

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57657
            Subject: LU-18225 build: remove virtio patch from el9.3-
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 71d13b4292ecceb68728c11f1fe89b86f506f7a7

            gerrit Gerrit Updater added a comment - - edited "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57657 Subject: LU-18225 build: remove virtio patch from el9.3- Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 71d13b4292ecceb68728c11f1fe89b86f506f7a7

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56644/
            Subject: LU-18225 kernel: silent page allocation failure in virtio_scsi
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 80a9e94df1865287dc67676528d3141a020288be

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56644/ Subject: LU-18225 kernel: silent page allocation failure in virtio_scsi Project: fs/lustre-release Branch: master Current Patch Set: Commit: 80a9e94df1865287dc67676528d3141a020288be
            sarah Sarah Liu added a comment -

            sure, I will load the patch and update the ticket if it happens again

            sarah Sarah Liu added a comment - sure, I will load the patch and update the ticket if it happens again
            dongyang Dongyang Li added a comment -

            Looks like the block content has already been changed since Sep 24, the new name len is now 6:

            4120  1ef4 7401 1000 0601 6631 6239 6630 0000  ..t.....f1b9f0..
            

            I guess after the e2fsck the corrupted data is all gone.
            I just made a debug patch, which should give us more info when the error is triggered again. could you give that a try? Thanks

            dongyang Dongyang Li added a comment - Looks like the block content has already been changed since Sep 24, the new name len is now 6: 4120 1ef4 7401 1000 0601 6631 6239 6630 0000 ..t.....f1b9f0.. I guess after the e2fsck the corrupted data is all gone. I just made a debug patch, which should give us more info when the error is triggered again. could you give that a try? Thanks

            "Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56706
            Subject: LU-18225 ldiskfs: add ext4-de-debug.patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 738d9c83967e75370ae333a1b43cac5d7536f74f

            gerrit Gerrit Updater added a comment - "Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56706 Subject: LU-18225 ldiskfs: add ext4-de-debug.patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 738d9c83967e75370ae333a1b43cac5d7536f74f
            sarah Sarah Liu added a comment - - edited

            I can collect the info. the server vm is 10.25.80.52

            please check debugfs-Rbd and e2fsck-fn-1015

            sarah Sarah Liu added a comment - - edited I can collect the info. the server vm is 10.25.80.52 please check debugfs-Rbd and e2fsck-fn-1015
            dongyang Dongyang Li added a comment -

            And this the same inode with the same block 2349208195, with the same offset 2128.
            could we dump the block with

            debugfs -R "bd 2349208195" /dev/sdao
            

            and the output of read-only e2fsck, need to umount sdao first

            e2fsck -fn -m [num of cpus] /dev/sdao
            

            And if it's easier, could you point me to the server and I can login and collect/debug there?

            dongyang Dongyang Li added a comment - And this the same inode with the same block 2349208195, with the same offset 2128. could we dump the block with debugfs -R "bd 2349208195" /dev/sdao and the output of read-only e2fsck, need to umount sdao first e2fsck -fn -m [num of cpus] /dev/sdao And if it's easier, could you point me to the server and I can login and collect/debug there?
            sarah Sarah Liu added a comment -

            I have the same ldiskfs error on same node, same sdao and inode on Sep 24

            [root@snp-18k-03-v3 log]# less messages|grep "LDISKFS-fs er"
            Sep 24 13:30:57 snp-18k-03-v3 kernel: LDISKFS-fs error (device sdao): ldiskfs_find_dest_de:2295: inode #73400343: block 2349208195: comm ll_ost03_040: bad entry in directory: rec_len is too small for name_len - offset=2128, inode=24441886, rec_len=16, name_len=3, size=4096
            

            upload the output of "debugfs -c -R 'stat <73400343>' /dev/sdao", I have not run e2fsck, please let me know any other information is needed.

            sarah Sarah Liu added a comment - I have the same ldiskfs error on same node, same sdao and inode on Sep 24 [root@snp-18k-03-v3 log]# less messages|grep "LDISKFS-fs er" Sep 24 13:30:57 snp-18k-03-v3 kernel: LDISKFS-fs error (device sdao): ldiskfs_find_dest_de:2295: inode #73400343: block 2349208195: comm ll_ost03_040: bad entry in directory: rec_len is too small for name_len - offset=2128, inode=24441886, rec_len=16, name_len=3, size=4096 upload the output of "debugfs -c -R 'stat <73400343>' /dev/sdao", I have not run e2fsck, please let me know any other information is needed.

            "Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56644
            Subject: LU-18225 kernel: silent page allocation failure in virtio_scsi
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 01cfbb23defae93dbdeef344ce6d9687410d8b4c

            gerrit Gerrit Updater added a comment - "Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56644 Subject: LU-18225 kernel: silent page allocation failure in virtio_scsi Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 01cfbb23defae93dbdeef344ce6d9687410d8b4c

            If the allocation failure is handled transparently at a higher level, then using __GFP_NOWARN to quiet this failure is fine. Possibly that should also go upstream, or is there already a patch fixing this code that could be backported?

            It looks like the GFP flag is passed down from the caller through a few layers, so it is a bit tricky to know where is the right place to add __GFP_NOWARN. Ideally, this would be done at the same layer that is doing the allocation fallback, since it knows that the allocation failure is not harmful.

            adilger Andreas Dilger added a comment - If the allocation failure is handled transparently at a higher level, then using __GFP_NOWARN to quiet this failure is fine. Possibly that should also go upstream, or is there already a patch fixing this code that could be backported? It looks like the GFP flag is passed down from the caller through a few layers, so it is a bit tricky to know where is the right place to add __GFP_NOWARN . Ideally, this would be done at the same layer that is doing the allocation fallback, since it knows that the allocation failure is not harmful.

            People

              dongyang Dongyang Li
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated: