[LU-18225] LDISKFS-fs: initial error at time 1724092382: ldiskfs_generic_delete_entry - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.16.0
Labels:
- HWC

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

loading build from https://build.whamcloud.com/job/lustre-reviews/107525/ on both server and clients of WR-SOAK, with no HA enabled, soak hit following issue

[63239.265986] Lustre: Lustre: Build Version: 2.15.90_24_ge863f95f

both VMs from the 2nd controller rebooted, see following errors in console log

vcontroller vm0

[63664.348413] Lustre: sfa18k03-OST0025: new connection from sfa18k03-MDT0003-mdtlov (cleaning up unused objects from 0x340000401:20963886 to 0x340000401:20968833)
[63664.813575] Lustre: sfa18k03-OST0026: new connection from sfa18k03-MDT0003-mdtlov (cleaning up unused objects from 0xc40000400:20942873 to 0xc40000400:20948833)
[63664.817519] Lustre: sfa18k03-OST0024: new connection from sfa18k03-MDT0003-mdtlov (cleaning up unused objects from 0x5c0000400:20671906 to 0x5c0000400:20675073)
[63670.858601] Lustre: sfa18k03-OST0024: new connection from sfa18k03-MDT0001-mdtlov (cleaning up unused objects from 0x5c0000401:19996851 to 0x5c0000401:19999185)
[63670.878393] Lustre: sfa18k03-OST001c: new connection from sfa18k03-MDT0001-mdtlov (cleaning up unused objects from 0x780000400:19109574 to 0x780000400:19111665)
[63705.015859] Lustre: sfa18k03-OST0029-osc-MDT0001: Connection restored to 172.25.80.53@tcp (at 172.25.80.53@tcp)
[63705.019483] Lustre: Skipped 14 previous similar messages
[63711.321154] LustreError: sfa18k03-OST0031-osc-MDT0001: operation ost_connect to node 172.25.80.53@tcp failed: rc = -19
[63711.323272] LustreError: Skipped 10 previous similar messages
[63936.604947] LDISKFS-fs (sdao): error count since last fsck: 38
[63936.604946] LDISKFS-fs (sdap): error count since last fsck: 2
[63936.604966] LDISKFS-fs (sdap): initial error at time 1723874458: ldiskfs_find_dest_de:2297
[63936.609560] LDISKFS-fs (sdao): initial error at time 1723660739: ldiskfs_find_dest_de:2297
[63936.610905] : inode 182190082
[63936.612597] : inode 73138214
[63936.614252] : block 5830242584
[63936.615147] : block 60217102
[63936.615989] 
[63936.616849] 
[63936.617671] LDISKFS-fs (sdap): last error at time 1723874458: ldiskfs_evict_inode:257
[63936.618286] LDISKFS-fs (sdao): last error at time 1723878231: ldiskfs_evict_inode:257
[63936.618893] 
[63936.620394] 
[75516.204047] LustreError: sfa18k03-OST002c: not available for connect from 172.25.80.50@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[75516.206751] LustreError: sfa18k03-OST002b-osc-MDT0001: operation ost_statfs to node 172.25.80.53@tcp failed: rc = -107
[75516.206773] Lustre: sfa18k03-MDT0003-osp-MDT0001: Connection to sfa18k03-MDT0003 (at 172.25.80.53@tcp) was lost; in progress operations using this service will wait for recovery to complete
[75516.206775] Lustre: Skipped 1 previous similar message
[75516.209189] LustreError: Skipped 238 previous similar messages
[75516.211099] LustreError: Skipped 10 previous similar messages
[ESC[0;32m  OK  ESC[0m] Stopped target resource-agents dependencies.
         Stopping Restore /run/initramfs on shutdown...
         Stopping LVM event activation on device 8:0...
         Stopping LVM event activation on device 8:48...
         Stopping LVM event activation on device 252:3...
[ESC[0;32m  OK  ESC[0m] Stopped target rpc_pipefs.target.
         Unmounting RPC Pipe File System...
         Stopping LVM event activation on device 8:32...
         Stopping LVM event activation on device 252:2...
         Stopping LVM event activation on device 8:64...
         Stopping Hostname Service...
[ESC[0;32m  OK  ESC[0m] Stopped t[ESC[0;32m  OK  ESC[0m] Stopped irqbalance daemon.
[ESC[0;32m  OK  ESC[0m] Stopped Self Monitoring and Reporting Technology (SMART) Daemon.
[ESC[0;32m  OK  ESC[0m] Stopped Prometheus exporter for Lustre filesystem.
[ESC[0;32m  OK  ESC[0m] Stopped Hostname Service.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

debugfs-bd
16 kB
08/Oct/24 5:01 PM
debugfs-c-R-1014
15 kB
14/Oct/24 5:29 PM
debugfs-Rbd
16 kB
15/Oct/24 4:13 PM
debugfs-stat
15 kB
08/Oct/24 5:01 PM
e2fsck-fn-1015
3 kB
15/Oct/24 4:13 PM
e2fsck-output
12 kB
18/Sep/24 9:59 PM
messages-20240923.gz
1.11 MB
04/Oct/24 5:40 PM
sdao-dump
2 kB
08/Oct/24 5:01 PM

Issue Links

is related to

LU-18487 ll_ost_io03_049: page allocation failure: order:3, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0

Open

is related to

LU-17711 ldiskfs corruption on el9 (dx_probe: Corrupt directory)

Resolved

Activity

[LU-18225] LDISKFS-fs: initial error at time 1724092382: ldiskfs_generic_delete_entry

Gerrit Updater added a comment - 06/Jan/25 9:22 PM - edited

~~"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch:~~ https://review.whamcloud.com/c/fs/lustre-release/+/57661
~~Subject: LU-18225 build: remove virtio patch from el9.3~~
~~Project: fs/lustre-release~~
~~Branch: master~~
~~Current Patch Set: 1~~
~~Commit: a234d1f243dd1b9561bfb3b6358154a8bff3043a~~

Gerrit Updater added a comment - 06/Jan/25 9:22 PM - edited "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57661 Subject: LU-18225 build: remove virtio patch from el9.3 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a234d1f243dd1b9561bfb3b6358154a8bff3043a

Gerrit Updater added a comment - 06/Jan/25 5:19 PM - edited

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57657
Subject: LU-18225 build: remove virtio patch from el9.3-
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 71d13b4292ecceb68728c11f1fe89b86f506f7a7

Gerrit Updater added a comment - 06/Jan/25 5:19 PM - edited "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57657 Subject: LU-18225 build: remove virtio patch from el9.3- Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 71d13b4292ecceb68728c11f1fe89b86f506f7a7

Gerrit Updater added a comment - 02/Jan/25 8:49 PM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56644/
Subject: LU-18225 kernel: silent page allocation failure in virtio_scsi
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 80a9e94df1865287dc67676528d3141a020288be

Gerrit Updater added a comment - 02/Jan/25 8:49 PM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56644/ Subject: LU-18225 kernel: silent page allocation failure in virtio_scsi Project: fs/lustre-release Branch: master Current Patch Set: Commit: 80a9e94df1865287dc67676528d3141a020288be

Sarah Liu added a comment - 16/Oct/24 4:26 PM

sure, I will load the patch and update the ticket if it happens again

Sarah Liu added a comment - 16/Oct/24 4:26 PM sure, I will load the patch and update the ticket if it happens again

Dongyang Li added a comment - 16/Oct/24 3:51 AM

Looks like the block content has already been changed since Sep 24, the new name len is now 6:

4120  1ef4 7401 1000 0601 6631 6239 6630 0000  ..t.....f1b9f0..

I guess after the e2fsck the corrupted data is all gone.
I just made a debug patch, which should give us more info when the error is triggered again. could you give that a try? Thanks

Dongyang Li added a comment - 16/Oct/24 3:51 AM Looks like the block content has already been changed since Sep 24, the new name len is now 6: 4120 1ef4 7401 1000 0601 6631 6239 6630 0000 ..t.....f1b9f0.. I guess after the e2fsck the corrupted data is all gone. I just made a debug patch, which should give us more info when the error is triggered again. could you give that a try? Thanks

Gerrit Updater added a comment - 16/Oct/24 3:28 AM

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56706
Subject: LU-18225 ldiskfs: add ext4-de-debug.patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 738d9c83967e75370ae333a1b43cac5d7536f74f

Gerrit Updater added a comment - 16/Oct/24 3:28 AM "Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56706 Subject: LU-18225 ldiskfs: add ext4-de-debug.patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 738d9c83967e75370ae333a1b43cac5d7536f74f

Sarah Liu added a comment - 15/Oct/24 3:49 PM - edited

I can collect the info. the server vm is 10.25.80.52

please check debugfs-Rbd and e2fsck-fn-1015

Sarah Liu added a comment - 15/Oct/24 3:49 PM - edited I can collect the info. the server vm is 10.25.80.52 please check debugfs-Rbd and e2fsck-fn-1015

Dongyang Li added a comment - 15/Oct/24 12:03 AM

And this the same inode with the same block 2349208195, with the same offset 2128.
could we dump the block with

debugfs -R "bd 2349208195" /dev/sdao

and the output of read-only e2fsck, need to umount sdao first

e2fsck -fn -m [num of cpus] /dev/sdao

And if it's easier, could you point me to the server and I can login and collect/debug there?

Dongyang Li added a comment - 15/Oct/24 12:03 AM And this the same inode with the same block 2349208195, with the same offset 2128. could we dump the block with debugfs -R "bd 2349208195" /dev/sdao and the output of read-only e2fsck, need to umount sdao first e2fsck -fn -m [num of cpus] /dev/sdao And if it's easier, could you point me to the server and I can login and collect/debug there?

Sarah Liu added a comment - 14/Oct/24 5:28 PM

I have the same ldiskfs error on same node, same sdao and inode on Sep 24

[root@snp-18k-03-v3 log]# less messages|grep "LDISKFS-fs er"
Sep 24 13:30:57 snp-18k-03-v3 kernel: LDISKFS-fs error (device sdao): ldiskfs_find_dest_de:2295: inode #73400343: block 2349208195: comm ll_ost03_040: bad entry in directory: rec_len is too small for name_len - offset=2128, inode=24441886, rec_len=16, name_len=3, size=4096

upload the output of "debugfs -c -R 'stat <73400343>' /dev/sdao", I have not run e2fsck, please let me know any other information is needed.

Sarah Liu added a comment - 14/Oct/24 5:28 PM I have the same ldiskfs error on same node, same sdao and inode on Sep 24 [root@snp-18k-03-v3 log]# less messages|grep "LDISKFS-fs er" Sep 24 13:30:57 snp-18k-03-v3 kernel: LDISKFS-fs error (device sdao): ldiskfs_find_dest_de:2295: inode #73400343: block 2349208195: comm ll_ost03_040: bad entry in directory: rec_len is too small for name_len - offset=2128, inode=24441886, rec_len=16, name_len=3, size=4096 upload the output of "debugfs -c -R 'stat <73400343>' /dev/sdao", I have not run e2fsck, please let me know any other information is needed.

Gerrit Updater added a comment - 10/Oct/24 10:04 AM

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56644
Subject: LU-18225 kernel: silent page allocation failure in virtio_scsi
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 01cfbb23defae93dbdeef344ce6d9687410d8b4c

Gerrit Updater added a comment - 10/Oct/24 10:04 AM "Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56644 Subject: LU-18225 kernel: silent page allocation failure in virtio_scsi Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 01cfbb23defae93dbdeef344ce6d9687410d8b4c

Andreas Dilger added a comment - 10/Oct/24 7:24 AM

If the allocation failure is handled transparently at a higher level, then using __GFP_NOWARN to quiet this failure is fine. Possibly that should also go upstream, or is there already a patch fixing this code that could be backported?

It looks like the GFP flag is passed down from the caller through a few layers, so it is a bit tricky to know where is the right place to add __GFP_NOWARN. Ideally, this would be done at the same layer that is doing the allocation fallback, since it knows that the allocation failure is not harmful.

Andreas Dilger added a comment - 10/Oct/24 7:24 AM If the allocation failure is handled transparently at a higher level, then using __GFP_NOWARN to quiet this failure is fine. Possibly that should also go upstream, or is there already a patch fixing this code that could be backported? It looks like the GFP flag is passed down from the caller through a few layers, so it is a bit tricky to know where is the right place to add __GFP_NOWARN . Ideally, this would be done at the same layer that is doing the allocation fallback, since it knows that the allocation failure is not harmful.

People

Assignee:: Dongyang Li

Reporter:: Sarah Liu

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 16/Sep/24 5:21 PM

Updated:: 18/Feb/25 8:14 AM