[LU-18225] LDISKFS-fs: initial error at time 1724092382: ldiskfs_generic_delete_entry - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.16.0
Labels:
- HWC

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

loading build from https://build.whamcloud.com/job/lustre-reviews/107525/ on both server and clients of WR-SOAK, with no HA enabled, soak hit following issue

[63239.265986] Lustre: Lustre: Build Version: 2.15.90_24_ge863f95f

both VMs from the 2nd controller rebooted, see following errors in console log

vcontroller vm0

[63664.348413] Lustre: sfa18k03-OST0025: new connection from sfa18k03-MDT0003-mdtlov (cleaning up unused objects from 0x340000401:20963886 to 0x340000401:20968833)
[63664.813575] Lustre: sfa18k03-OST0026: new connection from sfa18k03-MDT0003-mdtlov (cleaning up unused objects from 0xc40000400:20942873 to 0xc40000400:20948833)
[63664.817519] Lustre: sfa18k03-OST0024: new connection from sfa18k03-MDT0003-mdtlov (cleaning up unused objects from 0x5c0000400:20671906 to 0x5c0000400:20675073)
[63670.858601] Lustre: sfa18k03-OST0024: new connection from sfa18k03-MDT0001-mdtlov (cleaning up unused objects from 0x5c0000401:19996851 to 0x5c0000401:19999185)
[63670.878393] Lustre: sfa18k03-OST001c: new connection from sfa18k03-MDT0001-mdtlov (cleaning up unused objects from 0x780000400:19109574 to 0x780000400:19111665)
[63705.015859] Lustre: sfa18k03-OST0029-osc-MDT0001: Connection restored to 172.25.80.53@tcp (at 172.25.80.53@tcp)
[63705.019483] Lustre: Skipped 14 previous similar messages
[63711.321154] LustreError: sfa18k03-OST0031-osc-MDT0001: operation ost_connect to node 172.25.80.53@tcp failed: rc = -19
[63711.323272] LustreError: Skipped 10 previous similar messages
[63936.604947] LDISKFS-fs (sdao): error count since last fsck: 38
[63936.604946] LDISKFS-fs (sdap): error count since last fsck: 2
[63936.604966] LDISKFS-fs (sdap): initial error at time 1723874458: ldiskfs_find_dest_de:2297
[63936.609560] LDISKFS-fs (sdao): initial error at time 1723660739: ldiskfs_find_dest_de:2297
[63936.610905] : inode 182190082
[63936.612597] : inode 73138214
[63936.614252] : block 5830242584
[63936.615147] : block 60217102
[63936.615989] 
[63936.616849] 
[63936.617671] LDISKFS-fs (sdap): last error at time 1723874458: ldiskfs_evict_inode:257
[63936.618286] LDISKFS-fs (sdao): last error at time 1723878231: ldiskfs_evict_inode:257
[63936.618893] 
[63936.620394] 
[75516.204047] LustreError: sfa18k03-OST002c: not available for connect from 172.25.80.50@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[75516.206751] LustreError: sfa18k03-OST002b-osc-MDT0001: operation ost_statfs to node 172.25.80.53@tcp failed: rc = -107
[75516.206773] Lustre: sfa18k03-MDT0003-osp-MDT0001: Connection to sfa18k03-MDT0003 (at 172.25.80.53@tcp) was lost; in progress operations using this service will wait for recovery to complete
[75516.206775] Lustre: Skipped 1 previous similar message
[75516.209189] LustreError: Skipped 238 previous similar messages
[75516.211099] LustreError: Skipped 10 previous similar messages
[ESC[0;32m  OK  ESC[0m] Stopped target resource-agents dependencies.
         Stopping Restore /run/initramfs on shutdown...
         Stopping LVM event activation on device 8:0...
         Stopping LVM event activation on device 8:48...
         Stopping LVM event activation on device 252:3...
[ESC[0;32m  OK  ESC[0m] Stopped target rpc_pipefs.target.
         Unmounting RPC Pipe File System...
         Stopping LVM event activation on device 8:32...
         Stopping LVM event activation on device 252:2...
         Stopping LVM event activation on device 8:64...
         Stopping Hostname Service...
[ESC[0;32m  OK  ESC[0m] Stopped t[ESC[0;32m  OK  ESC[0m] Stopped irqbalance daemon.
[ESC[0;32m  OK  ESC[0m] Stopped Self Monitoring and Reporting Technology (SMART) Daemon.
[ESC[0;32m  OK  ESC[0m] Stopped Prometheus exporter for Lustre filesystem.
[ESC[0;32m  OK  ESC[0m] Stopped Hostname Service.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

debugfs-bd
16 kB
08/Oct/24 5:01 PM
debugfs-c-R-1014
15 kB
14/Oct/24 5:29 PM
debugfs-Rbd
16 kB
15/Oct/24 4:13 PM
debugfs-stat
15 kB
08/Oct/24 5:01 PM
e2fsck-fn-1015
3 kB
15/Oct/24 4:13 PM
e2fsck-output
12 kB
18/Sep/24 9:59 PM
messages-20240923.gz
1.11 MB
04/Oct/24 5:40 PM
sdao-dump
2 kB
08/Oct/24 5:01 PM

Issue Links

is related to

LU-18487 ll_ost_io03_049: page allocation failure: order:3, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0

Open

is related to

LU-17711 ldiskfs corruption on el9 (dx_probe: Corrupt directory)

Resolved

Activity

[LU-18225] LDISKFS-fs: initial error at time 1724092382: ldiskfs_generic_delete_entry

Gerrit Updater added a comment - 02/Jan/25 8:49 PM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56644/
Subject: LU-18225 kernel: silent page allocation failure in virtio_scsi
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 80a9e94df1865287dc67676528d3141a020288be

Gerrit Updater added a comment - 02/Jan/25 8:49 PM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56644/ Subject: LU-18225 kernel: silent page allocation failure in virtio_scsi Project: fs/lustre-release Branch: master Current Patch Set: Commit: 80a9e94df1865287dc67676528d3141a020288be

Sarah Liu added a comment - 16/Oct/24 4:26 PM

sure, I will load the patch and update the ticket if it happens again

Sarah Liu added a comment - 16/Oct/24 4:26 PM sure, I will load the patch and update the ticket if it happens again

Dongyang Li added a comment - 16/Oct/24 3:51 AM

Looks like the block content has already been changed since Sep 24, the new name len is now 6:

4120  1ef4 7401 1000 0601 6631 6239 6630 0000  ..t.....f1b9f0..

I guess after the e2fsck the corrupted data is all gone.
I just made a debug patch, which should give us more info when the error is triggered again. could you give that a try? Thanks

Dongyang Li added a comment - 16/Oct/24 3:51 AM Looks like the block content has already been changed since Sep 24, the new name len is now 6: 4120 1ef4 7401 1000 0601 6631 6239 6630 0000 ..t.....f1b9f0.. I guess after the e2fsck the corrupted data is all gone. I just made a debug patch, which should give us more info when the error is triggered again. could you give that a try? Thanks

Gerrit Updater added a comment - 16/Oct/24 3:28 AM

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56706
Subject: LU-18225 ldiskfs: add ext4-de-debug.patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 738d9c83967e75370ae333a1b43cac5d7536f74f

Gerrit Updater added a comment - 16/Oct/24 3:28 AM "Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56706 Subject: LU-18225 ldiskfs: add ext4-de-debug.patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 738d9c83967e75370ae333a1b43cac5d7536f74f

Sarah Liu added a comment - 15/Oct/24 3:49 PM - edited

I can collect the info. the server vm is 10.25.80.52

please check debugfs-Rbd and e2fsck-fn-1015

Sarah Liu added a comment - 15/Oct/24 3:49 PM - edited I can collect the info. the server vm is 10.25.80.52 please check debugfs-Rbd and e2fsck-fn-1015

Dongyang Li added a comment - 15/Oct/24 12:03 AM

And this the same inode with the same block 2349208195, with the same offset 2128.
could we dump the block with

debugfs -R "bd 2349208195" /dev/sdao

and the output of read-only e2fsck, need to umount sdao first

e2fsck -fn -m [num of cpus] /dev/sdao

And if it's easier, could you point me to the server and I can login and collect/debug there?

Dongyang Li added a comment - 15/Oct/24 12:03 AM And this the same inode with the same block 2349208195, with the same offset 2128. could we dump the block with debugfs -R "bd 2349208195" /dev/sdao and the output of read-only e2fsck, need to umount sdao first e2fsck -fn -m [num of cpus] /dev/sdao And if it's easier, could you point me to the server and I can login and collect/debug there?

Sarah Liu added a comment - 14/Oct/24 5:28 PM

I have the same ldiskfs error on same node, same sdao and inode on Sep 24

[root@snp-18k-03-v3 log]# less messages|grep "LDISKFS-fs er"
Sep 24 13:30:57 snp-18k-03-v3 kernel: LDISKFS-fs error (device sdao): ldiskfs_find_dest_de:2295: inode #73400343: block 2349208195: comm ll_ost03_040: bad entry in directory: rec_len is too small for name_len - offset=2128, inode=24441886, rec_len=16, name_len=3, size=4096

upload the output of "debugfs -c -R 'stat <73400343>' /dev/sdao", I have not run e2fsck, please let me know any other information is needed.

Sarah Liu added a comment - 14/Oct/24 5:28 PM I have the same ldiskfs error on same node, same sdao and inode on Sep 24 [root@snp-18k-03-v3 log]# less messages|grep "LDISKFS-fs er" Sep 24 13:30:57 snp-18k-03-v3 kernel: LDISKFS-fs error (device sdao): ldiskfs_find_dest_de:2295: inode #73400343: block 2349208195: comm ll_ost03_040: bad entry in directory: rec_len is too small for name_len - offset=2128, inode=24441886, rec_len=16, name_len=3, size=4096 upload the output of "debugfs -c -R 'stat <73400343>' /dev/sdao", I have not run e2fsck, please let me know any other information is needed.

Gerrit Updater added a comment - 10/Oct/24 10:04 AM

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56644
Subject: LU-18225 kernel: silent page allocation failure in virtio_scsi
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 01cfbb23defae93dbdeef344ce6d9687410d8b4c

Gerrit Updater added a comment - 10/Oct/24 10:04 AM "Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56644 Subject: LU-18225 kernel: silent page allocation failure in virtio_scsi Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 01cfbb23defae93dbdeef344ce6d9687410d8b4c

Andreas Dilger added a comment - 10/Oct/24 7:24 AM

If the allocation failure is handled transparently at a higher level, then using __GFP_NOWARN to quiet this failure is fine. Possibly that should also go upstream, or is there already a patch fixing this code that could be backported?

It looks like the GFP flag is passed down from the caller through a few layers, so it is a bit tricky to know where is the right place to add __GFP_NOWARN. Ideally, this would be done at the same layer that is doing the allocation fallback, since it knows that the allocation failure is not harmful.

Andreas Dilger added a comment - 10/Oct/24 7:24 AM If the allocation failure is handled transparently at a higher level, then using __GFP_NOWARN to quiet this failure is fine. Possibly that should also go upstream, or is there already a patch fixing this code that could be backported? It looks like the GFP flag is passed down from the caller through a few layers, so it is a bit tricky to know where is the right place to add __GFP_NOWARN . Ideally, this would be done at the same layer that is doing the allocation fallback, since it knows that the allocation failure is not harmful.

Dongyang Li added a comment - 09/Oct/24 10:40 PM

Yes the allocation failure is a separate problem that needs looking into.

Dongyang Li added a comment - 09/Oct/24 10:40 PM Yes the allocation failure is a separate problem that needs looking into.

Sarah Liu added a comment - 09/Oct/24 4:53 PM

hi dongyang, no, there was no ldiskfs error on vm1. After I ran e2fsck and restarted the test, there was no ldiskfs error on any server vm when it was timeout. sorry I didn't recall the output of e2fsck.

I rebooted all the server vms since the fs was no response(ls hung). it ran overnight, testing seems ongoing but I found on all server vms, there are page allocation error, dmesg like
should I open new ticket for this?

[16247.823902] Lustre: 68562:0:(mdt_recovery.c:148:mdt_req_from_lrd()) @@@ restoring transno  req@000000001dd58399 x1811746499175296/t523986305680(0) o101->7ec4b321-fc8d-49b2-9fc1-4ae12b6cd4c0@172.25.80.33@tcp:657/0 lens 376/1784 e 0 to 0 dl 1728427442 ref 1 fl Interpret:H/202/0 rc 0/0 job:'' uid:0 gid:0
[22473.618001] Lustre: 127072:0:(mdt_recovery.c:148:mdt_req_from_lrd()) @@@ restoring transno  req@00000000abb6e3d3 x1811759166812672/t523989095606(0) o101->ce0abb5d-3cd7-43eb-bb94-be1985fd7ad4@172.25.80.22@tcp:63/0 lens 376/1864 e 0 to 0 dl 1728433643 ref 1 fl Interpret:H/202/0 rc 0/0 job:'' uid:0 gid:0
[22474.401775] Lustre: 129905:0:(mdt_recovery.c:148:mdt_req_from_lrd()) @@@ restoring transno  req@000000007252b13d x1811759170103296/t523989095619(0) o101->ce0abb5d-3cd7-43eb-bb94-be1985fd7ad4@172.25.80.22@tcp:64/0 lens 376/1864 e 0 to 0 dl 1728433644 ref 1 fl Interpret:H/202/0 rc 0/0 job:'' uid:0 gid:0
[22581.834967] ll_ost_io02_039: page allocation failure: order:3, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[22581.838485] CPU: 8 PID: 127347 Comm: ll_ost_io02_039 Kdump: loaded Tainted: G           OE     -------- -  - 4.18.0-553.16.1.el8_lustre.x86_64 #1
[22581.841877] Hardware name: DDN SFA18KXE, BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[22581.844339] Call Trace:
[22581.845495]  dump_stack+0x41/0x60
[22581.846780]  warn_alloc.cold.127+0x7b/0x108
[22581.848191]  __alloc_pages_slowpath+0xcb2/0xcd0
[22581.849511]  __alloc_pages_nodemask+0x2e2/0x330
[22581.850575]  kmalloc_order+0x28/0x90
[22581.851486]  kmalloc_order_trace+0x1d/0xb0
[22581.852635]  __kmalloc+0x25c/0x2d0
[22581.853670]  virtqueue_add+0x493/0xc70
[22581.854731]  ? finish_wait+0x80/0x80
[22581.855904]  virtqueue_add_sgs+0x80/0xa0
[22581.857186]  __virtscsi_add_cmd+0x148/0x270 [virtio_scsi]
[22581.858910]  ? scsi_alloc_sgtables+0x84/0x1c0
[22581.860249]  virtscsi_add_cmd+0x38/0xa0 [virtio_scsi]
[22581.861689]  virtscsi_queuecommand+0x186/0x2d0 [virtio_scsi]
[22581.863242]  scsi_queue_rq+0x512/0xb10
[22581.864494]  __blk_mq_try_issue_directly+0x163/0x200
[22581.866006]  blk_mq_request_issue_directly+0x4e/0xb0
[22581.867558]  blk_mq_try_issue_list_directly+0x62/0x100
[22581.869078]  blk_mq_sched_insert_requests+0xa4/0xf0
[22581.870562]  blk_mq_flush_plug_list+0x135/0x220
[22581.871979]  blk_flush_plug_list+0xd7/0x100
[22581.873287]  blk_finish_plug+0x25/0x36
[22581.874491]  osd_do_bio+0x7f1/0xc70 [osd_ldiskfs]
[22581.876012]  ? ldiskfs_map_blocks+0x607/0x610 [ldiskfs]
[22581.877596]  ? osd_ldiskfs_map_inode_pages+0x2c2/0xb20 [osd_ldiskfs]
[22581.879360]  osd_ldiskfs_map_inode_pages+0x2c2/0xb20 [osd_ldiskfs]
[22581.881043]  osd_write_commit+0x4c6/0x990 [osd_ldiskfs]
[22581.882571]  ofd_commitrw_write+0x9df/0x1b00 [ofd]
[22581.884021]  ofd_commitrw+0x5dc/0xd80 [ofd]
[22581.885272]  ? obd_commitrw+0x1b6/0x370 [ptlrpc]
[22581.887103]  obd_commitrw+0x1b6/0x370 [ptlrpc]
[22581.888655]  tgt_brw_write+0xf54/0x1f90 [ptlrpc]
[22581.890213]  ? _cond_resched+0x15/0x30
[22581.891400]  ? flush_work+0x42/0x1d0
[22581.892500]  ? internal_add_timer+0x42/0x70
[22581.893720]  ? _cond_resched+0x15/0x30
[22581.894884]  ? mutex_lock+0xe/0x30
[22581.895949]  tgt_request_handle+0x3f4/0x1a30 [ptlrpc]
[22581.897495]  ? ptlrpc_update_export_timer+0x3d/0x500 [ptlrpc]
[22581.899225]  ptlrpc_server_handle_request+0x2aa/0xcf0 [ptlrpc]
[22581.900939]  ? lprocfs_counter_add+0x10e/0x180 [obdclass]
[22581.902509]  ptlrpc_main+0xc9e/0x15c0 [ptlrpc]
[22581.903979]  ? ptlrpc_wait_event+0x5b0/0x5b0 [ptlrpc]
[22581.904744] ll_ost_io01_034: page allocation failure: order:3, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[22581.905528]  kthread+0x134/0x150
[22581.907356] ll_ost_io03_029: page allocation failure: order:3, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[22581.909345] ll_ost_io01_064: page allocation failure: order:3, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[22581.911672]  ? set_kthread_struct+0x50/0x50
[22581.912135] ll_ost_io02_073: page allocation failure: order:3, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[22581.917402]  ret_from_fork+0x1f/0x40
[22581.918250] Mem-Info:
[22581.918251] CPU: 12 PID: 127683 Comm: ll_ost_io03_029 Kdump: loaded Tainted: G           OE     -------- -  - 4.18.0-553.16.1.el8_lustre.x86_64 #1
[22581.918886] active_anon:9586 inactive_anon:235852 isolated_anon:0
                active_file:26214948 inactive_file:9997372 isolated_file:34
                unevictable:0 dirty:172 writeback:0
                slab_reclaimable:462546 slab_unreclaimable:544066
                mapped:29737 shmem:197 pagetables:3250 bounce:0
                free:558197 free_pcp:4297 free_cma:0
[22581.921880] Hardware name: DDN SFA18KXE, BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[22581.928597] Node 0 active_anon:38344kB inactive_anon:943408kB active_file:104869260kB inactive_file:39989456kB unevictable:0kB isolated(anon):0kB isolated(file):620kB mapped:118948kB dirty:688kB writeback:0kB shmem:788kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:34656kB pagetables:13000kB all_unreclaimable? no
[22581.930768] Call Trace:
[22581.936536] Node 0 DMA free:11264kB min:4kB low:16kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[22581.937430]  dump_stack+0x41/0x60
[22581.941697] lowmem_reserve[]: 0 913 150007 150007 150007
[22581.942766]  warn_alloc.cold.127+0x7b/0x108
[22581.943921] Node 0 DMA32 free:596536kB min:300kB low:1232kB high:2164kB active_anon:0kB inactive_anon:83612kB active_file:176000kB inactive_file:60624kB unevictable:0kB writepending:0kB present:2080608kB managed:966496kB mlocked:0kB bounce:0kB free_pcp:15096kB local_pcp:372kB free_cma:0kB
[22581.946334]  __alloc_pages_slowpath+0xcb2/0xcd0
[22581.951399] lowmem_reserve[]: 0 0 149093 149093 149093
[22581.952769]  __alloc_pages_nodemask+0x2e2/0x330
[22581.954004] Node 0 Normal free:1694996kB min:49268kB low:201940kB high:354612kB active_anon:38344kB inactive_anon:859796kB active_file:104623944kB inactive_file:39919140kB unevictable:0kB writepending:688kB present:155189248kB managed:152680592kB mlocked:0kB bounce:0kB free_pcp:2208kB local_pcp:24kB free_cma:0kB
[22581.955501]  kmalloc_order+0x28/0x90
[22581.961168] lowmem_reserve[]: 0 0 0 0 0
[22581.962424]  kmalloc_order_trace+0x1d/0xb0
[22581.963433] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 
[22581.964729]  __kmalloc+0x25c/0x2d0
[22581.964734] 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
[22581.967434]  virtqueue_add+0x493/0xc70
[22581.968864] Node 0 DMA32: 8556*4kB (UME) 4961*8kB (UME) 2066*16kB (ME) 
[22581.970182]  virtqueue_add_sgs+0x80/0xa0
[22581.970186] 725*32kB (UME) 311*64kB (UMEH) 122*128kB (UME) 73*256kB (UM) 51*512kB (UM) 17*1024kB (ME) 18*2048kB 
[22581.973014]  __virtscsi_add_cmd+0x148/0x270 [virtio_scsi]
[22581.973019] (UM) 81*4096kB (UM) = 596536kB
[22581.976714]  ? scsi_alloc_sgtables+0x84/0x1c0
[22581.977790] Node 0 Normal: 279769*4kB (UMEH) 108102*8kB (UMEH) 9366*16kB (UMH) 12*32kB 
[22581.979216]  virtscsi_add_cmd+0x38/0xa0 [virtio_scsi]
[22581.980930] (UMEH) 6*64kB (EH) 4*128kB (E) 2*256kB (E) 1*512kB (E) 0*1024kB 0*2048kB 0*4096kB = 2136052kB
[22581.982477]  virtscsi_queuecommand+0x186/0x2d0 [virtio_scsi]
[22581.984429] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[22581.986082]  scsi_queue_rq+0x512/0xb10
[22581.987909] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[22581.989257]  __blk_mq_try_issue_directly+0x163/0x200
[22581.991057] 36019216 total pagecache pages
[22581.992606]  blk_mq_request_issue_directly+0x4e/0xb0
[22581.993679] 4589 pages in swap cache
[22581.995184]  blk_mq_try_issue_list_directly+0x62/0x100
[22581.996150] Swap cache stats: add 236588, delete 231998, find 24220/31008
[22581.997572]  blk_mq_sched_insert_requests+0xa4/0xf0
[22581.999028] Free swap  = 10341124kB
[22582.000489]  blk_mq_flush_plug_list+0x135/0x220
[22582.001436] Total swap = 11075580kB
[22582.002794]  blk_flush_plug_list+0xd7/0x100
[22582.003723] 39321462 pages RAM
[22582.005003]  blk_finish_plug+0x25/0x36
[22582.005933] 0 pages HighMem/MovableOnly
[22582.007125]  osd_do_bio+0x7f1/0xc70 [osd_ldiskfs]
[22582.008103] 905850 pages reserved
[22582.009447]  ? ldiskfs_map_blocks+0x607/0x610 [ldiskfs]
[22582.010354] 0 pages hwpoisoned
[22582.011810]  ? osd_ldiskfs_map_inode_pages+0x2c2/0xb20 [osd_ldiskfs]
[22582.014521]  osd_ldiskfs_map_inode_pages+0x2c2/0xb20 [osd_ldiskfs]
[22582.016448]  osd_write_commit+0x4c6/0x990 [osd_ldiskfs]
[22582.017985]  ofd_commitrw_write+0x9df/0x1b00 [ofd]
[22582.019536]  ofd_commitrw+0x5dc/0xd80 [ofd]
[22582.020816]  ? obd_commitrw+0x1b6/0x370 [ptlrpc]
[22582.022466]  obd_commitrw+0x1b6/0x370 [ptlrpc]
[22582.023992]  tgt_brw_write+0xf54/0x1f90 [ptlrpc]
[22582.025495]  ? _cond_resched+0x15/0x30
[22582.026671]  ? flush_work+0x42/0x1d0
[22582.027752]  ? internal_add_timer+0x42/0x70
[22582.028976]  ? _cond_resched+0x15/0x30
[22582.030064]  ? mutex_lock+0xe/0x30
[22582.031072]  tgt_request_handle+0x3f4/0x1a30 [ptlrpc]
[22582.032636]  ? ptlrpc_update_export_timer+0x3d/0x500 [ptlrpc]
[22582.034103]  ptlrpc_server_handle_request+0x2aa/0xcf0 [ptlrpc]
[22582.035813]  ? lprocfs_counter_add+0x10e/0x180 [obdclass]
[22582.037335]  ptlrpc_main+0xc9e/0x15c0 [ptlrpc]
[22582.038763]  ? ptlrpc_wait_event+0x5b0/0x5b0 [ptlrpc]
[22582.040292]  kthread+0x134/0x150
[22582.041227]  ? set_kthread_struct+0x50/0x50
[22582.042376]  ret_from_fork+0x1f/0x40
[22582.043397] CPU: 9 PID: 127537 Comm: ll_ost_io02_073 Kdump: loaded Tainted: G           OE     -------- -  - 4.18.0-553.16.1.el8_lustre.x86_64 #1
[22582.045848] Hardware name: DDN SFA18KXE, BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[22582.047678] Call Trace:

Sarah Liu added a comment - 09/Oct/24 4:53 PM hi dongyang , no, there was no ldiskfs error on vm1. After I ran e2fsck and restarted the test, there was no ldiskfs error on any server vm when it was timeout. sorry I didn't recall the output of e2fsck. I rebooted all the server vms since the fs was no response(ls hung). it ran overnight, testing seems ongoing but I found on all server vms, there are page allocation error, dmesg like should I open new ticket for this? [16247.823902] Lustre: 68562:0:(mdt_recovery.c:148:mdt_req_from_lrd()) @@@ restoring transno req@000000001dd58399 x1811746499175296/t523986305680(0) o101->7ec4b321-fc8d-49b2-9fc1-4ae12b6cd4c0@172.25.80.33@tcp:657/0 lens 376/1784 e 0 to 0 dl 1728427442 ref 1 fl Interpret:H/202/0 rc 0/0 job:'' uid:0 gid:0 [22473.618001] Lustre: 127072:0:(mdt_recovery.c:148:mdt_req_from_lrd()) @@@ restoring transno req@00000000abb6e3d3 x1811759166812672/t523989095606(0) o101->ce0abb5d-3cd7-43eb-bb94-be1985fd7ad4@172.25.80.22@tcp:63/0 lens 376/1864 e 0 to 0 dl 1728433643 ref 1 fl Interpret:H/202/0 rc 0/0 job:'' uid:0 gid:0 [22474.401775] Lustre: 129905:0:(mdt_recovery.c:148:mdt_req_from_lrd()) @@@ restoring transno req@000000007252b13d x1811759170103296/t523989095619(0) o101->ce0abb5d-3cd7-43eb-bb94-be1985fd7ad4@172.25.80.22@tcp:64/0 lens 376/1864 e 0 to 0 dl 1728433644 ref 1 fl Interpret:H/202/0 rc 0/0 job:'' uid:0 gid:0 [22581.834967] ll_ost_io02_039: page allocation failure: order:3, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0 [22581.838485] CPU: 8 PID: 127347 Comm: ll_ost_io02_039 Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.16.1.el8_lustre.x86_64 #1 [22581.841877] Hardware name: DDN SFA18KXE, BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [22581.844339] Call Trace: [22581.845495] dump_stack+0x41/0x60 [22581.846780] warn_alloc.cold.127+0x7b/0x108 [22581.848191] __alloc_pages_slowpath+0xcb2/0xcd0 [22581.849511] __alloc_pages_nodemask+0x2e2/0x330 [22581.850575] kmalloc_order+0x28/0x90 [22581.851486] kmalloc_order_trace+0x1d/0xb0 [22581.852635] __kmalloc+0x25c/0x2d0 [22581.853670] virtqueue_add+0x493/0xc70 [22581.854731] ? finish_wait+0x80/0x80 [22581.855904] virtqueue_add_sgs+0x80/0xa0 [22581.857186] __virtscsi_add_cmd+0x148/0x270 [virtio_scsi] [22581.858910] ? scsi_alloc_sgtables+0x84/0x1c0 [22581.860249] virtscsi_add_cmd+0x38/0xa0 [virtio_scsi] [22581.861689] virtscsi_queuecommand+0x186/0x2d0 [virtio_scsi] [22581.863242] scsi_queue_rq+0x512/0xb10 [22581.864494] __blk_mq_try_issue_directly+0x163/0x200 [22581.866006] blk_mq_request_issue_directly+0x4e/0xb0 [22581.867558] blk_mq_try_issue_list_directly+0x62/0x100 [22581.869078] blk_mq_sched_insert_requests+0xa4/0xf0 [22581.870562] blk_mq_flush_plug_list+0x135/0x220 [22581.871979] blk_flush_plug_list+0xd7/0x100 [22581.873287] blk_finish_plug+0x25/0x36 [22581.874491] osd_do_bio+0x7f1/0xc70 [osd_ldiskfs] [22581.876012] ? ldiskfs_map_blocks+0x607/0x610 [ldiskfs] [22581.877596] ? osd_ldiskfs_map_inode_pages+0x2c2/0xb20 [osd_ldiskfs] [22581.879360] osd_ldiskfs_map_inode_pages+0x2c2/0xb20 [osd_ldiskfs] [22581.881043] osd_write_commit+0x4c6/0x990 [osd_ldiskfs] [22581.882571] ofd_commitrw_write+0x9df/0x1b00 [ofd] [22581.884021] ofd_commitrw+0x5dc/0xd80 [ofd] [22581.885272] ? obd_commitrw+0x1b6/0x370 [ptlrpc] [22581.887103] obd_commitrw+0x1b6/0x370 [ptlrpc] [22581.888655] tgt_brw_write+0xf54/0x1f90 [ptlrpc] [22581.890213] ? _cond_resched+0x15/0x30 [22581.891400] ? flush_work+0x42/0x1d0 [22581.892500] ? internal_add_timer+0x42/0x70 [22581.893720] ? _cond_resched+0x15/0x30 [22581.894884] ? mutex_lock+0xe/0x30 [22581.895949] tgt_request_handle+0x3f4/0x1a30 [ptlrpc] [22581.897495] ? ptlrpc_update_export_timer+0x3d/0x500 [ptlrpc] [22581.899225] ptlrpc_server_handle_request+0x2aa/0xcf0 [ptlrpc] [22581.900939] ? lprocfs_counter_add+0x10e/0x180 [obdclass] [22581.902509] ptlrpc_main+0xc9e/0x15c0 [ptlrpc] [22581.903979] ? ptlrpc_wait_event+0x5b0/0x5b0 [ptlrpc] [22581.904744] ll_ost_io01_034: page allocation failure: order:3, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0 [22581.905528] kthread+0x134/0x150 [22581.907356] ll_ost_io03_029: page allocation failure: order:3, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0 [22581.909345] ll_ost_io01_064: page allocation failure: order:3, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0 [22581.911672] ? set_kthread_struct+0x50/0x50 [22581.912135] ll_ost_io02_073: page allocation failure: order:3, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0 [22581.917402] ret_from_fork+0x1f/0x40 [22581.918250] Mem-Info: [22581.918251] CPU: 12 PID: 127683 Comm: ll_ost_io03_029 Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.16.1.el8_lustre.x86_64 #1 [22581.918886] active_anon:9586 inactive_anon:235852 isolated_anon:0 active_file:26214948 inactive_file:9997372 isolated_file:34 unevictable:0 dirty:172 writeback:0 slab_reclaimable:462546 slab_unreclaimable:544066 mapped:29737 shmem:197 pagetables:3250 bounce:0 free:558197 free_pcp:4297 free_cma:0 [22581.921880] Hardware name: DDN SFA18KXE, BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [22581.928597] Node 0 active_anon:38344kB inactive_anon:943408kB active_file:104869260kB inactive_file:39989456kB unevictable:0kB isolated(anon):0kB isolated(file):620kB mapped:118948kB dirty:688kB writeback:0kB shmem:788kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:34656kB pagetables:13000kB all_unreclaimable? no [22581.930768] Call Trace: [22581.936536] Node 0 DMA free:11264kB min:4kB low:16kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [22581.937430] dump_stack+0x41/0x60 [22581.941697] lowmem_reserve[]: 0 913 150007 150007 150007 [22581.942766] warn_alloc.cold.127+0x7b/0x108 [22581.943921] Node 0 DMA32 free:596536kB min:300kB low:1232kB high:2164kB active_anon:0kB inactive_anon:83612kB active_file:176000kB inactive_file:60624kB unevictable:0kB writepending:0kB present:2080608kB managed:966496kB mlocked:0kB bounce:0kB free_pcp:15096kB local_pcp:372kB free_cma:0kB [22581.946334] __alloc_pages_slowpath+0xcb2/0xcd0 [22581.951399] lowmem_reserve[]: 0 0 149093 149093 149093 [22581.952769] __alloc_pages_nodemask+0x2e2/0x330 [22581.954004] Node 0 Normal free:1694996kB min:49268kB low:201940kB high:354612kB active_anon:38344kB inactive_anon:859796kB active_file:104623944kB inactive_file:39919140kB unevictable:0kB writepending:688kB present:155189248kB managed:152680592kB mlocked:0kB bounce:0kB free_pcp:2208kB local_pcp:24kB free_cma:0kB [22581.955501] kmalloc_order+0x28/0x90 [22581.961168] lowmem_reserve[]: 0 0 0 0 0 [22581.962424] kmalloc_order_trace+0x1d/0xb0 [22581.963433] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB [22581.964729] __kmalloc+0x25c/0x2d0 [22581.964734] 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB [22581.967434] virtqueue_add+0x493/0xc70 [22581.968864] Node 0 DMA32: 8556*4kB (UME) 4961*8kB (UME) 2066*16kB (ME) [22581.970182] virtqueue_add_sgs+0x80/0xa0 [22581.970186] 725*32kB (UME) 311*64kB (UMEH) 122*128kB (UME) 73*256kB (UM) 51*512kB (UM) 17*1024kB (ME) 18*2048kB [22581.973014] __virtscsi_add_cmd+0x148/0x270 [virtio_scsi] [22581.973019] (UM) 81*4096kB (UM) = 596536kB [22581.976714] ? scsi_alloc_sgtables+0x84/0x1c0 [22581.977790] Node 0 Normal: 279769*4kB (UMEH) 108102*8kB (UMEH) 9366*16kB (UMH) 12*32kB [22581.979216] virtscsi_add_cmd+0x38/0xa0 [virtio_scsi] [22581.980930] (UMEH) 6*64kB (EH) 4*128kB (E) 2*256kB (E) 1*512kB (E) 0*1024kB 0*2048kB 0*4096kB = 2136052kB [22581.982477] virtscsi_queuecommand+0x186/0x2d0 [virtio_scsi] [22581.984429] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [22581.986082] scsi_queue_rq+0x512/0xb10 [22581.987909] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [22581.989257] __blk_mq_try_issue_directly+0x163/0x200 [22581.991057] 36019216 total pagecache pages [22581.992606] blk_mq_request_issue_directly+0x4e/0xb0 [22581.993679] 4589 pages in swap cache [22581.995184] blk_mq_try_issue_list_directly+0x62/0x100 [22581.996150] Swap cache stats: add 236588, delete 231998, find 24220/31008 [22581.997572] blk_mq_sched_insert_requests+0xa4/0xf0 [22581.999028] Free swap = 10341124kB [22582.000489] blk_mq_flush_plug_list+0x135/0x220 [22582.001436] Total swap = 11075580kB [22582.002794] blk_flush_plug_list+0xd7/0x100 [22582.003723] 39321462 pages RAM [22582.005003] blk_finish_plug+0x25/0x36 [22582.005933] 0 pages HighMem/MovableOnly [22582.007125] osd_do_bio+0x7f1/0xc70 [osd_ldiskfs] [22582.008103] 905850 pages reserved [22582.009447] ? ldiskfs_map_blocks+0x607/0x610 [ldiskfs] [22582.010354] 0 pages hwpoisoned [22582.011810] ? osd_ldiskfs_map_inode_pages+0x2c2/0xb20 [osd_ldiskfs] [22582.014521] osd_ldiskfs_map_inode_pages+0x2c2/0xb20 [osd_ldiskfs] [22582.016448] osd_write_commit+0x4c6/0x990 [osd_ldiskfs] [22582.017985] ofd_commitrw_write+0x9df/0x1b00 [ofd] [22582.019536] ofd_commitrw+0x5dc/0xd80 [ofd] [22582.020816] ? obd_commitrw+0x1b6/0x370 [ptlrpc] [22582.022466] obd_commitrw+0x1b6/0x370 [ptlrpc] [22582.023992] tgt_brw_write+0xf54/0x1f90 [ptlrpc] [22582.025495] ? _cond_resched+0x15/0x30 [22582.026671] ? flush_work+0x42/0x1d0 [22582.027752] ? internal_add_timer+0x42/0x70 [22582.028976] ? _cond_resched+0x15/0x30 [22582.030064] ? mutex_lock+0xe/0x30 [22582.031072] tgt_request_handle+0x3f4/0x1a30 [ptlrpc] [22582.032636] ? ptlrpc_update_export_timer+0x3d/0x500 [ptlrpc] [22582.034103] ptlrpc_server_handle_request+0x2aa/0xcf0 [ptlrpc] [22582.035813] ? lprocfs_counter_add+0x10e/0x180 [obdclass] [22582.037335] ptlrpc_main+0xc9e/0x15c0 [ptlrpc] [22582.038763] ? ptlrpc_wait_event+0x5b0/0x5b0 [ptlrpc] [22582.040292] kthread+0x134/0x150 [22582.041227] ? set_kthread_struct+0x50/0x50 [22582.042376] ret_from_fork+0x1f/0x40 [22582.043397] CPU: 9 PID: 127537 Comm: ll_ost_io02_073 Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.16.1.el8_lustre.x86_64 #1 [22582.045848] Hardware name: DDN SFA18KXE, BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [22582.047678] Call Trace:

People

Assignee:: Dongyang Li

Reporter:: Sarah Liu

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 16/Sep/24 5:21 PM

Updated:: 18/Feb/25 8:14 AM