[LU-18296] sanity-benchmark test_iozone hung - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: Lustre 2.17.0
Affects Version/s: Lustre 2.16.0, Lustre 2.15.6
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/a1f8a33a-6478-4a6b-b792-2466443c2fe8

test_iozone failed with the following error:

O_DIRECT feature enabled
	Include fsync in write timing
	>>> I/O Diagnostic mode enabled. <<<
	Performance measurements are invalid in this mode.
	Record Size 512 kB
	File size set to 5694480 kB
	Command line used: iozone -I -i 0 -i 1 -i 2 -e -+d -r 512 -s 5694480 -f /mnt/lustre/d0.iozone/iozone.odir
	Output is in kBytes/sec
	Time Resolution = 0.000001 seconds.
	Processor cache size set to 1024 kBytes.
	Processor cache line size set to 32 bytes.
	File stride size set to 17 * record size.
                                                              random    random     bkwd    record    stride                                    
              kB  reclen    write  rewrite    read    reread    read     write     read   rewrite      read   fwrite frewrite    fread  freread
         5694480     512

Timeout occurred after 408 minutes, last suite running was sanity-benchmark

Test session details:
clients: https://build.whamcloud.com/job/lustre-master/4581 - 5.14.0-427.31.1.el9_4.x86_64
servers: https://build.whamcloud.com/job/lustre-master/4581 - 5.14.0-427.31.1_lustre.el9.x86_64

<<Please provide additional information about the failure here>>

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-benchmark test_iozone - Timeout occurred after 408 minutes, last suite running was sanity-benchmark

Attachments

Issue Links

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(35 mentioned in)

Activity

[LU-18296] sanity-benchmark test_iozone hung

Alex Zhuravlev added a comment - 13/Mar/25 5:30 AM

Alex Zhuravlev is this the same issue you were describing on the call today? Client stack ending in balance_dirty_pages()?

yes

Alex Zhuravlev added a comment - 13/Mar/25 5:30 AM Alex Zhuravlev is this the same issue you were describing on the call today? Client stack ending in balance_dirty_pages()? yes

Alex Zhuravlev added a comment - 13/Mar/25 5:30 AM

there are two different symptoms, in one Lustre client is stuck:

[Wed Mar 12 03:52:17 2025] task:iozone          state:I stack:0     pid:179119 ppid:176185 flags:0x00000080
[Wed Mar 12 03:52:17 2025] Call Trace:
[Wed Mar 12 03:52:17 2025]  __schedule+0x2d1/0x870
[Wed Mar 12 03:52:17 2025]  ? cl_io_iter_fini+0x58/0x140 [obdclass]
[Wed Mar 12 03:52:17 2025]  schedule+0x55/0xf0
[Wed Mar 12 03:52:17 2025]  cl_sync_io_wait+0x215/0x2f0 [obdclass]
[Wed Mar 12 03:52:17 2025]  ? finish_wait+0x80/0x80
[Wed Mar 12 03:52:17 2025]  cl_sync_io_wait_recycle+0x27/0x40 [obdclass]
[Wed Mar 12 03:52:17 2025]  ll_file_io_generic+0x53e/0xdf0 [lustre]
[Wed Mar 12 03:52:17 2025]  ll_file_write_iter+0x62e/0x880 [lustre]
[Wed Mar 12 03:52:17 2025]  ? apic_timer_interrupt+0xa/0x20
[Wed Mar 12 03:52:17 2025]  new_sync_write+0x112/0x160
[Wed Mar 12 03:52:17 2025]  vfs_write+0xa5/0x1b0
[Wed Mar 12 03:52:17 2025]  ksys_write+0x4f/0xb0
[Wed Mar 12 03:52:17 2025]  do_syscall_64+0x5b/0x1a0

in another it's the kernel being stuck:

[Sat Feb 22 12:53:15 2025] task:iozone          state:D stack:0     pid:176266 ppid:176211 flags:0x00004002
[Sat Feb 22 12:53:15 2025] Call Trace:
[Sat Feb 22 12:53:15 2025]  <TASK>
[Sat Feb 22 12:53:15 2025]  __schedule+0x21b/0x550
[Sat Feb 22 12:53:15 2025]  ? _raw_spin_unlock_irqrestore+0xa/0x30
[Sat Feb 22 12:53:15 2025]  schedule+0x2d/0x70
[Sat Feb 22 12:53:15 2025]  schedule_timeout+0x88/0x160
[Sat Feb 22 12:53:15 2025]  ? __pfx_process_timeout+0x10/0x10
[Sat Feb 22 12:53:15 2025]  io_schedule_timeout+0x4c/0x80
[Sat Feb 22 12:53:15 2025]  balance_dirty_pages+0x2fc/0xcf0
[Sat Feb 22 12:53:15 2025]  balance_dirty_pages_ratelimited_flags+0x27a/0x380
[Sat Feb 22 12:53:15 2025]  generic_perform_write+0x141/0x200
[Sat Feb 22 12:53:15 2025]  __generic_file_write_iter+0xe5/0x1b0
[Sat Feb 22 12:53:15 2025]  vvp_io_write_start+0x44f/0xcb0 [lustre]
[Sat Feb 22 12:53:15 2025]  cl_io_start+0x61/0x130 [obdclass]
[Sat Feb 22 12:53:15 2025]  cl_io_loop+0x99/0x220 [obdclass]
[Sat Feb 22 12:53:15 2025]  ll_file_io_generic+0x53d/0xef0 [lustre]
[Sat Feb 22 12:53:15 2025]  ? lov_io_fini+0x290/0x420 [lov]
[Sat Feb 22 12:53:15 2025]  do_file_write_iter+0x4e3/0x770 [lustre]
[Sat Feb 22 12:53:15 2025]  vfs_write+0x2ce/0x410

the both 4.18 and 5.14 kernels seem to be vulnerable and there is noting interesting on OST/MDS side in the logs.

Alex Zhuravlev added a comment - 13/Mar/25 5:30 AM there are two different symptoms, in one Lustre client is stuck: [Wed Mar 12 03:52:17 2025] task:iozone state:I stack:0 pid:179119 ppid:176185 flags:0x00000080 [Wed Mar 12 03:52:17 2025] Call Trace: [Wed Mar 12 03:52:17 2025] __schedule+0x2d1/0x870 [Wed Mar 12 03:52:17 2025] ? cl_io_iter_fini+0x58/0x140 [obdclass] [Wed Mar 12 03:52:17 2025] schedule+0x55/0xf0 [Wed Mar 12 03:52:17 2025] cl_sync_io_wait+0x215/0x2f0 [obdclass] [Wed Mar 12 03:52:17 2025] ? finish_wait+0x80/0x80 [Wed Mar 12 03:52:17 2025] cl_sync_io_wait_recycle+0x27/0x40 [obdclass] [Wed Mar 12 03:52:17 2025] ll_file_io_generic+0x53e/0xdf0 [lustre] [Wed Mar 12 03:52:17 2025] ll_file_write_iter+0x62e/0x880 [lustre] [Wed Mar 12 03:52:17 2025] ? apic_timer_interrupt+0xa/0x20 [Wed Mar 12 03:52:17 2025] new_sync_write+0x112/0x160 [Wed Mar 12 03:52:17 2025] vfs_write+0xa5/0x1b0 [Wed Mar 12 03:52:17 2025] ksys_write+0x4f/0xb0 [Wed Mar 12 03:52:17 2025] do_syscall_64+0x5b/0x1a0 in another it's the kernel being stuck: [Sat Feb 22 12:53:15 2025] task:iozone state:D stack:0 pid:176266 ppid:176211 flags:0x00004002 [Sat Feb 22 12:53:15 2025] Call Trace: [Sat Feb 22 12:53:15 2025] <TASK> [Sat Feb 22 12:53:15 2025] __schedule+0x21b/0x550 [Sat Feb 22 12:53:15 2025] ? _raw_spin_unlock_irqrestore+0xa/0x30 [Sat Feb 22 12:53:15 2025] schedule+0x2d/0x70 [Sat Feb 22 12:53:15 2025] schedule_timeout+0x88/0x160 [Sat Feb 22 12:53:15 2025] ? __pfx_process_timeout+0x10/0x10 [Sat Feb 22 12:53:15 2025] io_schedule_timeout+0x4c/0x80 [Sat Feb 22 12:53:15 2025] balance_dirty_pages+0x2fc/0xcf0 [Sat Feb 22 12:53:15 2025] balance_dirty_pages_ratelimited_flags+0x27a/0x380 [Sat Feb 22 12:53:15 2025] generic_perform_write+0x141/0x200 [Sat Feb 22 12:53:15 2025] __generic_file_write_iter+0xe5/0x1b0 [Sat Feb 22 12:53:15 2025] vvp_io_write_start+0x44f/0xcb0 [lustre] [Sat Feb 22 12:53:15 2025] cl_io_start+0x61/0x130 [obdclass] [Sat Feb 22 12:53:15 2025] cl_io_loop+0x99/0x220 [obdclass] [Sat Feb 22 12:53:15 2025] ll_file_io_generic+0x53d/0xef0 [lustre] [Sat Feb 22 12:53:15 2025] ? lov_io_fini+0x290/0x420 [lov] [Sat Feb 22 12:53:15 2025] do_file_write_iter+0x4e3/0x770 [lustre] [Sat Feb 22 12:53:15 2025] vfs_write+0x2ce/0x410 the both 4.18 and 5.14 kernels seem to be vulnerable and there is noting interesting on OST/MDS side in the logs.

Andreas Dilger added a comment - 13/Mar/25 5:00 AM

bzzz is this the same issue you were describing on the call today? Client stack ending in balance_dirty_pages()?

Andreas Dilger added a comment - 13/Mar/25 5:00 AM bzzz is this the same issue you were describing on the call today? Client stack ending in balance_dirty_pages() ?

Jian Yu added a comment - 19/Nov/24 7:21 PM

Lustre 2.15.6 RC1: https://testing.whamcloud.com/test_sets/bc81ea51-e142-45f7-a81a-7935612ef5fa

Jian Yu added a comment - 19/Nov/24 7:21 PM Lustre 2.15.6 RC1: https://testing.whamcloud.com/test_sets/bc81ea51-e142-45f7-a81a-7935612ef5fa

Jian Yu added a comment - 06/Nov/24 10:07 PM

+1 on Lustre b2_15 branch: https://testing.whamcloud.com/test_sets/a73e5900-88e5-4588-b96e-b724b1ae835f

Jian Yu added a comment - 06/Nov/24 10:07 PM +1 on Lustre b2_15 branch: https://testing.whamcloud.com/test_sets/a73e5900-88e5-4588-b96e-b724b1ae835f

Jian Yu added a comment - 28/Oct/24 6:25 AM

Lustre 2.16.0 RC5: https://testing.whamcloud.com/test_sets/a77bfd4c-8f42-4c4c-a95e-0e359d70962d

Jian Yu added a comment - 28/Oct/24 6:25 AM Lustre 2.16.0 RC5: https://testing.whamcloud.com/test_sets/a77bfd4c-8f42-4c4c-a95e-0e359d70962d

Emoly Liu added a comment - 09/Oct/24 8:59 AM

The similar test_iozone issues have happened at around 20% failure rate for more than two years. One is crash issue due to OOM(EX-7304) and the other is timeout issue due to the error "lustre-OST000x-osc-ffff94260fc92000: disconnect after 21s idle".

It seems both of them are related to the testing environment.

Emoly Liu added a comment - 09/Oct/24 8:59 AM The similar test_iozone issues have happened at around 20% failure rate for more than two years. One is crash issue due to OOM(EX-7304) and the other is timeout issue due to the error "lustre-OST000x-osc-ffff94260fc92000: disconnect after 21s idle". It seems both of them are related to the testing environment.

Li Xi added a comment - 09/Oct/24 8:37 AM

emoly.liu Would you please take a look?

Li Xi added a comment - 09/Oct/24 8:37 AM emoly.liu Would you please take a look?

People

Assignee:: Emoly Liu

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 02/Oct/24 5:44 AM

Updated:: 26/Apr/25 9:54 PM