Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • Lustre 2.17.0
    • Lustre 2.16.0, Lustre 2.15.6
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/a1f8a33a-6478-4a6b-b792-2466443c2fe8

      test_iozone failed with the following error:

      O_DIRECT feature enabled
      	Include fsync in write timing
      	>>> I/O Diagnostic mode enabled. <<<
      	Performance measurements are invalid in this mode.
      	Record Size 512 kB
      	File size set to 5694480 kB
      	Command line used: iozone -I -i 0 -i 1 -i 2 -e -+d -r 512 -s 5694480 -f /mnt/lustre/d0.iozone/iozone.odir
      	Output is in kBytes/sec
      	Time Resolution = 0.000001 seconds.
      	Processor cache size set to 1024 kBytes.
      	Processor cache line size set to 32 bytes.
      	File stride size set to 17 * record size.
                                                                    random    random     bkwd    record    stride                                    
                    kB  reclen    write  rewrite    read    reread    read     write     read   rewrite      read   fwrite frewrite    fread  freread
               5694480     512
      
      Timeout occurred after 408 minutes, last suite running was sanity-benchmark
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master/4581 - 5.14.0-427.31.1.el9_4.x86_64
      servers: https://build.whamcloud.com/job/lustre-master/4581 - 5.14.0-427.31.1_lustre.el9.x86_64

      <<Please provide additional information about the failure here>>

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-benchmark test_iozone - Timeout occurred after 408 minutes, last suite running was sanity-benchmark

      Attachments

        Issue Links

          Activity

            [LU-18296] sanity-benchmark test_iozone hung

            Alex Zhuravlev is this the same issue you were describing on the call today? Client stack ending in balance_dirty_pages()?

            yes

            bzzz Alex Zhuravlev added a comment - Alex Zhuravlev is this the same issue you were describing on the call today? Client stack ending in balance_dirty_pages()? yes

            there are two different symptoms, in one Lustre client is stuck:

            [Wed Mar 12 03:52:17 2025] task:iozone          state:I stack:0     pid:179119 ppid:176185 flags:0x00000080
            [Wed Mar 12 03:52:17 2025] Call Trace:
            [Wed Mar 12 03:52:17 2025]  __schedule+0x2d1/0x870
            [Wed Mar 12 03:52:17 2025]  ? cl_io_iter_fini+0x58/0x140 [obdclass]
            [Wed Mar 12 03:52:17 2025]  schedule+0x55/0xf0
            [Wed Mar 12 03:52:17 2025]  cl_sync_io_wait+0x215/0x2f0 [obdclass]
            [Wed Mar 12 03:52:17 2025]  ? finish_wait+0x80/0x80
            [Wed Mar 12 03:52:17 2025]  cl_sync_io_wait_recycle+0x27/0x40 [obdclass]
            [Wed Mar 12 03:52:17 2025]  ll_file_io_generic+0x53e/0xdf0 [lustre]
            [Wed Mar 12 03:52:17 2025]  ll_file_write_iter+0x62e/0x880 [lustre]
            [Wed Mar 12 03:52:17 2025]  ? apic_timer_interrupt+0xa/0x20
            [Wed Mar 12 03:52:17 2025]  new_sync_write+0x112/0x160
            [Wed Mar 12 03:52:17 2025]  vfs_write+0xa5/0x1b0
            [Wed Mar 12 03:52:17 2025]  ksys_write+0x4f/0xb0
            [Wed Mar 12 03:52:17 2025]  do_syscall_64+0x5b/0x1a0
            

            in another it's the kernel being stuck:

            [Sat Feb 22 12:53:15 2025] task:iozone          state:D stack:0     pid:176266 ppid:176211 flags:0x00004002
            [Sat Feb 22 12:53:15 2025] Call Trace:
            [Sat Feb 22 12:53:15 2025]  <TASK>
            [Sat Feb 22 12:53:15 2025]  __schedule+0x21b/0x550
            [Sat Feb 22 12:53:15 2025]  ? _raw_spin_unlock_irqrestore+0xa/0x30
            [Sat Feb 22 12:53:15 2025]  schedule+0x2d/0x70
            [Sat Feb 22 12:53:15 2025]  schedule_timeout+0x88/0x160
            [Sat Feb 22 12:53:15 2025]  ? __pfx_process_timeout+0x10/0x10
            [Sat Feb 22 12:53:15 2025]  io_schedule_timeout+0x4c/0x80
            [Sat Feb 22 12:53:15 2025]  balance_dirty_pages+0x2fc/0xcf0
            [Sat Feb 22 12:53:15 2025]  balance_dirty_pages_ratelimited_flags+0x27a/0x380
            [Sat Feb 22 12:53:15 2025]  generic_perform_write+0x141/0x200
            [Sat Feb 22 12:53:15 2025]  __generic_file_write_iter+0xe5/0x1b0
            [Sat Feb 22 12:53:15 2025]  vvp_io_write_start+0x44f/0xcb0 [lustre]
            [Sat Feb 22 12:53:15 2025]  cl_io_start+0x61/0x130 [obdclass]
            [Sat Feb 22 12:53:15 2025]  cl_io_loop+0x99/0x220 [obdclass]
            [Sat Feb 22 12:53:15 2025]  ll_file_io_generic+0x53d/0xef0 [lustre]
            [Sat Feb 22 12:53:15 2025]  ? lov_io_fini+0x290/0x420 [lov]
            [Sat Feb 22 12:53:15 2025]  do_file_write_iter+0x4e3/0x770 [lustre]
            [Sat Feb 22 12:53:15 2025]  vfs_write+0x2ce/0x410
            

            the both 4.18 and 5.14 kernels seem to be vulnerable and there is noting interesting on OST/MDS side in the logs.

            bzzz Alex Zhuravlev added a comment - there are two different symptoms, in one Lustre client is stuck: [Wed Mar 12 03:52:17 2025] task:iozone state:I stack:0 pid:179119 ppid:176185 flags:0x00000080 [Wed Mar 12 03:52:17 2025] Call Trace: [Wed Mar 12 03:52:17 2025] __schedule+0x2d1/0x870 [Wed Mar 12 03:52:17 2025] ? cl_io_iter_fini+0x58/0x140 [obdclass] [Wed Mar 12 03:52:17 2025] schedule+0x55/0xf0 [Wed Mar 12 03:52:17 2025] cl_sync_io_wait+0x215/0x2f0 [obdclass] [Wed Mar 12 03:52:17 2025] ? finish_wait+0x80/0x80 [Wed Mar 12 03:52:17 2025] cl_sync_io_wait_recycle+0x27/0x40 [obdclass] [Wed Mar 12 03:52:17 2025] ll_file_io_generic+0x53e/0xdf0 [lustre] [Wed Mar 12 03:52:17 2025] ll_file_write_iter+0x62e/0x880 [lustre] [Wed Mar 12 03:52:17 2025] ? apic_timer_interrupt+0xa/0x20 [Wed Mar 12 03:52:17 2025] new_sync_write+0x112/0x160 [Wed Mar 12 03:52:17 2025] vfs_write+0xa5/0x1b0 [Wed Mar 12 03:52:17 2025] ksys_write+0x4f/0xb0 [Wed Mar 12 03:52:17 2025] do_syscall_64+0x5b/0x1a0 in another it's the kernel being stuck: [Sat Feb 22 12:53:15 2025] task:iozone state:D stack:0 pid:176266 ppid:176211 flags:0x00004002 [Sat Feb 22 12:53:15 2025] Call Trace: [Sat Feb 22 12:53:15 2025] <TASK> [Sat Feb 22 12:53:15 2025] __schedule+0x21b/0x550 [Sat Feb 22 12:53:15 2025] ? _raw_spin_unlock_irqrestore+0xa/0x30 [Sat Feb 22 12:53:15 2025] schedule+0x2d/0x70 [Sat Feb 22 12:53:15 2025] schedule_timeout+0x88/0x160 [Sat Feb 22 12:53:15 2025] ? __pfx_process_timeout+0x10/0x10 [Sat Feb 22 12:53:15 2025] io_schedule_timeout+0x4c/0x80 [Sat Feb 22 12:53:15 2025] balance_dirty_pages+0x2fc/0xcf0 [Sat Feb 22 12:53:15 2025] balance_dirty_pages_ratelimited_flags+0x27a/0x380 [Sat Feb 22 12:53:15 2025] generic_perform_write+0x141/0x200 [Sat Feb 22 12:53:15 2025] __generic_file_write_iter+0xe5/0x1b0 [Sat Feb 22 12:53:15 2025] vvp_io_write_start+0x44f/0xcb0 [lustre] [Sat Feb 22 12:53:15 2025] cl_io_start+0x61/0x130 [obdclass] [Sat Feb 22 12:53:15 2025] cl_io_loop+0x99/0x220 [obdclass] [Sat Feb 22 12:53:15 2025] ll_file_io_generic+0x53d/0xef0 [lustre] [Sat Feb 22 12:53:15 2025] ? lov_io_fini+0x290/0x420 [lov] [Sat Feb 22 12:53:15 2025] do_file_write_iter+0x4e3/0x770 [lustre] [Sat Feb 22 12:53:15 2025] vfs_write+0x2ce/0x410 the both 4.18 and 5.14 kernels seem to be vulnerable and there is noting interesting on OST/MDS side in the logs.

            bzzz is this the same issue you were describing on the call today? Client stack ending in balance_dirty_pages()?

            adilger Andreas Dilger added a comment - bzzz is this the same issue you were describing on the call today? Client stack ending in balance_dirty_pages() ?
            yujian Jian Yu added a comment - Lustre 2.15.6 RC1: https://testing.whamcloud.com/test_sets/bc81ea51-e142-45f7-a81a-7935612ef5fa
            yujian Jian Yu added a comment - +1 on Lustre b2_15 branch: https://testing.whamcloud.com/test_sets/a73e5900-88e5-4588-b96e-b724b1ae835f
            yujian Jian Yu added a comment - Lustre 2.16.0 RC5: https://testing.whamcloud.com/test_sets/a77bfd4c-8f42-4c4c-a95e-0e359d70962d
            emoly.liu Emoly Liu added a comment -

            The similar test_iozone issues have happened at around 20% failure rate for more than two years. One is crash issue due to OOM(EX-7304) and the other is timeout issue due to the error "lustre-OST000x-osc-ffff94260fc92000: disconnect after 21s idle".

            It seems both of them are related to the testing environment.

            emoly.liu Emoly Liu added a comment - The similar test_iozone issues have happened at around 20% failure rate for more than two years. One is crash issue due to OOM(EX-7304) and the other is timeout issue due to the error "lustre-OST000x-osc-ffff94260fc92000: disconnect after 21s idle". It seems both of them are related to the testing environment.
            lixi_wc Li Xi added a comment -

            emoly.liu Would you please take a look?

            lixi_wc Li Xi added a comment - emoly.liu Would you please take a look?

            People

              emoly.liu Emoly Liu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: