Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17361

lustre-rsync-test test_2a: Timeout occurred after 97 minutes, last suite running was lustre-rsync-test

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • Lustre 2.17.0
    • Lustre 2.16.0, Lustre 2.15.5, Lustre 2.15.6
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Arshad <arshad.hussain@aeoncomputing.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/8d022068-68bb-4bbe-8686-e7928146c95e

      test_2a failed with the following error:

      Timeout occurred after 97 minutes, last suite running was lustre-rsync-test
      

       

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/100812 - 4.18.0-477.27.1.el8_8.x86_64 (x86_64/RHEL 8.8/2.15.59.54)
      servers: https://build.whamcloud.com/job/lustre-reviews/100812 - 4.18.0-477.27.1.el8_lustre.x86_64 (ldiskfs/x86_64/RHEL 8.8/2.15.59.54)

      Client logs

      == lustre-rsync-test test 2a: Replicate files created by dbench. ===== 18:06:53 (1702404413)
      CMD: trevis-127vm1 /usr/sbin/lctl get_param mdd.lustre-MDT0000.changelog_mask -n
      CMD: trevis-127vm1 /usr/sbin/lctl set_param mdd.lustre-MDT0000.changelog_mask=+hsm
      mdd.lustre-MDT0000.changelog_mask=+hsm
      CMD: trevis-127vm1 /usr/sbin/lctl --device lustre-MDT0000 changelog_register -n
      CMD: trevis-127vm2 /usr/sbin/lctl get_param mdd.lustre-MDT0001.changelog_mask -n
      CMD: trevis-127vm2 /usr/sbin/lctl set_param mdd.lustre-MDT0001.changelog_mask=+hsm
      mdd.lustre-MDT0001.changelog_mask=+hsm
      CMD: trevis-127vm2 /usr/sbin/lctl --device lustre-MDT0001 changelog_register -n
      CMD: trevis-127vm1 /usr/sbin/lctl get_param mdd.lustre-MDT0002.changelog_mask -n
      CMD: trevis-127vm1 /usr/sbin/lctl set_param mdd.lustre-MDT0002.changelog_mask=+hsm
      mdd.lustre-MDT0002.changelog_mask=+hsm
      CMD: trevis-127vm1 /usr/sbin/lctl --device lustre-MDT0002 changelog_register -n
      CMD: trevis-127vm2 /usr/sbin/lctl get_param mdd.lustre-MDT0003.changelog_mask -n
      CMD: trevis-127vm2 /usr/sbin/lctl set_param mdd.lustre-MDT0003.changelog_mask=+hsm
      mdd.lustre-MDT0003.changelog_mask=+hsm
      CMD: trevis-127vm2 /usr/sbin/lctl --device lustre-MDT0003 changelog_register -n
      Registered 4 changelog users: 'cl3 cl3 cl3 cl3'
      lustre-MDT0000: Registered changelog user cl3
      looking for dbench program
      /usr/bin/dbench
      found dbench client file /usr/share/dbench/client.txt
      '/usr/share/dbench/client.txt' -> 'client.txt'
      copying required dbench files to chroot directory /mnt/lustre/d2a.lustre-rsync-test
      '/usr/bin/dbench' -> 'dbench'
      lib64/libpopt.so.0
      lib64/libc.so.6
      lib64/ld-linux-x86-64.so.2
      running 'dbench 2 -t 60' in /mnt/lustre/d2a.lustre-rsync-test at Tue Dec 12 18:06:58 UTC 2023
      waiting for dbench pid 27630
      dbench version 4.00 - Copyright Andrew Tridgell 1999-2004
      Running for 60 seconds with load 'client.txt' and minimum warmup 12 secs
      failed to create barrier semaphore 
      0 of 2 processes prepared for launch   0 sec
      2 of 2 processes prepared for launch   0 sec
      releasing clients
         2       670    42.34 MB/sec  warmup   1 sec  latency 243.676 ms
         2       761    21.39 MB/sec  warmup   2 sec  latency 221.017 ms
         2       946    14.56 MB/sec  warmup   3 sec  latency 191.389 ms
      :
         2     34715     5.27 MB/sec  execute  55 sec  latency 241.709 ms
         2     35719     5.47 MB/sec  execute  56 sec  latency 119.864 ms
         2     36221     5.41 MB/sec  execute  57 sec  latency 173.668 ms
         2     36352     5.33 MB/sec  execute  58 sec  latency 132.522 ms
         2     36769     5.33 MB/sec  execute  59 sec  latency 160.981 ms
         2  cleanup  60 sec
         0  cleanup  60 sec
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX      10567     2.825   123.841
       Close           7710     0.642    24.838
       Rename           440     3.282    41.200
       Unlink          2156     0.976    22.880
       Qpathinfo       9492     0.636    38.188
       Qfileinfo       1629     0.076     3.208
       Qfsinfo         1719     0.018     1.539
       Sfileinfo        868     2.031    39.754
       Find            3625     0.296    41.384
       WriteX          5127     0.449    36.828
       ReadX          16023     0.009     0.055
       LockX             32     0.440     0.998
       UnlockX           32     0.504     1.832
       Flush            745    93.845   969.365
      Throughput 5.32869 MB/sec  2 clients  2 procs  max_latency=969.378 ms
      stopping dbench in /mnt/lustre/d2a.lustre-rsync-test at Tue Dec 12 18:08:10 UTC 2023 with return code 0
      clean dbench files in /mnt/lustre/d2a.lustre-rsync-test
      /mnt/lustre/d2a.lustre-rsync-test /mnt/lustre/d2a.lustre-rsync-test
      removed directory 'clients/client0/~dmtmp/PARADOX'
      removed directory 'clients/client0/~dmtmp/PM'
      :
      removed directory 'clients/client1/~dmtmp/PWRPNT'
      removed directory 'clients/client1/~dmtmp'
      removed directory 'clients/client1'
      removed directory 'clients'
      removed 'client.txt'
      removed 'dbench'
      removed 'lib64/libpopt.so.0'
      removed 'lib64/libc.so.6'
      removed 'lib64/ld-linux-x86-64.so.2'
      /mnt/lustre/d2a.lustre-rsync-test
      dbench successfully finished
      

       

      ======= 18:06:53 \(1702404413\)
      [  302.317896] Lustre: DEBUG MARKER: == lustre-rsync-test test 2a: Replicate files created by dbench. ==== 18:06:53 (1702404413)
      [  305.333993] Lustre: lustre-OST0005-osc-ffff895045192000: disconnect after 24s idle
      [  305.335802] Lustre: Skipped 1 previous similar message
      [  407.730814] Lustre: lustre-OST0000-osc-ffff895045192000: disconnect after 24s idle
      [  412.850667] Lustre: lustre-OST0006-osc-ffff895045192000: disconnect after 23s idle
      [  412.852206] Lustre: Skipped 6 previous similar messages
      [ 5749.820145] sysrq: SysRq : Changing Loglevel
      [ 5749.821211] sysrq: Loglevel set to 8
      [ 5749.823561] sysrq: SysRq : Show State
      :
      [ 5752.514580] task:lustre_rsync    state:D stack:    0 pid:27687 ppid: 27182 flags:0x00004080
      [ 5752.516178] Call Trace:
      [ 5752.516719]  __schedule+0x2d1/0x870
      [ 5752.518361]  schedule+0x55/0xf0
      [ 5752.519019]  jbd2_log_wait_commit+0xac/0x120 [jbd2]
      [ 5752.520751]  ext4_sync_file+0x365/0x3b0 [ext4]
      [ 5752.521695]  ext4_file_write_iter+0x1a6/0x3e0 [ext4]
      [ 5752.522710]  new_sync_write+0x112/0x160
      [ 5752.523537]  vfs_write+0xa5/0x1b0
      [ 5752.524238]  ksys_write+0x4f/0xb0
      [ 5752.524948]  do_syscall_64+0x5b/0x1b0
      

      The jbd2 thread is stuck on IO submission to the VM storage (vda):

      [ 5750.277661] task:jbd2/vda1-8     state:D stack:    0 pid:  368 ppid:     2 flags:0x80004000
      [ 5750.279261] Call Trace:
      [ 5750.279801]  __schedule+0x2d1/0x870
      [ 5750.281340]  schedule+0x55/0xf0
      [ 5750.282004]  io_schedule+0x12/0x40
      [ 5750.282714]  bit_wait_io+0xd/0x50
      [ 5750.283418]  __wait_on_bit+0x2d/0x90
      [ 5750.284155]  out_of_line_wait_on_bit+0x91/0xb0
      [ 5750.285919]  jbd2_journal_commit_transaction+0x14db/0x19f0 [jbd2]
      [ 5750.288069]  kjournald2+0xbd/0x270 [jbd2]
      [ 5750.290560]  kthread+0x134/0x150
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      lustre-rsync-test test_2a - Timeout occurred after 97 minutes, last suite running was lustre-rsync-test

      Attachments

        Issue Links

          Activity

            [LU-17361] lustre-rsync-test test_2a: Timeout occurred after 97 minutes, last suite running was lustre-rsync-test

            From the stack traces it looks like the lustre-rsync and auditd processes are stuck in {{ext4_sync_file->jbd2_log_wait_commit{{, so this is writing to the local VM filesystem.

            The jbd2 thread is stuck in IO wait on the underlying storage (vda1 device), and doesn't appear to relate to Lustre at all:

            [ 5750.277661] task:jbd2/vda1-8     state:D stack:    0 pid:  368 ppid:     2 flags:0x80004000
            [ 5750.279261] Call Trace:
            [ 5750.279801]  __schedule+0x2d1/0x870
            [ 5750.281340]  schedule+0x55/0xf0
            [ 5750.282004]  io_schedule+0x12/0x40
            [ 5750.282714]  bit_wait_io+0xd/0x50
            [ 5750.283418]  __wait_on_bit+0x2d/0x90
            [ 5750.284155]  out_of_line_wait_on_bit+0x91/0xb0
            [ 5750.285919]  jbd2_journal_commit_transaction+0x14db/0x19f0 [jbd2]
            [ 5750.288069]  kjournald2+0xbd/0x270 [jbd2]
            [ 5750.290560]  kthread+0x134/0x150
            

            Of course there might be some hidden connection like the Lustre client was using too much memory in the previous subtest, but it seems equally or more likely that the VM host is overloaded and syncing a lot of dbench data on the host filesystem is too slow, and Lustre isn't even being tested yet.

            Purely speculation, but I see patch https://review.whamcloud.com/52770 ("LU-6142 tests: Add missing /tmp/target2 under cleanup_src_tgt") was landed on 2023-11-03 and affects test behavior (cleaning up the test directory) so it seems possible there are huge numbers of dentries or inodes in memory, or a lot of dirty blocks on the host slowing down the test? Is it worthwhile to push a revert of that patch to test to see if the issue is avoided? The problem is that the subtest fails only intermittently, so it is hard to know if it is "fixed" or not.

            adilger Andreas Dilger added a comment - From the stack traces it looks like the lustre-rsync and auditd processes are stuck in {{ext4_sync_file->jbd2_log_wait_commit{{, so this is writing to the local VM filesystem. The jbd2 thread is stuck in IO wait on the underlying storage (vda1 device), and doesn't appear to relate to Lustre at all: [ 5750.277661] task:jbd2/vda1-8 state:D stack: 0 pid: 368 ppid: 2 flags:0x80004000 [ 5750.279261] Call Trace: [ 5750.279801] __schedule+0x2d1/0x870 [ 5750.281340] schedule+0x55/0xf0 [ 5750.282004] io_schedule+0x12/0x40 [ 5750.282714] bit_wait_io+0xd/0x50 [ 5750.283418] __wait_on_bit+0x2d/0x90 [ 5750.284155] out_of_line_wait_on_bit+0x91/0xb0 [ 5750.285919] jbd2_journal_commit_transaction+0x14db/0x19f0 [jbd2] [ 5750.288069] kjournald2+0xbd/0x270 [jbd2] [ 5750.290560] kthread+0x134/0x150 Of course there might be some hidden connection like the Lustre client was using too much memory in the previous subtest, but it seems equally or more likely that the VM host is overloaded and syncing a lot of dbench data on the host filesystem is too slow, and Lustre isn't even being tested yet. Purely speculation, but I see patch https://review.whamcloud.com/52770 (" LU-6142 tests: Add missing /tmp/target2 under cleanup_src_tgt") was landed on 2023-11-03 and affects test behavior (cleaning up the test directory) so it seems possible there are huge numbers of dentries or inodes in memory, or a lot of dirty blocks on the host slowing down the test? Is it worthwhile to push a revert of that patch to test to see if the issue is avoided? The problem is that the subtest fails only intermittently, so it is hard to know if it is "fixed" or not.

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57266
            Subject: LU-17361 tests: debug lustre-rsync-test/2a
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 67e2596e5b6335c57717a46bc1a3c9d340dae402

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57266 Subject: LU-17361 tests: debug lustre-rsync-test/2a Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 67e2596e5b6335c57717a46bc1a3c9d340dae402

            Forgot to also indicate that the test log had also stopped at the same point than for this ticket submission occurrence, and meaning that lustre_rsync process/command has just started and did not even write its startup messages...

            bfaccini-nvda Bruno Faccini added a comment - Forgot to also indicate that the test log had also stopped at the same point than for this ticket submission occurrence, and meaning that lustre_rsync process/command has just started and did not even write its startup messages...
            bfaccini-nvda Bruno Faccini added a comment - - edited

            +1 on master branch at https://testing.whamcloud.com/test_sets/4554eb6b-d6a7-43d4-8fd0-ab26a675170d .

            I had a quick look to the logs for this occurrence, and in the Lustre client Console all-tasks stacks dump, taken upon timeout, it appears that the lustre-rsync process is stuck waiting in jbd2_log_wait_commit() :

            [ 5761.527765] task:lustre_rsync    state:D stack:0     pid:26569 ppid:26042  flags:0x00004002
            [ 5761.527767] Call Trace:
            [ 5761.527767]  <TASK>
            [ 5761.527768]  __schedule+0x20a/0x550
            [ 5761.527770]  ? __wake_up_common_lock+0x91/0xd0
            [ 5761.527773]  schedule+0x2d/0x70
            [ 5761.527775]  jbd2_log_wait_commit+0xa3/0x110 [jbd2]
            [ 5761.527787]  ? __pfx_autoremove_wake_function+0x10/0x10
            [ 5761.527790]  ext4_sync_file+0x15c/0x390 [ext4]
            [ 5761.527826]  ext4_buffered_write_iter+0xe8/0x110 [ext4]
            [ 5761.527867]  new_sync_write+0xff/0x190
            [ 5761.527871]  vfs_write+0x1ef/0x280
            [ 5761.527874]  ksys_write+0x5f/0xe0
            [ 5761.527876]  do_syscall_64+0x5c/0x90
            [ 5761.527878]  ? switch_fpu_return+0x4c/0xd0
            [ 5761.527882]  ? exit_to_user_mode_prepare+0xec/0x100
            [ 5761.527884]  ? syscall_exit_to_user_mode+0x22/0x40
            [ 5761.527886]  ? do_syscall_64+0x69/0x90
            [ 5761.527888]  ? do_syscall_64+0x69/0x90
            [ 5761.527889]  ? do_syscall_64+0x69/0x90
            [ 5761.527890]  ? sysvec_apic_timer_interrupt+0x3c/0x90
            [ 5761.527893]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
            [ 5761.527896] RIP: 0033:0x7f39c1f3e927
            [ 5761.527898] RSP: 002b:00007ffd4175fd88 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
            [ 5761.527900] RAX: ffffffffffffffda RBX: 000055882b909260 RCX: 00007f39c1f3e927
            [ 5761.527901] RDX: 0000000000001100 RSI: 000055882b909260 RDI: 0000000000000006
            [ 5761.527902] RBP: 0000000000000006 R08: 0000000000000000 R09: 00007ffd4175fcb7
            [ 5761.527903] R10: 0000000000000180 R11: 0000000000000246 R12: 00007f39c20126c0
            [ 5761.527904] R13: 0000000000000000 R14: 000055882b84b7a4 R15: 000055882b84b790 

            a quick look in other tasks also stuck in JBD2 layer shows auditd also stuck the same way :

            [ 5760.327293]  </TASK>
            [ 5760.327294] task:auditd          state:D stack:0     pid:512   ppid:1      flags:0x00004002
            [ 5760.327296] Call Trace:
            [ 5760.327297]  <TASK>
            [ 5760.327298]  __schedule+0x20a/0x550
            [ 5760.327300]  ? __wake_up_common_lock+0x91/0xd0
            [ 5760.327303]  schedule+0x2d/0x70
            [ 5760.327306]  jbd2_log_wait_commit+0xa3/0x110 [jbd2]
            [ 5760.327317]  ? __pfx_autoremove_wake_function+0x10/0x10
            [ 5760.327320]  ext4_sync_file+0x15c/0x390 [ext4]
            [ 5760.327355]  __x64_sys_fsync+0x33/0x60
            [ 5760.327373]  do_syscall_64+0x5c/0x90
            [ 5760.327375]  ? exit_to_user_mode_prepare+0xec/0x100
            [ 5760.327377]  ? syscall_exit_to_user_mode+0x22/0x40
            [ 5760.327379]  ? do_syscall_64+0x69/0x90
            [ 5760.327381]  ? syscall_exit_to_user_mode+0x22/0x40
            [ 5760.327383]  ? do_syscall_64+0x69/0x90
            [ 5760.327385]  ? do_syscall_64+0x69/0x90
            [ 5760.327386]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
            [ 5760.327389] RIP: 0033:0x7fbe37f4506b
            [ 5760.327392] RSP: 002b:00007fbe377fec50 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
            [ 5760.327393] RAX: ffffffffffffffda RBX: 0000555b51b17120 RCX: 00007fbe37f4506b
            [ 5760.327394] RDX: 0000000000000002 RSI: 0000000000000002 RDI: 0000000000000005
            [ 5760.327395] RBP: 0000555b51b170e0 R08: 0000000000000000 R09: 00000000ffffffff
            [ 5760.327397] R10: 0000000000000000 R11: 0000000000000293 R12: 00007fbe377ff640
            [ 5760.327398] R13: 0000000000000002 R14: 00007fbe37e9f530 R15: 0000000000000000 

            and a jbd2 thread, for vda1 partitition on vda virtual device, also stuck with this stack :

            [ 5760.326453]  </TASK>
            [ 5760.326454] task:jbd2/vda1-8     state:D stack:0     pid:398   ppid:2      flags:0x00004000
            [ 5760.326456] Call Trace:
            [ 5760.326457]  <TASK>
            [ 5760.326458]  __schedule+0x20a/0x550
            [ 5760.326460]  schedule+0x2d/0x70
            [ 5760.326462]  io_schedule+0x42/0x70
            [ 5760.326464]  bit_wait_io+0xd/0x60
            [ 5760.326467]  __wait_on_bit+0x4b/0x150
            [ 5760.326469]  ? __pfx_bit_wait_io+0x10/0x10
            [ 5760.326472]  out_of_line_wait_on_bit+0x92/0xb0
            [ 5760.326474]  ? __pfx_wake_bit_function+0x10/0x10
            [ 5760.326478]  jbd2_journal_commit_transaction+0x14b7/0x1900 [jbd2]
            [ 5760.326549]  ? update_load_avg+0x7e/0x740
            [ 5760.326554]  kjournald2+0xaf/0x280 [jbd2]
            [ 5760.326565]  ? __pfx_autoremove_wake_function+0x10/0x10
            [ 5760.326568]  ? __pfx_kjournald2+0x10/0x10 [jbd2]
            [ 5760.326578]  kthread+0xe0/0x100
            [ 5760.326584]  ? __pfx_kthread+0x10/0x10
            [ 5760.326586]  ret_from_fork+0x2c/0x50 

            all 3 tasks are in "D" state...
            And by the way, the lustre_rsync process is doing an ext4 write, so very likely trying to copy datas to a target file under /tmp local filesystem ...

            So may be it would be interesting to have a look to the bare-metal node trevis-24 hosting the VMs ??

            bfaccini-nvda Bruno Faccini added a comment - - edited +1 on master branch at https://testing.whamcloud.com/test_sets/4554eb6b-d6a7-43d4-8fd0-ab26a675170d . I had a quick look to the logs for this occurrence, and in the Lustre client Console all-tasks stacks dump, taken upon timeout, it appears that the lustre-rsync process is stuck waiting in jbd2_log_wait_commit() : [ 5761.527765] task:lustre_rsync state:D stack:0 pid:26569 ppid:26042 flags:0x00004002 [ 5761.527767] Call Trace: [ 5761.527767] <TASK> [ 5761.527768] __schedule+0x20a/0x550 [ 5761.527770] ? __wake_up_common_lock+0x91/0xd0 [ 5761.527773] schedule+0x2d/0x70 [ 5761.527775] jbd2_log_wait_commit+0xa3/0x110 [jbd2] [ 5761.527787] ? __pfx_autoremove_wake_function+0x10/0x10 [ 5761.527790] ext4_sync_file+0x15c/0x390 [ext4] [ 5761.527826] ext4_buffered_write_iter+0xe8/0x110 [ext4] [ 5761.527867] new_sync_write+0xff/0x190 [ 5761.527871] vfs_write+0x1ef/0x280 [ 5761.527874] ksys_write+0x5f/0xe0 [ 5761.527876] do_syscall_64+0x5c/0x90 [ 5761.527878] ? switch_fpu_return+0x4c/0xd0 [ 5761.527882] ? exit_to_user_mode_prepare+0xec/0x100 [ 5761.527884] ? syscall_exit_to_user_mode+0x22/0x40 [ 5761.527886] ? do_syscall_64+0x69/0x90 [ 5761.527888] ? do_syscall_64+0x69/0x90 [ 5761.527889] ? do_syscall_64+0x69/0x90 [ 5761.527890] ? sysvec_apic_timer_interrupt+0x3c/0x90 [ 5761.527893] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 5761.527896] RIP: 0033:0x7f39c1f3e927 [ 5761.527898] RSP: 002b:00007ffd4175fd88 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 5761.527900] RAX: ffffffffffffffda RBX: 000055882b909260 RCX: 00007f39c1f3e927 [ 5761.527901] RDX: 0000000000001100 RSI: 000055882b909260 RDI: 0000000000000006 [ 5761.527902] RBP: 0000000000000006 R08: 0000000000000000 R09: 00007ffd4175fcb7 [ 5761.527903] R10: 0000000000000180 R11: 0000000000000246 R12: 00007f39c20126c0 [ 5761.527904] R13: 0000000000000000 R14: 000055882b84b7a4 R15: 000055882b84b790 a quick look in other tasks also stuck in JBD2 layer shows auditd also stuck the same way : [ 5760.327293] </TASK> [ 5760.327294] task:auditd state:D stack:0 pid:512 ppid:1 flags:0x00004002 [ 5760.327296] Call Trace: [ 5760.327297] <TASK> [ 5760.327298] __schedule+0x20a/0x550 [ 5760.327300] ? __wake_up_common_lock+0x91/0xd0 [ 5760.327303] schedule+0x2d/0x70 [ 5760.327306] jbd2_log_wait_commit+0xa3/0x110 [jbd2] [ 5760.327317] ? __pfx_autoremove_wake_function+0x10/0x10 [ 5760.327320] ext4_sync_file+0x15c/0x390 [ext4] [ 5760.327355] __x64_sys_fsync+0x33/0x60 [ 5760.327373] do_syscall_64+0x5c/0x90 [ 5760.327375] ? exit_to_user_mode_prepare+0xec/0x100 [ 5760.327377] ? syscall_exit_to_user_mode+0x22/0x40 [ 5760.327379] ? do_syscall_64+0x69/0x90 [ 5760.327381] ? syscall_exit_to_user_mode+0x22/0x40 [ 5760.327383] ? do_syscall_64+0x69/0x90 [ 5760.327385] ? do_syscall_64+0x69/0x90 [ 5760.327386] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 5760.327389] RIP: 0033:0x7fbe37f4506b [ 5760.327392] RSP: 002b:00007fbe377fec50 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 5760.327393] RAX: ffffffffffffffda RBX: 0000555b51b17120 RCX: 00007fbe37f4506b [ 5760.327394] RDX: 0000000000000002 RSI: 0000000000000002 RDI: 0000000000000005 [ 5760.327395] RBP: 0000555b51b170e0 R08: 0000000000000000 R09: 00000000ffffffff [ 5760.327397] R10: 0000000000000000 R11: 0000000000000293 R12: 00007fbe377ff640 [ 5760.327398] R13: 0000000000000002 R14: 00007fbe37e9f530 R15: 0000000000000000 and a jbd2 thread, for vda1 partitition on vda virtual device, also stuck with this stack : [ 5760.326453] </TASK> [ 5760.326454] task:jbd2/vda1-8 state:D stack:0 pid:398 ppid:2 flags:0x00004000 [ 5760.326456] Call Trace: [ 5760.326457] <TASK> [ 5760.326458] __schedule+0x20a/0x550 [ 5760.326460] schedule+0x2d/0x70 [ 5760.326462] io_schedule+0x42/0x70 [ 5760.326464] bit_wait_io+0xd/0x60 [ 5760.326467] __wait_on_bit+0x4b/0x150 [ 5760.326469] ? __pfx_bit_wait_io+0x10/0x10 [ 5760.326472] out_of_line_wait_on_bit+0x92/0xb0 [ 5760.326474] ? __pfx_wake_bit_function+0x10/0x10 [ 5760.326478] jbd2_journal_commit_transaction+0x14b7/0x1900 [jbd2] [ 5760.326549] ? update_load_avg+0x7e/0x740 [ 5760.326554] kjournald2+0xaf/0x280 [jbd2] [ 5760.326565] ? __pfx_autoremove_wake_function+0x10/0x10 [ 5760.326568] ? __pfx_kjournald2+0x10/0x10 [jbd2] [ 5760.326578] kthread+0xe0/0x100 [ 5760.326584] ? __pfx_kthread+0x10/0x10 [ 5760.326586] ret_from_fork+0x2c/0x50 all 3 tasks are in " D " state... And by the way, the lustre_rsync process is doing an ext4 write, so very likely trying to copy datas to a target file under /tmp local filesystem ... So may be it would be interesting to have a look to the bare-metal node trevis-24 hosting the VMs ??
            yujian Jian Yu added a comment - Lustre 2.15.6 RC1: https://testing.whamcloud.com/test_sets/1eaa43a3-7227-4068-bad1-5c7bf860bfa1
            yujian Jian Yu added a comment - Also failed on b2_15 branch: https://testing.whamcloud.com/test_sets/ab11289f-d306-43b7-9643-a5f025536996
            yujian Jian Yu added a comment - Lustre 2.16.0 RC5: https://testing.whamcloud.com/test_sets/9ea55376-2703-4959-99de-3a52d2a4aba9
            yujian Jian Yu added a comment -

            The failure occurred frequently in the last 6 months.

            yujian Jian Yu added a comment - The failure occurred frequently in the last 6 months.
            yujian Jian Yu added a comment - Test session details: clients: https://build.whamcloud.com/job/lustre-b2_15/94 - 5.15.0-88-generic servers: https://build.whamcloud.com/job/lustre-master/4581 - 5.14.0-427.31.1_lustre.el9.x86_64 https://testing.whamcloud.com/test_sets/be9f839e-1409-467b-ba1c-20774af86e10
            yujian Jian Yu added a comment - +1 on master branch: https://testing.whamcloud.com/test_sets/8f2c7567-3730-4160-bcc5-13974d72d292

            People

              yujian Jian Yu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: