Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17361

lustre-rsync-test test_2a: Timeout occurred after 97 minutes, last suite running was lustre-rsync-test

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • Lustre 2.17.0
    • Lustre 2.16.0, Lustre 2.15.5, Lustre 2.15.6
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Arshad <arshad.hussain@aeoncomputing.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/8d022068-68bb-4bbe-8686-e7928146c95e

      test_2a failed with the following error:

      Timeout occurred after 97 minutes, last suite running was lustre-rsync-test
      

       

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/100812 - 4.18.0-477.27.1.el8_8.x86_64 (x86_64/RHEL 8.8/2.15.59.54)
      servers: https://build.whamcloud.com/job/lustre-reviews/100812 - 4.18.0-477.27.1.el8_lustre.x86_64 (ldiskfs/x86_64/RHEL 8.8/2.15.59.54)

      Client logs

      == lustre-rsync-test test 2a: Replicate files created by dbench. ===== 18:06:53 (1702404413)
      CMD: trevis-127vm1 /usr/sbin/lctl get_param mdd.lustre-MDT0000.changelog_mask -n
      CMD: trevis-127vm1 /usr/sbin/lctl set_param mdd.lustre-MDT0000.changelog_mask=+hsm
      mdd.lustre-MDT0000.changelog_mask=+hsm
      CMD: trevis-127vm1 /usr/sbin/lctl --device lustre-MDT0000 changelog_register -n
      CMD: trevis-127vm2 /usr/sbin/lctl get_param mdd.lustre-MDT0001.changelog_mask -n
      CMD: trevis-127vm2 /usr/sbin/lctl set_param mdd.lustre-MDT0001.changelog_mask=+hsm
      mdd.lustre-MDT0001.changelog_mask=+hsm
      CMD: trevis-127vm2 /usr/sbin/lctl --device lustre-MDT0001 changelog_register -n
      CMD: trevis-127vm1 /usr/sbin/lctl get_param mdd.lustre-MDT0002.changelog_mask -n
      CMD: trevis-127vm1 /usr/sbin/lctl set_param mdd.lustre-MDT0002.changelog_mask=+hsm
      mdd.lustre-MDT0002.changelog_mask=+hsm
      CMD: trevis-127vm1 /usr/sbin/lctl --device lustre-MDT0002 changelog_register -n
      CMD: trevis-127vm2 /usr/sbin/lctl get_param mdd.lustre-MDT0003.changelog_mask -n
      CMD: trevis-127vm2 /usr/sbin/lctl set_param mdd.lustre-MDT0003.changelog_mask=+hsm
      mdd.lustre-MDT0003.changelog_mask=+hsm
      CMD: trevis-127vm2 /usr/sbin/lctl --device lustre-MDT0003 changelog_register -n
      Registered 4 changelog users: 'cl3 cl3 cl3 cl3'
      lustre-MDT0000: Registered changelog user cl3
      looking for dbench program
      /usr/bin/dbench
      found dbench client file /usr/share/dbench/client.txt
      '/usr/share/dbench/client.txt' -> 'client.txt'
      copying required dbench files to chroot directory /mnt/lustre/d2a.lustre-rsync-test
      '/usr/bin/dbench' -> 'dbench'
      lib64/libpopt.so.0
      lib64/libc.so.6
      lib64/ld-linux-x86-64.so.2
      running 'dbench 2 -t 60' in /mnt/lustre/d2a.lustre-rsync-test at Tue Dec 12 18:06:58 UTC 2023
      waiting for dbench pid 27630
      dbench version 4.00 - Copyright Andrew Tridgell 1999-2004
      Running for 60 seconds with load 'client.txt' and minimum warmup 12 secs
      failed to create barrier semaphore 
      0 of 2 processes prepared for launch   0 sec
      2 of 2 processes prepared for launch   0 sec
      releasing clients
         2       670    42.34 MB/sec  warmup   1 sec  latency 243.676 ms
         2       761    21.39 MB/sec  warmup   2 sec  latency 221.017 ms
         2       946    14.56 MB/sec  warmup   3 sec  latency 191.389 ms
      :
         2     34715     5.27 MB/sec  execute  55 sec  latency 241.709 ms
         2     35719     5.47 MB/sec  execute  56 sec  latency 119.864 ms
         2     36221     5.41 MB/sec  execute  57 sec  latency 173.668 ms
         2     36352     5.33 MB/sec  execute  58 sec  latency 132.522 ms
         2     36769     5.33 MB/sec  execute  59 sec  latency 160.981 ms
         2  cleanup  60 sec
         0  cleanup  60 sec
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX      10567     2.825   123.841
       Close           7710     0.642    24.838
       Rename           440     3.282    41.200
       Unlink          2156     0.976    22.880
       Qpathinfo       9492     0.636    38.188
       Qfileinfo       1629     0.076     3.208
       Qfsinfo         1719     0.018     1.539
       Sfileinfo        868     2.031    39.754
       Find            3625     0.296    41.384
       WriteX          5127     0.449    36.828
       ReadX          16023     0.009     0.055
       LockX             32     0.440     0.998
       UnlockX           32     0.504     1.832
       Flush            745    93.845   969.365
      Throughput 5.32869 MB/sec  2 clients  2 procs  max_latency=969.378 ms
      stopping dbench in /mnt/lustre/d2a.lustre-rsync-test at Tue Dec 12 18:08:10 UTC 2023 with return code 0
      clean dbench files in /mnt/lustre/d2a.lustre-rsync-test
      /mnt/lustre/d2a.lustre-rsync-test /mnt/lustre/d2a.lustre-rsync-test
      removed directory 'clients/client0/~dmtmp/PARADOX'
      removed directory 'clients/client0/~dmtmp/PM'
      :
      removed directory 'clients/client1/~dmtmp/PWRPNT'
      removed directory 'clients/client1/~dmtmp'
      removed directory 'clients/client1'
      removed directory 'clients'
      removed 'client.txt'
      removed 'dbench'
      removed 'lib64/libpopt.so.0'
      removed 'lib64/libc.so.6'
      removed 'lib64/ld-linux-x86-64.so.2'
      /mnt/lustre/d2a.lustre-rsync-test
      dbench successfully finished
      

       

      ======= 18:06:53 \(1702404413\)
      [  302.317896] Lustre: DEBUG MARKER: == lustre-rsync-test test 2a: Replicate files created by dbench. ==== 18:06:53 (1702404413)
      [  305.333993] Lustre: lustre-OST0005-osc-ffff895045192000: disconnect after 24s idle
      [  305.335802] Lustre: Skipped 1 previous similar message
      [  407.730814] Lustre: lustre-OST0000-osc-ffff895045192000: disconnect after 24s idle
      [  412.850667] Lustre: lustre-OST0006-osc-ffff895045192000: disconnect after 23s idle
      [  412.852206] Lustre: Skipped 6 previous similar messages
      [ 5749.820145] sysrq: SysRq : Changing Loglevel
      [ 5749.821211] sysrq: Loglevel set to 8
      [ 5749.823561] sysrq: SysRq : Show State
      :
      [ 5752.514580] task:lustre_rsync    state:D stack:    0 pid:27687 ppid: 27182 flags:0x00004080
      [ 5752.516178] Call Trace:
      [ 5752.516719]  __schedule+0x2d1/0x870
      [ 5752.518361]  schedule+0x55/0xf0
      [ 5752.519019]  jbd2_log_wait_commit+0xac/0x120 [jbd2]
      [ 5752.520751]  ext4_sync_file+0x365/0x3b0 [ext4]
      [ 5752.521695]  ext4_file_write_iter+0x1a6/0x3e0 [ext4]
      [ 5752.522710]  new_sync_write+0x112/0x160
      [ 5752.523537]  vfs_write+0xa5/0x1b0
      [ 5752.524238]  ksys_write+0x4f/0xb0
      [ 5752.524948]  do_syscall_64+0x5b/0x1b0
      

      The jbd2 thread is stuck on IO submission to the VM storage (vda):

      [ 5750.277661] task:jbd2/vda1-8     state:D stack:    0 pid:  368 ppid:     2 flags:0x80004000
      [ 5750.279261] Call Trace:
      [ 5750.279801]  __schedule+0x2d1/0x870
      [ 5750.281340]  schedule+0x55/0xf0
      [ 5750.282004]  io_schedule+0x12/0x40
      [ 5750.282714]  bit_wait_io+0xd/0x50
      [ 5750.283418]  __wait_on_bit+0x2d/0x90
      [ 5750.284155]  out_of_line_wait_on_bit+0x91/0xb0
      [ 5750.285919]  jbd2_journal_commit_transaction+0x14db/0x19f0 [jbd2]
      [ 5750.288069]  kjournald2+0xbd/0x270 [jbd2]
      [ 5750.290560]  kthread+0x134/0x150
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      lustre-rsync-test test_2a - Timeout occurred after 97 minutes, last suite running was lustre-rsync-test

      Attachments

        Issue Links

          Activity

            People

              yujian Jian Yu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: