Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9273

replay-ost-single test_5: timeout after ost failover

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.12.0, Lustre 2.10.6
    • Lustre 2.10.0, Lustre 2.11.0, Lustre 2.10.3, Lustre 2.10.4
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/d94fa898-0a02-11e7-9053-5254006e85c2.

      The sub-test test_5 failed with the following error:

      test failed to respond and timed out
      

      Env:
      server: tag-2.9.54 el7
      client: tag-2.9.54 SLES12SP2

      test log

      == replay-ost-single test 5: Fail OST during iozone ================================================== 04:17:10 (1489576630)
      iozone bg pid=7403
      + iozone -i 0 -i 1 -i 2 -+d -r 4 -s 1048576 -f /mnt/lustre/d0.replay-ost-single/f5.replay-ost-single
      tmppipe=/tmp/replay-ost-single.test_5.pipe
      iozone pid=7406
      Iozone: Performance Test of File I/O
      Version $Revision: 3.373 $
      Compiled for 64 bit mode.
      Build: linux-AMD64
      
      Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
      Al Slater, Scott Rhine, Mike Wisner, Ken Goss
      Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
      Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
      Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
      Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
      Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer.
      
      Run began: Wed Mar 15 04:17:10 2017
      
      >>> I/O Diagnostic mode enabled. <<<
      Performance measurements are invalid in this mode.
      Record Size 4 KB
      File size set to 1048576 KB
      Command line used: iozone -i 0 -i 1 -i 2 -+d -r 4 -s 1048576 -f /mnt/lustre/d0.replay-ost-single/f5.replay-ost-single
      Output is in Kbytes/sec
      Time Resolution = 0.000001 seconds.
      Processor cache size set to 1024 Kbytes.
      Processor cache line size set to 32 bytes.
      File stride size set to 17 * record size.
      random  random    bkwd   record   stride
      KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
      Failing ost1 on onyx-32vm4
      CMD: onyx-32vm4 grep -c /mnt/lustre-ost1' ' /proc/mounts
      Stopping /mnt/lustre-ost1 (opts:) on onyx-32vm4
      CMD: onyx-32vm4 umount /mnt/lustre-ost1
      CMD: onyx-32vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      reboot facets: ost1
      Failover ost1 to onyx-32vm4
      04:17:30 (1489576650) waiting for onyx-32vm4 network 900 secs ...
      04:17:30 (1489576650) network interface is UP
      CMD: onyx-32vm4 hostname
      mount facets: ost1
      CMD: onyx-32vm4 test -b /dev/lvm-Role_OSS/P1
      CMD: onyx-32vm4 e2label /dev/lvm-Role_OSS/P1
      Starting ost1:   /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1
      CMD: onyx-32vm4 mkdir -p /mnt/lustre-ost1; mount -t lustre   		                   /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1
      CMD: onyx-32vm4 /usr/sbin/lctl get_param -n health_check
      CMD: onyx-32vm4 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/mpi/gcc/openmpi/bin:/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all\" 4 
      CMD: onyx-32vm4 e2label /dev/lvm-Role_OSS/P1 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
      CMD: onyx-32vm4 e2label /dev/lvm-Role_OSS/P1 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
      CMD: onyx-32vm4 e2label /dev/lvm-Role_OSS/P1 2>/dev/null
      Started lustre-OST0000
      CMD: onyx-32vm5,onyx-32vm6 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/mpi/gcc/openmpi/bin:/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh wait_import_state_mount FULL osc.lustre-OST0000-osc-*.ost_server_uuid 
      onyx-32vm5: CMD: onyx-32vm5 lctl get_param -n at_max
      onyx-32vm6: CMD: onyx-32vm6 lctl get_param -n at_max
      onyx-32vm5: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 2 sec
      onyx-32vm6: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 2 sec
      
      

      Info required for matching: replay-ost-single 5

      Attachments

        Issue Links

          Activity

            [LU-9273] replay-ost-single test_5: timeout after ost failover

            John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/33053/
            Subject: LU-9273 tests: disable random I/O in replay-ost-single/5
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: 52809289d5e81557784346bc53a436541214690f

            gerrit Gerrit Updater added a comment - John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/33053/ Subject: LU-9273 tests: disable random I/O in replay-ost-single/5 Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 52809289d5e81557784346bc53a436541214690f

            James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33053
            Subject: LU-9273 tests: disable random I/O in replay-ost-single/5
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 3362371fa079a532b82bb8922781a1dc6ad54572

            gerrit Gerrit Updater added a comment - James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33053 Subject: LU-9273 tests: disable random I/O in replay-ost-single/5 Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 3362371fa079a532b82bb8922781a1dc6ad54572
            jamesanunez James Nunez (Inactive) added a comment - Alex - Is this another instance of this hang with ZFS https://testing.whamcloud.com/test_sets/420c0390-9ac1-11e8-b0aa-52540065bddc ?

            I think this isn't an issue any more?

            bzzz Alex Zhuravlev added a comment - I think this isn't an issue any more?
            sarah Sarah Liu added a comment - +1 on b2_10 https://testing.hpdd.intel.com/test_sets/3fba602a-5910-11e8-93e6-52540065bddc

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31671/
            Subject: LU-9273 tests: disable random I/O in replay-ost-single/5
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e3bc6e681666aa2c60ada5f997966efa31fae68c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31671/ Subject: LU-9273 tests: disable random I/O in replay-ost-single/5 Project: fs/lustre-release Branch: master Current Patch Set: Commit: e3bc6e681666aa2c60ada5f997966efa31fae68c

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/31671
            Subject: LU-9273 tests: disable random I/O in replay-ost-single/5
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2a63c9ad83eb910128fe476250e9bb0b799459b8

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/31671 Subject: LU-9273 tests: disable random I/O in replay-ost-single/5 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2a63c9ad83eb910128fe476250e9bb0b799459b8
            bzzz Alex Zhuravlev added a comment - - edited

            these data collected with autotest confirm my local findings:

             

            ldiskfs:

            read write
            pages per rpc rpcs % cum % | rpcs % cum %
            1: 0 0 0 | 19 3 3
            2: 0 0 0 | 1 0 3
            4: 0 0 0 | 3 0 4
            8: 0 0 0 | 0 0 4
            16: 0 0 0 | 18 3 8
            32: 0 0 0 | 18 3 11
            64: 0 0 0 | 59 11 23
            128: 0 0 0 | 31 6 29
            256: 0 0 0 | 29 5 35
            512: 0 0 0 | 202 39 74
            1024: 0 0 0 | 128 25 100

             

            ZFS:

            read write
            pages per rpc rpcs % cum % | rpcs % cum %
            1: 0 0 0 | 2 0 0
            2: 0 0 0 | 32534 98 98
            4: 0 0 0 | 144 0 98
            8: 0 0 0 | 0 0 98
            16: 0 0 0 | 1 0 98
            32: 0 0 0 | 1 0 98
            64: 0 0 0 | 0 0 98
            128: 0 0 0 | 0 0 98
            256: 0 0 0 | 512 1 100

             

            random writes consume granted space too quickly causing early writes to recycle grants.

             

            bzzz Alex Zhuravlev added a comment - - edited these data collected with autotest confirm my local findings:   ldiskfs: read write pages per rpc rpcs % cum % | rpcs % cum % 1: 0 0 0 | 19 3 3 2: 0 0 0 | 1 0 3 4: 0 0 0 | 3 0 4 8: 0 0 0 | 0 0 4 16: 0 0 0 | 18 3 8 32: 0 0 0 | 18 3 11 64: 0 0 0 | 59 11 23 128: 0 0 0 | 31 6 29 256: 0 0 0 | 29 5 35 512: 0 0 0 | 202 39 74 1024: 0 0 0 | 128 25 100   ZFS: read write pages per rpc rpcs % cum % | rpcs % cum % 1: 0 0 0 | 2 0 0 2: 0 0 0 | 32534 98 98 4: 0 0 0 | 144 0 98 8: 0 0 0 | 0 0 98 16: 0 0 0 | 1 0 98 32: 0 0 0 | 1 0 98 64: 0 0 0 | 0 0 98 128: 0 0 0 | 0 0 98 256: 0 0 0 | 512 1 100   random writes consume granted space too quickly causing early writes to recycle grants.  

            well, it's ZFS in the reported cases and I think I understand the root cause roughly. probably it makes sense to disable this subtest with ZFS for a while.

             

            bzzz Alex Zhuravlev added a comment - well, it's ZFS in the reported cases and I think I understand the root cause roughly. probably it makes sense to disable this subtest with ZFS for a while.  

             

            It looks like we are seeing this issue again. Please see https://testing.hpdd.intel.com/test_sets/5ba4a4fe-2746-11e8-9e0e-52540065bddc for more logs. In the MDS console, we see

            [56215.091992] jbd2/vda1-8     D ffff880036d2bf40     0   265      2 0x00000000
            
            [56215.092773] Call Trace:
            
            [56215.093088]  [<ffffffff816b2060>] ? bit_wait+0x50/0x50
            
            [56215.093589]  [<ffffffff816b40e9>] schedule+0x29/0x70
            
            [56215.094079]  [<ffffffff816b1a49>] schedule_timeout+0x239/0x2c0
            
            [56215.094745]  [<ffffffff812fd2d0>] ? generic_make_request_checks+0x1a0/0x3a0
            
            [56215.095421]  [<ffffffff81063f5e>] ? kvm_clock_get_cycles+0x1e/0x20
            
            [56215.096029]  [<ffffffff816b2060>] ? bit_wait+0x50/0x50
            
            [56215.096605]  [<ffffffff816b35ed>] io_schedule_timeout+0xad/0x130
            
            [56215.097194]  [<ffffffff816b3688>] io_schedule+0x18/0x20
            
            [56215.097791]  [<ffffffff816b2071>] bit_wait_io+0x11/0x50
            
            [56215.098300]  [<ffffffff816b1b97>] __wait_on_bit+0x67/0x90
            
            [56215.098849]  [<ffffffff816b2060>] ? bit_wait+0x50/0x50
            
            [56215.099422]  [<ffffffff816b1c41>] out_of_line_wait_on_bit+0x81/0xb0
            
            [56215.100036]  [<ffffffff810b5080>] ? wake_bit_function+0x40/0x40
            
            [56215.100689]  [<ffffffff8123b3fa>] __wait_on_buffer+0x2a/0x30
            
            [56215.101386]  [<ffffffffc00d4891>] jbd2_journal_commit_transaction+0x1781/0x19b0 [jbd2]
            
            [56215.102164]  [<ffffffff810c28a0>] ? finish_task_switch+0x50/0x170
            
            [56215.102869]  [<ffffffffc00d9b69>] kjournald2+0xc9/0x260 [jbd2]
            
            [56215.103444]  [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30
            
            [56215.104030]  [<ffffffffc00d9aa0>] ? commit_timeout+0x10/0x10 [jbd2]
            
            [56215.104734]  [<ffffffff810b4031>] kthread+0xd1/0xe0
            
            [56215.105204]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
            
            [56215.105807]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
            
            [56215.106405]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
            
            

             

            Yet, on the OST console, I see

            [56212.701386] txg_sync        D ffff88003fa35ee0     0 16826      2 0x00000080
            
            [56212.702137] Call Trace:
            
            [56212.702393]  [<ffffffff81240605>] ? bio_alloc_bioset+0x115/0x310
            
            [56212.703000]  [<ffffffff816b40e9>] schedule+0x29/0x70
            
            [56212.703610]  [<ffffffff816b1a49>] schedule_timeout+0x239/0x2c0
            
            [56212.704224]  [<ffffffff81063f5e>] ? kvm_clock_get_cycles+0x1e/0x20
            
            [56212.704842]  [<ffffffff810ecec2>] ? ktime_get_ts64+0x52/0xf0
            
            [56212.705458]  [<ffffffff816b35ed>] io_schedule_timeout+0xad/0x130
            
            [56212.706073]  [<ffffffff810b4cb6>] ? prepare_to_wait_exclusive+0x56/0x90
            
            [56212.706806]  [<ffffffff816b3688>] io_schedule+0x18/0x20
            
            [56212.707364]  [<ffffffffc065e502>] cv_wait_common+0xb2/0x150 [spl]
            
            [56212.707983]  [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30
            
            [56212.708637]  [<ffffffffc065e5f8>] __cv_wait_io+0x18/0x20 [spl]
            
            [56212.709275]  [<ffffffffc0806833>] zio_wait+0x113/0x1c0 [zfs]
            
            [56212.709879]  [<ffffffffc07bafd1>] vdev_config_sync+0xf1/0x180 [zfs]
            
            [56212.710612]  [<ffffffffc079b2b4>] spa_sync+0xa24/0xdf0 [zfs]
            
            [56212.711225]  [<ffffffff810c7c82>] ? default_wake_function+0x12/0x20
            
            [56212.711883]  [<ffffffffc07aef91>] txg_sync_thread+0x301/0x510 [zfs]
            
            [56212.712604]  [<ffffffffc07aec90>] ? txg_fini+0x2a0/0x2a0 [zfs]
            
            [56212.713231]  [<ffffffffc0658fc3>] thread_generic_wrapper+0x73/0x80 [spl]
            
            [56212.713926]  [<ffffffffc0658f50>] ? __thread_exit+0x20/0x20 [spl]
            
            [56212.714604]  [<ffffffff810b4031>] kthread+0xd1/0xe0
            
            [56212.715128]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
            
            [56212.715746]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
            
            [56212.716360]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
            
            

             

            Which looks like LU-9247.

             

            Here’s are other examples

            https://testing.hpdd.intel.com/test_sets/03b2ab72-2761-11e8-9e0e-52540065bddc

            https://testing.hpdd.intel.com/test_sets/73816850-2773-11e8-b3c6-52540065bddc

             

             

            jamesanunez James Nunez (Inactive) added a comment -   It looks like we are seeing this issue again. Please see https://testing.hpdd.intel.com/test_sets/5ba4a4fe-2746-11e8-9e0e-52540065bddc for more logs. In the MDS console, we see [56215.091992] jbd2/vda1-8     D ffff880036d2bf40     0   265      2 0x00000000 [56215.092773] Call Trace: [56215.093088]  [<ffffffff816b2060>] ? bit_wait+0x50/0x50 [56215.093589]  [<ffffffff816b40e9>] schedule+0x29/0x70 [56215.094079]  [<ffffffff816b1a49>] schedule_timeout+0x239/0x2c0 [56215.094745]  [<ffffffff812fd2d0>] ? generic_make_request_checks+0x1a0/0x3a0 [56215.095421]  [<ffffffff81063f5e>] ? kvm_clock_get_cycles+0x1e/0x20 [56215.096029]  [<ffffffff816b2060>] ? bit_wait+0x50/0x50 [56215.096605]  [<ffffffff816b35ed>] io_schedule_timeout+0xad/0x130 [56215.097194]  [<ffffffff816b3688>] io_schedule+0x18/0x20 [56215.097791]  [<ffffffff816b2071>] bit_wait_io+0x11/0x50 [56215.098300]  [<ffffffff816b1b97>] __wait_on_bit+0x67/0x90 [56215.098849]  [<ffffffff816b2060>] ? bit_wait+0x50/0x50 [56215.099422]  [<ffffffff816b1c41>] out_of_line_wait_on_bit+0x81/0xb0 [56215.100036]  [<ffffffff810b5080>] ? wake_bit_function+0x40/0x40 [56215.100689]  [<ffffffff8123b3fa>] __wait_on_buffer+0x2a/0x30 [56215.101386]  [<ffffffffc00d4891>] jbd2_journal_commit_transaction+0x1781/0x19b0 [jbd2] [56215.102164]  [<ffffffff810c28a0>] ? finish_task_switch+0x50/0x170 [56215.102869]  [<ffffffffc00d9b69>] kjournald2+0xc9/0x260 [jbd2] [56215.103444]  [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30 [56215.104030]  [<ffffffffc00d9aa0>] ? commit_timeout+0x10/0x10 [jbd2] [56215.104734]  [<ffffffff810b4031>] kthread+0xd1/0xe0 [56215.105204]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 [56215.105807]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0 [56215.106405]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40   Yet, on the OST console, I see [56212.701386] txg_sync        D ffff88003fa35ee0     0 16826      2 0x00000080 [56212.702137] Call Trace: [56212.702393]  [<ffffffff81240605>] ? bio_alloc_bioset+0x115/0x310 [56212.703000]  [<ffffffff816b40e9>] schedule+0x29/0x70 [56212.703610]  [<ffffffff816b1a49>] schedule_timeout+0x239/0x2c0 [56212.704224]  [<ffffffff81063f5e>] ? kvm_clock_get_cycles+0x1e/0x20 [56212.704842]  [<ffffffff810ecec2>] ? ktime_get_ts64+0x52/0xf0 [56212.705458]  [<ffffffff816b35ed>] io_schedule_timeout+0xad/0x130 [56212.706073]  [<ffffffff810b4cb6>] ? prepare_to_wait_exclusive+0x56/0x90 [56212.706806]  [<ffffffff816b3688>] io_schedule+0x18/0x20 [56212.707364]  [<ffffffffc065e502>] cv_wait_common+0xb2/0x150 [spl] [56212.707983]  [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30 [56212.708637]  [<ffffffffc065e5f8>] __cv_wait_io+0x18/0x20 [spl] [56212.709275]  [<ffffffffc0806833>] zio_wait+0x113/0x1c0 [zfs] [56212.709879]  [<ffffffffc07bafd1>] vdev_config_sync+0xf1/0x180 [zfs] [56212.710612]  [<ffffffffc079b2b4>] spa_sync+0xa24/0xdf0 [zfs] [56212.711225]  [<ffffffff810c7c82>] ? default_wake_function+0x12/0x20 [56212.711883]  [<ffffffffc07aef91>] txg_sync_thread+0x301/0x510 [zfs] [56212.712604]  [<ffffffffc07aec90>] ? txg_fini+0x2a0/0x2a0 [zfs] [56212.713231]  [<ffffffffc0658fc3>] thread_generic_wrapper+0x73/0x80 [spl] [56212.713926]  [<ffffffffc0658f50>] ? __thread_exit+0x20/0x20 [spl] [56212.714604]  [<ffffffff810b4031>] kthread+0xd1/0xe0 [56212.715128]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 [56212.715746]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0 [56212.716360]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40   Which looks like LU-9247 .   Here’s are other examples https://testing.hpdd.intel.com/test_sets/03b2ab72-2761-11e8-9e0e-52540065bddc https://testing.hpdd.intel.com/test_sets/73816850-2773-11e8-b3c6-52540065bddc    

            People

              bzzz Alex Zhuravlev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: