Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9845

ost-pools test_22 hangs with ‘WARNING: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and has been suspended.’

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.11.0, Lustre 2.10.4
    • Lustre 2.11.0, Lustre 2.10.2, Lustre 2.10.3, Lustre 2.10.7, Lustre 2.12.3
    • ZFS 0.7.x
    • 3
    • 9223372036854775807

    Description

      ost-pools test 22 hangs when testing with ZFS 0.7.0. The last lines in the test_log before the hang are:

      loop for testpool2 complete
      total: 100 open/close in 0.09 seconds: 1061.34 ops/second
      CMD: onyx-35vm9 lctl pool_remove lustre.testpool OST[0001-0001/1]
      onyx-35vm9: OST lustre-OST0001_UUID removed from pool lustre.testpool
      Pool testpool, iteration 2
      CMD: onyx-35vm9 lctl pool_remove lustre.testpool OST[0001-0001/1]
      onyx-35vm9: OST lustre-OST0001_UUID removed from pool lustre.testpool
      Pool testpool, iteration 3
      CMD: onyx-35vm9 lctl pool_remove lustre.testpool OST[0001-0001/1]
      onyx-35vm9: OST lustre-OST0001_UUID removed from pool lustre.testpool
      Pool testpool, iteration 4
      

      In the MDS console logs, we see:

      04:17:47:[11669.022658] Lustre: DEBUG MARKER: lctl pool_remove lustre.testpool OST[0001-0001/1]
      04:17:47:[11677.206802] Lustre: DEBUG MARKER: lctl pool_add lustre.testpool OST[0000-0001/1]
      04:17:47:[11687.391683] Lustre: DEBUG MARKER: lctl pool_remove lustre.testpool OST[0001-0001/1]
      04:17:47:[11693.557227] WARNING: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and has been suspended.
      04:17:47:[11693.557227] 
      04:17:47:[11697.575911] Lustre: DEBUG MARKER: lctl pool_add lustre.testpool OST[0000-0001/1]
      04:17:47:[15221.223579] SysRq : Show State
      04:17:47:[15221.224071]   task                        PC stack   pid father
      04:17:47:[15221.224729] systemd         S 0000000000000000     0     1      0 0x00000000
      04:17:47:[15221.225617]  ffff88007c7bbdb8 0000000000000082 ffff88007c7b0000 ffff88007c7bbfd8
      04:17:47:[15221.226469]  ffff88007c7bbfd8 ffff88007c7bbfd8 ffff88007c7b0000 0000000000000000
      04:17:47:[15221.227325]  0000000000000000 ffff8800369d1a20 ffff88007c7b0000 0000000000000000
      04:17:47:[15221.228217] Call Trace:
      04:17:47:[15221.228581]  [<ffffffff8168ca49>] schedule+0x29/0x70
      04:17:47:[15221.229094]  [<ffffffff8168bb4d>] schedule_hrtimeout_range_clock+0x12d/0x150
      04:17:47:[15221.229864]  [<ffffffff812486f9>] ? ep_scan_ready_list.isra.7+0x1b9/0x1f0
      04:17:47:[15221.230669]  [<ffffffff8168bb83>] schedule_hrtimeout_range+0x13/0x20
      04:17:47:[15221.231363]  [<ffffffff8124898e>] ep_poll+0x23e/0x360
      04:17:47:[15221.231927]  [<ffffffff81200873>] ? __fput+0x183/0x260
      04:17:47:[15221.232481]  [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20
      04:17:47:[15221.233121]  [<ffffffff81249bfd>] SyS_epoll_wait+0xed/0x120
      04:17:47:[15221.233737]  [<ffffffff81697a49>] system_call_fastpath+0x16/0x1b
      …
      04:17:47:[15222.536161] cfs_rh_00       S ffff88007a8667b0     0  2617      2 0x00000080
      04:17:47:[15222.536916]  ffff8800793ebe28 0000000000000046 ffff88000001bec0 ffff8800793ebfd8
      04:17:47:[15222.537791]  ffff8800793ebfd8 ffff8800793ebfd8 ffff88000001bec0 ffff88007a866780
      04:17:47:[15222.538646]  ffff88007a866798 ffff88007a866790 0000000000000000 ffff88007a8667b0
      04:17:47:[15222.539495] Call Trace:
      04:17:47:[15222.539749]  [<ffffffff8168ca49>] schedule+0x29/0x70
      04:17:47:[15222.540295]  [<ffffffffa05cab58>] cfs_wi_scheduler+0x2f8/0x440 [libcfs]
      04:17:47:[15222.541047]  [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
      04:17:47:[15222.541647]  [<ffffffffa05ca860>] ? cfs_wi_schedule+0x190/0x190 [libcfs]
      04:17:47:[15222.542384]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      04:17:47:[15222.542905]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.543561]  [<ffffffff81697998>] ret_from_fork+0x58/0x90
      04:17:47:[15222.544161]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.544827] cfs_rh_01       S ffff88007a8667b0     0  2618      2 0x00000080
      04:17:47:[15222.545646]  ffff88007b4ffe28 0000000000000046 ffff880000018fb0 ffff88007b4fffd8
      04:17:47:[15222.546465]  ffff88007b4fffd8 ffff88007b4fffd8 ffff880000018fb0 ffff88007a866780
      04:17:47:[15222.547284]  ffff88007a866798 ffff88007a866790 0000000000000000 ffff88007a8667b0
      04:17:47:[15222.548152] Call Trace:
      04:17:47:[15222.548437]  [<ffffffff8168ca49>] schedule+0x29/0x70
      04:17:47:[15222.549020]  [<ffffffffa05cab58>] cfs_wi_scheduler+0x2f8/0x440 [libcfs]
      04:17:47:[15222.549696]  [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
      04:17:47:[15222.550348]  [<ffffffffa05ca860>] ? cfs_wi_schedule+0x190/0x190 [libcfs]
      04:17:47:[15222.551049]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      04:17:47:[15222.551554]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.552281]  [<ffffffff81697998>] ret_from_fork+0x58/0x90
      04:17:47:[15222.552858]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.553513] spl_system_task S ffff8800791e43c0     0  3362      2 0x00000080
      04:17:47:[15222.554288]  ffff88007bae3e00 0000000000000046 ffff88007a24af10 ffff88007bae3fd8
      04:17:47:[15222.555149]  ffff88007bae3fd8 ffff88007bae3fd8 ffff88007a24af10 ffff8800791e4300
      04:17:47:[15222.556011]  ffff88007a9c6de0 ffff8800791e4390 ffff8800366fbd30 ffff8800791e43c0
      04:17:47:[15222.556870] Call Trace:
      04:17:47:[15222.557125]  [<ffffffff8168ca49>] schedule+0x29/0x70
      04:17:47:[15222.557656]  [<ffffffffa061f0a9>] taskq_thread+0x429/0x470 [spl]
      04:17:47:[15222.558369]  [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20
      04:17:47:[15222.558965]  [<ffffffffa061ec80>] ? taskq_thread_spawn+0x60/0x60 [spl]
      04:17:47:[15222.559630]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      04:17:47:[15222.560181]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.560842]  [<ffffffff81697998>] ret_from_fork+0x58/0x90
      04:17:47:[15222.561451]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.562133] spl_delay_taskq S ffff88007b8fe4c0     0  3363      2 0x00000080
      04:17:47:[15222.562894]  ffff88007a907e00 0000000000000046 ffff88007a24bec0 ffff88007a907fd8
      04:17:47:[15222.563748]  ffff88007a907fd8 ffff88007a907fd8 ffff88007a24bec0 ffff88007b8fe400
      04:17:47:[15222.564605]  ffff88007a9c6f00 ffff88007b8fe490 0000000000000000 ffff88007b8fe4c0
      04:17:47:[15222.565474] Call Trace:
      04:17:47:[15222.565733]  [<ffffffff8168ca49>] schedule+0x29/0x70
      04:17:47:[15222.566242]  [<ffffffffa061f0a9>] taskq_thread+0x429/0x470 [spl]
      04:17:47:[15222.566941]  [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20
      04:17:47:[15222.567515]  [<ffffffffa061ec80>] ? taskq_thread_spawn+0x60/0x60 [spl]
      04:17:47:[15222.568229]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      04:17:47:[15222.568761]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.569429]  [<ffffffff81697998>] ret_from_fork+0x58/0x90
      04:17:47:[15222.570037]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.570703] spl_dynamic_tas S ffff88007b8fe2c0     0  3364      2 0x00000080
      04:17:47:[15222.571482]  ffff88007b28fe00 0000000000000046 ffff88007a248000 ffff88007b28ffd8
      04:17:47:[15222.572328]  ffff88007b28ffd8 ffff88007b28ffd8 ffff88007a248000 ffff88007b8fe200
      04:17:47:[15222.573178]  ffff88007a9c6c60 ffff88007b8fe290 ffff8800668c9670 ffff88007b8fe2c0
      04:17:47:[15222.574045] Call Trace:
      04:17:47:[15222.574314]  [<ffffffff8168ca49>] schedule+0x29/0x70
      04:17:47:[15222.574893]  [<ffffffffa061f0a9>] taskq_thread+0x429/0x470 [spl]
      04:17:47:[15222.575505]  [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20
      04:17:47:[15222.576122]  [<ffffffffa061ec80>] ? taskq_thread_spawn+0x60/0x60 [spl]
      04:17:47:[15222.576785]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      04:17:47:[15222.577352]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.578036]  [<ffffffff81697998>] ret_from_fork+0x58/0x90
      04:17:47:[15222.578596]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.579309] spl_kmem_cache  S ffff88007b8feec0     0  3365      2 0x00000080
      04:17:47:[15222.580047]  ffff88007ba3fe00 0000000000000046 ffff88007a24ce70 ffff88007ba3ffd8
      04:17:47:[15222.580902]  ffff88007ba3ffd8 ffff88007ba3ffd8 ffff88007a24ce70 ffff88007b8fee00
      04:17:47:[15222.581750]  ffff88007a9c6a20 ffff88007b8fee90 ffff8800387fe800 ffff88007b8feec0
      04:17:47:[15222.582597] Call Trace:
      04:17:47:[15222.582853]  [<ffffffff8168ca49>] schedule+0x29/0x70
      04:17:47:[15222.583358]  [<ffffffffa061f0a9>] taskq_thread+0x429/0x470 [spl]
      04:17:47:[15222.584057]  [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20
      04:17:47:[15222.584645]  [<ffffffffa061ec80>] ? taskq_thread_spawn+0x60/0x60 [spl]
      04:17:47:[15222.585372]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      04:17:47:[15222.585891]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.586549]  [<ffffffff81697998>] ret_from_fork+0x58/0x90
      04:17:47:[15222.587153]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.587820] zvol            S ffff88007b5abbc0     0  3366      2 0x00000080
      04:17:47:[15222.588648]  ffff88007b29be00 0000000000000046 ffff88007a249f60 ffff88007b29bfd8
      04:17:47:[15222.589473]  ffff88007b29bfd8 ffff88007b29bfd8 ffff88007a249f60 ffff88007b5abb00
      04:17:47:[15222.590300]  ffff8800364fa6c0 ffff88007b5abb90 0000000000000000 ffff88007b5abbc0
      04:17:47:[15222.591167] Call Trace:
      04:17:47:[15222.591437]  [<ffffffff8168ca49>] schedule+0x29/0x70
      04:17:47:[15222.592015]  [<ffffffffa061f0a9>] taskq_thread+0x429/0x470 [spl]
      04:17:47:[15222.592643]  [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20
      04:17:47:[15222.593264]  [<ffffffffa061ec80>] ? taskq_thread_spawn+0x60/0x60 [spl]
      04:17:47:[15222.593951]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      04:17:47:[15222.594450]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.595156]  [<ffffffff81697998>] ret_from_fork+0x58/0x90
      04:17:47:[15222.595709]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.596425] arc_prune       S ffff880079bc1ac0     0  3372      2 0x00000080
      04:17:47:[15222.597166]  ffff88007b8a7e00 0000000000000046 ffff88007a9d8000 ffff88007b8a7fd8
      04:17:47:[15222.598033]  ffff88007b8a7fd8 ffff88007b8a7fd8 ffff88007a9d8000 ffff880079bc1a00
      04:17:47:[15222.598901]  ffff8800796aad80 ffff880079bc1a90 0000000000000000 ffff880079bc1ac0
      04:17:47:[15222.599755] Call Trace:
      04:17:47:[15222.600008]  [<ffffffff8168ca49>] schedule+0x29/0x70
      04:17:47:[15222.600537]  [<ffffffffa061f0a9>] taskq_thread+0x429/0x470 [spl]
      04:17:47:[15222.601256]  [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20
      04:17:47:[15222.601857]  [<ffffffffa061ec80>] ? taskq_thread_spawn+0x60/0x60 [spl]
      04:17:47:[15222.602514]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      04:17:47:[15222.603059]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.603722]  [<ffffffff81697998>] ret_from_fork+0x58/0x90
      04:17:47:[15222.604329]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.605006] arc_reclaim     S ffffffffa0914a08     0  3373      2 0x00000080
      04:17:47:[15222.605764]  ffff88007b933cf8 0000000000000046 ffff88007a9dedd0 ffff88007b933fd8
      04:17:47:[15222.606622]  ffff88007b933fd8 ffff88007b933fd8 ffff88007a9dedd0 ffff88007b933dc8
      04:17:47:[15222.607473]  00000000000186a0 0000000000000001 ffffffffa0914a88 ffffffffa0914a08
      04:17:47:[15222.608292] Call Trace:
      04:17:47:[15222.608635]  [<ffffffff8168ca49>] schedule+0x29/0x70
      04:17:47:[15222.609129]  [<ffffffff8168bad2>] schedule_hrtimeout_range_clock+0xb2/0x150
      04:17:47:[15222.609843]  [<ffffffff810b4640>] ? hrtimer_get_res+0x50/0x50
      04:17:47:[15222.610414]  [<ffffffff8168bac6>] ? schedule_hrtimeout_range_clock+0xa6/0x150
      04:17:47:[15222.611121]  [<ffffffff8168bb83>] schedule_hrtimeout_range+0x13/0x20
      04:17:47:[15222.611769]  [<ffffffffa06236ab>] cv_timedwait_hires_common+0x12b/0x220 [spl]
      04:17:47:[15222.612503]  [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
      04:17:47:[15222.613080]  [<ffffffffa06237d4>] cv_timedwait_sig_hires+0x14/0x20 [spl]
      04:17:47:[15222.613844]  [<ffffffffa071ad31>] arc_reclaim_thread+0x121/0x260 [zfs]
      04:17:47:[15222.614529]  [<ffffffffa071ac10>] ? arc_shrink+0xc0/0xc0 [zfs]
      04:17:47:[15222.615337]  [<ffffffffa061df91>] thread_generic_wrapper+0x71/0x80 [spl]
      04:17:47:[15222.616077]  [<ffffffffa061df20>] ? __thread_exit+0x20/0x20 [spl]
      04:17:47:[15222.616766]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      04:17:47:[15222.617319]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.618001]  [<ffffffff81697998>] ret_from_fork+0x58/0x90
      04:17:47:[15222.618548]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.619262] dbu_evict       S ffff88003641fbc0     0  3374      2 0x00000080
      04:17:47:[15222.620014]  ffff88000005be00 0000000000000046 ffff88007a9daf10 ffff88000005bfd8
      04:17:47:[15222.620866]  ffff88000005bfd8 ffff88000005bfd8 ffff88007a9daf10 ffff88003641fb00
      04:17:47:[15222.621722]  ffff8800796aad20 ffff88003641fb90 ffff88004caa2070 ffff88003641fbc0
      04:17:47:[15222.622573] Call Trace:
      04:17:47:[15222.622826]  [<ffffffff8168ca49>] schedule+0x29/0x70
      04:17:47:[15222.623336]  [<ffffffffa061f0a9>] taskq_thread+0x429/0x470 [spl]
      04:17:47:[15222.624031]  [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20
      04:17:47:[15222.624606]  [<ffffffffa061ec80>] ? taskq_thread_spawn+0x60/0x60 [spl]
      04:17:47:[15222.625315]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      04:17:47:[15222.625832]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      04:17:47:[15222.626491]  [<ffffffff81697998>] ret_from_fork+0x58/0x90
      04:17:47:[15222.627083]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      

      From ost-pools.sh,

      1196 test_22() {
      1197         local POOL_ROOT=${POOL_ROOT:-$DIR/$tdir}
      1198         [[ $OSTCOUNT -le 1 ]] && skip_env "Need at least 2 OSTs" && return
      1199 
      1200         local numfiles=100
      1201 
      1202         create_pool_nofail $POOL
      1203         add_pool $POOL "OST0000" "$FSNAME-OST0000_UUID "
      1204         create_pool_nofail $POOL2
      1205         add_pool $POOL2 "OST0000" "$FSNAME-OST0000_UUID "
      1206 
      1207         add_loop $POOL 1 &
      1208         add_loop $POOL2 2 &
      1209         sleep 5
      1210         create_dir $POOL_ROOT $POOL
      1211         createmany -o $POOL_ROOT/${tfile} $numfiles ||
      1212                 error "createmany $POOL_ROOT/${tfile} failed!"
      1213         wait
      1214 
      1215         return 0
      1216 }
      1217 run_test 22 "Simultaneous manipulation of a pool"
      

      There’s only one instance of this failure. Logs are at:
      https://testing.hpdd.intel.com/test_sets/d0619996-799f-11e7-9b69-5254006e85c2

      Attachments

        1. mds02_faulted.log
          20 kB
          Kim Sebo
        2. zdb-lustre-ost1.log
          75 kB
          Nathaniel Clark

        Issue Links

          Activity

            People

              utopiabound Nathaniel Clark
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: