Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7797

Can't mount zpools after OSS restart

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Blocker
    • None
    • None
    • lola
      build: 2.8.50-6-gf9ca359 ;commit f9ca359284357d145819beb08b316e932f7a3060
    • 3
    • 9223372036854775807

    Description

      Error happened during soak testing of build '20160218' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160218). DNE is enabled.
      MDT's have been formated using ldiskfs, OSTs using zfs.

      Sequence of events:

      • 2016-02-18 18:24:30,824:fsmgmt.fsmgmt:INFO executing cmd pm -h powerman -c lola-5 (restart of OSS)
      • Boot process hang by with several errors (see line 25015 in console-loa-5.log, after timestamp 'Feb 18, 18:20:01')
          25105 WARNING: Pool 'soaked-ost11' has encountered an uncorrectable I/O failure and has been suspended.
          25106 
          25107 INFO: task zpool:5003 blocked for more than 120 seconds.
          25108       Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.gf9ca359.x86_64 #1
          25109 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
          25110 zpool         D 0000000000000011     0  5003   4993 0x00000000
          25111  ffff880830f7bbe8 0000000000000086 0000000000000000 ffffffff81064a6e
          25112  ffff880830f7bba8 0000000000000019 0000000d6e7b4a08 0000000000000001
          25113  ffff880830f7bb68 00000000fffc4649 ffff8808317c5068 ffff880830f7bfd8
          25114 Call Trace:
          25115  [<ffffffff81064a6e>] ? try_to_wake_up+0x24e/0x3e0
          25116  [<ffffffffa02e178d>] cv_wait_common+0x11d/0x130 [spl]
          25117  [<ffffffff8109ec20>] ? autoremove_wake_function+0x0/0x40
          25118  [<ffffffffa02e17f5>] __cv_wait+0x15/0x20 [spl]
          25119  [<ffffffffa039884b>] txg_wait_synced+0x8b/0xd0 [zfs]
          25120  [<ffffffffa039038c>] spa_config_update+0xcc/0x120 [zfs]
          25121  [<ffffffffa038de8a>] spa_import+0x56a/0x730 [zfs]
          25122  [<ffffffffa02fe454>] ? nvlist_lookup_common+0x84/0xd0 [znvpair]
          25123  [<ffffffffa03c0134>] zfs_ioc_pool_import+0xe4/0x120 [zfs]
          25124  [<ffffffffa03c2955>] zfsdev_ioctl+0x495/0x4d0 [zfs]
          25125  [<ffffffff811a3ff2>] vfs_ioctl+0x22/0xa0
          25126  [<ffffffff811a4194>] do_vfs_ioctl+0x84/0x580
          25127  [<ffffffff81190101>] ? __fput+0x1a1/0x210
          25128  [<ffffffff811a4711>] sys_ioctl+0x81/0xa0
          25129  [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
        
      • After powercycling the node the zpool soaked-ost3 fails to mount with
        error:
        LustreError: 11505:0:(llog_obd.c:209:llog_setup()) MGC192.168.1.108@o2ib10: ctxt 0 lop_setup=ffffffffa06da310 failed: rc = -5
        LustreError: 11505:0:(obd_mount_server.c:308:server_mgc_set_fs()) can't set_fs -5
        LustreError: 11505:0:(obd_mount_server.c:1798:server_fill_super()) Unable to start targets: -5
        LustreError: 11505:0:(obd_mount_server.c:1512:server_put_super()) no obd soaked-OST0003
        LustreError: 11505:0:(obd_mount_server.c:140:server_deregister_mount()) soaked-OST0003 not registered
        

        The MGS is available, IB fabric operational

      • Trying to mount zpool soaked-ost7 lead to kernel panic:
        LustreError: 11938:0:(obd_mount_server.c:140:server_deregister_mount()) soaked-OST0007 not registered
        VERIFY3(0 == dmu_buf_hold_array(os, object, offset, size, 0, ((char *)__func__), &numbufs, &dbp)) failed (0 == 5)
        PANIC at dmu.c:819:dmu_write()
        Showing stack for process 9182
        Pid: 9182, comm: txg_sync Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.gf9ca359.x86_64 #1
        Call Trace:
         [<ffffffffa02df7cd>] ? spl_dumpstack+0x3d/0x40 [spl]
         [<ffffffffa02df9c2>] ? spl_panic+0xc2/0xe0 [spl]
         [<ffffffffa0349c51>] ? dmu_buf_hold_array_by_dnode+0x231/0x560 [zfs]
         [<ffffffffa035a8b4>] ? dnode_rele_and_unlock+0x64/0xb0 [zfs]
         [<ffffffffa035a943>] ? dnode_rele+0x43/0x50 [zfs]
         [<ffffffffa034a79b>] ? dmu_write+0x19b/0x1a0 [zfs]
         [<ffffffffa0342af2>] ? dmu_buf_will_dirty+0xb2/0x100 [zfs]
         [<ffffffffa0397421>] ? space_map_write+0x361/0x5f0 [zfs]
         [<ffffffffa037b01b>] ? metaslab_sync+0x11b/0x760 [zfs]
         [<ffffffffa0373cf4>] ? dsl_scan_sync+0x54/0xb80 [zfs]
         [<ffffffff8152b83e>] ? mutex_lock+0x1e/0x50
         [<ffffffffa039be3f>] ? vdev_sync+0x6f/0x140 [zfs]
         [<ffffffffa03839bb>] ? spa_sync+0x4bb/0xb90 [zfs]
         [<ffffffff81057849>] ? __wake_up_common+0x59/0x90
         [<ffffffff8105bd83>] ? __wake_up+0x53/0x70
         [<ffffffff81014a29>] ? read_tsc+0x9/0x20
         [<ffffffffa0399079>] ? txg_sync_thread+0x389/0x5f0 [zfs]
         [<ffffffffa0398cf0>] ? txg_sync_thread+0x0/0x5f0 [zfs]
         [<ffffffffa0398cf0>] ? txg_sync_thread+0x0/0x5f0 [zfs]
         [<ffffffffa02dcfb8>] ? thread_generic_wrapper+0x68/0x80 [spl]
         [<ffffffffa02dcf50>] ? thread_generic_wrapper+0x0/0x80 [spl]
         [<ffffffff8109e78e>] ? kthread+0x9e/0xc0
         [<ffffffff8100c28a>] ? child_rip+0xa/0x20
         [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
         [<ffffffff8100c280>] ? child_rip+0x0/0x20
        

        Both OSTs were mounted and operational before and both error can be reproduced constantly.

      Attached messages and console log files of lola-5

      Attachments

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: