Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7797

Can't mount zpools after OSS restart

Details

    • Bug
    • Resolution: Not a Bug
    • Blocker
    • None
    • None
    • lola
      build: 2.8.50-6-gf9ca359 ;commit f9ca359284357d145819beb08b316e932f7a3060
    • 3
    • 9223372036854775807

    Description

      Error happened during soak testing of build '20160218' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160218). DNE is enabled.
      MDT's have been formated using ldiskfs, OSTs using zfs.

      Sequence of events:

      • 2016-02-18 18:24:30,824:fsmgmt.fsmgmt:INFO executing cmd pm -h powerman -c lola-5 (restart of OSS)
      • Boot process hang by with several errors (see line 25015 in console-loa-5.log, after timestamp 'Feb 18, 18:20:01')
          25105 WARNING: Pool 'soaked-ost11' has encountered an uncorrectable I/O failure and has been suspended.
          25106 
          25107 INFO: task zpool:5003 blocked for more than 120 seconds.
          25108       Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.gf9ca359.x86_64 #1
          25109 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
          25110 zpool         D 0000000000000011     0  5003   4993 0x00000000
          25111  ffff880830f7bbe8 0000000000000086 0000000000000000 ffffffff81064a6e
          25112  ffff880830f7bba8 0000000000000019 0000000d6e7b4a08 0000000000000001
          25113  ffff880830f7bb68 00000000fffc4649 ffff8808317c5068 ffff880830f7bfd8
          25114 Call Trace:
          25115  [<ffffffff81064a6e>] ? try_to_wake_up+0x24e/0x3e0
          25116  [<ffffffffa02e178d>] cv_wait_common+0x11d/0x130 [spl]
          25117  [<ffffffff8109ec20>] ? autoremove_wake_function+0x0/0x40
          25118  [<ffffffffa02e17f5>] __cv_wait+0x15/0x20 [spl]
          25119  [<ffffffffa039884b>] txg_wait_synced+0x8b/0xd0 [zfs]
          25120  [<ffffffffa039038c>] spa_config_update+0xcc/0x120 [zfs]
          25121  [<ffffffffa038de8a>] spa_import+0x56a/0x730 [zfs]
          25122  [<ffffffffa02fe454>] ? nvlist_lookup_common+0x84/0xd0 [znvpair]
          25123  [<ffffffffa03c0134>] zfs_ioc_pool_import+0xe4/0x120 [zfs]
          25124  [<ffffffffa03c2955>] zfsdev_ioctl+0x495/0x4d0 [zfs]
          25125  [<ffffffff811a3ff2>] vfs_ioctl+0x22/0xa0
          25126  [<ffffffff811a4194>] do_vfs_ioctl+0x84/0x580
          25127  [<ffffffff81190101>] ? __fput+0x1a1/0x210
          25128  [<ffffffff811a4711>] sys_ioctl+0x81/0xa0
          25129  [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
        
      • After powercycling the node the zpool soaked-ost3 fails to mount with
        error:
        LustreError: 11505:0:(llog_obd.c:209:llog_setup()) MGC192.168.1.108@o2ib10: ctxt 0 lop_setup=ffffffffa06da310 failed: rc = -5
        LustreError: 11505:0:(obd_mount_server.c:308:server_mgc_set_fs()) can't set_fs -5
        LustreError: 11505:0:(obd_mount_server.c:1798:server_fill_super()) Unable to start targets: -5
        LustreError: 11505:0:(obd_mount_server.c:1512:server_put_super()) no obd soaked-OST0003
        LustreError: 11505:0:(obd_mount_server.c:140:server_deregister_mount()) soaked-OST0003 not registered
        

        The MGS is available, IB fabric operational

      • Trying to mount zpool soaked-ost7 lead to kernel panic:
        LustreError: 11938:0:(obd_mount_server.c:140:server_deregister_mount()) soaked-OST0007 not registered
        VERIFY3(0 == dmu_buf_hold_array(os, object, offset, size, 0, ((char *)__func__), &numbufs, &dbp)) failed (0 == 5)
        PANIC at dmu.c:819:dmu_write()
        Showing stack for process 9182
        Pid: 9182, comm: txg_sync Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.gf9ca359.x86_64 #1
        Call Trace:
         [<ffffffffa02df7cd>] ? spl_dumpstack+0x3d/0x40 [spl]
         [<ffffffffa02df9c2>] ? spl_panic+0xc2/0xe0 [spl]
         [<ffffffffa0349c51>] ? dmu_buf_hold_array_by_dnode+0x231/0x560 [zfs]
         [<ffffffffa035a8b4>] ? dnode_rele_and_unlock+0x64/0xb0 [zfs]
         [<ffffffffa035a943>] ? dnode_rele+0x43/0x50 [zfs]
         [<ffffffffa034a79b>] ? dmu_write+0x19b/0x1a0 [zfs]
         [<ffffffffa0342af2>] ? dmu_buf_will_dirty+0xb2/0x100 [zfs]
         [<ffffffffa0397421>] ? space_map_write+0x361/0x5f0 [zfs]
         [<ffffffffa037b01b>] ? metaslab_sync+0x11b/0x760 [zfs]
         [<ffffffffa0373cf4>] ? dsl_scan_sync+0x54/0xb80 [zfs]
         [<ffffffff8152b83e>] ? mutex_lock+0x1e/0x50
         [<ffffffffa039be3f>] ? vdev_sync+0x6f/0x140 [zfs]
         [<ffffffffa03839bb>] ? spa_sync+0x4bb/0xb90 [zfs]
         [<ffffffff81057849>] ? __wake_up_common+0x59/0x90
         [<ffffffff8105bd83>] ? __wake_up+0x53/0x70
         [<ffffffff81014a29>] ? read_tsc+0x9/0x20
         [<ffffffffa0399079>] ? txg_sync_thread+0x389/0x5f0 [zfs]
         [<ffffffffa0398cf0>] ? txg_sync_thread+0x0/0x5f0 [zfs]
         [<ffffffffa0398cf0>] ? txg_sync_thread+0x0/0x5f0 [zfs]
         [<ffffffffa02dcfb8>] ? thread_generic_wrapper+0x68/0x80 [spl]
         [<ffffffffa02dcf50>] ? thread_generic_wrapper+0x0/0x80 [spl]
         [<ffffffff8109e78e>] ? kthread+0x9e/0xc0
         [<ffffffff8100c28a>] ? child_rip+0xa/0x20
         [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
         [<ffffffff8100c280>] ? child_rip+0x0/0x20
        

        Both OSTs were mounted and operational before and both error can be reproduced constantly.

      Attached messages and console log files of lola-5

      Attachments

        Issue Links

          Activity

            [LU-7797] Can't mount zpools after OSS restart

            I think this ticket can be closed. The error was caused by the node set-up not reflecting the zfs constraints (aka importing the same zpool (OST) simultaneously on two nodes (OSSes)).

            heckes Frank Heckes (Inactive) added a comment - I think this ticket can be closed. The error was caused by the node set-up not reflecting the zfs constraints (aka importing the same zpool (OST) simultaneously on two nodes (OSSes)).

            Andreas: Many thanks for the pointers. I'll try to fix the OSTs and enhance the soak framework to follow the HA best practices for ZFS.

            heckes Frank Heckes (Inactive) added a comment - Andreas: Many thanks for the pointers. I'll try to fix the OSTs and enhance the soak framework to follow the HA best practices for ZFS.

            Unfortunately, there is no OI scrub functionality for ZFS today, so it isn't possible to just delete the corrupted OI file and have OI Scrub rebuild it. ZFS is not supposed to be corrupted during usage, and none of the APIs that Lustre is using to modify the filesystem should allow the pool to be corrupt.

            However, in LU-7798 Frank wrote:

            I forgot to mention that the OSS nodes have been extended to operate in active-active failover configuration
            for disk resources by Feb, 17th 2016. So the failover partner node lola-4 can see of node lola-5 all disks and has its ZFS pools imported also.
            There's no start-up (boot) wrapper script that prevents the (primary) zpools of the other node from being imported.

            ZFS is definitely vulnerable to corruption if it is mounted on multiple nodes at the same time, and there is not currently an MMP feature like ldiskfs that actively prevents it from being accessed by two nodes. This kind of corruption would show up first for files that are being modified frequently (e.g. the OI file seen here) because each node will be assigning different blocks and updating the tree differently. There needs to be strict Linux HA control of the zpools so that they are not imported on the backup node unless they are failed over, and when failover happens there needs to be STONITH to turn off the primary node before the pool is imported on the backup to avoid concurrent access. It is worthwhile to contact Gabriele or Zhiqi to see if there is a best-practices guide to installing ZFS in HA failover configuration.

            Also, see patch http://review.whamcloud.com/16611 "LU-7134 utils: Ensure hostid set for ZFS during mkfs" which requires /etc/hostid is set and would at least provide basic protection from pool import if it is in use on another node.

            You may be able to recover the corrupted OST zpool if it hasn't been mounted for a long time by reverting to an older uberblock or snapshot that does not have the corruption in it. See for example http://www.solarisinternals.com/wiki/index.php/ZFS_forensics_scrollback_script or https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSMetadataRecovery for details. If that doesn't work then it would be necessary to reformat the filesystem. However, at a minimum the /etc/hostid should be set (and unique!) to prevent gratuitous imports, and failover between OSSes should be disabled until proper HA configuration is done.

            adilger Andreas Dilger added a comment - Unfortunately, there is no OI scrub functionality for ZFS today, so it isn't possible to just delete the corrupted OI file and have OI Scrub rebuild it. ZFS is not supposed to be corrupted during usage, and none of the APIs that Lustre is using to modify the filesystem should allow the pool to be corrupt. However, in LU-7798 Frank wrote: I forgot to mention that the OSS nodes have been extended to operate in active-active failover configuration for disk resources by Feb, 17th 2016. So the failover partner node lola-4 can see of node lola-5 all disks and has its ZFS pools imported also. There's no start-up (boot) wrapper script that prevents the (primary) zpools of the other node from being imported. ZFS is definitely vulnerable to corruption if it is mounted on multiple nodes at the same time, and there is not currently an MMP feature like ldiskfs that actively prevents it from being accessed by two nodes. This kind of corruption would show up first for files that are being modified frequently (e.g. the OI file seen here) because each node will be assigning different blocks and updating the tree differently. There needs to be strict Linux HA control of the zpools so that they are not imported on the backup node unless they are failed over, and when failover happens there needs to be STONITH to turn off the primary node before the pool is imported on the backup to avoid concurrent access. It is worthwhile to contact Gabriele or Zhiqi to see if there is a best-practices guide to installing ZFS in HA failover configuration. Also, see patch http://review.whamcloud.com/16611 " LU-7134 utils: Ensure hostid set for ZFS during mkfs" which requires /etc/hostid is set and would at least provide basic protection from pool import if it is in use on another node. You may be able to recover the corrupted OST zpool if it hasn't been mounted for a long time by reverting to an older uberblock or snapshot that does not have the corruption in it. See for example http://www.solarisinternals.com/wiki/index.php/ZFS_forensics_scrollback_script or https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSMetadataRecovery for details. If that doesn't work then it would be necessary to reformat the filesystem. However, at a minimum the /etc/hostid should be set (and unique!) to prevent gratuitous imports, and failover between OSSes should be disabled until proper HA configuration is done.
            pjones Peter Jones added a comment -

            Alex

            Could you please look into how this could have occurred?

            Thanks

            Peter

            pjones Peter Jones added a comment - Alex Could you please look into how this could have occurred? Thanks Peter

            one remark: The called out zpool 'soaked-ost11' ('...has unrecoverable errros..') can be mounted and is operational after Lustre recovery completes.

            heckes Frank Heckes (Inactive) added a comment - one remark: The called out zpool 'soaked-ost11' ('...has unrecoverable errros..') can be mounted and is operational after Lustre recovery completes.

            The states of the zpool are as follows:

            soaked-ost3

            [root@lola-5 ~]# zpool status -v soaked-ost3
              pool: soaked-ost3
             state: ONLINE
            status: One or more devices has experienced an error resulting in data
                    corruption.  Applications may be affected.
            action: Restore the file in question if possible.  Otherwise restore the
                    entire pool from backup.
               see: http://zfsonlinux.org/msg/ZFS-8000-8A
              scan: none requested
            config:
            
                    NAME                    STATE     READ WRITE CKSUM
                    soaked-ost3             ONLINE       0     0     0
                      raidz2-0              ONLINE       0     0     0
                        lola-5_ost3_disk_0  ONLINE       0     0     0
                        lola-5_ost3_disk_1  ONLINE       0     0     0
                        lola-5_ost3_disk_2  ONLINE       0     0     0
                        lola-5_ost3_disk_3  ONLINE       0     0     0
                        lola-5_ost3_disk_4  ONLINE       0     0     0
                        lola-5_ost3_disk_5  ONLINE       0     0     0
                        lola-5_ost3_disk_6  ONLINE       0     0     0
                        lola-5_ost3_disk_7  ONLINE       0     0     0
                        lola-5_ost3_disk_8  ONLINE       0     0     0
                        lola-5_ost3_disk_9  ONLINE       0     0     0
            
            errors: Permanent errors have been detected in the following files:
            
                    soaked-ost3/ost3:/oi.10
            

            soaked-ost7

            [root@lola-5 ~]# zpool status -v soaked-ost7
              pool: soaked-ost7
             state: ONLINE
              scan: none requested
            config:
            
                    NAME                    STATE     READ WRITE CKSUM
                    soaked-ost7             ONLINE       0     0     0
                      raidz2-0              ONLINE       0     0     0
                        lola-5_ost7_disk_0  ONLINE       0     0     0
                        lola-5_ost7_disk_1  ONLINE       0     0     0
                        lola-5_ost7_disk_2  ONLINE       0     0     0
                        lola-5_ost7_disk_3  ONLINE       0     0     0
                        lola-5_ost7_disk_4  ONLINE       0     0     0
                        lola-5_ost7_disk_5  ONLINE       0     0     0
                        lola-5_ost7_disk_6  ONLINE       0     0     0
                        lola-5_ost7_disk_7  ONLINE       0     0     0
                        lola-5_ost7_disk_8  ONLINE       0     0     0
                        lola-5_ost7_disk_9  ONLINE       0     0     0
            
            errors: No known data errors
            
            heckes Frank Heckes (Inactive) added a comment - The states of the zpool are as follows: soaked-ost3 [root@lola-5 ~]# zpool status -v soaked-ost3 pool: soaked-ost3 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: none requested config: NAME STATE READ WRITE CKSUM soaked-ost3 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 lola-5_ost3_disk_0 ONLINE 0 0 0 lola-5_ost3_disk_1 ONLINE 0 0 0 lola-5_ost3_disk_2 ONLINE 0 0 0 lola-5_ost3_disk_3 ONLINE 0 0 0 lola-5_ost3_disk_4 ONLINE 0 0 0 lola-5_ost3_disk_5 ONLINE 0 0 0 lola-5_ost3_disk_6 ONLINE 0 0 0 lola-5_ost3_disk_7 ONLINE 0 0 0 lola-5_ost3_disk_8 ONLINE 0 0 0 lola-5_ost3_disk_9 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: soaked-ost3/ost3:/oi.10 soaked-ost7 [root@lola-5 ~]# zpool status -v soaked-ost7 pool: soaked-ost7 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM soaked-ost7 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 lola-5_ost7_disk_0 ONLINE 0 0 0 lola-5_ost7_disk_1 ONLINE 0 0 0 lola-5_ost7_disk_2 ONLINE 0 0 0 lola-5_ost7_disk_3 ONLINE 0 0 0 lola-5_ost7_disk_4 ONLINE 0 0 0 lola-5_ost7_disk_5 ONLINE 0 0 0 lola-5_ost7_disk_6 ONLINE 0 0 0 lola-5_ost7_disk_7 ONLINE 0 0 0 lola-5_ost7_disk_8 ONLINE 0 0 0 lola-5_ost7_disk_9 ONLINE 0 0 0 errors: No known data errors

            People

              bzzz Alex Zhuravlev
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: