[LU-7797] Can't mount zpools after OSS restart Created: 19/Feb/16  Updated: 24/Feb/16  Resolved: 24/Feb/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: Alex Zhuravlev
Resolution: Not a Bug Votes: 0
Labels: soak
Environment:

lola
build: 2.8.50-6-gf9ca359 ;commit f9ca359284357d145819beb08b316e932f7a3060


Attachments: File console-lola-5.log.bz2     File messages-lola-5.log.bz2    
Issue Links:
Related
is related to LU-7798 ll_prep_inode()) ASSERTION( fid_is_sa... Resolved
is related to LU-7585 Implement OI Scrub for ZFS Resolved
is related to LU-7134 Ensure ZFS hostid protection if servi... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Error happened during soak testing of build '20160218' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160218). DNE is enabled.
MDT's have been formated using ldiskfs, OSTs using zfs.

Sequence of events:

  • 2016-02-18 18:24:30,824:fsmgmt.fsmgmt:INFO executing cmd pm -h powerman -c lola-5 (restart of OSS)
  • Boot process hang by with several errors (see line 25015 in console-loa-5.log, after timestamp 'Feb 18, 18:20:01')
      25105 WARNING: Pool 'soaked-ost11' has encountered an uncorrectable I/O failure and has been suspended.
      25106 
      25107 INFO: task zpool:5003 blocked for more than 120 seconds.
      25108       Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.gf9ca359.x86_64 #1
      25109 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      25110 zpool         D 0000000000000011     0  5003   4993 0x00000000
      25111  ffff880830f7bbe8 0000000000000086 0000000000000000 ffffffff81064a6e
      25112  ffff880830f7bba8 0000000000000019 0000000d6e7b4a08 0000000000000001
      25113  ffff880830f7bb68 00000000fffc4649 ffff8808317c5068 ffff880830f7bfd8
      25114 Call Trace:
      25115  [<ffffffff81064a6e>] ? try_to_wake_up+0x24e/0x3e0
      25116  [<ffffffffa02e178d>] cv_wait_common+0x11d/0x130 [spl]
      25117  [<ffffffff8109ec20>] ? autoremove_wake_function+0x0/0x40
      25118  [<ffffffffa02e17f5>] __cv_wait+0x15/0x20 [spl]
      25119  [<ffffffffa039884b>] txg_wait_synced+0x8b/0xd0 [zfs]
      25120  [<ffffffffa039038c>] spa_config_update+0xcc/0x120 [zfs]
      25121  [<ffffffffa038de8a>] spa_import+0x56a/0x730 [zfs]
      25122  [<ffffffffa02fe454>] ? nvlist_lookup_common+0x84/0xd0 [znvpair]
      25123  [<ffffffffa03c0134>] zfs_ioc_pool_import+0xe4/0x120 [zfs]
      25124  [<ffffffffa03c2955>] zfsdev_ioctl+0x495/0x4d0 [zfs]
      25125  [<ffffffff811a3ff2>] vfs_ioctl+0x22/0xa0
      25126  [<ffffffff811a4194>] do_vfs_ioctl+0x84/0x580
      25127  [<ffffffff81190101>] ? __fput+0x1a1/0x210
      25128  [<ffffffff811a4711>] sys_ioctl+0x81/0xa0
      25129  [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
    
  • After powercycling the node the zpool soaked-ost3 fails to mount with
    error:
    LustreError: 11505:0:(llog_obd.c:209:llog_setup()) MGC192.168.1.108@o2ib10: ctxt 0 lop_setup=ffffffffa06da310 failed: rc = -5
    LustreError: 11505:0:(obd_mount_server.c:308:server_mgc_set_fs()) can't set_fs -5
    LustreError: 11505:0:(obd_mount_server.c:1798:server_fill_super()) Unable to start targets: -5
    LustreError: 11505:0:(obd_mount_server.c:1512:server_put_super()) no obd soaked-OST0003
    LustreError: 11505:0:(obd_mount_server.c:140:server_deregister_mount()) soaked-OST0003 not registered
    

    The MGS is available, IB fabric operational

  • Trying to mount zpool soaked-ost7 lead to kernel panic:
    LustreError: 11938:0:(obd_mount_server.c:140:server_deregister_mount()) soaked-OST0007 not registered
    VERIFY3(0 == dmu_buf_hold_array(os, object, offset, size, 0, ((char *)__func__), &numbufs, &dbp)) failed (0 == 5)
    PANIC at dmu.c:819:dmu_write()
    Showing stack for process 9182
    Pid: 9182, comm: txg_sync Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.gf9ca359.x86_64 #1
    Call Trace:
     [<ffffffffa02df7cd>] ? spl_dumpstack+0x3d/0x40 [spl]
     [<ffffffffa02df9c2>] ? spl_panic+0xc2/0xe0 [spl]
     [<ffffffffa0349c51>] ? dmu_buf_hold_array_by_dnode+0x231/0x560 [zfs]
     [<ffffffffa035a8b4>] ? dnode_rele_and_unlock+0x64/0xb0 [zfs]
     [<ffffffffa035a943>] ? dnode_rele+0x43/0x50 [zfs]
     [<ffffffffa034a79b>] ? dmu_write+0x19b/0x1a0 [zfs]
     [<ffffffffa0342af2>] ? dmu_buf_will_dirty+0xb2/0x100 [zfs]
     [<ffffffffa0397421>] ? space_map_write+0x361/0x5f0 [zfs]
     [<ffffffffa037b01b>] ? metaslab_sync+0x11b/0x760 [zfs]
     [<ffffffffa0373cf4>] ? dsl_scan_sync+0x54/0xb80 [zfs]
     [<ffffffff8152b83e>] ? mutex_lock+0x1e/0x50
     [<ffffffffa039be3f>] ? vdev_sync+0x6f/0x140 [zfs]
     [<ffffffffa03839bb>] ? spa_sync+0x4bb/0xb90 [zfs]
     [<ffffffff81057849>] ? __wake_up_common+0x59/0x90
     [<ffffffff8105bd83>] ? __wake_up+0x53/0x70
     [<ffffffff81014a29>] ? read_tsc+0x9/0x20
     [<ffffffffa0399079>] ? txg_sync_thread+0x389/0x5f0 [zfs]
     [<ffffffffa0398cf0>] ? txg_sync_thread+0x0/0x5f0 [zfs]
     [<ffffffffa0398cf0>] ? txg_sync_thread+0x0/0x5f0 [zfs]
     [<ffffffffa02dcfb8>] ? thread_generic_wrapper+0x68/0x80 [spl]
     [<ffffffffa02dcf50>] ? thread_generic_wrapper+0x0/0x80 [spl]
     [<ffffffff8109e78e>] ? kthread+0x9e/0xc0
     [<ffffffff8100c28a>] ? child_rip+0xa/0x20
     [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
     [<ffffffff8100c280>] ? child_rip+0x0/0x20
    

    Both OSTs were mounted and operational before and both error can be reproduced constantly.

Attached messages and console log files of lola-5



 Comments   
Comment by Frank Heckes (Inactive) [ 19/Feb/16 ]

The states of the zpool are as follows:

soaked-ost3

[root@lola-5 ~]# zpool status -v soaked-ost3
  pool: soaked-ost3
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: none requested
config:

        NAME                    STATE     READ WRITE CKSUM
        soaked-ost3             ONLINE       0     0     0
          raidz2-0              ONLINE       0     0     0
            lola-5_ost3_disk_0  ONLINE       0     0     0
            lola-5_ost3_disk_1  ONLINE       0     0     0
            lola-5_ost3_disk_2  ONLINE       0     0     0
            lola-5_ost3_disk_3  ONLINE       0     0     0
            lola-5_ost3_disk_4  ONLINE       0     0     0
            lola-5_ost3_disk_5  ONLINE       0     0     0
            lola-5_ost3_disk_6  ONLINE       0     0     0
            lola-5_ost3_disk_7  ONLINE       0     0     0
            lola-5_ost3_disk_8  ONLINE       0     0     0
            lola-5_ost3_disk_9  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        soaked-ost3/ost3:/oi.10

soaked-ost7

[root@lola-5 ~]# zpool status -v soaked-ost7
  pool: soaked-ost7
 state: ONLINE
  scan: none requested
config:

        NAME                    STATE     READ WRITE CKSUM
        soaked-ost7             ONLINE       0     0     0
          raidz2-0              ONLINE       0     0     0
            lola-5_ost7_disk_0  ONLINE       0     0     0
            lola-5_ost7_disk_1  ONLINE       0     0     0
            lola-5_ost7_disk_2  ONLINE       0     0     0
            lola-5_ost7_disk_3  ONLINE       0     0     0
            lola-5_ost7_disk_4  ONLINE       0     0     0
            lola-5_ost7_disk_5  ONLINE       0     0     0
            lola-5_ost7_disk_6  ONLINE       0     0     0
            lola-5_ost7_disk_7  ONLINE       0     0     0
            lola-5_ost7_disk_8  ONLINE       0     0     0
            lola-5_ost7_disk_9  ONLINE       0     0     0

errors: No known data errors
Comment by Frank Heckes (Inactive) [ 19/Feb/16 ]

one remark: The called out zpool 'soaked-ost11' ('...has unrecoverable errros..') can be mounted and is operational after Lustre recovery completes.

Comment by Peter Jones [ 19/Feb/16 ]

Alex

Could you please look into how this could have occurred?

Thanks

Peter

Comment by Andreas Dilger [ 19/Feb/16 ]

Unfortunately, there is no OI scrub functionality for ZFS today, so it isn't possible to just delete the corrupted OI file and have OI Scrub rebuild it. ZFS is not supposed to be corrupted during usage, and none of the APIs that Lustre is using to modify the filesystem should allow the pool to be corrupt.

However, in LU-7798 Frank wrote:

I forgot to mention that the OSS nodes have been extended to operate in active-active failover configuration
for disk resources by Feb, 17th 2016. So the failover partner node lola-4 can see of node lola-5 all disks and has its ZFS pools imported also.
There's no start-up (boot) wrapper script that prevents the (primary) zpools of the other node from being imported.

ZFS is definitely vulnerable to corruption if it is mounted on multiple nodes at the same time, and there is not currently an MMP feature like ldiskfs that actively prevents it from being accessed by two nodes. This kind of corruption would show up first for files that are being modified frequently (e.g. the OI file seen here) because each node will be assigning different blocks and updating the tree differently. There needs to be strict Linux HA control of the zpools so that they are not imported on the backup node unless they are failed over, and when failover happens there needs to be STONITH to turn off the primary node before the pool is imported on the backup to avoid concurrent access. It is worthwhile to contact Gabriele or Zhiqi to see if there is a best-practices guide to installing ZFS in HA failover configuration.

Also, see patch http://review.whamcloud.com/16611 "LU-7134 utils: Ensure hostid set for ZFS during mkfs" which requires /etc/hostid is set and would at least provide basic protection from pool import if it is in use on another node.

You may be able to recover the corrupted OST zpool if it hasn't been mounted for a long time by reverting to an older uberblock or snapshot that does not have the corruption in it. See for example http://www.solarisinternals.com/wiki/index.php/ZFS_forensics_scrollback_script or https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSMetadataRecovery for details. If that doesn't work then it would be necessary to reformat the filesystem. However, at a minimum the /etc/hostid should be set (and unique!) to prevent gratuitous imports, and failover between OSSes should be disabled until proper HA configuration is done.

Comment by Frank Heckes (Inactive) [ 22/Feb/16 ]

Andreas: Many thanks for the pointers. I'll try to fix the OSTs and enhance the soak framework to follow the HA best practices for ZFS.

Comment by Frank Heckes (Inactive) [ 24/Feb/16 ]

I think this ticket can be closed. The error was caused by the node set-up not reflecting the zfs constraints (aka importing the same zpool (OST) simultaneously on two nodes (OSSes)).

Generated at Sat Feb 10 02:12:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.