[LU-7797] Can't mount zpools after OSS restart - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Not a Bug
Priority: Blocker
Fix Version/s: None
Affects Version/s: None
Labels:
- soak
Environment:
lola
build: 2.8.50-6-gf9ca359 ;commit f9ca359284357d145819beb08b316e932f7a3060

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Error happened during soak testing of build '20160218' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160218). DNE is enabled.
MDT's have been formated using ldiskfs, OSTs using zfs.

Sequence of events:

2016-02-18 18:24:30,824:fsmgmt.fsmgmt:INFO executing cmd pm -h powerman -c lola-5 (restart of OSS)

Boot process hang by with several errors (see line 25015 in console-loa-5.log, after timestamp 'Feb 18, 18:20:01')

  25105 WARNING: Pool 'soaked-ost11' has encountered an uncorrectable I/O failure and has been suspended.
  25106 
  25107 INFO: task zpool:5003 blocked for more than 120 seconds.
  25108       Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.gf9ca359.x86_64 #1
  25109 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  25110 zpool         D 0000000000000011     0  5003   4993 0x00000000
  25111  ffff880830f7bbe8 0000000000000086 0000000000000000 ffffffff81064a6e
  25112  ffff880830f7bba8 0000000000000019 0000000d6e7b4a08 0000000000000001
  25113  ffff880830f7bb68 00000000fffc4649 ffff8808317c5068 ffff880830f7bfd8
  25114 Call Trace:
  25115  [<ffffffff81064a6e>] ? try_to_wake_up+0x24e/0x3e0
  25116  [<ffffffffa02e178d>] cv_wait_common+0x11d/0x130 [spl]
  25117  [<ffffffff8109ec20>] ? autoremove_wake_function+0x0/0x40
  25118  [<ffffffffa02e17f5>] __cv_wait+0x15/0x20 [spl]
  25119  [<ffffffffa039884b>] txg_wait_synced+0x8b/0xd0 [zfs]
  25120  [<ffffffffa039038c>] spa_config_update+0xcc/0x120 [zfs]
  25121  [<ffffffffa038de8a>] spa_import+0x56a/0x730 [zfs]
  25122  [<ffffffffa02fe454>] ? nvlist_lookup_common+0x84/0xd0 [znvpair]
  25123  [<ffffffffa03c0134>] zfs_ioc_pool_import+0xe4/0x120 [zfs]
  25124  [<ffffffffa03c2955>] zfsdev_ioctl+0x495/0x4d0 [zfs]
  25125  [<ffffffff811a3ff2>] vfs_ioctl+0x22/0xa0
  25126  [<ffffffff811a4194>] do_vfs_ioctl+0x84/0x580
  25127  [<ffffffff81190101>] ? __fput+0x1a1/0x210
  25128  [<ffffffff811a4711>] sys_ioctl+0x81/0xa0
  25129  [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b

After powercycling the node the zpool soaked-ost3 fails to mount with
error:

LustreError: 11505:0:(llog_obd.c:209:llog_setup()) MGC192.168.1.108@o2ib10: ctxt 0 lop_setup=ffffffffa06da310 failed: rc = -5
LustreError: 11505:0:(obd_mount_server.c:308:server_mgc_set_fs()) can't set_fs -5
LustreError: 11505:0:(obd_mount_server.c:1798:server_fill_super()) Unable to start targets: -5
LustreError: 11505:0:(obd_mount_server.c:1512:server_put_super()) no obd soaked-OST0003
LustreError: 11505:0:(obd_mount_server.c:140:server_deregister_mount()) soaked-OST0003 not registered

The MGS is available, IB fabric operational

Trying to mount zpool soaked-ost7 lead to kernel panic:

LustreError: 11938:0:(obd_mount_server.c:140:server_deregister_mount()) soaked-OST0007 not registered
VERIFY3(0 == dmu_buf_hold_array(os, object, offset, size, 0, ((char *)__func__), &numbufs, &dbp)) failed (0 == 5)
PANIC at dmu.c:819:dmu_write()
Showing stack for process 9182
Pid: 9182, comm: txg_sync Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.gf9ca359.x86_64 #1
Call Trace:
 [<ffffffffa02df7cd>] ? spl_dumpstack+0x3d/0x40 [spl]
 [<ffffffffa02df9c2>] ? spl_panic+0xc2/0xe0 [spl]
 [<ffffffffa0349c51>] ? dmu_buf_hold_array_by_dnode+0x231/0x560 [zfs]
 [<ffffffffa035a8b4>] ? dnode_rele_and_unlock+0x64/0xb0 [zfs]
 [<ffffffffa035a943>] ? dnode_rele+0x43/0x50 [zfs]
 [<ffffffffa034a79b>] ? dmu_write+0x19b/0x1a0 [zfs]
 [<ffffffffa0342af2>] ? dmu_buf_will_dirty+0xb2/0x100 [zfs]
 [<ffffffffa0397421>] ? space_map_write+0x361/0x5f0 [zfs]
 [<ffffffffa037b01b>] ? metaslab_sync+0x11b/0x760 [zfs]
 [<ffffffffa0373cf4>] ? dsl_scan_sync+0x54/0xb80 [zfs]
 [<ffffffff8152b83e>] ? mutex_lock+0x1e/0x50
 [<ffffffffa039be3f>] ? vdev_sync+0x6f/0x140 [zfs]
 [<ffffffffa03839bb>] ? spa_sync+0x4bb/0xb90 [zfs]
 [<ffffffff81057849>] ? __wake_up_common+0x59/0x90
 [<ffffffff8105bd83>] ? __wake_up+0x53/0x70
 [<ffffffff81014a29>] ? read_tsc+0x9/0x20
 [<ffffffffa0399079>] ? txg_sync_thread+0x389/0x5f0 [zfs]
 [<ffffffffa0398cf0>] ? txg_sync_thread+0x0/0x5f0 [zfs]
 [<ffffffffa0398cf0>] ? txg_sync_thread+0x0/0x5f0 [zfs]
 [<ffffffffa02dcfb8>] ? thread_generic_wrapper+0x68/0x80 [spl]
 [<ffffffffa02dcf50>] ? thread_generic_wrapper+0x0/0x80 [spl]
 [<ffffffff8109e78e>] ? kthread+0x9e/0xc0
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20
 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

Both OSTs were mounted and operational before and both error can be reproduced constantly.

Attached messages and console log files of lola-5

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

console-lola-5.log.bz2
246 kB
19/Feb/16 3:44 PM
messages-lola-5.log.bz2
200 kB
19/Feb/16 3:44 PM

Issue Links

is related to

LU-7798 ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed:

Resolved

LU-7585 Implement OI Scrub for ZFS

Resolved

LU-7134 Ensure ZFS hostid protection if servicenode/failover options given to mkfs.lustre

Resolved

Activity

[LU-7797] Can't mount zpools after OSS restart

Frank Heckes (Inactive) added a comment - 24/Feb/16 9:38 AM

I think this ticket can be closed. The error was caused by the node set-up not reflecting the zfs constraints (aka importing the same zpool (OST) simultaneously on two nodes (OSSes)).

Frank Heckes (Inactive) added a comment - 24/Feb/16 9:38 AM I think this ticket can be closed. The error was caused by the node set-up not reflecting the zfs constraints (aka importing the same zpool (OST) simultaneously on two nodes (OSSes)).

Frank Heckes (Inactive) added a comment - 22/Feb/16 9:53 AM

Andreas: Many thanks for the pointers. I'll try to fix the OSTs and enhance the soak framework to follow the HA best practices for ZFS.

Frank Heckes (Inactive) added a comment - 22/Feb/16 9:53 AM Andreas: Many thanks for the pointers. I'll try to fix the OSTs and enhance the soak framework to follow the HA best practices for ZFS.

Andreas Dilger added a comment - 19/Feb/16 7:11 PM

Unfortunately, there is no OI scrub functionality for ZFS today, so it isn't possible to just delete the corrupted OI file and have OI Scrub rebuild it. ZFS is not supposed to be corrupted during usage, and none of the APIs that Lustre is using to modify the filesystem should allow the pool to be corrupt.

However, in ~~LU-7798~~ Frank wrote:

I forgot to mention that the OSS nodes have been extended to operate in active-active failover configuration
for disk resources by Feb, 17th 2016. So the failover partner node lola-4 can see of node lola-5 all disks and has its ZFS pools imported also.
There's no start-up (boot) wrapper script that prevents the (primary) zpools of the other node from being imported.

ZFS is definitely vulnerable to corruption if it is mounted on multiple nodes at the same time, and there is not currently an MMP feature like ldiskfs that actively prevents it from being accessed by two nodes. This kind of corruption would show up first for files that are being modified frequently (e.g. the OI file seen here) because each node will be assigning different blocks and updating the tree differently. There needs to be strict Linux HA control of the zpools so that they are not imported on the backup node unless they are failed over, and when failover happens there needs to be STONITH to turn off the primary node before the pool is imported on the backup to avoid concurrent access. It is worthwhile to contact Gabriele or Zhiqi to see if there is a best-practices guide to installing ZFS in HA failover configuration.

Also, see patch http://review.whamcloud.com/16611 "~~LU-7134~~ utils: Ensure hostid set for ZFS during mkfs" which requires /etc/hostid is set and would at least provide basic protection from pool import if it is in use on another node.

You may be able to recover the corrupted OST zpool if it hasn't been mounted for a long time by reverting to an older uberblock or snapshot that does not have the corruption in it. See for example http://www.solarisinternals.com/wiki/index.php/ZFS_forensics_scrollback_script or https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSMetadataRecovery for details. If that doesn't work then it would be necessary to reformat the filesystem. However, at a minimum the /etc/hostid should be set (and unique!) to prevent gratuitous imports, and failover between OSSes should be disabled until proper HA configuration is done.

Andreas Dilger added a comment - 19/Feb/16 7:11 PM Unfortunately, there is no OI scrub functionality for ZFS today, so it isn't possible to just delete the corrupted OI file and have OI Scrub rebuild it. ZFS is not supposed to be corrupted during usage, and none of the APIs that Lustre is using to modify the filesystem should allow the pool to be corrupt. However, in LU-7798 Frank wrote: I forgot to mention that the OSS nodes have been extended to operate in active-active failover configuration for disk resources by Feb, 17th 2016. So the failover partner node lola-4 can see of node lola-5 all disks and has its ZFS pools imported also. There's no start-up (boot) wrapper script that prevents the (primary) zpools of the other node from being imported. ZFS is definitely vulnerable to corruption if it is mounted on multiple nodes at the same time, and there is not currently an MMP feature like ldiskfs that actively prevents it from being accessed by two nodes. This kind of corruption would show up first for files that are being modified frequently (e.g. the OI file seen here) because each node will be assigning different blocks and updating the tree differently. There needs to be strict Linux HA control of the zpools so that they are not imported on the backup node unless they are failed over, and when failover happens there needs to be STONITH to turn off the primary node before the pool is imported on the backup to avoid concurrent access. It is worthwhile to contact Gabriele or Zhiqi to see if there is a best-practices guide to installing ZFS in HA failover configuration. Also, see patch http://review.whamcloud.com/16611 " LU-7134 utils: Ensure hostid set for ZFS during mkfs" which requires /etc/hostid is set and would at least provide basic protection from pool import if it is in use on another node. You may be able to recover the corrupted OST zpool if it hasn't been mounted for a long time by reverting to an older uberblock or snapshot that does not have the corruption in it. See for example http://www.solarisinternals.com/wiki/index.php/ZFS_forensics_scrollback_script or https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSMetadataRecovery for details. If that doesn't work then it would be necessary to reformat the filesystem. However, at a minimum the /etc/hostid should be set (and unique!) to prevent gratuitous imports, and failover between OSSes should be disabled until proper HA configuration is done.

Peter Jones added a comment - 19/Feb/16 6:41 PM

Alex

Could you please look into how this could have occurred?

Thanks

Peter

Peter Jones added a comment - 19/Feb/16 6:41 PM Alex Could you please look into how this could have occurred? Thanks Peter

Frank Heckes (Inactive) added a comment - 19/Feb/16 3:55 PM

one remark: The called out zpool 'soaked-ost11' ('...has unrecoverable errros..') can be mounted and is operational after Lustre recovery completes.

Frank Heckes (Inactive) added a comment - 19/Feb/16 3:55 PM one remark: The called out zpool 'soaked-ost11' ('...has unrecoverable errros..') can be mounted and is operational after Lustre recovery completes.

Frank Heckes (Inactive) added a comment - 19/Feb/16 3:51 PM

The states of the zpool are as follows:

soaked-ost3

[root@lola-5 ~]# zpool status -v soaked-ost3
  pool: soaked-ost3
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: none requested
config:

        NAME                    STATE     READ WRITE CKSUM
        soaked-ost3             ONLINE       0     0     0
          raidz2-0              ONLINE       0     0     0
            lola-5_ost3_disk_0  ONLINE       0     0     0
            lola-5_ost3_disk_1  ONLINE       0     0     0
            lola-5_ost3_disk_2  ONLINE       0     0     0
            lola-5_ost3_disk_3  ONLINE       0     0     0
            lola-5_ost3_disk_4  ONLINE       0     0     0
            lola-5_ost3_disk_5  ONLINE       0     0     0
            lola-5_ost3_disk_6  ONLINE       0     0     0
            lola-5_ost3_disk_7  ONLINE       0     0     0
            lola-5_ost3_disk_8  ONLINE       0     0     0
            lola-5_ost3_disk_9  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        soaked-ost3/ost3:/oi.10

soaked-ost7

[root@lola-5 ~]# zpool status -v soaked-ost7
  pool: soaked-ost7
 state: ONLINE
  scan: none requested
config:

        NAME                    STATE     READ WRITE CKSUM
        soaked-ost7             ONLINE       0     0     0
          raidz2-0              ONLINE       0     0     0
            lola-5_ost7_disk_0  ONLINE       0     0     0
            lola-5_ost7_disk_1  ONLINE       0     0     0
            lola-5_ost7_disk_2  ONLINE       0     0     0
            lola-5_ost7_disk_3  ONLINE       0     0     0
            lola-5_ost7_disk_4  ONLINE       0     0     0
            lola-5_ost7_disk_5  ONLINE       0     0     0
            lola-5_ost7_disk_6  ONLINE       0     0     0
            lola-5_ost7_disk_7  ONLINE       0     0     0
            lola-5_ost7_disk_8  ONLINE       0     0     0
            lola-5_ost7_disk_9  ONLINE       0     0     0

errors: No known data errors

Frank Heckes (Inactive) added a comment - 19/Feb/16 3:51 PM The states of the zpool are as follows: soaked-ost3 [root@lola-5 ~]# zpool status -v soaked-ost3 pool: soaked-ost3 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: none requested config: NAME STATE READ WRITE CKSUM soaked-ost3 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 lola-5_ost3_disk_0 ONLINE 0 0 0 lola-5_ost3_disk_1 ONLINE 0 0 0 lola-5_ost3_disk_2 ONLINE 0 0 0 lola-5_ost3_disk_3 ONLINE 0 0 0 lola-5_ost3_disk_4 ONLINE 0 0 0 lola-5_ost3_disk_5 ONLINE 0 0 0 lola-5_ost3_disk_6 ONLINE 0 0 0 lola-5_ost3_disk_7 ONLINE 0 0 0 lola-5_ost3_disk_8 ONLINE 0 0 0 lola-5_ost3_disk_9 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: soaked-ost3/ost3:/oi.10 soaked-ost7 [root@lola-5 ~]# zpool status -v soaked-ost7 pool: soaked-ost7 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM soaked-ost7 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 lola-5_ost7_disk_0 ONLINE 0 0 0 lola-5_ost7_disk_1 ONLINE 0 0 0 lola-5_ost7_disk_2 ONLINE 0 0 0 lola-5_ost7_disk_3 ONLINE 0 0 0 lola-5_ost7_disk_4 ONLINE 0 0 0 lola-5_ost7_disk_5 ONLINE 0 0 0 lola-5_ost7_disk_6 ONLINE 0 0 0 lola-5_ost7_disk_7 ONLINE 0 0 0 lola-5_ost7_disk_8 ONLINE 0 0 0 lola-5_ost7_disk_9 ONLINE 0 0 0 errors: No known data errors

Can't mount zpools after OSS restart

Details

Description

Attachments

Attachments

Issue Links

Activity

soaked-ost3

soaked-ost7

People

Dates