[LU-16072] snapshot support to foreign host Created: 03/Aug/22  Updated: 25/Jan/24  Resolved: 04/Oct/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Upstream, Lustre 2.15.0
Fix Version/s: Lustre 2.16.0

Type: New Feature Priority: Minor
Reporter: Akash B Assignee: Akash B
Resolution: Fixed Votes: 0
Labels: ZFS, server
Environment:

Lustre filesystem with ZFS as backend filesystem.


Issue Links:
Related
is related to LU-12638 lsnapshot ignores failover host in ld... Resolved
Epic/Theme: zfs
Epic: server, zfs
Rank (Obsolete): 9223372036854775807

 Description   

lctl snapshot create does not work if one of the nodes is not reachable:

It seems like the lctl snapshot commands do not work if the resources fail to the partner node even though the partner node details are mentioned in ldev.conf.

Sample ldev.conf is as follows:

cslmo4702       cslmo4703       testfs-MDT0000  zfs:/dev/pool-mds65/mdt65 - -
cslmo4703       cslmo4702       testfs-MDT0001  zfs:/dev/pool-mds66/mdt66 - -
cslmo4704       cslmo4705       testfs-OST0000  zfs:/dev/pool-oss0/ost0 - -
cslmo4705      cslmo4704       testfs-OST0001  zfs:/dev/pool-oss1/ost1 - -

For eg. let's say there are 2 nodes cslmo4704 & cslmo4705. cslmo4705 is the partner of cslmo4704 and vice versa. cslmo4704 has dataset zfs:/dev/pool-oss0/ost0 and cslmo4705 has dataset zfs:/dev/pool-oss1/ost1. I fail/power off host cslmo4705 and hence the dataset /dev/pool-oss1/ost1 correctly fails over to cslmo4704. cslmo4704 has both datasets. In this situation, if I try to create lustre snapshot using “lctl snapshot create” command, the command fails on the dataset /dev/pool-oss1/ost1.

[root@cslmo4702 ~]# lctl snapshot_create -F testfs -n snap_test5				
ssh: connect to host cslmo4705 port 22: No route to host				
ssh: connect to host cslmo4705 port 22: No route to host				
ssh: connect to host cslmo4705 port 22: No route to host				
Can't create the snapshot snap_test5


 Comments   
Comment by Akash B [ 03/Aug/22 ]

HPE bug-id: LUS-10648

Reproduced with lustre 2.15:

snapshot config used:

[root@cslmo1602 ~]# cat /etc/ldev.conf
#local foreign/- label [md|zfs:]device-path [journal-path]/- [raidtab]
cslmo1602 cslmo1603 testfs-MDT0000 zfs:pool-mds65/mdt65
cslmo1603 cslmo1602 testfs-MDT0001 zfs:pool-mds66/mdt66
cslmo1604 cslmo1605 testfs-OST0000 zfs:pool-oss0/ost0
cslmo1605 cslmo1604 testfs-OST0001 zfs:pool-oss1/ost1
cslmo1606 cslmo1607 testfs-OST0002 zfs:pool-oss0/ost0
cslmo1607 cslmo1606 testfs-OST0003 zfs:pool-oss1/ost1

 Lustre targets when nodes are in a failed-over state:

[root@cslmo1600 ~]# pdsh -g lustre mount -t lustre | sort
cslmo1602: pool-mds65/mdt65 on /data/mdt65 type lustre (ro,svname=testfs-MDT0000,mgs,osd=osd-zfs)
cslmo1602: pool-mds66/mdt66 on /data/mdt66 type lustre (ro,svname=testfs-MDT0001,mgsnode=:10.230.26.5@o2ib,10.230.26.6@o2ib:10.230.26.7@o2ib,10.230.26.8@o2ib,osd=osd-zfs)
cslmo1605: pool-oss0/ost0 on /data/ost0 type lustre (ro,svname=testfs-OST0000,mgsnode=:10.230.26.5@o2ib,10.230.26.6@o2ib:10.230.26.7@o2ib,10.230.26.8@o2ib,osd=osd-zfs)
cslmo1605: pool-oss1/ost1 on /data/ost1 type lustre (ro,svname=testfs-OST0001,mgsnode=:10.230.26.5@o2ib,10.230.26.6@o2ib:10.230.26.7@o2ib,10.230.26.8@o2ib,osd=osd-zfs)
cslmo1606: pool-oss0/ost0 on /data/ost0 type lustre (ro,svname=testfs-OST0002,mgsnode=:10.230.26.5@o2ib,10.230.26.6@o2ib:10.230.26.7@o2ib,10.230.26.8@o2ib,osd=osd-zfs)
cslmo1606: pool-oss1/ost1 on /data/ost1 type lustre (ro,svname=testfs-OST0003,mgsnode=:10.230.26.5@o2ib,10.230.26.6@o2ib:10.230.26.7@o2ib,10.230.26.8@o2ib,osd=osd-zfs)

Before PATCH:
=========

snapshot create:

[root@cslmo1602 ~]# lctl snapshot_create -F testfs -n snap1
ssh: connect to host cslmo1604 port 22: No route to host
Can't create the snapshot snap1
 
[root@cslmo1602 ~]# lctl snapshot_list -F testfs
 
filesystem_name: testfs
snapshot_name: pre_snap
create_time: Thu Jul  7 16:19:49 2022
modify_time: Thu Jul  7 16:19:49 2022
snapshot_fsname: 01a29921
status: not mount
[root@cslmo1602 ~]#

snapshot list:

[root@cslmo1602 ~]# lctl snapshot_list -F testfs -d
 
filesystem_name: testfs
snapshot_name: pre_snap
 
snapshot_role: MDT0000
create_time: Thu Jul  7 16:19:49 2022
modify_time: Thu Jul  7 16:19:49 2022
snapshot_fsname: 01a29921
status: not mount
 
snapshot_role: MDT0001
cannot open 'pool-mds66/mdt66@pre_snap': dataset does not exist
status: not mount
 
snapshot_role: OST0000
cannot open 'pool-oss0/ost0@pre_snap': dataset does not exist
status: not mount
 
snapshot_role: OST0001
create_time: Thu Jul  7 16:19:49 2022
modify_time: Thu Jul  7 16:19:49 2022
snapshot_fsname: 01a29921
status: not mount
 
snapshot_role: OST0002
snapshot_fsname: 01a29921
modify_time: Thu Jul  7 16:19:49 2022
create_time: Thu Jul  7 16:19:49 2022
status: not mount
 
snapshot_role: OST0003
cannot open 'pool-oss1/ost1@pre_snap': dataset does not exist
status: not mount
 

snapshot mount/umount:

[root@cslmo1602 ~]# lctl snapshot_mount -F testfs -n pre_snap
mount.lustre: pool-mds66/mdt66@pre_snap has not been formatted with mkfs.lustre or the backend filesystem type is not supported by this tool
mount.lustre: pool-oss0/ost0@pre_snap has not been formatted with mkfs.lustre or the backend filesystem type is not supported by this tool
mount.lustre: pool-oss1/ost1@pre_snap has not been formatted with mkfs.lustre or the backend filesystem type is not supported by this tool
3 of 6 pieces of the snapshot pre_snap can't be mounted: No such device
 

snapshot modify:

[root@cslmo1602 ~]# lctl snapshot_modify -F testfs -n pre_snap -N mod_snap
cannot open 'pool-oss0/ost0@pre_snap': dataset does not exist
cannot open 'pool-mds66/mdt66@pre_snap': dataset does not exist
cannot open 'pool-oss1/ost1@pre_snap': dataset does not exist
Can't modify the snapshot pre_snap
 

snapshot destroy:

[root@cslmo1602 ~]# lctl snapshot_destroy -F testfs -n pre_snap
Miss snapshot piece on the OST0000. Use '-f' option if want to destroy it by force.
Can't destroy the snapshot pre_snap
[root@cslmo1602 ~]#

 After PATCH:
========

Applied lustre fix:

snapshot create:

[root@cslmo1602 ~]# lctl snapshot_create -F testfs -n snap1
[root@cslmo1602 ~]#

snapshot list:

[root@cslmo1602 ~]# lctl snapshot_list -F testfs
 
filesystem_name: testfs
snapshot_name: snap1
modify_time: Fri Jul  8 14:06:05 2022
create_time: Fri Jul  8 14:06:05 2022
snapshot_fsname: 4d4aaffb
status: not mount
 
filesystem_name: testfs
snapshot_name: pre_snap
create_time: Thu Jul  7 16:19:49 2022
modify_time: Thu Jul  7 16:35:34 2022
snapshot_fsname: 01a29921
status: not mount
[root@cslmo1602 ~]#
[root@cslmo1602 ~]# lctl snapshot_list -F testfs -n snap1 -d
 
filesystem_name: testfs
snapshot_name: snap1
 
snapshot_role: MDT0000
modify_time: Fri Jul  8 14:06:05 2022
create_time: Fri Jul  8 14:06:05 2022
snapshot_fsname: 4d4aaffb
status: not mount
 
snapshot_role: MDT0001
modify_time: Fri Jul  8 14:06:05 2022
create_time: Fri Jul  8 14:06:05 2022
snapshot_fsname: 4d4aaffb
status: not mount
 
snapshot_role: OST0000
create_time: Fri Jul  8 14:06:05 2022
modify_time: Fri Jul  8 14:06:05 2022
snapshot_fsname: 4d4aaffb
status: not mount
 
snapshot_role: OST0001
snapshot_fsname: 4d4aaffb
create_time: Fri Jul  8 14:06:05 2022
modify_time: Fri Jul  8 14:06:05 2022
status: not mount
 
snapshot_role: OST0002
snapshot_fsname: 4d4aaffb
create_time: Fri Jul  8 14:06:05 2022
modify_time: Fri Jul  8 14:06:05 2022
status: not mount
 
snapshot_role: OST0003
create_time: Fri Jul  8 14:06:05 2022
modify_time: Fri Jul  8 14:06:05 2022
snapshot_fsname: 4d4aaffb
status: not mount
[root@cslmo1602 ~]#

snapshot mount/umount:

[root@cslmo1602 ~]# lctl snapshot_mount -F testfs -n snap1
mounted the snapshot snap1 with fsname 4d4aaffb
[root@cslmo1602 ~]#
[root@cslmo1602 ~]# lctl snapshot_umount -F testfs -n snap1
[root@cslmo1602 ~]#

snapshot modify:

[root@cslmo1602 ~]# lctl snapshot_modify -F testfs -n snap1 -N Snap1
[root@cslmo1602 ~]#

snapshot destroy:

[root@cslmo1602 ~]# lctl snapshot_destroy -F testfs -n Snap1
[root@cslmo1602 ~]#
[root@cslmo1602 ~]# lctl snapshot_list -F testfs -n Snap1
Can't list the snapshot Snap1
[root@cslmo1602 ~]#

With the fix applied, we will be able to create/destroy/modify/list/mount/umount a lustre snapshot even when lustre targets are failed over to the partner nodes as defined by the /etc/ldev.conf configuration file. Previously this would fail as foreign host field in /etc/ldev.conf configuration file was ignored.

Comment by Cory Spitz [ 03/Aug/22 ]

pjones, can you please assign this to akash-b? Jira gives me some "communications breakdown" error when I try to do it.

Comment by Peter Jones [ 03/Aug/22 ]

Cory

I think that was JIRA's helpful error message to tell you that Akash B was not a valid selection  I have added Akash B into the developers group and now tickets can be assigned ok

Peter

Comment by Akash B [ 24/Aug/22 ]

CR/Patch: https://review.whamcloud.com/#/c/48226/

Subject: LU-16072 utils: snapshot support to foreign host

Project: fs/lustre-release
Branch: master

Comment by Gerrit Updater [ 04/Oct/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48226/
Subject: LU-16072 utils: snapshot support to foreign host
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 815ca64afc8e54f9707ed9f458e14a9c99629ed7

Comment by Peter Jones [ 04/Oct/22 ]

Landed for 2.16

Generated at Sat Feb 10 03:23:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.