[LU-16072] snapshot support to foreign host Created: 03/Aug/22 Updated: 25/Jan/24 Resolved: 04/Oct/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Upstream, Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Akash B | Assignee: | Akash B |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ZFS, server | ||
| Environment: |
Lustre filesystem with ZFS as backend filesystem. |
||
| Issue Links: |
|
||||||||
| Epic/Theme: | zfs | ||||||||
| Epic: | server, zfs | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
lctl snapshot create does not work if one of the nodes is not reachable: It seems like the lctl snapshot commands do not work if the resources fail to the partner node even though the partner node details are mentioned in ldev.conf. Sample ldev.conf is as follows: cslmo4702 cslmo4703 testfs-MDT0000 zfs:/dev/pool-mds65/mdt65 - - cslmo4703 cslmo4702 testfs-MDT0001 zfs:/dev/pool-mds66/mdt66 - - cslmo4704 cslmo4705 testfs-OST0000 zfs:/dev/pool-oss0/ost0 - - cslmo4705 cslmo4704 testfs-OST0001 zfs:/dev/pool-oss1/ost1 - - For eg. let's say there are 2 nodes cslmo4704 & cslmo4705. cslmo4705 is the partner of cslmo4704 and vice versa. cslmo4704 has dataset zfs:/dev/pool-oss0/ost0 and cslmo4705 has dataset zfs:/dev/pool-oss1/ost1. I fail/power off host cslmo4705 and hence the dataset /dev/pool-oss1/ost1 correctly fails over to cslmo4704. cslmo4704 has both datasets. In this situation, if I try to create lustre snapshot using “lctl snapshot create” command, the command fails on the dataset /dev/pool-oss1/ost1. [root@cslmo4702 ~]# lctl snapshot_create -F testfs -n snap_test5 ssh: connect to host cslmo4705 port 22: No route to host ssh: connect to host cslmo4705 port 22: No route to host ssh: connect to host cslmo4705 port 22: No route to host Can't create the snapshot snap_test5 |
| Comments |
| Comment by Akash B [ 03/Aug/22 ] |
|
HPE bug-id: LUS-10648 Reproduced with lustre 2.15: snapshot config used: [root@cslmo1602 ~]# cat /etc/ldev.conf #local foreign/- label [md|zfs:]device-path [journal-path]/- [raidtab] cslmo1602 cslmo1603 testfs-MDT0000 zfs:pool-mds65/mdt65 cslmo1603 cslmo1602 testfs-MDT0001 zfs:pool-mds66/mdt66 cslmo1604 cslmo1605 testfs-OST0000 zfs:pool-oss0/ost0 cslmo1605 cslmo1604 testfs-OST0001 zfs:pool-oss1/ost1 cslmo1606 cslmo1607 testfs-OST0002 zfs:pool-oss0/ost0 cslmo1607 cslmo1606 testfs-OST0003 zfs:pool-oss1/ost1 Lustre targets when nodes are in a failed-over state: [root@cslmo1600 ~]# pdsh -g lustre mount -t lustre | sort cslmo1602: pool-mds65/mdt65 on /data/mdt65 type lustre (ro,svname=testfs-MDT0000,mgs,osd=osd-zfs) cslmo1602: pool-mds66/mdt66 on /data/mdt66 type lustre (ro,svname=testfs-MDT0001,mgsnode=:10.230.26.5@o2ib,10.230.26.6@o2ib:10.230.26.7@o2ib,10.230.26.8@o2ib,osd=osd-zfs) cslmo1605: pool-oss0/ost0 on /data/ost0 type lustre (ro,svname=testfs-OST0000,mgsnode=:10.230.26.5@o2ib,10.230.26.6@o2ib:10.230.26.7@o2ib,10.230.26.8@o2ib,osd=osd-zfs) cslmo1605: pool-oss1/ost1 on /data/ost1 type lustre (ro,svname=testfs-OST0001,mgsnode=:10.230.26.5@o2ib,10.230.26.6@o2ib:10.230.26.7@o2ib,10.230.26.8@o2ib,osd=osd-zfs) cslmo1606: pool-oss0/ost0 on /data/ost0 type lustre (ro,svname=testfs-OST0002,mgsnode=:10.230.26.5@o2ib,10.230.26.6@o2ib:10.230.26.7@o2ib,10.230.26.8@o2ib,osd=osd-zfs) cslmo1606: pool-oss1/ost1 on /data/ost1 type lustre (ro,svname=testfs-OST0003,mgsnode=:10.230.26.5@o2ib,10.230.26.6@o2ib:10.230.26.7@o2ib,10.230.26.8@o2ib,osd=osd-zfs) Before PATCH: snapshot create: [root@cslmo1602 ~]# lctl snapshot_create -F testfs -n snap1 ssh: connect to host cslmo1604 port 22: No route to host Can't create the snapshot snap1 [root@cslmo1602 ~]# lctl snapshot_list -F testfs filesystem_name: testfs snapshot_name: pre_snap create_time: Thu Jul 7 16:19:49 2022 modify_time: Thu Jul 7 16:19:49 2022 snapshot_fsname: 01a29921 status: not mount [root@cslmo1602 ~]# snapshot list: [root@cslmo1602 ~]# lctl snapshot_list -F testfs -d filesystem_name: testfs snapshot_name: pre_snap snapshot_role: MDT0000 create_time: Thu Jul 7 16:19:49 2022 modify_time: Thu Jul 7 16:19:49 2022 snapshot_fsname: 01a29921 status: not mount snapshot_role: MDT0001 cannot open 'pool-mds66/mdt66@pre_snap': dataset does not exist status: not mount snapshot_role: OST0000 cannot open 'pool-oss0/ost0@pre_snap': dataset does not exist status: not mount snapshot_role: OST0001 create_time: Thu Jul 7 16:19:49 2022 modify_time: Thu Jul 7 16:19:49 2022 snapshot_fsname: 01a29921 status: not mount snapshot_role: OST0002 snapshot_fsname: 01a29921 modify_time: Thu Jul 7 16:19:49 2022 create_time: Thu Jul 7 16:19:49 2022 status: not mount snapshot_role: OST0003 cannot open 'pool-oss1/ost1@pre_snap': dataset does not exist status: not mount snapshot mount/umount: [root@cslmo1602 ~]# lctl snapshot_mount -F testfs -n pre_snap mount.lustre: pool-mds66/mdt66@pre_snap has not been formatted with mkfs.lustre or the backend filesystem type is not supported by this tool mount.lustre: pool-oss0/ost0@pre_snap has not been formatted with mkfs.lustre or the backend filesystem type is not supported by this tool mount.lustre: pool-oss1/ost1@pre_snap has not been formatted with mkfs.lustre or the backend filesystem type is not supported by this tool 3 of 6 pieces of the snapshot pre_snap can't be mounted: No such device snapshot modify: [root@cslmo1602 ~]# lctl snapshot_modify -F testfs -n pre_snap -N mod_snap cannot open 'pool-oss0/ost0@pre_snap': dataset does not exist cannot open 'pool-mds66/mdt66@pre_snap': dataset does not exist cannot open 'pool-oss1/ost1@pre_snap': dataset does not exist Can't modify the snapshot pre_snap snapshot destroy: [root@cslmo1602 ~]# lctl snapshot_destroy -F testfs -n pre_snap Miss snapshot piece on the OST0000. Use '-f' option if want to destroy it by force. Can't destroy the snapshot pre_snap [root@cslmo1602 ~]# After PATCH: Applied lustre fix: snapshot create: [root@cslmo1602 ~]# lctl snapshot_create -F testfs -n snap1 [root@cslmo1602 ~]# snapshot list: [root@cslmo1602 ~]# lctl snapshot_list -F testfs filesystem_name: testfs snapshot_name: snap1 modify_time: Fri Jul 8 14:06:05 2022 create_time: Fri Jul 8 14:06:05 2022 snapshot_fsname: 4d4aaffb status: not mount filesystem_name: testfs snapshot_name: pre_snap create_time: Thu Jul 7 16:19:49 2022 modify_time: Thu Jul 7 16:35:34 2022 snapshot_fsname: 01a29921 status: not mount [root@cslmo1602 ~]# [root@cslmo1602 ~]# lctl snapshot_list -F testfs -n snap1 -d filesystem_name: testfs snapshot_name: snap1 snapshot_role: MDT0000 modify_time: Fri Jul 8 14:06:05 2022 create_time: Fri Jul 8 14:06:05 2022 snapshot_fsname: 4d4aaffb status: not mount snapshot_role: MDT0001 modify_time: Fri Jul 8 14:06:05 2022 create_time: Fri Jul 8 14:06:05 2022 snapshot_fsname: 4d4aaffb status: not mount snapshot_role: OST0000 create_time: Fri Jul 8 14:06:05 2022 modify_time: Fri Jul 8 14:06:05 2022 snapshot_fsname: 4d4aaffb status: not mount snapshot_role: OST0001 snapshot_fsname: 4d4aaffb create_time: Fri Jul 8 14:06:05 2022 modify_time: Fri Jul 8 14:06:05 2022 status: not mount snapshot_role: OST0002 snapshot_fsname: 4d4aaffb create_time: Fri Jul 8 14:06:05 2022 modify_time: Fri Jul 8 14:06:05 2022 status: not mount snapshot_role: OST0003 create_time: Fri Jul 8 14:06:05 2022 modify_time: Fri Jul 8 14:06:05 2022 snapshot_fsname: 4d4aaffb status: not mount [root@cslmo1602 ~]# snapshot mount/umount: [root@cslmo1602 ~]# lctl snapshot_mount -F testfs -n snap1 mounted the snapshot snap1 with fsname 4d4aaffb [root@cslmo1602 ~]# [root@cslmo1602 ~]# lctl snapshot_umount -F testfs -n snap1 [root@cslmo1602 ~]# snapshot modify: [root@cslmo1602 ~]# lctl snapshot_modify -F testfs -n snap1 -N Snap1 [root@cslmo1602 ~]# snapshot destroy: [root@cslmo1602 ~]# lctl snapshot_destroy -F testfs -n Snap1 [root@cslmo1602 ~]# [root@cslmo1602 ~]# lctl snapshot_list -F testfs -n Snap1 Can't list the snapshot Snap1 [root@cslmo1602 ~]# With the fix applied, we will be able to create/destroy/modify/list/mount/umount a lustre snapshot even when lustre targets are failed over to the partner nodes as defined by the /etc/ldev.conf configuration file. Previously this would fail as foreign host field in /etc/ldev.conf configuration file was ignored. |
| Comment by Cory Spitz [ 03/Aug/22 ] |
|
pjones, can you please assign this to akash-b? Jira gives me some "communications breakdown" error when I try to do it. |
| Comment by Peter Jones [ 03/Aug/22 ] |
|
Cory I think that was JIRA's helpful error message to tell you that Akash B was not a valid selection Peter |
| Comment by Akash B [ 24/Aug/22 ] |
|
CR/Patch: https://review.whamcloud.com/#/c/48226/ Subject: Project: fs/lustre-release |
| Comment by Gerrit Updater [ 04/Oct/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48226/ |
| Comment by Peter Jones [ 04/Oct/22 ] |
|
Landed for 2.16 |