[LU-16054] changes in lu-14797 create issues with older clients accessing fs though nodemaps Created: 28/Jul/22  Updated: 29/Sep/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Beal Assignee: Sebastien Buisson
Resolution: Unresolved Votes: 0
Labels: None

Attachments: Text File LU-14797-nodemap-map-project-id_b2_14_CLIENTONLY.patch     Text File LU-14797-sec-add-projid-to-nodemap_b2_14_CLIENTONLY.patch     Text File LU-15661-nodemap-fix-map-mode-value-for-both_b2_14_CLIENTONLY.patch     Text File nodemap.txt    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We have a client accessing a file system via a nodemap.

ubuntu@lus2526-tcp1:~$ cat /sys/fs/lustre/version 
2.14.0_2_gb280f22
ubuntu@lus2526-tcp1:~$ df /lustre/scratch12{5..6}
Filesystem                                     1K-blocks  Used     Available Use% Mounted on
10.160.40.37@tcp1:10.160.40.36@tcp1:/lus25 5161226796192 21176 5109139362568   1% /lustre/scratch125
10.160.42.37@tcp1:10.160.42.36@tcp1:/lus26 4301022330160 17604 4257616135516   1% /lustre/scratch126
ubuntu@lus2526-tcp1:~$ id
uid=1000(ubuntu) gid=1000(ubuntu) groups=1000(ubuntu),4(adm),20(dialout),24(cdrom),25(floppy),27(sudo),29(audio),30(dip),44(video),46(plugdev),117(netdev),118(lxd)
ubuntu@lus2526-tcp1:~$ echo > /lustre/scratch125/$(uname -n) ; echo > /lustre/scratch126/$(uname -n)
ubuntu@lus2526-tcp1:~$ ls -l /lustre/scratch12*/*
-rw-rw-r-- 1 ubuntu ubuntu 1 Jul 28 07:45 /lustre/scratch125/lus2526-tcp1
-rw-rw-r-- 1 ubuntu ubuntu 1 Jul 28 07:45 /lustre/scratch126/lus2526-tcp1

The external default client sees:

jb23@gen3-os0000011:~$ id jb23
uid=12296(jb23) gid=1105(team94) groups=1105(team94),15283(sag-secure-fileshare),1415(ssg-confluence),1400(www-pagesmith),15016(isg-dcim),4999(jb23test),15141(sag-mfa),1533(ssg-isg),1490(docker),706(ssg),15404(dir-admins),15456(IDS),15264(sag-mso365-apps)
jb23@gen3-os0000011:~$ find  /lustre/scratch12[56]/admin/team94/jb23/tcp* -type f -ls
144115339507007490      4 -rw-rw-r--   1 99       acedbdoc        1 Jul 28 08:45 /lustre/scratch125/admin/team94/jb23/tcp1/lus2526-tcp1
144115406599094274      4 -rw-rw-r--   1 99       acedbdoc        1 Jul 28 08:45 /lustre/scratch126/admin/team94/jb23/tcp1/lus2526-tcp1
jb23@gen3-os0000011:~$ ls -l /lustre/scratch12*/admin/team94/jb23/tcp1/lus2526-tcp1
-rw-rw-r-- 1 99 acedbdoc 1 Jul 28 08:45 /lustre/scratch125/admin/team94/jb23/tcp1/lus2526-tcp1
-rw-rw-r-- 1 99 acedbdoc 1 Jul 28 08:45 /lustre/scratch126/admin/team94/jb23/tcp1/lus2526-tcp1
jb23@gen3-os0000011:~$ getent group acedbdoc
acedbdoc:*:99:image
jb23@gen3-os0000011:~$ ls -l .bashrc
-rw-r--r-- 1 jb23 team94 1130 May 26 12:47 .bashrc 

 

The nodemap is configured

[root@lus25-mds1 ~]# cat /sys/fs/lustre/version 
2.12.6_ddn66
[root@lus25-mds1 lustre_casm16]# for i in *
> do
> echo $i $(cat $i )
> done
admin_nodemap 0
audit_mode 1
deny_unknown 0
exports [ { nid: 10.177.127.35@tcp1, uuid: 720c5dc0-efbf-4d55-9340-8a2bde26d039 }, ]
fileset /admin/team94/jb23/tcp1
id 1
idmap [ { idtype: uid, client_id: 1000, fs_id: 12296 }, { idtype: gid, client_id: 1000, fs_id: 1105 } ]
map_mode all
ranges [ { id: 1, start_nid: 10.177.126.0@tcp1, end_nid: 10.177.127.255@tcp1 } ]
sepol
squash_gid 1105
squash_projid  
squash_uid 12296
trusted_nodemap 0


 Comments   
Comment by James Beal [ 28/Jul/22 ]

root@gen3-os0000011:~# ls -l  /lustre/scratch125/admin/team94/
total 4
drwxr-xr-x 18 99 acedbdoc 4096 Jul 21 14:39 jb23
root@gen3-os0000011:~# ls -l  /lustre/scratch125/admin/team94
total 4
drwxr-xr-x 18 99 acedbdoc 4096 Jul 21 14:39 jb23
root@gen3-os0000011:~# ls -l  /lustre/scratch125/admin
total 4
drwxr-xr-x 3 99 acedbdoc 4096 Jul 19 15:07 team94
root@gen3-os0000011:~# ls -l  /lustre/scratch125
total 4
drwxr-xr-x 3 99 acedbdoc 4096 Jul 19 15:07 admin
It looked like all the files/directories have had their owner/group changed.

root@gen3-os0000011:~# ls -ld /lustre/scratch125/admin/team94/jb23/lfs_test
drwxrwxrwx 2 99 acedbdoc 4096 Jul 27 15:30 /lustre/scratch125/admin/team94/jb23/lfs_test
root@gen3-os0000011:~# chown jb23 /lustre/scratch125/admin/team94/jb23/lfs_test
chown: changing ownership of '/lustre/scratch125/admin/team94/jb23/lfs_test': Operation not permitted
root@gen3-os0000011:~# 
root@gen3-os0000011:~# lctl list_nids
172.27.71.121@tcp

the default nodemap appears to have been changed, this may be unrelated however its not something I have done delibrately. And are other system we are commissioning has the same issue.

[root@lus25-mds1 ost-survey]# lctl nodemap_test_nid 172.27.71.121@tcp
default
[root@lus25-mds1 exports]# lctl get_param -R 'nodemap.default' 
nodemap.default.admin_nodemap=0
nodemap.default.audit_mode=1
nodemap.default.exports=
[
 { nid: 10.177.161.188@tcp5, uuid: 12ebc53b-9e45-427d-8e30-5a2439569728 }, { nid: 172.27.71.121@tcp, uuid: a64511ef-8ced-4702-892e-50a74c631d98 }, { nid: 10.160.40.12@tcp, uuid: lus25-MDT0000-lwp-OST0008_UUID }, { nid: 10.160.40.12@tcp, uuid: lus25-MDT0000-lwp-OST0009_UUID }, { nid: 10.160.40.13@tcp, uuid: lus25-MDT0000-lwp-OST000a_UUID }, { nid: 10.160.40.10@tcp, uuid: lus25-MDT0000-lwp-OST0005_UUID }, { nid: 10.160.40.11@tcp, uuid: lus25-MDT0000-lwp-OST0006_UUID }, { nid: 10.160.40.9@tcp, uuid: lus25-MDT0000-lwp-OST0002_UUID }, { nid: 10.160.40.8@tcp, uuid: lus25-MDT0000-lwp-OST0001_UUID }, { nid: 10.160.40.9@tcp, uuid: lus25-MDT0000-lwp-OST0003_UUID }, { nid: 10.160.40.8@tcp, uuid: lus25-MDT0000-lwp-OST0000_UUID }, { nid: 10.160.40.5@tcp, uuid: lus25-MDT0000-lwp-MDT0001_UUID }, { nid: 10.160.40.5@tcp, uuid: lus25-MDT0001-mdtlov_UUID }, { nid: 0@lo, uuid: lus25-MDT0000-lwp-MDT0000_UUID }, { nid: 10.160.40.6@tcp, uuid: lus25-MDT0000-lwp-MDT0002_UUID }, { nid: 10.160.40.6@tcp, uuid: lus25-MDT0002-mdtlov_UUID }, { nid: 10.160.40.7@tcp, uuid: lus25-MDT0000-lwp-MDT0003_UUID }, { nid: 10.160.40.7@tcp, uuid: lus25-MDT0003-mdtlov_UUID }, { nid: 10.160.40.13@tcp, uuid: lus25-MDT0000-lwp-OST000b_UUID }, { nid: 10.160.40.10@tcp, uuid: lus25-MDT0000-lwp-OST0004_UUID }, { nid: 10.160.40.11@tcp, uuid: lus25-MDT0000-lwp-OST0007_UUID },
]
nodemap.default.fileset=nodemap.default.id=0
nodemap.default.squash_gid=99
nodemap.default.squash_projid=99
nodemap.default.squash_uid=99
nodemap.default.trusted_nodemap=0   
Comment by James Beal [ 28/Jul/22 ]

 

I have attached the full nodemap information form lus25 as an attachment. For comparision with another system we have.

 

[root@lus24-mds1 ~]# lctl get_param -R 'nodemap.default'
nodemap.default.admin_nodemap=0
nodemap.default.audit_mode=1
nodemap.default.exports=[
]
nodemap.default.fileset=/null
nodemap.default.id=0
nodemap.default.squash_gid=65534
nodemap.default.squash_uid=65534
nodemap.default.trusted_nodemap=0
 

 

I am wondering if the issue is not with writes being poorly translated via nodemaps, but that the default nodemap being damaged somehow and therefore it appears to have all files owned by user 99 as I am accessing via the default nodemap which has uid squash turned on.

Comment by Sebastien Buisson [ 28/Jul/22 ]

Indeed, for lus25 you have admin and trusted properties to 0, so all accesses from nodes considered as part of this 'default' nodemap will be squashed to the ID 99 that you defined.

Comment by Sebastien Buisson [ 28/Jul/22 ]

I think the first thing to do would be the cleanup the situation regarding LU-14797, so that we work from a sane base.

Comment by James Beal [ 28/Jul/22 ]

Do all clients including those using the default nodemap need those patches ?

Are the patches in community release 2.15.0 ?

Comment by Sebastien Buisson [ 28/Jul/22 ]

Do all clients including those using the default nodemap need those patches ?

Yes, not matter which nodemap they are in, they can hit the problem.

Are the patches in community release 2.15.0 ?

The patches are included in 2.15.0 (as well as EXA 5.2.5 and EXA 6.1).

Comment by James Beal [ 28/Jul/22 ]

comparing lus24 which  works how we would expect with no mapping and lus25. I can see the squashed uid and gid are different  but the admin  trusted are the same ?

[root@lus24-mds1 ~]# lctl nodemap_test_nid 10.10.10.1@tcp0
Native
[root@lus25-mds1 exports]# lctl nodemap_test_nid 10.10.10.1@tcp
default

I think this an important difference.

Comment by Sebastien Buisson [ 28/Jul/22 ]

I can see no mapping for node 10.10.10.1@tcp in your definitions of lus25. Is that intended to have a different behavior than with lus24, which maps it to the 'Native' nodemap?

Comment by James Beal [ 28/Jul/22 ]

What I am saying is that the two systems are different and the behaviour that is useful is the one lus24has.

 

( How do we say all@tcp0 is native )

Comment by Sebastien Buisson [ 02/Aug/22 ]

I think the first thing to do would be the cleanup the situation regarding LU-14797, so that we work from a sane base.

  • first problem is that your server side Lustre version 2.12.6_ddn66 is missing an important fix to LU-14797. This is explained in LU-15661. The fix is included in EXA5.2.5 (rpm tag 2.12.8-ddn6), and it needs to be applied on both server and client sides;
  • on your custom 2.14 clients, you need to apply the 3 patches attached to this ticket, in this order: LU-14797-sec-add-projid-to-nodemap_b2_14_CLIENTONLY.patch LU-14797-nodemap-map-project-id_b2_14_CLIENTONLY.patch LU-15661-nodemap-fix-map-mode-value-for-both_b2_14_CLIENTONLY.patch. Please note theses patches are for 2.14 client side only, server part has been expunged from them.

The client part is actually optional to have LU-14797 fixed. This is nodemap related, and nodemap is not exported to client. So you can stick with your custom 2.14 client, the essential part being to have 2.12.8-ddn6 rpms installed on your servers.

Comment by Sebastien Buisson [ 02/Aug/22 ]

Do all clients including those using the default nodemap need those patches ?

Yes, not matter which nodemap they are in, they can hit the problem.

What I meant is LU-14797 is a server side problem, so no matter the nodemap to which clients belong (incl. default), they will be hit by the problem as long as server side is not fixed.

Comment by James Beal [ 26/Sep/23 ]

I have just ran into this again on a newly reinstalled system...

 

Server is

[root@lus22-mds1 secure-lustre]# cat /sys/fs/lustre/version 
2.12.9_ddn8

Client is

ubuntu@lus2526-tcp15:~$ cat /sys/fs/lustre/version 
2.15.3

I see

[root@lus22-mds1 secure-lustre]#  lctl nodemap_test_nid 10.10.10.1@tcp
default
Comment by Peter Jones [ 28/Sep/23 ]

James

This is puzzling. I understand that you've opened a support ticket to track this. Sébastien will need to review the logs supplied there and get back to you

Peter

Comment by James Beal [ 29/Sep/23 ]

Support pointed out "

He has spotted one issue with the nodemap onfiguration though. there is no privileged nodemap, as explained in the Lustre Operations Manual:
https://doc.lustre.org/lustre_manual.xhtml#idm140715165297696

For proper operations, the Lustre file system requires to have a privileged group that
covers all Lustre server nodes. So the very first step when working with nodemaps is to
create such a group with both properties admin and trusted set. It is recommended to
give this group an explicit label such as “TrustedSystems” or some identifier that makes
the association clear."

Hopefully next time we make that mistake this comment will remind me.

[root@lus22-mds1 secure-lustre]# lctl nodemap_add Native
[root@lus22-mds1 secure-lustre]# lctl nodemap_add_range --name Native --range "[0-255].[0-255].[0-255].[0-255]@tcp"
[root@lus22-mds1 secure-lustre]# lctl nodemap_modify --name Native --property admin --value 1
[root@lus22-mds1 secure-lustre]# lctl nodemap_modify --name Native --property trusted --value 1

Thank you and sorry for the confusion.

Generated at Sat Feb 10 03:23:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.