Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.10.0, Lustre 2.10.1
-
client: lustre-client-2.10.0-1.el7.x86_64, lustre-2.10.1_RC1_srcc01-1.el7.centos.x86_64 (2.10.1-RC1 + patch from
LU-9929)
-
2
-
9223372036854775807
Description
We're using the nodemap feature with map_mode=gid_only in production and we are seeing more and more issues with GID mapping, which seems to default to squash_gid instead of being properly mapped. The nodemap hasn't changed for these groups, we just add new groups from time to time.
Example, configuration for mapping 'sherlock' on MGS:
[root@oak-md1-s1 sherlock]# pwd /proc/fs/lustre/nodemap/sherlock [root@oak-md1-s1 sherlock]# cat ranges [ { id: 6, start_nid: 0.0.0.0@o2ib4, end_nid: 255.255.255.255@o2ib4 }, { id: 5, start_nid: 0.0.0.0@o2ib3, end_nid: 255.255.255.255@o2ib3 } ] [root@oak-md1-s1 sherlock]# cat idmap [ { idtype: gid, client_id: 3525, fs_id: 3741 } { idtype: gid, client_id: 6401, fs_id: 3752 } { idtype: gid, client_id: 99001, fs_id: 3159 } { idtype: gid, client_id: 10525, fs_id: 3351 } { idtype: gid, client_id: 11886, fs_id: 3593 } { idtype: gid, client_id: 12193, fs_id: 3636 } { idtype: gid, client_id: 13103, fs_id: 3208 } { idtype: gid, client_id: 17079, fs_id: 3700 } { idtype: gid, client_id: 19437, fs_id: 3618 } { idtype: gid, client_id: 22959, fs_id: 3745 } { idtype: gid, client_id: 24369, fs_id: 3526 } { idtype: gid, client_id: 26426, fs_id: 3352 } { idtype: gid, client_id: 29361, fs_id: 3746 } { idtype: gid, client_id: 29433, fs_id: 3479 } { idtype: gid, client_id: 30289, fs_id: 3262 } { idtype: gid, client_id: 32264, fs_id: 3199 } { idtype: gid, client_id: 32774, fs_id: 3623 } { idtype: gid, client_id: 38517, fs_id: 3702 } { idtype: gid, client_id: 40387, fs_id: 3708 } { idtype: gid, client_id: 47235, fs_id: 3674 } { idtype: gid, client_id: 48931, fs_id: 3325 } { idtype: gid, client_id: 50590, fs_id: 3360 } { idtype: gid, client_id: 52892, fs_id: 3377 } { idtype: gid, client_id: 56316, fs_id: 3353 } { idtype: gid, client_id: 56628, fs_id: 3411 } { idtype: gid, client_id: 59943, fs_id: 3372 } { idtype: gid, client_id: 63938, fs_id: 3756 } { idtype: gid, client_id: 100533, fs_id: 3281 } { idtype: gid, client_id: 244300, fs_id: 3617 } { idtype: gid, client_id: 254778, fs_id: 3362 } { idtype: gid, client_id: 267829, fs_id: 3748 } { idtype: gid, client_id: 270331, fs_id: 3690 } { idtype: gid, client_id: 305454, fs_id: 3371 } { idtype: gid, client_id: 308753, fs_id: 3367 } [root@oak-md1-s1 sherlock]# cat squash_gid 99 [root@oak-md1-s1 sherlock]# cat map_mode gid_only [root@oak-md1-s1 sherlock]# cat admin_nodemap 0 [root@oak-md1-s1 sherlock]# cat deny_unknown 1 [root@oak-md1-s1 sherlock]# cat trusted_nodemap 0
Issue with group: GID 3593 (mapped to GID 11886 on sherlock)
lfs quota, not mapped (using canonical GID 3593):
[root@oak-rbh01 ~]# lfs quota -g oak_euan /oak
Disk quotas for group oak_euan (gid 3593):
Filesystem kbytes quota limit grace files quota limit grace
/oak 33255114444 50000000000 50000000000 - 526016 7500000 7500000 -
Broken lfs quota mapped on sherlock (o2ib4):
[root@sh-113-01 ~]# lfs quota -g euan /oak
Disk quotas for grp euan (gid 11886):
Filesystem kbytes quota limit grace files quota limit grace
/oak 2875412844* 1 1 - 26* 1 1 -
[root@sh-113-01 ~]# lctl list_nids
10.9.113.1@o2ib4
It matches the quota usage for squash_gid:
[root@oak-rbh01 ~]# lfs quota -g 99 /oak
Disk quotas for group 99 (gid 99):
Filesystem kbytes quota limit grace files quota limit grace
/oak 2875412844* 1 1 - 26* 1 1 -
Please note that GID mapping works OK for most of the groups though:
3199 -> 32264(sherlock) canonical: [root@oak-rbh01 ~]# lfs quota -g oak_ruthm /oak Disk quotas for group oak_ruthm (gid 3199): Filesystem kbytes quota limit grace files quota limit grace /oak 10460005688 20000000000 20000000000 - 1683058 3000000 3000000 - mapped (sherlock): [root@sh-113-01 ~]# lfs quota -g ruthm /oak Disk quotas for grp ruthm (gid 32264): Filesystem kbytes quota limit grace files quota limit grace /oak 10460005688 20000000000 20000000000 - 1683058 3000000 3000000 -
Failing over the MDT resolved a few groups, but not all. Failing the MDT back showed an issue on the exact same original groups having issues (currently 4-5).
While I haven't seen it by myself yet, the issue seems to affect users as a few of them reported erroneous EDQUOT errors. This is why it is quite urgent to figure out what's wrong. Please note that the issue was already there before using the patch from LU-9929.
I'm willing to attach some debug logs, but what debug flags should I enable to troubleshoot such a quota+nodemap issue on client and server?
Thanks!
Stephane
Attachments
Issue Links
- is related to
-
LU-10135 nodemap_del_idmap() calls nodemap_idx_idmap_del() while holding rwlock
- Closed