[LU-10040] nodemap and quota issues (ineffective GID mapping) - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.11.0, Lustre 2.10.2
Affects Version/s: Lustre 2.10.0, Lustre 2.10.1
Labels:
- patch
Environment:

Hide
client: lustre-client-2.10.0-1.el7.x86_64, lustre-2.10.1_RC1_srcc01-1.el7.centos.x86_64 (2.10.1-RC1 + patch from ~~LU-9929~~)

Show
client: lustre-client-2.10.0-1.el7.x86_64, lustre-2.10.1_RC1_srcc01-1.el7.centos.x86_64 (2.10.1-RC1 + patch from LU-9929 )

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

We're using the nodemap feature with map_mode=gid_only in production and we are seeing more and more issues with GID mapping, which seems to default to squash_gid instead of being properly mapped. The nodemap hasn't changed for these groups, we just add new groups from time to time.

Example, configuration for mapping 'sherlock' on MGS:

[root@oak-md1-s1 sherlock]# pwd
/proc/fs/lustre/nodemap/sherlock

[root@oak-md1-s1 sherlock]# cat ranges 
[
 { id: 6, start_nid: 0.0.0.0@o2ib4, end_nid: 255.255.255.255@o2ib4 },
 { id: 5, start_nid: 0.0.0.0@o2ib3, end_nid: 255.255.255.255@o2ib3 }
]

[root@oak-md1-s1 sherlock]# cat idmap 
[
 { idtype: gid, client_id: 3525, fs_id: 3741 } { idtype: gid, client_id: 6401, fs_id: 3752 } { idtype: gid, client_id: 99001, fs_id: 3159 } { idtype: gid, client_id: 10525, fs_id: 3351 } { idtype: gid, client_id: 11886, fs_id: 3593 } { idtype: gid, client_id: 12193, fs_id: 3636 } { idtype: gid, client_id: 13103, fs_id: 3208 } { idtype: gid, client_id: 17079, fs_id: 3700 } { idtype: gid, client_id: 19437, fs_id: 3618 } { idtype: gid, client_id: 22959, fs_id: 3745 } { idtype: gid, client_id: 24369, fs_id: 3526 } { idtype: gid, client_id: 26426, fs_id: 3352 } { idtype: gid, client_id: 29361, fs_id: 3746 } { idtype: gid, client_id: 29433, fs_id: 3479 } { idtype: gid, client_id: 30289, fs_id: 3262 } { idtype: gid, client_id: 32264, fs_id: 3199 } { idtype: gid, client_id: 32774, fs_id: 3623 } { idtype: gid, client_id: 38517, fs_id: 3702 } { idtype: gid, client_id: 40387, fs_id: 3708 } { idtype: gid, client_id: 47235, fs_id: 3674 } { idtype: gid, client_id: 48931, fs_id: 3325 } { idtype: gid, client_id: 50590, fs_id: 3360 } { idtype: gid, client_id: 52892, fs_id: 3377 } { idtype: gid, client_id: 56316, fs_id: 3353 } { idtype: gid, client_id: 56628, fs_id: 3411 } { idtype: gid, client_id: 59943, fs_id: 3372 } { idtype: gid, client_id: 63938, fs_id: 3756 } { idtype: gid, client_id: 100533, fs_id: 3281 } { idtype: gid, client_id: 244300, fs_id: 3617 } { idtype: gid, client_id: 254778, fs_id: 3362 } { idtype: gid, client_id: 267829, fs_id: 3748 } { idtype: gid, client_id: 270331, fs_id: 3690 } { idtype: gid, client_id: 305454, fs_id: 3371 } { idtype: gid, client_id: 308753, fs_id: 3367 }

[root@oak-md1-s1 sherlock]# cat squash_gid 
99
[root@oak-md1-s1 sherlock]# cat map_mode 
gid_only

[root@oak-md1-s1 sherlock]# cat admin_nodemap 
0
[root@oak-md1-s1 sherlock]# cat deny_unknown 
1
[root@oak-md1-s1 sherlock]# cat trusted_nodemap 
0

Issue with group: GID 3593 (mapped to GID 11886 on sherlock)

lfs quota, not mapped (using canonical GID 3593):

[root@oak-rbh01 ~]# lfs quota -g oak_euan /oak
Disk quotas for group oak_euan (gid 3593):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
           /oak 33255114444  50000000000 50000000000       -  526016  7500000 7500000       -

Broken lfs quota mapped on sherlock (o2ib4):

[root@sh-113-01 ~]# lfs quota -g euan /oak
Disk quotas for grp euan (gid 11886):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
           /oak 2875412844*      1       1       -      26*      1       1       -
[root@sh-113-01 ~]# lctl list_nids
10.9.113.1@o2ib4

It matches the quota usage for squash_gid:

[root@oak-rbh01 ~]# lfs quota -g 99 /oak
Disk quotas for group 99 (gid 99):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
           /oak 2875412844*      1       1       -      26*      1       1       -

Please note that GID mapping works OK for most of the groups though:

3199 -> 32264(sherlock)

canonical:
[root@oak-rbh01 ~]# lfs quota -g oak_ruthm /oak
Disk quotas for group oak_ruthm (gid 3199):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
           /oak 10460005688  20000000000 20000000000       - 1683058  3000000 3000000       -

mapped (sherlock):
[root@sh-113-01 ~]# lfs quota -g ruthm /oak
Disk quotas for grp ruthm (gid 32264):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
           /oak 10460005688  20000000000 20000000000       - 1683058  3000000 3000000       -

Failing over the MDT resolved a few groups, but not all. Failing the MDT back showed an issue on the exact same original groups having issues (currently 4-5).

While I haven't seen it by myself yet, the issue seems to affect users as a few of them reported erroneous EDQUOT errors. This is why it is quite urgent to figure out what's wrong. Please note that the issue was already there before using the patch from ~~LU-9929~~.

I'm willing to attach some debug logs, but what debug flags should I enable to troubleshoot such a quota+nodemap issue on client and server?

Thanks!
Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

break_nodemap_rbtree.sh
0.7 kB
07/Oct/17 6:49 AM
oak-md1-s1.glb-grp.txt
11 kB
29/Sep/17 4:44 PM
oak-md1-s2.dk.log
1.25 MB
28/Sep/17 6:33 PM
oak-md1-s2.glb-grp.txt
11 kB
29/Sep/17 4:44 PM
oak-md1-s2.mdt.dk.full.log
53.84 MB
02/Oct/17 5:41 AM
reproducer.log
3 kB
07/Oct/17 6:49 AM
sh-101-59.client.dk.full.log
2.25 MB
02/Oct/17 5:41 AM
sh-113-01.dk.log
547 kB
28/Sep/17 6:33 PM

Issue Links

is related to

LU-10135 nodemap_del_idmap() calls nodemap_idx_idmap_del() while holding rwlock

Closed

nodemap and quota issues (ineffective GID mapping)

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates