[LU-10040] nodemap and quota issues (ineffective GID mapping) Created: 28/Sep/17  Updated: 15/Mar/19  Resolved: 22/Nov/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.10.1
Fix Version/s: Lustre 2.11.0, Lustre 2.10.2

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Emoly Liu
Resolution: Fixed Votes: 0
Labels: patch
Environment:

client: lustre-client-2.10.0-1.el7.x86_64, lustre-2.10.1_RC1_srcc01-1.el7.centos.x86_64 (2.10.1-RC1 + patch from LU-9929)


Attachments: File break_nodemap_rbtree.sh     Text File oak-md1-s1.glb-grp.txt     Text File oak-md1-s2.dk.log     Text File oak-md1-s2.glb-grp.txt     Text File oak-md1-s2.mdt.dk.full.log     Text File reproducer.log     Text File sh-101-59.client.dk.full.log     Text File sh-113-01.dk.log    
Issue Links:
Related
is related to LU-10135 nodemap_del_idmap() calls nodemap_idx... Closed
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

We're using the nodemap feature with map_mode=gid_only in production and we are seeing more and more issues with GID mapping, which seems to default to squash_gid instead of being properly mapped. The nodemap hasn't changed for these groups, we just add new groups from time to time.

Example, configuration for mapping 'sherlock' on MGS:

[root@oak-md1-s1 sherlock]# pwd
/proc/fs/lustre/nodemap/sherlock

[root@oak-md1-s1 sherlock]# cat ranges 
[
 { id: 6, start_nid: 0.0.0.0@o2ib4, end_nid: 255.255.255.255@o2ib4 },
 { id: 5, start_nid: 0.0.0.0@o2ib3, end_nid: 255.255.255.255@o2ib3 }
]

[root@oak-md1-s1 sherlock]# cat idmap 
[
 { idtype: gid, client_id: 3525, fs_id: 3741 } { idtype: gid, client_id: 6401, fs_id: 3752 } { idtype: gid, client_id: 99001, fs_id: 3159 } { idtype: gid, client_id: 10525, fs_id: 3351 } { idtype: gid, client_id: 11886, fs_id: 3593 } { idtype: gid, client_id: 12193, fs_id: 3636 } { idtype: gid, client_id: 13103, fs_id: 3208 } { idtype: gid, client_id: 17079, fs_id: 3700 } { idtype: gid, client_id: 19437, fs_id: 3618 } { idtype: gid, client_id: 22959, fs_id: 3745 } { idtype: gid, client_id: 24369, fs_id: 3526 } { idtype: gid, client_id: 26426, fs_id: 3352 } { idtype: gid, client_id: 29361, fs_id: 3746 } { idtype: gid, client_id: 29433, fs_id: 3479 } { idtype: gid, client_id: 30289, fs_id: 3262 } { idtype: gid, client_id: 32264, fs_id: 3199 } { idtype: gid, client_id: 32774, fs_id: 3623 } { idtype: gid, client_id: 38517, fs_id: 3702 } { idtype: gid, client_id: 40387, fs_id: 3708 } { idtype: gid, client_id: 47235, fs_id: 3674 } { idtype: gid, client_id: 48931, fs_id: 3325 } { idtype: gid, client_id: 50590, fs_id: 3360 } { idtype: gid, client_id: 52892, fs_id: 3377 } { idtype: gid, client_id: 56316, fs_id: 3353 } { idtype: gid, client_id: 56628, fs_id: 3411 } { idtype: gid, client_id: 59943, fs_id: 3372 } { idtype: gid, client_id: 63938, fs_id: 3756 } { idtype: gid, client_id: 100533, fs_id: 3281 } { idtype: gid, client_id: 244300, fs_id: 3617 } { idtype: gid, client_id: 254778, fs_id: 3362 } { idtype: gid, client_id: 267829, fs_id: 3748 } { idtype: gid, client_id: 270331, fs_id: 3690 } { idtype: gid, client_id: 305454, fs_id: 3371 } { idtype: gid, client_id: 308753, fs_id: 3367 }

[root@oak-md1-s1 sherlock]# cat squash_gid 
99
[root@oak-md1-s1 sherlock]# cat map_mode 
gid_only

[root@oak-md1-s1 sherlock]# cat admin_nodemap 
0
[root@oak-md1-s1 sherlock]# cat deny_unknown 
1
[root@oak-md1-s1 sherlock]# cat trusted_nodemap 
0


Issue with group: GID 3593 (mapped to GID 11886 on sherlock)

lfs quota, not mapped (using canonical GID 3593):

[root@oak-rbh01 ~]# lfs quota -g oak_euan /oak
Disk quotas for group oak_euan (gid 3593):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
           /oak 33255114444  50000000000 50000000000       -  526016  7500000 7500000       -

Broken lfs quota mapped on sherlock (o2ib4):

[root@sh-113-01 ~]# lfs quota -g euan /oak
Disk quotas for grp euan (gid 11886):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
           /oak 2875412844*      1       1       -      26*      1       1       -
[root@sh-113-01 ~]# lctl list_nids
10.9.113.1@o2ib4

It matches the quota usage for squash_gid:

[root@oak-rbh01 ~]# lfs quota -g 99 /oak
Disk quotas for group 99 (gid 99):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
           /oak 2875412844*      1       1       -      26*      1       1       -

 

Please note that GID mapping works OK for most of the groups though:

3199 -> 32264(sherlock)

canonical:
[root@oak-rbh01 ~]# lfs quota -g oak_ruthm /oak
Disk quotas for group oak_ruthm (gid 3199):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
           /oak 10460005688  20000000000 20000000000       - 1683058  3000000 3000000       -

mapped (sherlock):
[root@sh-113-01 ~]# lfs quota -g ruthm /oak
Disk quotas for grp ruthm (gid 32264):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
           /oak 10460005688  20000000000 20000000000       - 1683058  3000000 3000000       -


Failing over the MDT resolved a few groups, but not all. Failing the MDT back showed an issue on the exact same original groups having issues (currently 4-5).

While I haven't seen it by myself yet, the issue seems to affect users as a few of them reported erroneous EDQUOT errors. This is why it is quite urgent to figure out what's wrong. Please note that the issue was already there before using the patch from LU-9929.

I'm willing to attach some debug logs, but what debug flags should I enable to troubleshoot such a quota+nodemap issue on client and server?

Thanks!
Stephane



 Comments   
Comment by Emoly Liu [ 28/Sep/17 ]

Stephane,
There is no debugging mask for nodemap, so I think enabling quota should be OK. Is this issue related to LU-9671?
Thanks,
Emoly
 

Comment by Stephane Thiell [ 28/Sep/17 ]

Hi Emoly,

Yes it might be related to LU-9671. I just attached some logs from the MDS and one client. I did the following on the client, one that is incorrectly mapped, followed by one correctly mapped:

 

[root@sh-113-01 ~]# lfs quota -hg euan /oak
Disk quotas for grp euan (gid 11886):
     Filesystem    used   quota   limit   grace   files   quota   limit   grace
           /oak  2.678T*     1k      1k       -      26*      1       1       -
[root@sh-113-01 ~]# lfs quota -hg ruthm /oak
Disk quotas for grp ruthm (gid 32264):
     Filesystem    used   quota   limit   grace   files   quota   limit   grace
           /oak  9.886T  18.63T  18.63T       - 1683246  3000000 3000000       -


I think the first lfs quota ends up like this on the MDT:

1691 00040000:04000000:0.0:1506621079.799639:0:200838:0:(qmt_handler.c:65:qmt_get()) $$$ fetch settings qmt:oak-QMT0000 pool:0-md id:99 enforced:1 hard:1 soft:1 granted:27 time:1506754418 qunit:1024 edquot:1 may_rel:0 revoke:135062
1692 00040000:04000000:0.0:1506621079.799642:0:200838:0:(qmt_handler.c:65:qmt_get()) $$$ fetch settings qmt:oak-QMT0000 pool:0-dt id:99 enforced:1 hard:1 soft:1 granted:2875370220 time:1506717740 qunit:1024 edquot:1 may_rel:0 revoke:135032


Notice the id:99, which is the squash_gid

The good one seems to be:

3417 00040000:04000000:7.0:1506621082.680316:0:200864:0:(qmt_handler.c:65:qmt_get()) $$$ fetch settings qmt:oak-QMT0000 pool:0-md id:3199 enforced:1 hard:3000000 soft:3000000 granted:1769473 time:0 qunit:262144 edquot:0 may_rel:0 revoke:0
3418 00040000:04000000:7.0:1506621082.680324:0:200864:0:(qmt_handler.c:65:qmt_get()) $$$ fetch settings qmt:oak-QMT0000 pool:0-dt id:3199 enforced:1 hard:20000000000 soft:20000000000 granted:12762252828 time:0 qunit:67108864 edquot:0 may_rel:0 revoke:0

Any idea of what could be wrong there?
Thanks!!
Stephane

Comment by Emoly Liu [ 29/Sep/17 ]

Stephane,
Can you provide the gid of user "euan" and "ruthm" in your example above? because I don't know if they are 11886 and 32264 respectively.
And BTW, I just noticed in your following example, the users are different:

lfs quota, not mapped (using canonical GID 3593):

[root@oak-rbh01 ~]# lfs quota -g oak_euan /oak
Disk quotas for group oak_euan (gid 3593):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
           /oak 33255114444  50000000000 50000000000       -  526016  7500000 7500000       -

Broken lfs quota mapped on sherlock (o2ib4):

[root@sh-113-01 ~]# lfs quota -g euan /oak
Disk quotas for grp euan (gid 11886):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
           /oak 2875412844*      1       1       -      26*      1       1       -
[root@sh-113-01 ~]# lctl list_nids
10.9.113.1@o2ib4

Can you show me the gid of "oak_euan" on oak_rbh01 and "euan" on sh-110-01? I will analyze your logs and see if I can reproduce this issue.

Thanks,
Emoly

 

Comment by Emoly Liu [ 29/Sep/17 ]

And, can you run "getfacl /oak" and post the output here? Thanks.

Comment by Stephane Thiell [ 29/Sep/17 ]

oak_euan GID 3593 maps to euan GID 11886:

[root@oak-rbh01 ~]# getent group oak_euan
oak_euan:*:3593:(...)

[root@sh-113-01 ~]# getent group euan
euan:*:11886:(...)



Similarly, oak_ruthm GID 3199 maps to ruthm GID 32264

getfacl /oak results below:

[root@oak-rbh01 ~]# getfacl /oak
getfacl: Removing leading '/' from absolute path names
# file: oak
# owner: root
# group: root
user::rwx
group::r-x
other::r-x

[root@sh-113-01 ~]# getfacl /oak
getfacl: Removing leading '/' from absolute path names
# file: oak
# owner: root
# group: nobody
user::rwx
group::r-x
other::r-x


Thanks!

Comment by Emoly Liu [ 29/Sep/17 ]

Your nodemap policy { idtype: gid, client_id: 11886, fs_id: 3593 } means a user GID=11886 from a client with any IP address will be mapped to a canonical GID=3593 on server. But oak-euan(3593) is not defined as a client_id in any idmap, so there is no mapping for it. That's why you see the following output. This is right.

[root@oak-rbh01 ~]# lfs quota -g oak_euan /oak
Disk quotas for group oak_euan (gid 3593):

Comment by Stephane Thiell [ 29/Sep/17 ]

No, my nodemap policy is for o2ib3 and o2ib4 (sherlock). oak-rbh01 is o2ib5 and is in the canonical GID range, sorry for the confusion, I used it to get the canonical GID, that's all. sh-113-01 is the client with mapped GID in my example.

[root@oak-md1-s2 ~]# cat /proc/fs/lustre/nodemap/sherlock/ranges 
[
 { id: 6, start_nid: 0.0.0.0@o2ib4, end_nid: 255.255.255.255@o2ib4 },
 { id: 5, start_nid: 0.0.0.0@o2ib3, end_nid: 255.255.255.255@o2ib3 }
]


The 'sherlock' policy has

{ idtype: gid, client_id: 11886, fs_id: 3593 }

with ranges=o2ib3 and o2ib4, so GID=11886 on sherlock (eg. sh-113-01, o2ib4) should be mapped to the canonical GID 3593, but it is not...

 

 

Comment by Stephane Thiell [ 29/Sep/17 ]

Actually, I wasn't able to reproduce the 'disk quota exceeded errors' that a few users from the 'euan' group are getting and reported. That means the gid mapping seems to be done correctly when using the filesystem from 'sherlock'. But what is clearly broken right now is "lfs quota -g euan /oak" from sherlock (with the sherlock policy nodemap). But it works for almost all other groups, weird. Thanks for your help.

Comment by Emoly Liu [ 29/Sep/17 ]

When you run "[root@sh-113-01 ~]# lfs quota -g euan /oak", as you said, since sh-113-01 is o2ib4 and euan GID=11886, it will be mapped to canonical GID=3593 on server. But, this is only half, because there is an unmapping process when server returns the GID to the client, 3593 will be unmapped to 11886.

Comment by Stephane Thiell [ 29/Sep/17 ]

To illustrate what I am trying to say:

Using policy

{ idtype: gid, client_id: 11886, fs_id: 3593 }

On a client with mapped GID 11886:

[euan@sherlock-ln01 login_node ~]$ id
uid=11886(euan) gid=11886(euan) groups=11886(euan),…
[euan@sherlock-ln01 login_node ~]$ cd $OAK
[euan@sherlock-ln01 login_node /oak/stanford/groups/euan]$ touch test
[euan@sherlock-ln01 login_node /oak/stanford/groups/euan]$ ls -l test
-rw-rw----+ 1 euan euan 0 Sep 28 22:14 test
[euan@sherlock-ln01 login_node /oak/stanford/groups/euan]$ ls -ln test
-rw-rw----+ 1 11886 11886 0 Sep 28 22:14 test
[euan@sherlock-ln01 login_node /oak/stanford/groups/euan]$ lfs quota -g euan /oak
Disk quotas for grp euan (gid 11886):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
           /oak 2875169252*      1       1       -      25*      1       1       -


On a client with canonical GID 3593:

[root@oak-rbh01 ~]# ls -l /oak/stanford/groups/euan/test
-rw-rw----+ 1 euan oak_euan 0 Sep 28 22:17 /oak/stanford/groups/euan/test
[root@oak-rbh01 ~]# ls -ln /oak/stanford/groups/euan/test
-rw-rw----+ 1 11886 3593 0 Sep 28 22:17 /oak/stanford/groups/euan/test

[root@oak-rbh01 ~]# lfs quota -g oak_euan /oak
Disk quotas for group oak_euan (gid 3593):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
           /oak 33246312992  50000000000 50000000000       -  525598  7500000 7500000       -
Comment by Emoly Liu [ 29/Sep/17 ]

Stephane, never mind. If you can hit this issue next time, can you collect the logs for the whole lfs quota command? Because this time I didn't see any nodemap related information in your mds log file. I need to check if there is something wrong with nodemap mapping or unmapping. Thanks.

Comment by Stephane Thiell [ 29/Sep/17 ]

Ok, will do that first thing tomorrow morning, thanks! If you have the debug flags and subs I should use, please let me know. Otherwise, I will enable full debug, but the MDS is in production so that might be a bit verbose. Thanks!

Comment by Emoly Liu [ 29/Sep/17 ]

I see your question now, you mean both of these two users should return same quota information. I have checked our code that we have nodemap handling in quota. I will keep investigation.

Comment by Emoly Liu [ 29/Sep/17 ]

Stephane,

Could you please run the following command on all the MDS server(s)? If the output is too big to paste here, you can save it as a file and upload here.

lctl get_param qmt.*.*.glb-grp

BTW, as you said, you failed over the MDT, could you please check if this MDT has the same lustre version to others?

Thanks,
Emoly

Comment by Stephane Thiell [ 29/Sep/17 ]

Hi Emoly,

Attached oak-md1-s2.glb-grp.txt (output of glb-grp of the primary MDS), then I failed over the MDT to oak-md1-s1 (MDS failover partner) and did the same command,  output in attached file oak-md1-s1.glb-grp.txt.

Thanks,
Stephane

Comment by Emoly Liu [ 30/Sep/17 ]

Since GID mapping works OK for most of the groups and this issue can't be reproduced now, I suspect it is related to the process of assigning or getting the nodemap to/from the given export. I will have a check.

Comment by Stephane Thiell [ 02/Oct/17 ]

Hi Emoly,

The issue with 'lfs quota' is easily reproducible (for me). The one with the real quota error isn't, but a few users provided job logs with disk quota exceeded errors. So far, users that have reported this error are from the groups corresponding to the broken groups in "lfs quota", so I think there is a link.

I took new, cleaner logs from MDT and one mapped client (sh-101-59)

oak-md1-s2.mdt.dk.full.log sh-101-59.client.dk.full.log

with two lfs quota commands:

  • one lfs quota that doesn't work (euan),
  • and the second one that does work (ruthm).
[root@sh-101-59 ~]# lfs quota -gh euan /oak
Disk quotas for grp euan (gid 11886):
     Filesystem    used   quota   limit   grace   files   quota   limit   grace
           /oak  2.687T*     1k      1k       -      40*      1       1       -
[root@sh-101-59 ~]# lfs quota -gh ruthm /oak
Disk quotas for grp ruthm (gid 32264):
     Filesystem    used   quota   limit   grace   files   quota   limit   grace
           /oak  14.86T  18.63T  18.63T       - 2958342  3000000 3000000       -
[root@sh-101-59 ~]# lctl dk /tmp/dk.full.log


As a reminder, gid 11886 is mapped to 3593 on MDT, and gid 32264 to 3199.

For the working GID 32264->3199, I can find two log entries in the MDT log:

115253 00000001:00000001:1.0:1506921877.591507:0:124387:0:(nodemap_handler.c:672:nodemap_map_id()) Process leaving (rc=3199 : 3199 : c7f)

116197 00000001:00000001:1.0:1506921877.595021:0:124387:0:(nodemap_handler.c:672:nodemap_map_id()) Process leaving (rc=3199 : 3199 : c7f)


Unfortunately, for the first one, if I look for "rc=3593", I cannot find anything :/

Please note that the client was doing nothing when the logs were taken, but the MDT is in production.

Please let me know if I can further help debug this issue.

Thanks!
Stephane

Comment by Stephane Thiell [ 05/Oct/17 ]

I tried to reproduce the issue on two VMs today, one client and one server (llmount.sh) after importing the exact same 'sherlock' idmap config, but I wasn't able to reproduce the issue, all groups are properly mapped including 'euan' and the two other broken groups. This is super weird.
Also, users from the group 'euan' are still reporting sporadic Disk quota exceeded error, apparently on disk quota not inode. We have now 2.10.1 GA on clients and 2.10.1 RC1 on servers (I'll upgrade soon to 2.10.1 GA too).
It is like this particular (euan) group mapping is ignored for lfs quota and sometimes for I/Os too.
The nodemap code doesn't have many debugging capabilities, this is very unfortunate. Could you help me add pertinent debugging code lines so I can compile a special version to retrieve more info about that?
Thanks!
Stephane

Comment by Stephane Thiell [ 07/Oct/17 ]

I just found this useful server-side 'lctl nodemap_test_id' command. At least, I know I'm not crazy now .

From the MGS (2.10.1 RC1):

[root@oak-md1-s1 ~]# cat /proc/fs/lustre/nodemap/sherlock/ranges 
[
 { id: 6, start_nid: 0.0.0.0@o2ib4, end_nid: 255.255.255.255@o2ib4 },
 { id: 5, start_nid: 0.0.0.0@o2ib3, end_nid: 255.255.255.255@o2ib3 }
]
[root@oak-md1-s1 ~]# cat /proc/fs/lustre/nodemap/sherlock/idmap 
[
 { idtype: gid, client_id: 3525, fs_id: 3741 } { idtype: gid, client_id: 6401, fs_id: 3752 } { idtype: gid, client_id: 99001, fs_id: 3159 } { idtype: gid, client_id: 10525, fs_id: 3351 } { idtype: gid, client_id: 11886, fs_id: 3593 } { idtype: gid, client_id: 12193, fs_id: 3636 } { idtype: gid, client_id: 13103, fs_id: 3208 } { idtype: gid, client_id: 17079, fs_id: 3700 } { idtype: gid, client_id: 19437, fs_id: 3618 } { idtype: gid, client_id: 22959, fs_id: 3745 } { idtype: gid, client_id: 24369, fs_id: 3526 } { idtype: gid, client_id: 26426, fs_id: 3352 } { idtype: gid, client_id: 29361, fs_id: 3746 } { idtype: gid, client_id: 29433, fs_id: 3479 } { idtype: gid, client_id: 30289, fs_id: 3262 } { idtype: gid, client_id: 32264, fs_id: 3199 } { idtype: gid, client_id: 32774, fs_id: 3623 } { idtype: gid, client_id: 38517, fs_id: 3702 } { idtype: gid, client_id: 40387, fs_id: 3708 } { idtype: gid, client_id: 40837, fs_id: 3768 } { idtype: gid, client_id: 47235, fs_id: 3674 } { idtype: gid, client_id: 48931, fs_id: 3325 } { idtype: gid, client_id: 50590, fs_id: 3360 } { idtype: gid, client_id: 52892, fs_id: 3377 } { idtype: gid, client_id: 56316, fs_id: 3353 } { idtype: gid, client_id: 56628, fs_id: 3411 } { idtype: gid, client_id: 59943, fs_id: 3372 } { idtype: gid, client_id: 63938, fs_id: 3756 } { idtype: gid, client_id: 100533, fs_id: 3281 } { idtype: gid, client_id: 244300, fs_id: 3617 } { idtype: gid, client_id: 254778, fs_id: 3362 } { idtype: gid, client_id: 267829, fs_id: 3748 } { idtype: gid, client_id: 270331, fs_id: 3690 } { idtype: gid, client_id: 305454, fs_id: 3371 } { idtype: gid, client_id: 308753, fs_id: 3367 }
]

This is pertinent below:
{ idtype: gid, client_id: 32264, fs_id: 3199 }
{ idtype: gid, client_id: 11886, fs_id: 3593 }

[root@oak-md1-s1 ~]# lctl nodemap_test_id --nid 10.210.47.253@o2ib3 --idtype gid --id 32264
3199
[root@oak-md1-s1 ~]# lctl nodemap_test_id --nid 10.210.47.253@o2ib3 --idtype gid --id 11886
99
[root@oak-md1-s1 ~]# cat /proc/fs/lustre/nodemap/sherlock/squash_gid 
99

Comment by Stephane Thiell [ 07/Oct/17 ]

Ok so I found out the origin of the problem. It looks like the problem came when we changed a client GID, then this broke the nodemap idmap rbtree somehow (only the client_to_fs_gidmap I think, the reverse seems to work). I would bet the 'replace' code part of idmap_insert() is broken.

I was able to reproduce the problem on a test VM. It is easy. Please try llmount following by ./break_nodemap_rbtree.sh

Reproducer logs and script attached. You'll see that client GID 99001 is not mapped and cannot be removed anymore, but listed in idmap.

On the impacted production system, I'm not sure this can be fixed without stopping production and recreating the whole nodemap. Please advise.

Thanks,

Stephane

Comment by Gerrit Updater [ 07/Oct/17 ]

Stephan Thiell (sthiell@stanford.edu) uploaded a new patch: https://review.whamcloud.com/29364
Subject: LU-10040 nodemap: only replace exact same idmap in idmap_insert
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 95d487db29c6a6d9e3289ea865d5dc8ea6ad51a9

Comment by Stephane Thiell [ 07/Oct/17 ]

With a better understanding of the situation, we were able to fix the issue of the production system without downtime. We duplicated the corrupted 'sherlock' idmap config into a new 'sherlock2' nodemap (without the nid ranges at first). After double-checking the config, we then moved the ranges from sherlock to sherlock2, that might have generated some bad GIDs during this short period of time (a few seconds), but now all the groups are properly mapped. I submitted a patch to review to avoid corrupting the idmap red black trees in the future. I'll delete the corrupted nodemap config on Monday (it worked on the test system). The priority of this ticket may be lowered.

Stephane

Comment by Emoly Liu [ 09/Oct/17 ]

sthiell,

Thanks for your investigation on this issue and your patch! I'm just back to work from my vacation. I will review you patch now.

Thanks,
Emoly

Comment by Emoly Liu [ 09/Oct/17 ]

The root cause does come from idmap_insert():

    if (!replace) {
        ...
    } else {
                rb_replace_node(&cur->id_client_to_fs,
                                &idmap->id_client_to_fs,
                                fwd_root);
                rb_replace_node(&cur->id_fs_to_client,
                                &idmap->id_fs_to_client,
                                bck_root);
                idmap_destroy(cur);
    }

When idmap_insert() calls rb_replace_node() to replace the "cur" id, only id_client_to_fs or id_fs_to_client is changed accordingly correctly, but the other one is not.
In your "break_nodemap_rbtree.sh" example, {3525:3741, 10000(99001):3159, 10525:3351, 270331:3690}, 3159 was found in id_fs_to_client, but 10000 was replaced with 99001 in a wrong place, I mean it has wrong left node and right node, so when idmap_search() was called later, nothing was found. My following debugging information shows that 10525's right node was not 99001 but 270331.

00000001:00000040:1.0:1507545424.048143:0:18591:0:(nodemap_idmap.c:254:idmap_search()) tree_type=1, id=99001, idmap->id_client=10525
00000001:00000040:1.0:1507545424.048144:0:18591:0:(nodemap_idmap.c:254:idmap_search()) tree_type=1, id=99001, idmap->id_client=270331
00000001:00000001:1.0:1507545424.048144:0:18591:0:(nodemap_idmap.c:264:idmap_search()) Process leaving (rc=0 : 0 : 0)

But when you call "lctl get_param nodemap.sherlock.idmap", it outputs everything correctly, as follows, because it calls rb_first() to print idmap from the leftmost one:

nodemap.sherlock.idmap=
[
 { idtype: gid, client_id: 3525, fs_id: 3741 } { idtype: gid, client_id: 99001, fs_id: 3159 } { idtype: gid, client_id: 10525, fs_id: 3351 } { idtype: gid, client_id: 270331, fs_id: 3690 }
]

I will fix this issue.

Comment by Stephane Thiell [ 10/Oct/17 ]

Hi Emoly,

Thanks! That was my understanding too. Glad to have you back!

When you say "when you call "lctl get_param nodemap.sherlock.idmap", it outputs everything correctly", I noticed that it is usually sorted by client_ids, but when the tree is "corrupted", the output is not sorted (in the case of the reproducer, we have: 3525, 99001, 10525, 270331 instead of 3525, 10525, 99001, 270331).

The full idmap output of my current (good) gid mapping looks like that:

[root@oak-md1-s1 ~]# cat /proc/fs/lustre/nodemap/sherlock2/idmap 
[
 { idtype: gid, client_id: 3525, fs_id: 3741 } { idtype: gid, client_id: 6401, fs_id: 3752 } { idtype: gid, client_id: 10525, fs_id: 3351 } { idtype: gid, client_id: 11886, fs_id: 3593 } { idtype: gid, client_id: 12193, fs_id: 3636 } { idtype: gid, client_id: 13103, fs_id: 3208 } { idtype: gid, client_id: 17079, fs_id: 3700 } { idtype: gid, client_id: 19437, fs_id: 3618 } { idtype: gid, client_id: 22959, fs_id: 3745 } { idtype: gid, client_id: 24369, fs_id: 3526 } { idtype: gid, client_id: 26426, fs_id: 3352 } { idtype: gid, client_id: 29361, fs_id: 3746 } { idtype: gid, client_id: 29433, fs_id: 3479 } { idtype: gid, client_id: 30289, fs_id: 3262 } { idtype: gid, client_id: 32264, fs_id: 3199 } { idtype: gid, client_id: 32774, fs_id: 3623 } { idtype: gid, client_id: 38517, fs_id: 3702 } { idtype: gid, client_id: 40387, fs_id: 3708 } { idtype: gid, client_id: 40837, fs_id: 3768 } { idtype: gid, client_id: 47235, fs_id: 3674 } { idtype: gid, client_id: 48931, fs_id: 3325 } { idtype: gid, client_id: 50590, fs_id: 3360 } { idtype: gid, client_id: 52892, fs_id: 3377 } { idtype: gid, client_id: 56316, fs_id: 3353 } { idtype: gid, client_id: 56628, fs_id: 3411 } { idtype: gid, client_id: 59943, fs_id: 3372 } { idtype: gid, client_id: 63938, fs_id: 3756 } { idtype: gid, client_id: 99001, fs_id: 3159 } { idtype: gid, client_id: 100533, fs_id: 3281 } { idtype: gid, client_id: 244300, fs_id: 3617 } { idtype: gid, client_id: 254778, fs_id: 3362 } { idtype: gid, client_id: 267829, fs_id: 3748 } { idtype: gid, client_id: 270331, fs_id: 3690 } { idtype: gid, client_id: 305454, fs_id: 3371 } { idtype: gid, client_id: 308753, fs_id: 3367 }
]

This is very clean now, and users don't seem to have any issue anymore.

But, unfortunately, when I tried to remove the corrupted nodemap 'sherlock' using "lctl nodemap_del sherlock", it was properly removed on the MGS, but I got the following message on all MDS and OSS:

[1197072.957459] Lustre: 123959:0:(mgc_request.c:1797:mgc_process_recover_nodemap_log()) MGC10.0.2.51@o2ib5: error processing nodemap log nodemap: rc = -2

Thanks,
Stephane

Comment by Stephane Thiell [ 16/Oct/17 ]

Hi Emoly,

Since we tried to remove the bad nodemap 'sherlock' using:

lctl nodemap_del sherlock

we cannot add or remove any idmap anymore on the servers. It only works on the MGS, and then the following errors occur on the servers:
 

# clush -w@mds,@oss 'dmesg | tail -1'
oak-md1-s2: [1784564.587104] Lustre: 123959:0:(mgc_request.c:1797:mgc_process_recover_nodemap_log()) MGC10.0.2.51@o2ib5: error processing nodemap log nodemap: rc = -2
oak-io2-s1: [1025664.072965] Lustre: 317684:0:(mgc_request.c:1797:mgc_process_recover_nodemap_log()) MGC10.0.2.51@o2ib5: error processing nodemap log nodemap: rc = -2
oak-io2-s2: [1025667.518836] Lustre: 309962:0:(mgc_request.c:1797:mgc_process_recover_nodemap_log()) MGC10.0.2.51@o2ib5: error processing nodemap log nodemap: rc = -2
oak-io1-s2: [1465169.609494] Lustre: 149882:0:(mgc_request.c:1797:mgc_process_recover_nodemap_log()) MGC10.0.2.51@o2ib5: error processing nodemap log nodemap: rc = -2
oak-io1-s1: [1462808.944808] Lustre: 226436:0:(mgc_request.c:1797:mgc_process_recover_nodemap_log()) MGC10.0.2.51@o2ib5: error processing nodemap log nodemap: rc = -2

I think because of the idmap_search() bug, the servers cannot properly remove the nodemap. I should have kept as is ... do you think your patch could handle this case to fix the trees and allow processing the next nodemap logs? I'll start to work on a new build with your patch just in case.

Thanks,
Stephane

Comment by Stephane Thiell [ 16/Oct/17 ]

Also, do you know a way to force purge the nodemap log on the MGS? I would like to get rid of that 'nodemap_del' entry, I can easily replay all next ones.

Many thanks,
Stephane

Comment by Stephane Thiell [ 16/Oct/17 ]

Hi,
I applied the patch and tried on the MDS, but unfortunately it is not able to process nodemap log. I will need to find a way to purge the nodemap log.

oak-MDT0000:

[  127.492117] Lustre: Lustre: Build Version: 2.10.1_srcc02
[  127.527461] LNet: Using FMR for registration
[  127.553475] LNet: Added LNI 10.0.2.52@o2ib5 [8/256/0/180]
[  190.367048] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
[  191.433340] LustreError: 137-5: oak-MDT0000_UUID: not available for connect from 10.210.45.60@o2ib3 (no target). If you are running an HA pair check that the target is mounted on the other server.
[  191.452861] LustreError: Skipped 3 previous similar messages
[  191.471790] Lustre: 13119:0:(mgc_request.c:1797:mgc_process_recover_nodemap_log()) MGC10.0.2.51@o2ib5: error processing nodemap log nodemap: rc = -2
[  191.523256] Lustre: oak-MDT0000: Not available for connect from 10.210.47.38@o2ib3 (not set up)
[  191.532970] Lustre: Skipped 3 previous similar messages
[  192.015060] Lustre: oak-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
[  192.501895] Lustre: oak-MDD0000: changelog on
[  192.549977] Lustre: oak-MDT0000: Will be in recovery for at least 2:30, or until 1212 clients reconnect
[  192.560492] Lustre: oak-MDT0000: Connection restored to cd0e08e0-aa22-f4da-21ed-94f218f886a1 (at 10.210.45.100@o2ib3)
[  192.595309] Lustre: oak-MDT0000: root_squash is set to 99:99
[  192.603004] Lustre: oak-MDT0000: nosquash_nids set to 10.0.2.[1-3]@o2ib5 10.0.2.[51-58]@o2ib5 10.0.2.[101-120]@o2ib5 10.0.2.[221-223]@o2ib5 10.0.2.[226-229]@o2ib5 10.0.2.[232-235]@o2ib5 10.0.2.[240-241]@o2ib5 10.210.47.253@o2ib3 10.9.0.[1-2]@o2ib4
...

Thanks,
Stephane

Comment by Stephane Thiell [ 16/Oct/17 ]

Also, I cannot remount the MGS anymore (2.10.1 + patch gerrit 29364):

[ 1174.919438] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[ 1174.932548] Lustre: 17247:0:(osd_handler.c:7008:osd_mount()) MGS-osd: device /dev/mapper/md1-rbod1-mgt was upgraded from Lustre-1.x without enabling the dirdata feature. If you do not want to downgrade to Lustre-1.x again, you can enable it via 'tune2fs -O dirdata device'
[ 1175.062057] Lustre: 17247:0:(nodemap_storage.c:914:nodemap_load_entries()) MGS-osd: failed to load nodemap configuration: rc = -2
[ 1175.075067] LustreError: 17247:0:(mgs_fs.c:187:mgs_fs_setup()) MGS: error loading nodemap config file, file must be removed via ldiskfs: rc = -2
[ 1175.089557] LustreError: 17247:0:(mgs_handler.c:1297:mgs_init0()) MGS: MGS filesystem method init failed: rc = -2
[ 1175.145812] LustreError: 17247:0:(obd_config.c:608:class_setup()) setup MGS failed (-2)
[ 1175.154748] LustreError: 17247:0:(obd_mount.c:203:lustre_start_simple()) MGS setup error -2
[ 1175.164081] LustreError: 17247:0:(obd_mount_server.c:135:server_deregister_mount()) MGS not registered
[ 1175.174463] LustreError: 15e-a: Failed to start MGS 'MGS' (-2). Is the 'mgs' module loaded?
[ 1175.282230] Lustre: server umount MGS complete
Comment by Emoly Liu [ 17/Oct/17 ]

Stephane, I saw the same "-2" logs from my server on Oct. 9 when I tried to reproduce this issue. Let me see how to purge the nodemap log.

Comment by Emoly Liu [ 17/Oct/17 ]

Here are some steps to remove nodemap config log from MGS. This will remove all nodemap information from MGS, so before do that, you'd better save all of nodemap information by "cp -r /proc/fs/lustre/nodemap $nodemap_dir" or "lctl get_param nodemap.*.* > $nodemap_file".

  1. umount your MGS
  2. mount your MGS with ldiskfs type, by the command: mount -t ldiskfs $your_MGS_device $mountpoint
  3. cd $mountpoint, you will see file ./CONFIGS/nodemap. I also suggest to save a backup(e.g. /tmp/nodemap) before remove it.
  4. umount your MGS and remount it with lustre type

Please let me know if this works for you.

Comment by Stephane Thiell [ 17/Oct/17 ]

Hi Emoly,

Good news. I renamed ./CONFIGS/nodemap into ./CONFIGS/nodemap.corrupted instead of removing it, but it worked! I was then able to mount the MGS and recreate all nodemaps by hand from there. And now, I can add new idmaps again and they are properly propagated to the targets. The corrupted 'sherlock' nodemap can't be seen anymore from the MGS.

After some time, like a few minutes maybe (not immediately), the corrupted 'sherlock' nodemap was also automatically removed from all targets (MDT, OST). This is great.

Thanks again! By the way, I am now running 2.10.1 with the patch on the MGS/MDS.

Stephane

Comment by Emoly Liu [ 18/Oct/17 ]

Stephane,

That's great. After the MGS restarts/remounts, the other targets will detect the lock of config log changed and then fetch the config log from MGS to update their local copy.

Thanks,

Emoly

Comment by Gerrit Updater [ 22/Nov/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29364/
Subject: LU-10040 nodemap: add nodemap idmap correctly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 253ccbd55ffe7fcdc405c9fcc4f72a47578920fe

Comment by Peter Jones [ 22/Nov/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 22/Nov/17 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/30206
Subject: LU-10040 nodemap: add nodemap idmap correctly
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: a4de3c0f0ae3dbb684ba63874fa70e171c219cdf

Comment by Gerrit Updater [ 27/Nov/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30206/
Subject: LU-10040 nodemap: add nodemap idmap correctly
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: e881c665bb60543fd2bbbd2d195ccce99a65f16b

Generated at Sat Feb 10 02:31:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.