[LU-16457] sanity-pcc test_101a: Error: 'could not map uid 500 to root in namespace' Created: 09/Jan/23  Updated: 08/Feb/23  Resolved: 03/Feb/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.1
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Sebastien Buisson
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Minh Diep <mdiep@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/b93b43b5-a8f2-4ae5-895a-e04966fbb5dd

test_101a failed with the following error:

could not map uid 500 to root in namespace

== sanity-pcc test 101a: Test auto attach in mount namespace (simulated container) ========================================================== 17:56:00 (1672854960)
CMD: trevis-48vm4 cat /proc/sys/user/max_user_namespaces
CMD: trevis-48vm4 echo 10 > /proc/sys/user/max_user_namespaces
creating user namespace for 500
CMD: trevis-48vm4 runas -u 500 -g 500 unshare -Um sleep 600
trevis-48vm4: running as uid/gid/euid/egid 500/500/500/500, groups:
trevis-48vm4: [unshare] [-Um] [sleep] [600]
CMD: trevis-48vm4 pgrep sleep
pdsh@trevis-48vm3: trevis-48vm4: ssh exited with exit code 1
Created NS: child (sleep) pid
CMD: trevis-48vm4 runas -u 500 -g 500 newuidmap 0 500 1
trevis-48vm4: running as uid/gid/euid/egid 500/500/500/500, groups:
trevis-48vm4: [newuidmap] [0] [500] [1]
trevis-48vm4: newuidmap: Could not open proc directory for target 0
pdsh@trevis-48vm3: trevis-48vm4: ssh exited with exit code 1
sanity-pcc test_101a: @@@@@@ FAIL: could not map uid 500 to root in namespace
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:6406:error()
= /usr/lib64/lustre/tests/sanity-pcc.sh:1523:test_101a()
= /usr/lib64/lustre/tests/test-framework.sh:6723:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:6770:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:6611:run_test()
= /usr/lib64/lustre/tests/sanity-pcc.sh:1611:main()
Dumping lctl log to /autotest/autotest-1/2023-01-04/lustre-b2_15_full-part-2_47_38_91d6a73b-aa04-4158-9b08-8e31c70d3c23//sanity-pcc.test_101a.*.1672854968.log
CMD: trevis-48vm3.trevis.whamcloud.com,trevis-48vm4,trevis-48vm5,trevis-48vm6 /usr/sbin/lctl dk > /autotest/autotest-1/2023-01-04/lustre-b2_15_full-part-2_47_38_91d6a73b-aa04-4158-9b08-8e31c70d3c23//sanity-pcc.test_101a.debug_log.\$(hostname -s).1672854968.log;
dmesg > /autotest/autotest-1/2023-01-04/lustre-b2_15_full-part-2_47_38_91d6a73b-aa04-4158-9b08-8e31c70d3c23//sanity-pcc.test_101a.dmesg.\$(hostname -s).1672854968.log
CMD: trevis-48vm4 kill -9
trevis-48vm4: kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]
pdsh@trevis-48vm3: trevis-48vm4: ssh exited with exit code 2

Notice same test passed on 2.15.2.RC1 failed in RC2

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-pcc test_101a - could not map uid 500 to root in namespace



 Comments   
Comment by Andreas Dilger [ 10/Jan/23 ]

Looks like this may be a continuation of DCO-9004?

Comment by Andreas Dilger [ 10/Jan/23 ]

It looks like the original error is "newuidmap: Could not open proc directory for target 0" but I don't know much about what this test is doing.

There was one failure on 2022-12-07 with "execvp fails running newuidmap (2): No such file or directory" from DCO-9004, and then with the "target 0" error on 2022-12-19 and 2022-12-21 on master (2/374 runs) and the one reported here on b2_15 (1/56 runs).

Comment by Andreas Dilger [ 10/Jan/23 ]

Sebastien, could you please provide some analysis of what this failure means, how serious the impact of this failure is, and the likelihood of hitting it in production? Is it a testing environment issue, a race in the code during configuration, during runtime, and if it breaks security or just an inconvenience?

Comment by Sebastien Buisson [ 10/Jan/23 ]

newuidmap is a system command not related to Lustre. I do not know much about what sanity-pcc test_101a is trying to do, but as far as I can see it starts by creating a user+mount namespace on the agent node. Then it maps user $RUNAS_ID to root inside the namespace, via the newuidmap command. This is where it fails in the various cases reported above, and it has not even started using PCC or Lustre.

I checked recent test results, every time sanity-pcc test_101a fails with such an error, this is because the PID of the sleep process launched inside the namespace cannot be found. This can be seen with the message:

Created NS: child (sleep) pid 

which shows an empty $PID variable. As a consequence, the subsequent newuidmap call is incorrect, as it misses its first argument, the PID:

trevis-48vm4: [newuidmap] [0] [500] [1]

(this command requires at least 4 args).

My advice would be to use a longer sleep in the namespace, and retry to create the namespace if the the PID of sleep cannot be found.

Comment by Andreas Dilger [ 10/Jan/23 ]

Another oddity i just noticed is that the failure cases all take just over 600s, which is the duration of the remote sleep command, while a pass takes about 30-40s (one pass took 130s, but none took longer). This makes me wonder if the problem is in the remote ssh to the agent node and not the "sleep 2" that is waiting for it?

Comment by Gerrit Updater [ 10/Jan/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49587
Subject: LU-16457 tests: wait for remote sleep in sanity-pcc/101a
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9927aad023e4bf7447823c34cc344090078af82b

Comment by Gerrit Updater [ 03/Feb/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49587/
Subject: LU-16457 tests: wait for remote sleep in sanity-pcc/101a
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4b47c233b308dcfefe77a6a493c01d3b4fc59bbe

Comment by Peter Jones [ 03/Feb/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:27:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.