[LU-16457] sanity-pcc test_101a: Error: 'could not map uid 500 to root in namespace' Created: 09/Jan/23 Updated: 08/Feb/23 Resolved: 03/Feb/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.1 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Sebastien Buisson |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
This issue was created by maloo for Minh Diep <mdiep@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/b93b43b5-a8f2-4ae5-895a-e04966fbb5dd test_101a failed with the following error: could not map uid 500 to root in namespace == sanity-pcc test 101a: Test auto attach in mount namespace (simulated container) ========================================================== 17:56:00 (1672854960) Notice same test passed on 2.15.2.RC1 failed in RC2 VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 10/Jan/23 ] |
|
Looks like this may be a continuation of DCO-9004? |
| Comment by Andreas Dilger [ 10/Jan/23 ] |
|
It looks like the original error is "newuidmap: Could not open proc directory for target 0" but I don't know much about what this test is doing. There was one failure on 2022-12-07 with "execvp fails running newuidmap (2): No such file or directory" from DCO-9004, and then with the "target 0" error on 2022-12-19 and 2022-12-21 on master (2/374 runs) and the one reported here on b2_15 (1/56 runs). |
| Comment by Andreas Dilger [ 10/Jan/23 ] |
|
Sebastien, could you please provide some analysis of what this failure means, how serious the impact of this failure is, and the likelihood of hitting it in production? Is it a testing environment issue, a race in the code during configuration, during runtime, and if it breaks security or just an inconvenience? |
| Comment by Sebastien Buisson [ 10/Jan/23 ] |
|
newuidmap is a system command not related to Lustre. I do not know much about what sanity-pcc test_101a is trying to do, but as far as I can see it starts by creating a user+mount namespace on the agent node. Then it maps user $RUNAS_ID to root inside the namespace, via the newuidmap command. This is where it fails in the various cases reported above, and it has not even started using PCC or Lustre. I checked recent test results, every time sanity-pcc test_101a fails with such an error, this is because the PID of the sleep process launched inside the namespace cannot be found. This can be seen with the message: Created NS: child (sleep) pid which shows an empty $PID variable. As a consequence, the subsequent newuidmap call is incorrect, as it misses its first argument, the PID: trevis-48vm4: [newuidmap] [0] [500] [1] (this command requires at least 4 args). My advice would be to use a longer sleep in the namespace, and retry to create the namespace if the the PID of sleep cannot be found. |
| Comment by Andreas Dilger [ 10/Jan/23 ] |
|
Another oddity i just noticed is that the failure cases all take just over 600s, which is the duration of the remote sleep command, while a pass takes about 30-40s (one pass took 130s, but none took longer). This makes me wonder if the problem is in the remote ssh to the agent node and not the "sleep 2" that is waiting for it? |
| Comment by Gerrit Updater [ 10/Jan/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49587 |
| Comment by Gerrit Updater [ 03/Feb/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49587/ |
| Comment by Peter Jones [ 03/Feb/23 ] |
|
Landed for 2.16 |