[LU-14202] sanity-sec test 16 fails with 'unable to write quota test file' Created: 08/Dec/20 Updated: 14/Dec/20 Resolved: 14/Dec/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Sebastien Buisson |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | rhel8.3 | ||
| Environment: |
RHEL 8.3 clients/servers |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
sanity-sec test_16 fails for RHEL 8.3 client/server testing. We’ve only seen this test fail once with this error; https://testing.whamcloud.com/test_sets/1c1e2670-df76-498e-a3a9-d74b490d63a9 . Looking at the suite_log for this failure, we see several “pdsh@trevis-8vm7: trevis-8vm8: ssh exited with exit code 2” … pdsh@trevis-8vm7: trevis-8vm8: ssh exited with exit code 2 CMD: trevis-8vm8 runas -u60005 -g60005 -G60005 rm /mnt/lustre/d16.sanity-sec/f16.sanity-sec trevis-8vm8: running as uid/gid/euid/egid 60005/60005/60005/60005, groups: 60005 trevis-8vm8: [rm] [/mnt/lustre/d16.sanity-sec/f16.sanity-sec] CMD: trevis-24vm6 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1 CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: trevis-8vm8 runas -u60005 -g60005 -G60005 lfs quota -q /mnt/lustre trevis-8vm8: running as uid/gid/euid/egid 60005/60005/60005/60005, groups: 60005 trevis-8vm8: [lfs] [quota] [-q] [/mnt/lustre] pdsh@trevis-8vm7: trevis-8vm8: ssh exited with exit code 2 CMD: trevis-8vm7.trevis.whamcloud.com chmod 000 /mnt/lustre/d16.sanity-sec CMD: trevis-8vm7.trevis.whamcloud.com chmod 003 /mnt/lustre/d16.sanity-sec CMD: trevis-8vm7.trevis.whamcloud.com chmod 300 /mnt/lustre/d16.sanity-sec CMD: trevis-8vm7.trevis.whamcloud.com chmod 303 /mnt/lustre/d16.sanity-sec CMD: trevis-8vm7.trevis.whamcloud.com chmod 777 /mnt/lustre/d16.sanity-sec CMD: trevis-8vm8 runas -u60006 -g60006 -G60006 lfs quota -q /mnt/lustre trevis-8vm8: running as uid/gid/euid/egid 60006/60006/60006/60006, groups: 60006 trevis-8vm8: [lfs] [quota] [-q] [/mnt/lustre] pdsh@trevis-8vm7: trevis-8vm8: ssh exited with exit code 2 sanity-sec test_16: @@@@@@ FAIL: unable to write quota test file Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:6257:error() = /usr/lib64/lustre/tests/sanity-sec.sh:1290:do_fops_quota_test() = /usr/lib64/lustre/tests/sanity-sec.sh:1483:test_fops() = /usr/lib64/lustre/tests/sanity-sec.sh:1567:test_16() In the client2 (8vm8) console log, we see the following error [23001.757304] Lustre: DEBUG MARKER: runas -u60006 -g60006 -G60006 dd if=/dev/zero of=/mnt/lustre/d16.sanity-sec/f16.sanity-sec oflag=sync bs=1M count=1 [23002.132686] LustreError: 8532:0:(osc_request.c:1947:osc_brw_fini_request()) lustre-OST0002-osc-ffff8a310b0e7800: unexpected positive size 1 [23002.680119] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-sec test_16: @@@@@@ FAIL: unable to write quota test file |
| Comments |
| Comment by Peter Jones [ 09/Dec/20 ] |
|
Sebastien Could you please advise on this one? Will this be addressed by your unhanded patches in this area? Peter |
| Comment by Gerrit Updater [ 10/Dec/20 ] |
|
Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/40934 |
| Comment by Sebastien Buisson [ 10/Dec/20 ] |
|
This is not an issue I was aware of, and it is not going to be addressed by any pending patch I have. Although the issue is uncovered during sanity-sec, I am not sure it is a security related issue. sanity-sec test_16 just plays with nodemaps, and checks that quotas set are correctly enforced. In the case reported by James, neither SHARED_KEY nor SELinux are being used. The error finally returned by the test comes from a failed attempt to write to a file with dd, as user uid 60006. The relevant messages on client side are: [23001.757304] Lustre: DEBUG MARKER: runas -u60006 -g60006 -G60006 dd if=/dev/zero of=/mnt/lustre/d16.sanity-sec/f16.sanity-sec oflag=sync bs=1M count=1 [23002.132686] LustreError: 8532:0:(osc_request.c:1947:osc_brw_fini_request()) lustre-OST0002-osc-ffff8a310b0e7800: unexpected positive size 1 [23002.680119] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-sec test_16: @@@@@@ FAIL: unable to write quota test file and 00000008:00020000:0.0:1607377928.061849:0:8532:0:(osc_request.c:1947:osc_brw_fini_request()) lustre-OST0002-osc-ffff8a310b0e7800: unexpected positive size 1 00000080:00200000:0.0:1607377928.098285:0:382201:0:(vvp_io.c:1299:vvp_io_write_start()) f16.sanity-sec: write position mismatch: ki_pos 1048576 vs. pos 0, written 1048576, commit 0: rc = -71 00000080:00200000:0.0:1607377928.098314:0:382201:0:(file.c:1639:ll_file_io_generic()) f16.sanity-sec: 2 io complete with rc: -71, result: 0, restart: 0 00000080:00200000:0.0:1607377928.098315:0:382201:0:(file.c:1695:ll_file_io_generic()) iot: 2, result: 0 Unfortunately, on OSS side, the debug level does not leave us with many details: 00000100:00100000:0.0:1607377928.044135:0:602125:0:(service.c:2299:ptlrpc_server_handle_request()) Handling RPC req@0000000066b395ba pname:cluuid+ref:pid:xid:nid:opc:job ll_ost_io00_002:66c97887-330e-4bd1-930b-29b4c28d8385+6:8532:x1685434418579008:12345-10.9.4.88@tcp:4:dd.60006 00000100:00100000:0.0:1607377928.061461:0:602125:0:(service.c:2348:ptlrpc_server_handle_request()) Handled RPC req@0000000066b395ba pname:cluuid+ref:pid:xid:nid:opc:job ll_ost_io00_002:66c97887-330e-4bd1-930b-29b4c28d8385+7:8532:x1685434418579008:12345-10.9.4.88@tcp:4:dd.60006 Request processed in 17325us (17368us total) trans 0 rc 1/1 At least we can see that the 'unexpected positive 1' reported by the client is in fact returned by the server. As noticed by James, even before the error, there are suspicious ssh problems in the form of pdsh@trevis-8vm7: trevis-8vm8: ssh exited with exit code 2 in the test log. I have pushed patch #40934 to add more traces and try to reproduce the issue, in the hope the the debug level will be higher (but maybe the default used in Maloo testing is not enough in this case). |
| Comment by James Nunez (Inactive) [ 10/Dec/20 ] |
|
This may be the same as or related to |
| Comment by James Nunez (Inactive) [ 10/Dec/20 ] |
|
Similar issue seen on sanity-sec test 21 at https://testing.whamcloud.com/test_sets/9f7fdf4a-6ec8-4cd2-9a15-631ece42d85b. |
| Comment by Sebastien Buisson [ 11/Dec/20 ] |
|
OK, so I think we can definitely take the nodemap feature out of the equation. Could we mark this ticket as a duplicate of Thanks, |
| Comment by Peter Jones [ 13/Dec/20 ] |
|
Sébastien
Peter |
| Comment by Sebastien Buisson [ 14/Dec/20 ] |
|
Yes, I think we should mark this one as dup of |
| Comment by Peter Jones [ 14/Dec/20 ] |
|
thanks Sébastien! |