[LU-14202] sanity-sec test 16 fails with 'unable to write quota test file' Created: 08/Dec/20  Updated: 14/Dec/20  Resolved: 14/Dec/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Sebastien Buisson
Resolution: Duplicate Votes: 0
Labels: rhel8.3
Environment:

RHEL 8.3 clients/servers


Issue Links:
Duplicate
duplicates LU-14205 RHEL8.3: sanity test 398c fails with ... Resolved
Related
is related to LU-14200 ost-pools test 23b fails with 'dd did... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-sec test_16 fails for RHEL 8.3 client/server testing. We’ve only seen this test fail once with this error; https://testing.whamcloud.com/test_sets/1c1e2670-df76-498e-a3a9-d74b490d63a9 .

Looking at the suite_log for this failure, we see several “pdsh@trevis-8vm7: trevis-8vm8: ssh exited with exit code 2”

…
pdsh@trevis-8vm7: trevis-8vm8: ssh exited with exit code 2
CMD: trevis-8vm8 runas -u60005 -g60005 -G60005 rm /mnt/lustre/d16.sanity-sec/f16.sanity-sec
trevis-8vm8: running as uid/gid/euid/egid 60005/60005/60005/60005, groups: 60005
trevis-8vm8:  [rm] [/mnt/lustre/d16.sanity-sec/f16.sanity-sec]
CMD: trevis-24vm6 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1
CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: trevis-8vm8 runas -u60005 -g60005 -G60005 lfs quota -q /mnt/lustre
trevis-8vm8: running as uid/gid/euid/egid 60005/60005/60005/60005, groups: 60005
trevis-8vm8:  [lfs] [quota] [-q] [/mnt/lustre]
pdsh@trevis-8vm7: trevis-8vm8: ssh exited with exit code 2
CMD: trevis-8vm7.trevis.whamcloud.com chmod 000 /mnt/lustre/d16.sanity-sec
CMD: trevis-8vm7.trevis.whamcloud.com chmod 003 /mnt/lustre/d16.sanity-sec
CMD: trevis-8vm7.trevis.whamcloud.com chmod 300 /mnt/lustre/d16.sanity-sec
CMD: trevis-8vm7.trevis.whamcloud.com chmod 303 /mnt/lustre/d16.sanity-sec
CMD: trevis-8vm7.trevis.whamcloud.com chmod 777 /mnt/lustre/d16.sanity-sec
CMD: trevis-8vm8 runas -u60006 -g60006 -G60006 lfs quota -q /mnt/lustre
trevis-8vm8: running as uid/gid/euid/egid 60006/60006/60006/60006, groups: 60006
trevis-8vm8:  [lfs] [quota] [-q] [/mnt/lustre]
pdsh@trevis-8vm7: trevis-8vm8: ssh exited with exit code 2
 sanity-sec test_16: @@@@@@ FAIL: unable to write quota test file 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6257:error()
  = /usr/lib64/lustre/tests/sanity-sec.sh:1290:do_fops_quota_test()
  = /usr/lib64/lustre/tests/sanity-sec.sh:1483:test_fops()
  = /usr/lib64/lustre/tests/sanity-sec.sh:1567:test_16()

In the client2 (8vm8) console log, we see the following error

[23001.757304] Lustre: DEBUG MARKER: runas -u60006 -g60006 -G60006 dd if=/dev/zero of=/mnt/lustre/d16.sanity-sec/f16.sanity-sec oflag=sync bs=1M count=1
[23002.132686] LustreError: 8532:0:(osc_request.c:1947:osc_brw_fini_request()) lustre-OST0002-osc-ffff8a310b0e7800: unexpected positive size 1
[23002.680119] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity-sec test_16: @@@@@@ FAIL: unable to write quota test file 


 Comments   
Comment by Peter Jones [ 09/Dec/20 ]

Sebastien

Could you please advise on this one? Will this be addressed by your unhanded patches in this area?

Peter

Comment by Gerrit Updater [ 10/Dec/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/40934
Subject: LU-14202 tests: try to reproduce
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e9d63ad0ed558f578ce868e65b2250b26568bfa4

Comment by Sebastien Buisson [ 10/Dec/20 ]

This is not an issue I was aware of, and it is not going to be addressed by any pending patch I have.

Although the issue is uncovered during sanity-sec, I am not sure it is a security related issue. sanity-sec test_16 just plays with nodemaps, and checks that quotas set are correctly enforced. In the case reported by James, neither SHARED_KEY nor SELinux are being used.

The error finally returned by the test comes from a failed attempt to write to a file with dd, as user uid 60006. The relevant messages on client side are:

[23001.757304] Lustre: DEBUG MARKER: runas -u60006 -g60006 -G60006 dd if=/dev/zero of=/mnt/lustre/d16.sanity-sec/f16.sanity-sec oflag=sync bs=1M count=1
[23002.132686] LustreError: 8532:0:(osc_request.c:1947:osc_brw_fini_request()) lustre-OST0002-osc-ffff8a310b0e7800: unexpected positive size 1
[23002.680119] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity-sec test_16: @@@@@@ FAIL: unable to write quota test file 

and

00000008:00020000:0.0:1607377928.061849:0:8532:0:(osc_request.c:1947:osc_brw_fini_request()) lustre-OST0002-osc-ffff8a310b0e7800: unexpected positive size 1
00000080:00200000:0.0:1607377928.098285:0:382201:0:(vvp_io.c:1299:vvp_io_write_start()) f16.sanity-sec: write position mismatch: ki_pos 1048576 vs. pos 0, written 1048576, commit 0: rc = -71
00000080:00200000:0.0:1607377928.098314:0:382201:0:(file.c:1639:ll_file_io_generic()) f16.sanity-sec: 2 io complete with rc: -71, result: 0, restart: 0
00000080:00200000:0.0:1607377928.098315:0:382201:0:(file.c:1695:ll_file_io_generic()) iot: 2, result: 0

Unfortunately, on OSS side, the debug level does not leave us with many details:

00000100:00100000:0.0:1607377928.044135:0:602125:0:(service.c:2299:ptlrpc_server_handle_request()) Handling RPC req@0000000066b395ba pname:cluuid+ref:pid:xid:nid:opc:job ll_ost_io00_002:66c97887-330e-4bd1-930b-29b4c28d8385+6:8532:x1685434418579008:12345-10.9.4.88@tcp:4:dd.60006
00000100:00100000:0.0:1607377928.061461:0:602125:0:(service.c:2348:ptlrpc_server_handle_request()) Handled RPC req@0000000066b395ba pname:cluuid+ref:pid:xid:nid:opc:job ll_ost_io00_002:66c97887-330e-4bd1-930b-29b4c28d8385+7:8532:x1685434418579008:12345-10.9.4.88@tcp:4:dd.60006 Request processed in 17325us (17368us total) trans 0 rc 1/1

At least we can see that the 'unexpected positive 1' reported by the client is in fact returned by the server.

As noticed by James, even before the error, there are suspicious ssh problems in the form of pdsh@trevis-8vm7: trevis-8vm8: ssh exited with exit code 2 in the test log. I have pushed patch #40934 to add more traces and try to reproduce the issue, in the hope the the debug level will be higher (but maybe the default used in Maloo testing is not enough in this case).

Comment by James Nunez (Inactive) [ 10/Dec/20 ]

This may be the same as or related to LU-14200 since we see similar errors on the client.

Comment by James Nunez (Inactive) [ 10/Dec/20 ]

Similar issue seen on sanity-sec test 21 at https://testing.whamcloud.com/test_sets/9f7fdf4a-6ec8-4cd2-9a15-631ece42d85b.

Comment by Sebastien Buisson [ 11/Dec/20 ]

OK, so I think we can definitely take the nodemap feature out of the equation.

Could we mark this ticket as a duplicate of LU-14200, as it already references multiple tests like ost-pools, sanity-quota and sanity-flr?

Thanks,
Sebastien.

Comment by Peter Jones [ 13/Dec/20 ]

Sébastien

LU-14200 has in turn been marked a duplicate of LU-14205 so should we mark this one as a duplicate of that?

Peter

Comment by Sebastien Buisson [ 14/Dec/20 ]

Yes, I think we should mark this one as dup of LU-14205, as it is exactly the same error. Thanks.

Comment by Peter Jones [ 14/Dec/20 ]

thanks Sébastien!

Generated at Sat Feb 10 03:07:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.