Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14202

sanity-sec test 16 fails with 'unable to write quota test file'

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.14.0
    • RHEL 8.3 clients/servers
    • 3
    • 9223372036854775807

    Description

      sanity-sec test_16 fails for RHEL 8.3 client/server testing. We’ve only seen this test fail once with this error; https://testing.whamcloud.com/test_sets/1c1e2670-df76-498e-a3a9-d74b490d63a9 .

      Looking at the suite_log for this failure, we see several “pdsh@trevis-8vm7: trevis-8vm8: ssh exited with exit code 2”

      …
      pdsh@trevis-8vm7: trevis-8vm8: ssh exited with exit code 2
      CMD: trevis-8vm8 runas -u60005 -g60005 -G60005 rm /mnt/lustre/d16.sanity-sec/f16.sanity-sec
      trevis-8vm8: running as uid/gid/euid/egid 60005/60005/60005/60005, groups: 60005
      trevis-8vm8:  [rm] [/mnt/lustre/d16.sanity-sec/f16.sanity-sec]
      CMD: trevis-24vm6 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1
      CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-24vm6 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-8vm8 runas -u60005 -g60005 -G60005 lfs quota -q /mnt/lustre
      trevis-8vm8: running as uid/gid/euid/egid 60005/60005/60005/60005, groups: 60005
      trevis-8vm8:  [lfs] [quota] [-q] [/mnt/lustre]
      pdsh@trevis-8vm7: trevis-8vm8: ssh exited with exit code 2
      CMD: trevis-8vm7.trevis.whamcloud.com chmod 000 /mnt/lustre/d16.sanity-sec
      CMD: trevis-8vm7.trevis.whamcloud.com chmod 003 /mnt/lustre/d16.sanity-sec
      CMD: trevis-8vm7.trevis.whamcloud.com chmod 300 /mnt/lustre/d16.sanity-sec
      CMD: trevis-8vm7.trevis.whamcloud.com chmod 303 /mnt/lustre/d16.sanity-sec
      CMD: trevis-8vm7.trevis.whamcloud.com chmod 777 /mnt/lustre/d16.sanity-sec
      CMD: trevis-8vm8 runas -u60006 -g60006 -G60006 lfs quota -q /mnt/lustre
      trevis-8vm8: running as uid/gid/euid/egid 60006/60006/60006/60006, groups: 60006
      trevis-8vm8:  [lfs] [quota] [-q] [/mnt/lustre]
      pdsh@trevis-8vm7: trevis-8vm8: ssh exited with exit code 2
       sanity-sec test_16: @@@@@@ FAIL: unable to write quota test file 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6257:error()
        = /usr/lib64/lustre/tests/sanity-sec.sh:1290:do_fops_quota_test()
        = /usr/lib64/lustre/tests/sanity-sec.sh:1483:test_fops()
        = /usr/lib64/lustre/tests/sanity-sec.sh:1567:test_16()
      

      In the client2 (8vm8) console log, we see the following error

      [23001.757304] Lustre: DEBUG MARKER: runas -u60006 -g60006 -G60006 dd if=/dev/zero of=/mnt/lustre/d16.sanity-sec/f16.sanity-sec oflag=sync bs=1M count=1
      [23002.132686] LustreError: 8532:0:(osc_request.c:1947:osc_brw_fini_request()) lustre-OST0002-osc-ffff8a310b0e7800: unexpected positive size 1
      [23002.680119] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity-sec test_16: @@@@@@ FAIL: unable to write quota test file 
      

      Attachments

        Issue Links

          Activity

            [LU-14202] sanity-sec test 16 fails with 'unable to write quota test file'
            pjones Peter Jones added a comment -

            thanks Sébastien!

            pjones Peter Jones added a comment - thanks Sébastien!

            Yes, I think we should mark this one as dup of LU-14205, as it is exactly the same error. Thanks.

            sebastien Sebastien Buisson added a comment - Yes, I think we should mark this one as dup of LU-14205 , as it is exactly the same error. Thanks.
            pjones Peter Jones added a comment -

            Sébastien

            LU-14200 has in turn been marked a duplicate of LU-14205 so should we mark this one as a duplicate of that?

            Peter

            pjones Peter Jones added a comment - Sébastien LU-14200 has in turn been marked a duplicate of LU-14205 so should we mark this one as a duplicate of that? Peter

            OK, so I think we can definitely take the nodemap feature out of the equation.

            Could we mark this ticket as a duplicate of LU-14200, as it already references multiple tests like ost-pools, sanity-quota and sanity-flr?

            Thanks,
            Sebastien.

            sebastien Sebastien Buisson added a comment - OK, so I think we can definitely take the nodemap feature out of the equation. Could we mark this ticket as a duplicate of LU-14200 , as it already references multiple tests like ost-pools, sanity-quota and sanity-flr? Thanks, Sebastien.
            jamesanunez James Nunez (Inactive) added a comment - Similar issue seen on sanity-sec test 21 at https://testing.whamcloud.com/test_sets/9f7fdf4a-6ec8-4cd2-9a15-631ece42d85b .

            This may be the same as or related to LU-14200 since we see similar errors on the client.

            jamesanunez James Nunez (Inactive) added a comment - This may be the same as or related to LU-14200 since we see similar errors on the client.

            This is not an issue I was aware of, and it is not going to be addressed by any pending patch I have.

            Although the issue is uncovered during sanity-sec, I am not sure it is a security related issue. sanity-sec test_16 just plays with nodemaps, and checks that quotas set are correctly enforced. In the case reported by James, neither SHARED_KEY nor SELinux are being used.

            The error finally returned by the test comes from a failed attempt to write to a file with dd, as user uid 60006. The relevant messages on client side are:

            [23001.757304] Lustre: DEBUG MARKER: runas -u60006 -g60006 -G60006 dd if=/dev/zero of=/mnt/lustre/d16.sanity-sec/f16.sanity-sec oflag=sync bs=1M count=1
            [23002.132686] LustreError: 8532:0:(osc_request.c:1947:osc_brw_fini_request()) lustre-OST0002-osc-ffff8a310b0e7800: unexpected positive size 1
            [23002.680119] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity-sec test_16: @@@@@@ FAIL: unable to write quota test file 
            

            and

            00000008:00020000:0.0:1607377928.061849:0:8532:0:(osc_request.c:1947:osc_brw_fini_request()) lustre-OST0002-osc-ffff8a310b0e7800: unexpected positive size 1
            00000080:00200000:0.0:1607377928.098285:0:382201:0:(vvp_io.c:1299:vvp_io_write_start()) f16.sanity-sec: write position mismatch: ki_pos 1048576 vs. pos 0, written 1048576, commit 0: rc = -71
            00000080:00200000:0.0:1607377928.098314:0:382201:0:(file.c:1639:ll_file_io_generic()) f16.sanity-sec: 2 io complete with rc: -71, result: 0, restart: 0
            00000080:00200000:0.0:1607377928.098315:0:382201:0:(file.c:1695:ll_file_io_generic()) iot: 2, result: 0
            

            Unfortunately, on OSS side, the debug level does not leave us with many details:

            00000100:00100000:0.0:1607377928.044135:0:602125:0:(service.c:2299:ptlrpc_server_handle_request()) Handling RPC req@0000000066b395ba pname:cluuid+ref:pid:xid:nid:opc:job ll_ost_io00_002:66c97887-330e-4bd1-930b-29b4c28d8385+6:8532:x1685434418579008:12345-10.9.4.88@tcp:4:dd.60006
            00000100:00100000:0.0:1607377928.061461:0:602125:0:(service.c:2348:ptlrpc_server_handle_request()) Handled RPC req@0000000066b395ba pname:cluuid+ref:pid:xid:nid:opc:job ll_ost_io00_002:66c97887-330e-4bd1-930b-29b4c28d8385+7:8532:x1685434418579008:12345-10.9.4.88@tcp:4:dd.60006 Request processed in 17325us (17368us total) trans 0 rc 1/1
            

            At least we can see that the 'unexpected positive 1' reported by the client is in fact returned by the server.

            As noticed by James, even before the error, there are suspicious ssh problems in the form of pdsh@trevis-8vm7: trevis-8vm8: ssh exited with exit code 2 in the test log. I have pushed patch #40934 to add more traces and try to reproduce the issue, in the hope the the debug level will be higher (but maybe the default used in Maloo testing is not enough in this case).

            sebastien Sebastien Buisson added a comment - This is not an issue I was aware of, and it is not going to be addressed by any pending patch I have. Although the issue is uncovered during sanity-sec, I am not sure it is a security related issue. sanity-sec test_16 just plays with nodemaps, and checks that quotas set are correctly enforced. In the case reported by James, neither SHARED_KEY nor SELinux are being used. The error finally returned by the test comes from a failed attempt to write to a file with dd, as user uid 60006 . The relevant messages on client side are: [23001.757304] Lustre: DEBUG MARKER: runas -u60006 -g60006 -G60006 dd if=/dev/zero of=/mnt/lustre/d16.sanity-sec/f16.sanity-sec oflag=sync bs=1M count=1 [23002.132686] LustreError: 8532:0:(osc_request.c:1947:osc_brw_fini_request()) lustre-OST0002-osc-ffff8a310b0e7800: unexpected positive size 1 [23002.680119] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-sec test_16: @@@@@@ FAIL: unable to write quota test file and 00000008:00020000:0.0:1607377928.061849:0:8532:0:(osc_request.c:1947:osc_brw_fini_request()) lustre-OST0002-osc-ffff8a310b0e7800: unexpected positive size 1 00000080:00200000:0.0:1607377928.098285:0:382201:0:(vvp_io.c:1299:vvp_io_write_start()) f16.sanity-sec: write position mismatch: ki_pos 1048576 vs. pos 0, written 1048576, commit 0: rc = -71 00000080:00200000:0.0:1607377928.098314:0:382201:0:(file.c:1639:ll_file_io_generic()) f16.sanity-sec: 2 io complete with rc: -71, result: 0, restart: 0 00000080:00200000:0.0:1607377928.098315:0:382201:0:(file.c:1695:ll_file_io_generic()) iot: 2, result: 0 Unfortunately, on OSS side, the debug level does not leave us with many details: 00000100:00100000:0.0:1607377928.044135:0:602125:0:(service.c:2299:ptlrpc_server_handle_request()) Handling RPC req@0000000066b395ba pname:cluuid+ref:pid:xid:nid:opc:job ll_ost_io00_002:66c97887-330e-4bd1-930b-29b4c28d8385+6:8532:x1685434418579008:12345-10.9.4.88@tcp:4:dd.60006 00000100:00100000:0.0:1607377928.061461:0:602125:0:(service.c:2348:ptlrpc_server_handle_request()) Handled RPC req@0000000066b395ba pname:cluuid+ref:pid:xid:nid:opc:job ll_ost_io00_002:66c97887-330e-4bd1-930b-29b4c28d8385+7:8532:x1685434418579008:12345-10.9.4.88@tcp:4:dd.60006 Request processed in 17325us (17368us total) trans 0 rc 1/1 At least we can see that the 'unexpected positive 1' reported by the client is in fact returned by the server. As noticed by James, even before the error, there are suspicious ssh problems in the form of pdsh@trevis-8vm7: trevis-8vm8: ssh exited with exit code 2 in the test log. I have pushed patch #40934 to add more traces and try to reproduce the issue, in the hope the the debug level will be higher (but maybe the default used in Maloo testing is not enough in this case).

            Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/40934
            Subject: LU-14202 tests: try to reproduce
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e9d63ad0ed558f578ce868e65b2250b26568bfa4

            gerrit Gerrit Updater added a comment - Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/40934 Subject: LU-14202 tests: try to reproduce Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e9d63ad0ed558f578ce868e65b2250b26568bfa4
            pjones Peter Jones added a comment -

            Sebastien

            Could you please advise on this one? Will this be addressed by your unhanded patches in this area?

            Peter

            pjones Peter Jones added a comment - Sebastien Could you please advise on this one? Will this be addressed by your unhanded patches in this area? Peter

            People

              sebastien Sebastien Buisson
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: