Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14358

interop: sanity-pcc and sanity-flr tests fail with ‘cannot open volatile file’on the MDS

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.14.0
    • 2.13.0 servers and master clients with Lustre version>= 2.13.55.16
    • 3
    • 9223372036854775807

    Description

      A variety of sanity-pcc tests fail with different error messages, but all have the following in the MDS console log

      [67028.845779] LustreError: 3916:0:(mdt_open.c:1613:mdt_reint_open()) lustre-MDT0000: cannot open volatile file [0x2000766a8:0x4:0x0], orphan file will be left in PENDING directory until next reboot, rc = -2
      

      and the action on that file does not complete.

      This issue has only been seen in interop testing for 2.13.0 servers and master clients with Lustre version>= 2.13.55.16. It looks like this error message and failures started on 08 AUG 2020 with failures in sanity-pcc and sanity-flr in the test session https://testing.whamcloud.com/test_sessions/7175020e-210c-436a-a2a6-be91c6c12aad

      A few examples of this failure are:
      2021-01-16 Lustre server 2.13.0 and Lustre client 2.13.57.53 - https://testing.whamcloud.com/test_sets/c1c698e2-e40e-4035-a3bb-de8acfacc5b5
      sanity-pcc test_1a fails with 'request on 0x2000766a8:0x2:0x0 is not SUCCEED on mds1'
      sanity-pcc test_1e fails with ‘failed to attach file /mnt/lustre/d1e.sanity-pcc/f1e.sanity-pcc’
      sanity-pcc test_3a fails with ‘failed to attach file \/mnt\/lustre\/d3a.sanity-pcc\/f3a.sanity-pcc ‘

      2021-01-07 Lustre server 2.13.0 and Lustre client 2.13.57.44 - https://testing.whamcloud.com/test_sets/c445c494-4d76-4185-9554-f5eb131d5b03
      2021-12-25 Lustre server 2.13.0 and Lustre client 2.13.57.36 - https://testing.whamcloud.com/test_sets/e26503ed-449e-4070-84e3-8af9d129fad9
      2021-12-10 Lustre server 2.13.0 and Lustre client 2.13.57 - https://testing.whamcloud.com/test_sets/9e2b1a2a-1d6a-4f56-b375-cd177c058eb0
      sanity-pfl test 6 fails with 'Migrate(v1 -> composite) /mnt/lustre/d6.sanity-pfl/f6.sanity-pfl failed'

      2021-12-25 Lustre server 2.13.0 and Lustre client 2.13.57.12 - https://testing.whamcloud.com/test_sets/3f774e46-eb0f-4b4b-90e5-a70a48ae9c56
      sanity-pcc test_1c fails with 'request on 0x200076e71:0x2:0x0 is not SUCCEED on mds1'
      sanity-pcc test 7b fails with 'multiop mmap write failed'
      sanity-pcc test 12 fails with 'request on 0x200079d51:0x14:0x0 is not SUCCEED on mds1'
      sanity-pcc test 19 fails with 'Failed to attach /mnt/lustre/f19.sanity-pcc'

      We do see this error message occasionally in ost-pools test 28 and the test passes:
      2021-12-25 Lustre server 2.13.0 and Lustre client 2.13.57.36 - https://testing.whamcloud.com/test_sets/fe713439-9aa7-4a71-beb0-b1b715fca596
      2021-12-10 Lustre server 2.13.0 and Lustre client 2.13.57 - https://testing.whamcloud.com/test_sets/f48d3beb-c6d3-4451-8dc1-dcbe7dfe0b14

      Attachments

        Issue Links

          Activity

            [LU-14358] interop: sanity-pcc and sanity-flr tests fail with ‘cannot open volatile file’on the MDS
            laisiyao Lai Siyao added a comment -

            I reproduced it in local system, but I'm not familiar with PCC code and test case, Qian will look into it.

            laisiyao Lai Siyao added a comment - I reproduced it in local system, but I'm not familiar with PCC code and test case, Qian will look into it.
            pjones Peter Jones added a comment -

            Lai

            Could you please comment on this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Lai Could you please comment on this one? Thanks Peter
            jhammond John Hammond added a comment -

            https://testing.whamcloud.com/test_sessions/7175020e-210c-436a-a2a6-be91c6c12aad

            Earlier in the same session I see sanity 185 passing and sanity-hsm passes.

            This needs to be reproduced with trace enabled on the MDT. Unfortunately the MDT debug logs on for sanity-pcc text_1c are useless due to 829055 messages of the form:

            0000004:00020000:0.0:1596889762.316083:0:25563:0:(mdd_orphans.c:329:mdd_orphan_destroy()) lustre-MDD0000: could not delete orphan [0x1d6:0xf096f8eb:0x0]: rc = -2
            00000004:00080000:0.0:1596889762.316093:0:25563:0:(mdd_orphans.c:376:mdd_orphan_key_test_and_delete()) Found orphan [0x1d6:0xf096f8eb:0x0], delete it
            00000004:00020000:0.0:1596889762.316098:0:25563:0:(mdd_orphans.c:329:mdd_orphan_destroy()) lustre-MDD0000: could not delete orphan [0x1d6:0xf096f8eb:0x0]: rc = -2
            

            I wonder if this is related.

            jhammond John Hammond added a comment - https://testing.whamcloud.com/test_sessions/7175020e-210c-436a-a2a6-be91c6c12aad Earlier in the same session I see sanity 185 passing and sanity-hsm passes. This needs to be reproduced with trace enabled on the MDT. Unfortunately the MDT debug logs on for sanity-pcc text_1c are useless due to 829055 messages of the form: 0000004:00020000:0.0:1596889762.316083:0:25563:0:(mdd_orphans.c:329:mdd_orphan_destroy()) lustre-MDD0000: could not delete orphan [0x1d6:0xf096f8eb:0x0]: rc = -2 00000004:00080000:0.0:1596889762.316093:0:25563:0:(mdd_orphans.c:376:mdd_orphan_key_test_and_delete()) Found orphan [0x1d6:0xf096f8eb:0x0], delete it 00000004:00020000:0.0:1596889762.316098:0:25563:0:(mdd_orphans.c:329:mdd_orphan_destroy()) lustre-MDD0000: could not delete orphan [0x1d6:0xf096f8eb:0x0]: rc = -2 I wonder if this is related.

            The question was asked if we see this error using 2.13.0 for both servers and clients. In patch https://review.whamcloud.com/#/c/41301/, we tried to reproduce this error using 2.13.0 clients and servers, but, so far, we can't reproduce this error.

            jamesanunez James Nunez (Inactive) added a comment - The question was asked if we see this error using 2.13.0 for both servers and clients. In patch https://review.whamcloud.com/#/c/41301/ , we tried to reproduce this error using 2.13.0 clients and servers, but, so far, we can't reproduce this error.

            James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41301
            Subject: LU-14358 test: check for error 2.13 server/client
            Project: fs/lustre-release
            Branch: b2_13
            Current Patch Set: 1
            Commit: 784779d06ed7f2b23f0a56281dd9fdc25309b239

            gerrit Gerrit Updater added a comment - James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41301 Subject: LU-14358 test: check for error 2.13 server/client Project: fs/lustre-release Branch: b2_13 Current Patch Set: 1 Commit: 784779d06ed7f2b23f0a56281dd9fdc25309b239

            People

              qian_wc Qian Yingjin
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: