Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15551

Interop: Various tests fail with fallocate “Protocol error” errors

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.15.0
    • Lustre 2.15.0
    • 3
    • 9223372036854775807

    Description

      Several tests are failing with the a variety error messages, but they all have the message “do_preallocate: fallocate: Protocol error” . These failures started September 11, 2021.

      sanity-benchmark test_fsx and sanityn test_16a, test_16b, test_16c fail with the following in the suite_log:

      do_preallocate: fallocate: Protocol error
      

      sanity test_150b fail with the following in the suite_log:

      fallocate failed, error Protocol error, mode 0, offset 62914560, len 4194304
      

      sanity-benchmark test_fsx fails with 'fsx failed' starting with 2.14.54.45 servers and 2.14.0 clients; https://testing.whamcloud.com/test_sets/daab9d17-e641-427e-9572-168989a4b23b. The following in the suite_log:

      == sanity-benchmark test fsx: fsx ==================================================================== 12:02:50 (1631361770)
      keep default fallocate mode: 0
      debug=0
      Using: /usr/lib64/lustre/tests/fsx -c 50 -p 1000 -S 5151 -P /tmp -l 5461544 	     -N 100000  /mnt/lustre/f0.fsxfile
      Chance of close/open is 1 in 50
      Seed set to 5151
      fsx: test_fallocate: filesystem does not support fallocate mode 0x3, disabling!: Operation not supported
      fsx: test_fallocate: filesystem does not support fallocate mode 0x10, disabling!: Operation not supported
      truncating to largest ever: 0x3531dc
      fallocate: 118b1 to 469048
      do_preallocate: fallocate: Protocol error
      LOG DUMP (10 total operations):
      1: 1631361772.173266 TRUNCATE UP	from 0x00000 to 0x3531dc
      2: 1631361772.176123 WRITE    0x475dee thru 0x476c9e (0x00eb1 bytes) HOLE
      3: 1631361772.176366 TRUNCATE DOWN	from 0x476c9f to 0x18202b
      4: 1631361772.178621 MAPWRITE 0xd455f thru 0xd6920 (0x023c2 bytes)
      5: 1631361772.246526 SKIPPED (no operation)
      6: 1631361772.246564 READ     0x1b680 thru 0x26b75 (0x0b4f6 bytes)
      7: 1631361772.247881 MAPWRITE 0x22e11f thru 0x2313a0 (0x03282 bytes)
      8: 1631361772.280197 MAPWRITE 0x374883 thru 0x3810ae (0x0c82c bytes)
      9: 1631361772.315333 MAPWRITE 0x4796a2 thru 0x47a8f8 (0x01257 bytes)
      10: 1631361772.346572 FALLOC  	from 0x118b1 to 0x1efe9	(0x0d738 bytes)INTERIOR
      Correct content saved for comparison
      (maybe hexdump "/mnt/lustre/f0.fsxfile" vs "/tmp//f0.fsxfile.fsxgood")
       sanity-benchmark test_fsx: @@@@@@ FAIL: fsx failed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
        = /usr/lib64/lustre/tests/sanity-benchmark.sh:196:test_fsx()
      

      sanityn test_16a, test_16b, test_16c fails with 'fsx failed' starting on same data and same client/server versions;

      == sanityn test 16a: 12500 iterations of dual-mount fsx ============================================== 18:58:04 (1631386684)
      CMD: onyx-23vm7 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize
      keep default fallocate mode: 0
      7+0 records in
      7+0 records out
      7340032 bytes (7.3 MB, 7.0 MiB) copied, 0.577856 s, 12.7 MB/s
      7+0 records in
      7+0 records out
      7340032 bytes (7.3 MB, 7.0 MiB) copied, 0.46945 s, 15.6 MB/s
      Chance of close/open is 1 in 50
      Seed set to 6688
      fd 0: /mnt/lustre/f16a.sanityn
      fd 1: /mnt/lustre2/f16a.sanityn
      fsx: test_fallocate: filesystem does not support fallocate mode 0x3, disabling!: Operation not supported
      fsx: test_fallocate: filesystem does not support fallocate mode 0x10, disabling!: Operation not supported
      fallocate: 5233c8 to 4b288
      do_preallocate: fallocate: Protocol error
      LOG DUMP (2 total operations):
      1[1]: 1631386688.420122 WRITE    0x561c9c thru 0x56e64f (0x0c9b4 bytes) HOLE
      2[0]: 1631386688.427553 FALLOC  	from 0x5233c8 to 0x52350f	(0x00147 bytes)INTERIOR
      Correct content saved for comparison
      (maybe hexdump "/mnt/lustre/f16a.sanityn" vs "/mnt/lustre/f16a.sanityn.fsxgood")
       sanityn test_16a: @@@@@@ FAIL: fsx failed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
        = /usr/lib64/lustre/tests/sanityn.sh:402:test_16a()
      …
      == sanityn test 16b: 12500 iterations of dual-mount fsx at small size ================================ 18:58:12 (1631386692)
      keep default fallocate mode: 0
      7+0 records in
      7+0 records out
      7340032 bytes (7.3 MB, 7.0 MiB) copied, 0.515146 s, 14.2 MB/s
      7+0 records in
      7+0 records out
      7340032 bytes (7.3 MB, 7.0 MiB) copied, 0.453771 s, 16.2 MB/s
      Chance of close/open is 1 in 50
      Seed set to 6694
      fd 0: /mnt/lustre/f16b.sanityn
      fd 1: /mnt/lustre2/f16b.sanityn
      fsx: test_fallocate: filesystem does not support fallocate mode 0x3, disabling!: Operation not supported
      fsx: test_fallocate: filesystem does not support fallocate mode 0x10, disabling!: Operation not supported
      skipping zero size read
      fallocate: 3a858e to 5e8cc6
      do_preallocate: fallocate: Protocol error
      LOG DUMP (4 total operations):
      1[18446744073708685781]: 1631386695.071343 SKIPPED (no operation)
      2[1]: 1631386695.073361 SKIPPED (no operation)
      3[1]: 1631386695.073642 WRITE    0x98f7df thru 0x991253 (0x01a75 bytes) HOLE
      4[0]: 1631386695.081362 FALLOC  	from 0x3a858e to 0x3aa51d	(0x01f8f bytes)INTERIOR
      Correct content saved for comparison
      (maybe hexdump "/mnt/lustre/f16b.sanityn" vs "/mnt/lustre/f16b.sanityn.fsxgood")
       sanityn test_16b: @@@@@@ FAIL: fsx with tiny write failed. 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
        = /usr/lib64/lustre/tests/sanityn.sh:431:test_16b()
      …
      == sanityn test 16c: verify data consistency on ldiskfs with cache disabled (b=17397) ================ 18:58:18 (1631386698)
      CMD: onyx-23vm7 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize
      keep default fallocate mode: 0
      7+0 records in
      7+0 records out
      7340032 bytes (7.3 MB, 7.0 MiB) copied, 0.632607 s, 11.6 MB/s
      7+0 records in
      7+0 records out
      7340032 bytes (7.3 MB, 7.0 MiB) copied, 0.479825 s, 15.3 MB/s
      CMD: onyx-23vm7 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1
      CMD: onyx-23vm7 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: onyx-23vm7 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: onyx-23vm7 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: onyx-23vm7 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: onyx-23vm7 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: onyx-23vm7 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      Waiting for local destroys to complete
      CMD: onyx-23vm6 /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
      pdsh@onyx-23vm4: onyx-23vm6: ssh exited with exit code 2
      CMD: onyx-23vm6 /usr/sbin/lctl set_param -n obdfilter.lustre-OST*.read_cache_enable=0 		osd-*.lustre-OST*.read_cache_enable=0 2>&1
      pdsh@onyx-23vm4: onyx-23vm6: ssh exited with exit code 2
      CMD: onyx-23vm6 /usr/sbin/lctl set_param -n obdfilter.lustre-OST*.writethrough_cache_enable=0 		osd-*.lustre-OST*.writethrough_cache_enable=0 2>&1
      pdsh@onyx-23vm4: onyx-23vm6: ssh exited with exit code 2
      Chance of close/open is 1 in 50
      Seed set to 6716
      fd 0: /mnt/lustre/f16c.sanityn
      fd 1: /mnt/lustre2/f16c.sanityn
      fsx: test_fallocate: filesystem does not support fallocate mode 0x3, disabling!: Operation not supported
      fsx: test_fallocate: filesystem does not support fallocate mode 0x10, disabling!: Operation not supported
      fallocate: 4154fd to ffdea255
      do_preallocate: fallocate: Protocol error
      LOG DUMP (5 total operations):
      1[18446744073708595669]: 1631386716.729518 SKIPPED (no operation)
      2[18446744073708595669]: 1631386716.730419 SKIPPED (no operation)
      3[1]: 1631386716.736190 MAPWRITE 0x1f481f thru 0x1ff751 (0x0af33 bytes)
      4[1]: 1631386716.938927 SKIPPED (no operation)
      5[1]: 1631386716.939194 FALLOC  	from 0x4154fd to 0x4246d1	(0x0f1d4 bytes)INTERIOR
      Correct content saved for comparison
      (maybe hexdump "/mnt/lustre/f16c.sanityn" vs "/mnt/lustre/f16c.sanityn.fsxgood")
       sanityn test_16c: @@@@@@ FAIL: fsx failed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
        = /usr/lib64/lustre/tests/sanityn.sh:462:test_16c()
      
      

      sanity test_150b fails with a similar message ever since the test was introduced; https://testing.whamcloud.com/test_sets/9ea27d14-cf74-483f-98b4-46fc0efa0146.

      == sanity test 150b: Verify fallocate (prealloc) functionality ======================================= 07:55:05 (1623225305)
      keep default fallocate mode: 0
      fallocate failed, error Protocol error, mode 0, offset 62914560, len 4194304
       sanity test_150b: @@@@@@ FAIL: fallocate failed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
        = /usr/lib64/lustre/tests/sanity.sh:13441:test_150b()
       

      Attachments

        Issue Links

          Activity

            [LU-15551] Interop: Various tests fail with fallocate “Protocol error” errors
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46516/
            Subject: LU-15551 ofd: Return EOPNOTSUPP instead of EPROTO
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 2f496148c31dd435681e595129834cecdaec82dc

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46516/ Subject: LU-15551 ofd: Return EOPNOTSUPP instead of EPROTO Project: fs/lustre-release Branch: master Current Patch Set: Commit: 2f496148c31dd435681e595129834cecdaec82dc

            Hi Andreas,

            Thanks for your comments and analysis. The client is getting -EPROTO from ofd_fallocate_hdl functin (server side) Changed it to return EOPNOTSUPP. The client part looks ok to me. LU-14301 does not need further changes IMO.

            Thanks!

            arshad512 Arshad Hussain added a comment - Hi Andreas, Thanks for your comments and analysis. The client is getting - EPROTO from ofd_fallocate_hdl functin (server side) Changed it to return EOPNOTSUPP. The client part looks ok to me. LU-14301 does not need further changes IMO. Thanks!

            "Arshad Hussain <arshad.hussain@aeoncomputing.com>" uploaded a new patch: https://review.whamcloud.com/46516
            Subject: LU-15551 ofd: Return EOPNOTSUPP instead of EPROTO
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 454de1abd0de78d628147259f44c3eaabe6b1414

            gerrit Gerrit Updater added a comment - "Arshad Hussain <arshad.hussain@aeoncomputing.com>" uploaded a new patch: https://review.whamcloud.com/46516 Subject: LU-15551 ofd: Return EOPNOTSUPP instead of EPROTO Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 454de1abd0de78d628147259f44c3eaabe6b1414
            adilger Andreas Dilger added a comment - - edited

            I've fixed the URL to https://review.whamcloud.com/40877 "LU-14160 fallocate: Add punch mode to fallocate".

            It probably needs the server to return -EOPNOTSUPP instead of -EPROTO for unsupported fallocate modes. On the client (for interop) it should convert -EPROTO to -EOPNOTSUPP in fallocate. I thought there was already a patch to do this, but I guess it is not in the right place - https://review.whamcloud.com/41148 "LU-14301 llite: return EOPNOTSUPP if fallocate is not supported", but it looks like that patch is converting -ENOTSUPP to -EOPNOTSUPP and should be enhanced to do the same for -EPROTO?

            adilger Andreas Dilger added a comment - - edited I've fixed the URL to https://review.whamcloud.com/40877 " LU-14160 fallocate: Add punch mode to fallocate ". It probably needs the server to return -EOPNOTSUPP instead of -EPROTO for unsupported fallocate modes. On the client (for interop) it should convert -EPROTO to -EOPNOTSUPP in fallocate. I thought there was already a patch to do this, but I guess it is not in the right place - https://review.whamcloud.com/41148 " LU-14301 llite: return EOPNOTSUPP if fallocate is not supported ", but it looks like that patch is converting -ENOTSUPP to -EOPNOTSUPP and should be enhanced to do the same for -EPROTO ?

            >(Specifically https://review.whamcloud.com/#/c/46221/ )

            Hi Patrick, The URL fails to open with "Not found " and/or permisison. Sorry I cannot check this out. I am however, looking into other info in bug report.

            arshad512 Arshad Hussain added a comment - >(Specifically https://review.whamcloud.com/#/c/46221/ ) Hi Patrick, The URL fails to open with "Not found " and/or permisison. Sorry I cannot check this out. I am however, looking into other info in bug report.

            >Can you please advise? Does this seem related to LU-14160?

            Hi Peter, I am checking on this.

            arshad512 Arshad Hussain added a comment - >Can you please advise? Does this seem related to LU-14160 ? Hi Peter, I am checking on this.
            paf0186 Patrick Farrell added a comment - - edited
            paf0186 Patrick Farrell added a comment - - edited (Specifically https://review.whamcloud.com/40877 )
            pjones Peter Jones added a comment -

            Arshad

            Can you please advise? Does this seem related to LU-14160?

            Peter

            pjones Peter Jones added a comment - Arshad Can you please advise? Does this seem related to LU-14160 ? Peter

            People

              arshad512 Arshad Hussain
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: