Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13205

sanity-pfl test 16a fails with “setstripe /mnt/lustre/d16.sanity-pfl/f16.sanity-pfl.copy failed“

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4
    • PPC clients
    • 3
    • 9223372036854775807

    Description

      Looking at a recent failure, https://testing.whamcloud.com/test_sets/52073ba0-4715-11ea-b69a-52540065bddc, sanity-pfl test 16a fails with the following in the client test_log

      == sanity-pfl test 16a: Verify setstripe/getstripe with YAML config file ============================= 05:25:02 (1580793902)
      CMD: trevis-4vm12 dumpe2fs -h /dev/mapper/mds1_flakey 2>&1 |
      		grep -E -q '(ea_inode|large_xattr)'
      1. PFL file
      getstripe --yaml /mnt/lustre/d16a.sanity-pfl/f16a.sanity-pfl
      setstripe --yaml=/mnt/lustre/d16a.sanity-pfl/template /mnt/lustre/d16a.sanity-pfl/f16a.sanity-pfl.copy
      Set stripe size 4096 failed: Invalid argument
      lfs setstripe: cannot build layout from YAML file /mnt/lustre/d16a.sanity-pfl/template.
      error: setstripe: can't create composite layout from template file /mnt/lustre/d16a.sanity-pfl/template
       sanity-pfl test_16a: @@@@@@ FAIL: setstripe /mnt/lustre/d16a.sanity-pfl/f16a.sanity-pfl.copy failed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6121:error()
        = /usr/lib64/lustre/tests/test-framework.sh:9710:verify_yaml_layout()
        = /usr/lib64/lustre/tests/sanity-pfl.sh:761:test_16a()
      

      The only hint of a problem in the console logs is for client 2 (vm9) where we see

      [ 1212.789487] LustreError: 1758:0:(pack_generic.c:2447:lustre_swab_lov_comp_md_v1()) Invalid magic 0x1
      

      We’ve seen this same error message in sanity-pfl test 14 failures in LU-13186.

      Patch https://review.whamcloud.com/#/c/28425/ for LU-9846 landed to Lustre 2.12.54 on 01 JUNE 2019 and was not back ported to b2_12. This patch moved test_16 to test_16a and created a new test 16b.

      sanity-pfl test 16a started failing with this error message on 30 JULY 2019 with Lustre 2.12.56.72 and test 16 started failing for b2_12 on 13 AUG 2019 for Lustre 2.12.2.115.

      sanity-pfl test 16a fails only for PPC clients and fails 100% of the time for PPC.

      Logs for recent failures are at
      https://testing.whamcloud.com/test_sets/fd407770-4706-11ea-a1c8-52540065bddc
      https://testing.whamcloud.com/test_sets/07588ab8-2592-11ea-80b4-52540065bddc
      https://testing.whamcloud.com/test_sets/a2a89ace-1fdb-11ea-adca-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-13205] sanity-pfl test 16a fails with “setstripe /mnt/lustre/d16.sanity-pfl/f16.sanity-pfl.copy failed“

            I've captured a stack trace from the crash. It looks like it is calling DEBUG_REQ() in mdc_close() with a bad message buffer:

            [ 1188.641162] Lustre: DEBUG MARKER: == sanity-pfl test 16b: Verify setstripe/getstripe with YAML config file + overstriping == 23:01:06 (1581548466)
            [ 1189.084215] LustreError: 3150:0:(pack_generic.c:2488:lustre_swab_lov_comp_md_v1()) lov: invalid magic 0x1 for component 3
            [ 1189.087595] LustreError: 3150:0:(pack_generic.c:1199:lustre_msg_get_transno()) incorrect message magic: 00000200
            [ 1189.087680] LustreError: 3150:0:(pack_generic.c:1089:lustre_msg_get_opc()) incorrect message magic: 00000200 (msg:c0000000049cb800)
            [ 1189.087807] ------------[ cut here ]------------
            [ 1189.087904] WARNING: CPU: 1 PID: 3150 at lustre/ptlrpc/pack_generic.c:1090 .lustre_msg_get_opc+0x108/0x330 [ptlrpc]
             CPU: 1 PID: 3150 Comm: lfs Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.9.1.el7.ppc64 #1
             Call Trace:
             .lustre_msg_get_opc+0xf8/0x330 [ptlrpc] (unreliable)
             ._debug_req+0x674/0x830 [ptlrpc]
             .mdc_close+0x568/0x1150 [mdc]
             .lmv_close+0x238/0x580 [lmv]
             .ll_close_inode_openhandle+0xb7c/0x1380 [lustre]
             .ll_md_real_close+0x190/0x330 [lustre]
             .ll_file_release+0x728/0xcf0 [lustre]
             .ll_dir_release+0x15c/0x210 [lustre]
             .__fput+0xe8/0x330
            
            adilger Andreas Dilger added a comment - I've captured a stack trace from the crash. It looks like it is calling DEBUG_REQ() in mdc_close() with a bad message buffer: [ 1188.641162] Lustre: DEBUG MARKER: == sanity-pfl test 16b: Verify setstripe/getstripe with YAML config file + overstriping == 23:01:06 (1581548466) [ 1189.084215] LustreError: 3150:0:(pack_generic.c:2488:lustre_swab_lov_comp_md_v1()) lov: invalid magic 0x1 for component 3 [ 1189.087595] LustreError: 3150:0:(pack_generic.c:1199:lustre_msg_get_transno()) incorrect message magic: 00000200 [ 1189.087680] LustreError: 3150:0:(pack_generic.c:1089:lustre_msg_get_opc()) incorrect message magic: 00000200 (msg:c0000000049cb800) [ 1189.087807] ------------[ cut here ]------------ [ 1189.087904] WARNING: CPU: 1 PID: 3150 at lustre/ptlrpc/pack_generic.c:1090 .lustre_msg_get_opc+0x108/0x330 [ptlrpc] CPU: 1 PID: 3150 Comm: lfs Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.9.1.el7.ppc64 #1 Call Trace: .lustre_msg_get_opc+0xf8/0x330 [ptlrpc] (unreliable) ._debug_req+0x674/0x830 [ptlrpc] .mdc_close+0x568/0x1150 [mdc] .lmv_close+0x238/0x580 [lmv] .ll_close_inode_openhandle+0xb7c/0x1380 [lustre] .ll_md_real_close+0x190/0x330 [lustre] .ll_file_release+0x728/0xcf0 [lustre] .ll_dir_release+0x15c/0x210 [lustre] .__fput+0xe8/0x330

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37494
            Subject: LU-13205 lov: debug lov buffer swabbing crash
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e17fe2322008ce7520f4496932ea8638018a4261

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37494 Subject: LU-13205 lov: debug lov buffer swabbing crash Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e17fe2322008ce7520f4496932ea8638018a4261

            It seems that "caused" might be a strong word. Prior to the landing of the LU-10100 patch, sanity-pfl just failed every test. After the landing of LU-10100, many of the tests started passing, and test_16a, test_16b, test_17 began crashing, but only because they actually started doing something rather than exiting as soon as the "lfs setstripe" command failed. That means reverting LU-10100 would not improve the situation, as much as revert to a state where 100% of the tests are broken on ppc64 compared to only the PFL tests.

            I'm going to push a debug patch to see if it can identify where the missing swabbing is, which will hopefully allow us to solve the problem. The current issue is that the machine state is so fatally corrupted that the machine reboots without actually printing any stack traces of what is being run.

            adilger Andreas Dilger added a comment - It seems that "caused" might be a strong word. Prior to the landing of the LU-10100 patch, sanity-pfl just failed every test. After the landing of LU-10100 , many of the tests started passing, and test_16a, test_16b, test_17 began crashing, but only because they actually started doing something rather than exiting as soon as the " lfs setstripe " command failed. That means reverting LU-10100 would not improve the situation, as much as revert to a state where 100% of the tests are broken on ppc64 compared to only the PFL tests. I'm going to push a debug patch to see if it can identify where the missing swabbing is, which will hopefully allow us to solve the problem. The current issue is that the machine state is so fatally corrupted that the machine reboots without actually printing any stack traces of what is being run.

            Based on the date of the start of failures and the area affected, it looks like this was very likely caused by the landing of:

            commit 9d17996766e0fa93b1029d2422d45d087edde389
            CommitDate: Sat Jul 27 00:21:20 2019 +0000
            
                LU-10100 llite: swab LOV EA user data
                
                Many sub-tests failed with "Invalid argument" failures
                on PPC client because of the endianness issue.
                
                This patch fixes the issue by adding a common function
                lustre_swab_lov_user_md() to swab the LOV EA user data.
                
                Test-Parameters: clientarch=ppc64 envdefinitions=ONLY=27 testlist=sanity
            
                Reviewed-on: https://review.whamcloud.com/35291
            
            adilger Andreas Dilger added a comment - Based on the date of the start of failures and the area affected, it looks like this was very likely caused by the landing of: commit 9d17996766e0fa93b1029d2422d45d087edde389 CommitDate: Sat Jul 27 00:21:20 2019 +0000 LU-10100 llite: swab LOV EA user data Many sub-tests failed with "Invalid argument" failures on PPC client because of the endianness issue. This patch fixes the issue by adding a common function lustre_swab_lov_user_md() to swab the LOV EA user data. Test-Parameters: clientarch=ppc64 envdefinitions=ONLY=27 testlist=sanity Reviewed-on: https://review.whamcloud.com/35291

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: