Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12704

racer test_1: Invalid layout: The component end must be aligned by the stripe size

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/4c310714-c728-11e9-9fc9-52540065bddc

      test_1 failed with the following error:

      layout: raid0 raid0 pfl pfl pfl dom dom dom flr flr flr
      Invalid layout: The component end must be aligned by the stripe size
      

      <<Please provide additional information about the failure here>>

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      racer test_1 - Timeout occurred after 833 mins, last suite running was racer, restarting cluster to continue tests

      Attachments

        Issue Links

          Activity

            [LU-12704] racer test_1: Invalid layout: The component end must be aligned by the stripe size
            pjones Peter Jones added a comment -

            Moving to 2.14 until we can understand the crashes in Olegtest

            pjones Peter Jones added a comment - Moving to 2.14 until we can understand the crashes in Olegtest

            Mike, Oleg is still hitting crashes with https://review.whamcloud.com/36300 so it can't land as-is. Does it make sense to rebase your patch to be directly on master so that it can land independently, or does it depend on 36300 in order to work properly?

            adilger Andreas Dilger added a comment - Mike, Oleg is still hitting crashes with https://review.whamcloud.com/36300 so it can't land as-is. Does it make sense to rebase your patch to be directly on master so that it can land independently, or does it depend on 36300 in order to work properly?

            Vladimir, do you think https://review.whamcloud.com/#/c/36300 is still needed? I was thinking that we are safe from layout change while LSM is referenced and no need to initialize IO in llite.

            tappro Mikhail Pershin added a comment - Vladimir, do you think https://review.whamcloud.com/#/c/36300 is still needed? I was thinking that we are safe from layout change while LSM is referenced and no need to initialize IO in llite.
            vsaveliev Vladimir Saveliev added a comment - Mike, wouldn't it be better to have  https://review.whamcloud.com/36368  on top of https://review.whamcloud.com/#/c/36300?
            tappro Mikhail Pershin added a comment - - edited

            Vladimir, yes, I think it is what we need to add

            P.S. I've updated patch with that code

            tappro Mikhail Pershin added a comment - - edited Vladimir, yes, I think it is what we need to add P.S. I've updated patch with that code

            It seems that all elements of lov_dispatch[] need all members of struct lov_layout_operations initialized.

            So, how about defining lov_dispatch[LLT_EMPTY].llo_flush and lov_dispatch[LLT_RELEASED].llo_flush to a function like

            static int lov_flush_empty()
            {
               return 0;
            }
            

            Mike, would that work?

            vsaveliev Vladimir Saveliev added a comment - It seems that all elements of lov_dispatch[] need all members of struct lov_layout_operations initialized. So, how about defining lov_dispatch [LLT_EMPTY] .llo_flush and lov_dispatch [LLT_RELEASED] .llo_flush to a function like static int lov_flush_empty() { return 0; } Mike, would that work?

            Could you check if patch above fixes that problem?

            [ 2091.709375] BUG: unable to handle kernel NULL pointer dereference at           (null)
            [ 2091.712428] IP: [<          (null)>]           (null)
            ...
            [ 2091.736582] CPU: 0 PID: 20963 Comm: ldlm_bl_08 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.5.1.el7.x86_64 #1
            ...
            [ 2091.773122] Call Trace:
            [ 2091.775002]  [<ffffffffc0c9a1a2>] ? lov_object_flush+0x22/0x60 [lov]
            [ 2091.777577]  [<ffffffffc093e193>] cl_object_flush+0x63/0x120 [obdclass]
            [ 2091.780136]  [<ffffffffc0e1c408>] ll_lock_cancel_bits+0x9b8/0xc00 [lustre]
            

            According to crash dump (from Cray's test system), the BUG happened when lov_object_flush() tried to dispatch to lov_flush_composite(), so probably the change in lov_flush_composite() will not help.
            However, lov_object has LLT_EMPTY layout already:

            crash> lov_object.lo_type,lo_lsm 0xffff8f46e1af42e0
              lo_type = LLT_EMPTY
              lo_lsm = 0x0
            

            That probably happened earlier when layout lock was canceled:

            00000080:00010000:0.0:1569576242.662719:0:21009:0:(namei.c:248:ll_lock_cancel_bits()) ### to cancel bits 0x19 ns: lustre-MDT0001-mdc-ffff8f4777b0e000 lock: ffff8f4765861d40/0x7909e24de166dec lrc: 3/0,0 mode: PR/PR res: [0x240000406:0x5278:0x0].0x0 bits 0x19/0x19 rrc: 3 type: IBT flags: 0x429400000000 nid: local remote: 0x96dffd219df32662 expref: -99 pid: 3158 timeout: 0 lvb_type: 0
            00000080:00200000:0.0:1569576242.662727:0:21009:0:(vvp_object.c:140:vvp_conf_set()) [0x240000406:0x5278:0x0]: losing layout lock
            

            Then on canceling DoM lock ldlm_bl_08 thread faced with empty layout:

            00000080:00010000:0.0:1569576242.728877:0:20963:0:(namei.c:248:ll_lock_cancel_bits()) ### to cancel bits 0x40 ns: lustre-MDT0001-mdc-ffff8f4777b0e000 lock: ffff8f47663ee000/0x7909e24de166f19 lrc: 2/0,0 mode: PR/PR res: [0x240000406:0x5278:0x0].0x0 bits 0x48/0x40 rrc: 3 type: IBT flags: 0x460400000000 nid: local remote: 0x96dffd219df32c12 expref: -99 pid: 3288 timeout: 0 lvb_type: 3
            
            vsaveliev Vladimir Saveliev added a comment - Could you check if patch above fixes that problem? [ 2091.709375] BUG: unable to handle kernel NULL pointer dereference at (null) [ 2091.712428] IP: [< (null)>] (null) ... [ 2091.736582] CPU: 0 PID: 20963 Comm: ldlm_bl_08 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.5.1.el7.x86_64 #1 ... [ 2091.773122] Call Trace: [ 2091.775002] [<ffffffffc0c9a1a2>] ? lov_object_flush+0x22/0x60 [lov] [ 2091.777577] [<ffffffffc093e193>] cl_object_flush+0x63/0x120 [obdclass] [ 2091.780136] [<ffffffffc0e1c408>] ll_lock_cancel_bits+0x9b8/0xc00 [lustre] According to crash dump (from Cray's test system), the BUG happened when lov_object_flush() tried to dispatch to lov_flush_composite(), so probably the change in lov_flush_composite() will not help. However, lov_object has LLT_EMPTY layout already: crash> lov_object.lo_type,lo_lsm 0xffff8f46e1af42e0 lo_type = LLT_EMPTY lo_lsm = 0x0 That probably happened earlier when layout lock was canceled: 00000080:00010000:0.0:1569576242.662719:0:21009:0:(namei.c:248:ll_lock_cancel_bits()) ### to cancel bits 0x19 ns: lustre-MDT0001-mdc-ffff8f4777b0e000 lock: ffff8f4765861d40/0x7909e24de166dec lrc: 3/0,0 mode: PR/PR res: [0x240000406:0x5278:0x0].0x0 bits 0x19/0x19 rrc: 3 type: IBT flags: 0x429400000000 nid: local remote: 0x96dffd219df32662 expref: -99 pid: 3158 timeout: 0 lvb_type: 0 00000080:00200000:0.0:1569576242.662727:0:21009:0:(vvp_object.c:140:vvp_conf_set()) [0x240000406:0x5278:0x0]: losing layout lock Then on canceling DoM lock ldlm_bl_08 thread faced with empty layout: 00000080:00010000:0.0:1569576242.728877:0:20963:0:(namei.c:248:ll_lock_cancel_bits()) ### to cancel bits 0x40 ns: lustre-MDT0001-mdc-ffff8f4777b0e000 lock: ffff8f47663ee000/0x7909e24de166f19 lrc: 2/0,0 mode: PR/PR res: [0x240000406:0x5278:0x0].0x0 bits 0x48/0x40 rrc: 3 type: IBT flags: 0x460400000000 nid: local remote: 0x96dffd219df32c12 expref: -99 pid: 3288 timeout: 0 lvb_type: 3

            I was fixing DOM entry checking in lov_flush_composite() and thought that taking LSM reference should protect us from layout change. Could you check if patch above fixes that problem?

            tappro Mikhail Pershin added a comment - I was fixing DOM entry checking in lov_flush_composite() and thought that taking LSM reference should protect us from layout change. Could you check if patch above fixes that problem?

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36368
            Subject: LU-12704 lov: take lsm reference in lov_flush_composite
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9c5d50763ac4ce36f48a05e968ad3c84ffcdbe96

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36368 Subject: LU-12704 lov: take lsm reference in lov_flush_composite Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9c5d50763ac4ce36f48a05e968ad3c84ffcdbe96

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36174/
            Subject: LU-12704 tests: component end must be multiple of stripesize
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1cb7bdb883b0b19d944a4bee8403d0b5898a3998

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36174/ Subject: LU-12704 tests: component end must be multiple of stripesize Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1cb7bdb883b0b19d944a4bee8403d0b5898a3998

            Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/36300
            Subject: LU-12704 llite: init i/o for cl_object_flush
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: aa6e3dbaf3249ca26537b21565882aec42b2aa57

            gerrit Gerrit Updater added a comment - Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/36300 Subject: LU-12704 llite: init i/o for cl_object_flush Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: aa6e3dbaf3249ca26537b21565882aec42b2aa57

            People

              vsaveliev Vladimir Saveliev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: