Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11939

ASSERTION( tgd->tgd_tot_granted >= ted->ted_grant ) on OSS

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.14.0, Lustre 2.12.8
    • Lustre 2.12.0
    • None
    • CentOS 7.6, 3.10.0-957.1.3.el7_lustre.x86_64
    • 3
    • 9223372036854775807

    Description

      We just hit the following LBUG with Lustre 2.12 on an OSS (Fir). All clients are running Lustre 2.12 also.

      [1708550.581820] LustreError: 123124:0:(tgt_grant.c:1079:tgt_grant_discard()) ASSERTION( tgd->tgd_tot_granted >= ted->ted_grant ) failed: fir-OST001b: tot_granted 50041695803 cli d5e4b60f-fe33-b991-7d48-5b8db7e07ab0/ffff926b10975c00 ted_grant -49152
      [1708550.603611] LustreError: 123124:0:(tgt_grant.c:1079:tgt_grant_discard()) LBUG
      [1708550.610923] Pid: 123124, comm: ll_ost00_019 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      [1708550.621180] Call Trace:
      [1708550.623814]  [<ffffffffc0aa37cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [1708550.630548]  [<ffffffffc0aa387c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [1708550.636935]  [<ffffffffc0f220bc>] tgt_grant_discard+0x1dc/0x1e0 [ptlrpc]
      [1708550.643892]  [<ffffffffc14c81d4>] ofd_obd_disconnect+0x74/0x220 [ofd]
      [1708550.650541]  [<ffffffffc0e60157>] target_handle_disconnect+0xd7/0x450 [ptlrpc]
      [1708550.658005]  [<ffffffffc0efeb77>] tgt_disconnect+0x37/0x140 [ptlrpc]
      [1708550.664609]  [<ffffffffc0f0635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [1708550.671734]  [<ffffffffc0eaa92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [1708550.679628]  [<ffffffffc0eae25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [1708550.686136]  [<ffffffff8dcc1c31>] kthread+0xd1/0xe0
      [1708550.691224]  [<ffffffff8e374c24>] ret_from_fork_nospec_begin+0xe/0x21
      [1708550.697873]  [<ffffffffffffffff>] 0xffffffffffffffff
      [1708550.703065] Kernel panic - not syncing: LBUG
      [1708550.707509] CPU: 20 PID: 123124 Comm: ll_ost00_019 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.1.3.el7_lustre.x86_64 #1
      [1708550.720273] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.6.7 10/29/2018
      [1708550.728015] Call Trace:
      [1708550.730645]  [<ffffffff8e361e41>] dump_stack+0x19/0x1b
      [1708550.735962]  [<ffffffff8e35b550>] panic+0xe8/0x21f
      [1708550.740937]  [<ffffffffc0aa38cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      [1708550.747346]  [<ffffffffc0f220bc>] tgt_grant_discard+0x1dc/0x1e0 [ptlrpc]
      [1708550.754230]  [<ffffffffc14c81d4>] ofd_obd_disconnect+0x74/0x220 [ofd]
      [1708550.760880]  [<ffffffffc0e9ed81>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
      [1708550.767783]  [<ffffffffc0ec3933>] ? req_capsule_server_pack+0x43/0xf0 [ptlrpc]
      [1708550.775207]  [<ffffffffc0e60157>] target_handle_disconnect+0xd7/0x450 [ptlrpc]
      [1708550.782634]  [<ffffffffc0efeb77>] tgt_disconnect+0x37/0x140 [ptlrpc]
      [1708550.789194]  [<ffffffffc0f0635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [1708550.796272]  [<ffffffffc0edfa51>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
      [1708550.804022]  [<ffffffffc0aa3bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
      [1708550.811281]  [<ffffffffc0eaa92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [1708550.819142]  [<ffffffffc0ea77b5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [1708550.826110]  [<ffffffff8dcd67c2>] ? default_wake_function+0x12/0x20
      [1708550.832548]  [<ffffffff8dccba9b>] ? __wake_up_common+0x5b/0x90
      [1708550.838589]  [<ffffffffc0eae25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [1708550.845068]  [<ffffffffc0ead760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [1708550.852636]  [<ffffffff8dcc1c31>] kthread+0xd1/0xe0
      [1708550.857688]  [<ffffffff8dcc1b60>] ? insert_kthread_work+0x40/0x40
      [1708550.863956]  [<ffffffff8e374c24>] ret_from_fork_nospec_begin+0xe/0x21
      [1708550.870567]  [<ffffffff8dcc1b60>] ? insert_kthread_work+0x40/0x40
       

      Attachments

        Issue Links

          Activity

            [LU-11939] ASSERTION( tgd->tgd_tot_granted >= ted->ted_grant ) on OSS

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45489/
            Subject: LU-11939 tgt: Do not assert during grant cleanup
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 372c77f0a11573e9f8818751c24735e151aafc74

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45489/ Subject: LU-11939 tgt: Do not assert during grant cleanup Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 372c77f0a11573e9f8818751c24735e151aafc74

            "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45489
            Subject: LU-11939 tgt: Do not assert during grant cleanup
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 81f17cf04fc7d4d4bd7ab87cfe572b7f59cf81f3

            gerrit Gerrit Updater added a comment - "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45489 Subject: LU-11939 tgt: Do not assert during grant cleanup Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 81f17cf04fc7d4d4bd7ab87cfe572b7f59cf81f3

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34215/
            Subject: LU-11939 tgt: Do not assert during grant cleanup
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: af2d3ac30eafead6b47c5db20d76433c091d89de

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34215/ Subject: LU-11939 tgt: Do not assert during grant cleanup Project: fs/lustre-release Branch: master Current Patch Set: Commit: af2d3ac30eafead6b47c5db20d76433c091d89de
            pjones Peter Jones added a comment -

            Mike confirms that this is a duplicate of LU-12120

            pjones Peter Jones added a comment - Mike confirms that this is a duplicate of LU-12120

            Yes, that looks like the right one.  Do you agree that should take care of this issue as well?

            pfarrell Patrick Farrell (Inactive) added a comment - Yes, that looks like the right one.  Do you agree that should take care of this issue as well?

            Patrick, do you mean patch from LU-12120?

            tappro Mikhail Pershin added a comment - Patrick, do you mean patch from LU-12120 ?

            tappro:

            Mike,

            Didn't you fix this grant bug in another LU?  I can't find it right now...

            pfarrell Patrick Farrell (Inactive) added a comment - tappro : Mike, Didn't you fix this grant bug in another LU?  I can't find it right now...

            Nah, we've still got a patch to track under this

            pfarrell Patrick Farrell (Inactive) added a comment - Nah, we've still got a patch to track under this
            pjones Peter Jones added a comment -

            So ok to close this one as a duplicate of LU-11919?

            pjones Peter Jones added a comment - So ok to close this one as a duplicate of LU-11919 ?

            OK, we never noticed that before (with 2.10 clients). Thanks for your help! I used set_param -P on the MGS of Fir to set max_dirty_mb to 256 and it did work.

            lctl set_param -P osc.*.max_dirty_mb=256
            
            [root@sh-ln06 ~]# lctl get_param osc.*.max_dirty_mb
            osc.fir-OST0000-osc-ffff9bad01395000.max_dirty_mb=256
            osc.fir-OST0001-osc-ffff9bad01395000.max_dirty_mb=256
            ...
            osc.fir-OST002e-osc-ffff9bad01395000.max_dirty_mb=256
            osc.fir-OST002f-osc-ffff9bad01395000.max_dirty_mb=256
            osc.oak-OST0000-osc-ffff9baceaa3d800.max_dirty_mb=256
            osc.oak-OST0001-osc-ffff9baceaa3d800.max_dirty_mb=256
            ...
            osc.oak-OST0070-osc-ffff9baceaa3d800.max_dirty_mb=256
            osc.oak-OST0071-osc-ffff9baceaa3d800.max_dirty_mb=256
            osc.regal-OST0000-osc-ffff9bace6e28800.max_dirty_mb=256
            osc.regal-OST0001-osc-ffff9bace6e28800.max_dirty_mb=256
            osc.regal-OST0002-osc-ffff9bace6e28800.max_dirty_mb=256
            ...
            osc.regal-OST006b-osc-ffff9bace6e28800.max_dirty_mb=256
            

            So that should be much better. I'll report any new event regarding this issue, but so far so good. Thanks again.

            sthiell Stephane Thiell added a comment - OK, we never noticed that before (with 2.10 clients). Thanks for your help! I used set_param -P on the MGS of Fir to set max_dirty_mb to 256 and it did work. lctl set_param -P osc.*.max_dirty_mb=256 [root@sh-ln06 ~]# lctl get_param osc.*.max_dirty_mb osc.fir-OST0000-osc-ffff9bad01395000.max_dirty_mb=256 osc.fir-OST0001-osc-ffff9bad01395000.max_dirty_mb=256 ... osc.fir-OST002e-osc-ffff9bad01395000.max_dirty_mb=256 osc.fir-OST002f-osc-ffff9bad01395000.max_dirty_mb=256 osc.oak-OST0000-osc-ffff9baceaa3d800.max_dirty_mb=256 osc.oak-OST0001-osc-ffff9baceaa3d800.max_dirty_mb=256 ... osc.oak-OST0070-osc-ffff9baceaa3d800.max_dirty_mb=256 osc.oak-OST0071-osc-ffff9baceaa3d800.max_dirty_mb=256 osc.regal-OST0000-osc-ffff9bace6e28800.max_dirty_mb=256 osc.regal-OST0001-osc-ffff9bace6e28800.max_dirty_mb=256 osc.regal-OST0002-osc-ffff9bace6e28800.max_dirty_mb=256 ... osc.regal-OST006b-osc-ffff9bace6e28800.max_dirty_mb=256 So that should be much better. I'll report any new event regarding this issue, but so far so good. Thanks again.

            People

              tappro Mikhail Pershin
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: