[LU-11939] ASSERTION( tgd->tgd_tot_granted >= ted->ted_grant ) on OSS Created: 06/Feb/19  Updated: 22/Nov/21  Resolved: 18/Nov/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.14.0, Lustre 2.12.8

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 7.6, 3.10.0-957.1.3.el7_lustre.x86_64


Attachments: Text File vmcore-dmesg_fir-io1-s2_2019-02-06_21_29_56.txt     Text File vmcore-dmesg_fir-io3-s2_2019-02-06_12_50_35.txt    
Issue Links:
Duplicate
duplicates LU-12120 LustreError: 15069:0:(tgt_grant.c:561... Resolved
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We just hit the following LBUG with Lustre 2.12 on an OSS (Fir). All clients are running Lustre 2.12 also.

[1708550.581820] LustreError: 123124:0:(tgt_grant.c:1079:tgt_grant_discard()) ASSERTION( tgd->tgd_tot_granted >= ted->ted_grant ) failed: fir-OST001b: tot_granted 50041695803 cli d5e4b60f-fe33-b991-7d48-5b8db7e07ab0/ffff926b10975c00 ted_grant -49152
[1708550.603611] LustreError: 123124:0:(tgt_grant.c:1079:tgt_grant_discard()) LBUG
[1708550.610923] Pid: 123124, comm: ll_ost00_019 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[1708550.621180] Call Trace:
[1708550.623814]  [<ffffffffc0aa37cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[1708550.630548]  [<ffffffffc0aa387c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[1708550.636935]  [<ffffffffc0f220bc>] tgt_grant_discard+0x1dc/0x1e0 [ptlrpc]
[1708550.643892]  [<ffffffffc14c81d4>] ofd_obd_disconnect+0x74/0x220 [ofd]
[1708550.650541]  [<ffffffffc0e60157>] target_handle_disconnect+0xd7/0x450 [ptlrpc]
[1708550.658005]  [<ffffffffc0efeb77>] tgt_disconnect+0x37/0x140 [ptlrpc]
[1708550.664609]  [<ffffffffc0f0635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[1708550.671734]  [<ffffffffc0eaa92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[1708550.679628]  [<ffffffffc0eae25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[1708550.686136]  [<ffffffff8dcc1c31>] kthread+0xd1/0xe0
[1708550.691224]  [<ffffffff8e374c24>] ret_from_fork_nospec_begin+0xe/0x21
[1708550.697873]  [<ffffffffffffffff>] 0xffffffffffffffff
[1708550.703065] Kernel panic - not syncing: LBUG
[1708550.707509] CPU: 20 PID: 123124 Comm: ll_ost00_019 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.1.3.el7_lustre.x86_64 #1
[1708550.720273] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.6.7 10/29/2018
[1708550.728015] Call Trace:
[1708550.730645]  [<ffffffff8e361e41>] dump_stack+0x19/0x1b
[1708550.735962]  [<ffffffff8e35b550>] panic+0xe8/0x21f
[1708550.740937]  [<ffffffffc0aa38cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[1708550.747346]  [<ffffffffc0f220bc>] tgt_grant_discard+0x1dc/0x1e0 [ptlrpc]
[1708550.754230]  [<ffffffffc14c81d4>] ofd_obd_disconnect+0x74/0x220 [ofd]
[1708550.760880]  [<ffffffffc0e9ed81>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
[1708550.767783]  [<ffffffffc0ec3933>] ? req_capsule_server_pack+0x43/0xf0 [ptlrpc]
[1708550.775207]  [<ffffffffc0e60157>] target_handle_disconnect+0xd7/0x450 [ptlrpc]
[1708550.782634]  [<ffffffffc0efeb77>] tgt_disconnect+0x37/0x140 [ptlrpc]
[1708550.789194]  [<ffffffffc0f0635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[1708550.796272]  [<ffffffffc0edfa51>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[1708550.804022]  [<ffffffffc0aa3bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[1708550.811281]  [<ffffffffc0eaa92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[1708550.819142]  [<ffffffffc0ea77b5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[1708550.826110]  [<ffffffff8dcd67c2>] ? default_wake_function+0x12/0x20
[1708550.832548]  [<ffffffff8dccba9b>] ? __wake_up_common+0x5b/0x90
[1708550.838589]  [<ffffffffc0eae25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[1708550.845068]  [<ffffffffc0ead760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[1708550.852636]  [<ffffffff8dcc1c31>] kthread+0xd1/0xe0
[1708550.857688]  [<ffffffff8dcc1b60>] ? insert_kthread_work+0x40/0x40
[1708550.863956]  [<ffffffff8e374c24>] ret_from_fork_nospec_begin+0xe/0x21
[1708550.870567]  [<ffffffff8dcc1b60>] ? insert_kthread_work+0x40/0x40
 


 Comments   
Comment by Stephane Thiell [ 07/Feb/19 ]

Same crash happened on OST unmount while trying to fix another issue (@tcp announcing clients, LU-11888):

[1739660.322150] Lustre: Failing over fir-OST000b
[1739660.326649] Lustre: Skipped 5 previous similar messages
[1739660.337361] LustreError: 24209:0:(tgt_grant.c:1079:tgt_grant_discard()) ASSERTION( tgd->tgd_tot_granted >= ted->ted_grant ) failed: fir-OST0005: tot_granted 2114072895 cli e43da944-d239-923f-8f68-10646264727b/ffff90a28977ac00 ted_grant -12582912
[1739660.359265] LustreError: 24209:0:(tgt_grant.c:1079:tgt_grant_discard()) LBUG
[1739660.366492] Pid: 24209, comm: umount 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[1739660.376143] Call Trace:
[1739660.378780]  [<ffffffffc0afd7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[1739660.385519]  [<ffffffffc0afd87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[1739660.391906]  [<ffffffffc13fb0bc>] tgt_grant_discard+0x1dc/0x1e0 [ptlrpc]
[1739660.398876]  [<ffffffffc12631d4>] ofd_obd_disconnect+0x74/0x220 [ofd]
[1739660.405522]  [<ffffffffc0d2e7d6>] class_disconnect_export_list+0x1c6/0x680 [obdclass]
[1739660.413582]  [<ffffffffc0d2eda5>] class_disconnect_exports+0x115/0x310 [obdclass]
[1739660.421279]  [<ffffffffc0d493e7>] class_cleanup+0x297/0xbd0 [obdclass]
[1739660.428031]  [<ffffffffc0d4a9ac>] class_process_config+0x65c/0x2830 [obdclass]
[1739660.435465]  [<ffffffffc0d4cd46>] class_manual_cleanup+0x1c6/0x710 [obdclass]
[1739660.442816]  [<ffffffffc0d7dc2e>] server_put_super+0x8de/0xcd0 [obdclass]
[1739660.449827]  [<ffffffff84243dbd>] generic_shutdown_super+0x6d/0x100
[1739660.456303]  [<ffffffff842441b2>] kill_anon_super+0x12/0x20
[1739660.462077]  [<ffffffffc0d4f8b2>] lustre_kill_super+0x32/0x50 [obdclass]
[1739660.468989]  [<ffffffff8424456e>] deactivate_locked_super+0x4e/0x70
[1739660.475456]  [<ffffffff84244cf6>] deactivate_super+0x46/0x60
[1739660.481314]  [<ffffffff8426326f>] cleanup_mnt+0x3f/0x80
[1739660.486739]  [<ffffffff84263302>] __cleanup_mnt+0x12/0x20
[1739660.492338]  [<ffffffff840be79b>] task_work_run+0xbb/0xe0
[1739660.497948]  [<ffffffff8402bc65>] do_notify_resume+0xa5/0xc0
[1739660.503813]  [<ffffffff84775124>] int_signal+0x12/0x17
[1739660.509154]  [<ffffffffffffffff>] 0xffffffffffffffff
[1739660.514347] Kernel panic - not syncing: LBUG
[1739660.518793] CPU: 0 PID: 24209 Comm: umount Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.1.3.el7_lustre.x86_64 #1
[1739660.530863] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.6.7 10/29/2018
[1739660.538601] Call Trace:
[1739660.541232]  [<ffffffff84761e41>] dump_stack+0x19/0x1b
[1739660.546543]  [<ffffffff8475b550>] panic+0xe8/0x21f
[1739660.551512]  [<ffffffffc0afd8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[1739660.557903]  [<ffffffffc13fb0bc>] tgt_grant_discard+0x1dc/0x1e0 [ptlrpc]
[1739660.564783]  [<ffffffffc12631d4>] ofd_obd_disconnect+0x74/0x220 [ofd]
[1739660.571411]  [<ffffffffc0c89f02>] ? libcfs_nid2str_r+0xe2/0x130 [lnet]
[1739660.578124]  [<ffffffffc0d2e7d6>] class_disconnect_export_list+0x1c6/0x680 [obdclass]
[1739660.586143]  [<ffffffffc0d2eda5>] class_disconnect_exports+0x115/0x310 [obdclass]
[1739660.593814]  [<ffffffffc0d493e7>] class_cleanup+0x297/0xbd0 [obdclass]
[1739660.600517]  [<ffffffffc0b03f07>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[1739660.607324]  [<ffffffffc0d30836>] ? class_name2dev_nolock+0x46/0xb0 [obdclass]
[1739660.614735]  [<ffffffffc0d4a9ac>] class_process_config+0x65c/0x2830 [obdclass]
[1739660.622131]  [<ffffffffc0b03f07>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[1739660.628939]  [<ffffffffc0d4cd46>] class_manual_cleanup+0x1c6/0x710 [obdclass]
[1739660.636265]  [<ffffffffc0d7dc2e>] server_put_super+0x8de/0xcd0 [obdclass]
[1739660.643223]  [<ffffffff84243dbd>] generic_shutdown_super+0x6d/0x100
[1739660.649661]  [<ffffffff842441b2>] kill_anon_super+0x12/0x20
[1739660.655427]  [<ffffffffc0d4f8b2>] lustre_kill_super+0x32/0x50 [obdclass]
[1739660.662296]  [<ffffffff8424456e>] deactivate_locked_super+0x4e/0x70
[1739660.668736]  [<ffffffff84244cf6>] deactivate_super+0x46/0x60
[1739660.674571]  [<ffffffff8426326f>] cleanup_mnt+0x3f/0x80
[1739660.679968]  [<ffffffff84263302>] __cleanup_mnt+0x12/0x20
[1739660.685541]  [<ffffffff840be79b>] task_work_run+0xbb/0xe0
[1739660.691115]  [<ffffffff8402bc65>] do_notify_resume+0xa5/0xc0
[1739660.696947]  [<ffffffff84775124>] int_signal+0x12/0x17

vmcore is available if needed

Comment by Peter Jones [ 07/Feb/19 ]

Patrick is investigating

Comment by Patrick Farrell (Inactive) [ 07/Feb/19 ]

VMcore would be welcome, first off, dmesg in both cases would be great to have.

Few questions.

  1. OST backing file system - ZFS or ldiskfs?
  2. Any evictions?  (Hoping to see that in dmesg)
Comment by Stephane Thiell [ 07/Feb/19 ]

Hi Patrick,

Thanks for investigating and congrats on your new position!!

1. ldiskfs
2. many evictions due to clients configured with tcp0 nids even though the servers are IB only, see LU-11888/LU-11937

Attached the dmesg to the ticket:

  • first crash is vmcore-dmesg_fir-io3-s2_2019-02-06_12_50_35.txt
  • second crash while unmounting is vmcore-dmesg_fir-io1-s2_2019-02-06_21_29_56.txt

I also just uploaded the two vmcores to the ftp:

  • vmcore_fir-io3-s2_2019-02-06_12_50_35.gz
  • vmcore_fir-io1-s2_2019-02-06_21_29_56.gz
    I believe the debuginfo rpms are already available in the ftp:
  • kernel-debuginfo-3.10.0-957.1.3.el7_lustre.x86_64.rpm
  • kernel-debuginfo-common-x86_64-3.10.0-957.1.3.el7_lustre.x86_64.rpm
    and I also uploaded our lustre debuginfo rpm:
  • lustre-debuginfo-2.12.0-1.el7.x86_64.rpm

No more crash so far since we restarted all servers and fixed our clients announcing *@tcp NIDs.

Thanks!
Stephane

Comment by Patrick Farrell (Inactive) [ 08/Feb/19 ]

Thanks, Stephane!

So as you've probably guessed, there's some sort of bug related to grant handling on evictions.  If you're not having evictions, you shouldn't see this.  So you're probably good to go from a "site reliability" perspective.

I'm going to look in to the grant behavior at eviction, and also see about not asserting here.  Clean up and print an error instead.

Comment by Gerrit Updater [ 08/Feb/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34215
Subject: LU-11939 tgt: Do not assert during grant cleanup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f44d793b8a4786042a4cf38cf967ce9686da5b81

Comment by Patrick Farrell (Inactive) [ 08/Feb/19 ]

Stephane,

Looking a little in to the underlying bug...

What are your max_pages_per_rpc, max_dirty_mb, and max_rpcs_in_flight settings on the client?

Comment by Stephane Thiell [ 08/Feb/19 ]

Thanks Patrick!

On our clients, we have:

  • max_rpcs_in_flight=8 (default), only the data transfer nodes and robinhood server have max_rpcs_in_flight=32
  • max_dirty_mb=2000 (default), only the data transfer nodes and robinhood server have max_dirty_mb=128

As for max_pages_per_rpc, it should be set to 4096 and brw_size=16, but I noticed that it doesn't seem to be the case on all clients:

[root@sh-101-20 ~]# cd /proc/fs/lustre/osc; for o in fir-*; do echo -n "$o:"; cat $o/max_pages_per_rpc; done
fir-OST0000-osc-ffff9d0cad3de000:4096
fir-OST0001-osc-ffff9d0cad3de000:4096
fir-OST0002-osc-ffff9d0cad3de000:1024
fir-OST0003-osc-ffff9d0cad3de000:1024
fir-OST0004-osc-ffff9d0cad3de000:1024
fir-OST0005-osc-ffff9d0cad3de000:1024
fir-OST0006-osc-ffff9d0cad3de000:1024
fir-OST0007-osc-ffff9d0cad3de000:4096
fir-OST0008-osc-ffff9d0cad3de000:1024
fir-OST0009-osc-ffff9d0cad3de000:1024
fir-OST000a-osc-ffff9d0cad3de000:1024
fir-OST000b-osc-ffff9d0cad3de000:1024
fir-OST000c-osc-ffff9d0cad3de000:1024
fir-OST000d-osc-ffff9d0cad3de000:1024
fir-OST000e-osc-ffff9d0cad3de000:1024
fir-OST000f-osc-ffff9d0cad3de000:1024
fir-OST0010-osc-ffff9d0cad3de000:4096
fir-OST0011-osc-ffff9d0cad3de000:4096
fir-OST0012-osc-ffff9d0cad3de000:4096
fir-OST0013-osc-ffff9d0cad3de000:1024
fir-OST0014-osc-ffff9d0cad3de000:1024
fir-OST0015-osc-ffff9d0cad3de000:4096
fir-OST0016-osc-ffff9d0cad3de000:1024
fir-OST0017-osc-ffff9d0cad3de000:1024
fir-OST0018-osc-ffff9d0cad3de000:4096
fir-OST0019-osc-ffff9d0cad3de000:4096
fir-OST001a-osc-ffff9d0cad3de000:1024
fir-OST001b-osc-ffff9d0cad3de000:1024
fir-OST001c-osc-ffff9d0cad3de000:4096
fir-OST001d-osc-ffff9d0cad3de000:4096
fir-OST001e-osc-ffff9d0cad3de000:1024
fir-OST001f-osc-ffff9d0cad3de000:1024
fir-OST0020-osc-ffff9d0cad3de000:4096
fir-OST0021-osc-ffff9d0cad3de000:4096
fir-OST0022-osc-ffff9d0cad3de000:1024
fir-OST0023-osc-ffff9d0cad3de000:1024
fir-OST0024-osc-ffff9d0cad3de000:1024
fir-OST0025-osc-ffff9d0cad3de000:1024
fir-OST0026-osc-ffff9d0cad3de000:1024
fir-OST0027-osc-ffff9d0cad3de000:1024
fir-OST0028-osc-ffff9d0cad3de000:1024
fir-OST0029-osc-ffff9d0cad3de000:1024
fir-OST002a-osc-ffff9d0cad3de000:4096
fir-OST002b-osc-ffff9d0cad3de000:4096
fir-OST002c-osc-ffff9d0cad3de000:4096
fir-OST002d-osc-ffff9d0cad3de000:1024
fir-OST002e-osc-ffff9d0cad3de000:1024
fir-OST002f-osc-ffff9d0cad3de000:1024

We used this on the MGS:

lctl set_param -P fir-OST*.osc.max_pages_per_rpc=4096

So this isn't good. I just re-applied this command on the MGS:

[138882.544463] Lustre: Modifying parameter osc.fir-OST*.osc.max_pages_per_rpc in log params

and a newly mounted client is now set up at 4096. Do you think that could have caused this issue?

As for brw_size:

# clush -w @oss -b 'cat /proc/fs/lustre/obdfilter/*/brw_size'
---------------
fir-io[1-4]-s[1-2] (8)
---------------
16
16
16
16
16
16
Comment by Patrick Farrell (Inactive) [ 08/Feb/19 ]

Pages per RPC isn't a big deal (so, no, probably not), but max_dirty_mb may be.

Hmm, 2000 is not the default (the default is, I think, max_rpcs_in_flight * RPC size...  It's certainly much smaller than this value.).  So that's getting set somewhere.

It's also potentially too high.  max_dirty_mb is used in calculating grant requests, and there are some grant overflow bugs that occur when it's set that high.  (Particularly with 16 MiB RPCs.)  All the ones I know of are fixed in 2.12, but...

 

I strongly suspect you may have hit an overflow, leading to the grant inconsistency, leading to this crash. 

The grant value reported for the export for this client is negative - ted_grant -49152 in one case, in the other ted_grant -12582912.  These small-ish negative values strongly suggest overflow.  The server side value being compared against (tot_granted) is unsigned, and comparison with this negative value is why the "total grant >= grant for this export" assert we hit failed.  (The fact that your max_dirty_mb is at 2 GiB just makes the overflow explanation more likely.)

 

Your max_dirty_mb is above what should help performance, so tuning it down is a good idea.

The rule I use for max_dirty_mb is 2 * mb_per_rpc * rpcs_in_flight - the idea being that you can accumulate some dirty data so you're always ready to make an RPC when one completes, but you don't have tons of dirty data sitting around if it isn't getting processed fast enough.  (There's some docs floating around that say 4*, but with RPC sizes and counts increasing, that tends to be too much data.  2* should be plenty for good performance.)

So for you, that's 2*16*8 = 256, or in the case of your RBH nodes, that's 2*16*32=1024.

So I'd suggest turning down your max_dirty_mb to no more than 1 GiB.

Comment by Stephane Thiell [ 09/Feb/19 ]

Wow, thanks much for the detailed explanation. This is SUPER helpful. But... I don't think we have explicitly changed the value of max_dirty_mb. So I've been trying to track down why it is so high for ALL of our Lustre filesystems mounted on Sherlock (regal, oak and fir). If I understand correctly, /sys/fs/lustre/max_dirty_mb is used by the udev script provided by the lustre-client RPM right? and then the values probably max out at 2000?
 

[root@sh-ln06 ~]# cat /sys/fs/lustre/version
2.12.0
[root@sh-ln06 ~]# cat /sys/fs/lustre/max_dirty_mb 
32107
[root@sh-ln06 ~]# ls -l /sys/fs/lustre/max_dirty_mb
-rw-r--r-- 1 root root 4096 Feb  8 16:54 /sys/fs/lustre/max_dirty_mb
[root@sh-ln06 ~]# cat /etc/udev/rules.d/99-lustre.rules
KERNEL=="obd", MODE="0666"

# set sysfs values on client
SUBSYSTEM=="lustre", ACTION=="change", ENV{PARAM}=="?*", RUN+="/usr/sbin/lctl set_param '$env{PARAM}=$env{SETTING}'"


[root@sh-ln06 ~]# rpm -q --info lustre-client
Name        : lustre-client
Version     : 2.12.0
Release     : 1.el7
Architecture: x86_64
Install Date: Tue 05 Feb 2019 05:17:58 PM PST
Group       : System Environment/Kernel
Size        : 2007381
License     : GPL
Signature   : (none)
Source RPM  : lustre-client-2.12.0-1.el7.src.rpm
Build Date  : Fri 21 Dec 2018 01:53:18 PM PST
Build Host  : trevis-307-el7-x8664-3.trevis.whamcloud.com
Relocations : (not relocatable)
URL         : https://wiki.whamcloud.com/
Summary     : Lustre File System
Description :
Userspace tools and files for the Lustre file system.


[root@sh-ln06 ~]# lctl get_param osc.*.max_dirty_mb
osc.fir-OST0000-osc-ffff9bad01395000.max_dirty_mb=2000
osc.fir-OST0001-osc-ffff9bad01395000.max_dirty_mb=2000
...
osc.fir-OST002e-osc-ffff9bad01395000.max_dirty_mb=2000
osc.fir-OST002f-osc-ffff9bad01395000.max_dirty_mb=2000
osc.oak-OST0000-osc-ffff9baceaa3d800.max_dirty_mb=2000
osc.oak-OST0001-osc-ffff9baceaa3d800.max_dirty_mb=2000
...
osc.oak-OST0071-osc-ffff9baceaa3d800.max_dirty_mb=2000
osc.regal-OST0000-osc-ffff9bace6e28800.max_dirty_mb=2000
...
osc.regal-OST006a-osc-ffff9bace6e28800.max_dirty_mb=2000
osc.regal-OST006b-osc-ffff9bace6e28800.max_dirty_mb=2000

 

Comment by Patrick Farrell (Inactive) [ 09/Feb/19 ]

Hmm, I'm not familiar with the script, so I don't really know.  I don't think so, though...?

It's possible you're hitting:
https://jira.whamcloud.com/browse/LU-11919

Which is basically "cl_max_dirty_mb is supposed to start at zero, but instead starts with whatever was in memory".  Then, whatever was in memory is processed like it was a setting from userspace.  So if it's not zero (the most likely case, especially at startup), it's reasonably likely (though not guaranteed - it's more complicated than just "existing value in memory is > 2000 means 2000") to get set to the max.

Anyway, you can override that with a set_param -P.

Comment by Stephane Thiell [ 09/Feb/19 ]

OK, we never noticed that before (with 2.10 clients). Thanks for your help! I used set_param -P on the MGS of Fir to set max_dirty_mb to 256 and it did work.

lctl set_param -P osc.*.max_dirty_mb=256
[root@sh-ln06 ~]# lctl get_param osc.*.max_dirty_mb
osc.fir-OST0000-osc-ffff9bad01395000.max_dirty_mb=256
osc.fir-OST0001-osc-ffff9bad01395000.max_dirty_mb=256
...
osc.fir-OST002e-osc-ffff9bad01395000.max_dirty_mb=256
osc.fir-OST002f-osc-ffff9bad01395000.max_dirty_mb=256
osc.oak-OST0000-osc-ffff9baceaa3d800.max_dirty_mb=256
osc.oak-OST0001-osc-ffff9baceaa3d800.max_dirty_mb=256
...
osc.oak-OST0070-osc-ffff9baceaa3d800.max_dirty_mb=256
osc.oak-OST0071-osc-ffff9baceaa3d800.max_dirty_mb=256
osc.regal-OST0000-osc-ffff9bace6e28800.max_dirty_mb=256
osc.regal-OST0001-osc-ffff9bace6e28800.max_dirty_mb=256
osc.regal-OST0002-osc-ffff9bace6e28800.max_dirty_mb=256
...
osc.regal-OST006b-osc-ffff9bace6e28800.max_dirty_mb=256

So that should be much better. I'll report any new event regarding this issue, but so far so good. Thanks again.

Comment by Peter Jones [ 16/Feb/19 ]

So ok to close this one as a duplicate of LU-11919?

Comment by Patrick Farrell (Inactive) [ 08/Apr/19 ]

Nah, we've still got a patch to track under this

Comment by Patrick Farrell (Inactive) [ 12/Jul/19 ]

tappro:

Mike,

Didn't you fix this grant bug in another LU?  I can't find it right now...

Comment by Mikhail Pershin [ 12/Jul/19 ]

Patrick, do you mean patch from LU-12120?

Comment by Patrick Farrell (Inactive) [ 12/Jul/19 ]

Yes, that looks like the right one.  Do you agree that should take care of this issue as well?

Comment by Peter Jones [ 18/Sep/19 ]

Mike confirms that this is a duplicate of LU-12120

Comment by Gerrit Updater [ 08/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34215/
Subject: LU-11939 tgt: Do not assert during grant cleanup
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: af2d3ac30eafead6b47c5db20d76433c091d89de

Comment by Gerrit Updater [ 08/Nov/21 ]

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45489
Subject: LU-11939 tgt: Do not assert during grant cleanup
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 81f17cf04fc7d4d4bd7ab87cfe572b7f59cf81f3

Comment by Gerrit Updater [ 17/Nov/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45489/
Subject: LU-11939 tgt: Do not assert during grant cleanup
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 372c77f0a11573e9f8818751c24735e151aafc74

Generated at Sat Feb 10 02:48:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.