[LU-10826] Regression in LU-9372 on OPA enviroment and no recovery triggered Created: 19/Mar/18  Updated: 24/Jul/18  Resolved: 09/Apr/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

master, centos7.4
1 x MDS/MDT, 1 x OSS/OST and 1 client with OPA-10.6


Issue Links:
Related
is related to LU-9372 OOM happens on OSS during Lustre reco... Resolved
is related to LU-10993 Fix for LU-10826 is problematic and s... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Looks like LU-9372 has regression on OPA enviroment and somehow when test_req_buffer_pressure is enabled (test_req_buffer_pressure=1) on OSS or MDS, client never reconnect and no trigger Lustre recovery.

[root@es14k-vm1 ~]# umount -t lustre -a
[root@es14k-vm1 ~]# lustre_rmmod 
[root@es14k-vm1 ~]# vi /etc/modprobe.d/lustre.conf

options ptlrpc test_req_buffer_pressure=1

[root@es14k-vm1 ~]# mount -t lustre /dev/ddn/scratch0_ost0000 /lustre/scratch0/ost0000

[root@es14k-vm1 ~]# cat /proc/fs/lustre/obdfilter/scratch0-OST0000/recovery_status 
status: WAITING_FOR_CLIENTS

Never client Reconnected and no lustre recovery triggered



 Comments   
Comment by Bruno Faccini (Inactive) [ 19/Mar/18 ]

Hmm, I may have been too restrictive in my LU-9372 patches when test_req_buffer_pressure is being configured.

I will double-check on a test platform.

 

Comment by Gerrit Updater [ 20/Mar/18 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/31690
Subject: LU-10826 ptlrpc: fix test_req_buffer_pressure behavior
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8420c6b4f04ee29c4374f9fdbe216c3f87e6fc26

Comment by Shuichi Ihara (Inactive) [ 21/Mar/18 ]

Thanks Bruno, patch https://review.whamcloud.com/31690 works and now triggred Lustre recovery.
However, main concerns are tunning of resources allocation for all clients requests with test_req_buffer_pressure=1.
In fact, as LU-9372 described, if single OSS has many OSTs under limited amount of memory resources, Lustre recovery triggers OOM when Lustre recovery started in parallel for many OSTs. We want to prevent OOM on that situation. However, curently, once we set test_req_buffer_pressure=1, it's controled by memory size or we use req_buffers_max for manuall setting. Still not not automanted setting.

Comment by Bruno Faccini (Inactive) [ 21/Mar/18 ]

Shuichi, I am already looking for a way to auto-tune req_buffers_max based, possibly based on current number of active OSTs/targets, number of connected/to-be-recovered Clients, memory available, ...

I also want to check some others options which may help to limit each rqbd-buffer memory footprint (presently allocated in a 32K slab when it size is only 17K (this since patch for LU-4755 "ptlrpc: enlarge OST_MAXREQSIZE for 4MB RPC"), like to use a dedicated kmem_cache (if no underlying interference/drift from the Slab/Slub layer), or try to reduce OST_MAXREQSIZE in order to have a full buffer size down to 16K (if it keeps the "4MB RPC" capability) and thus have it being allocated in a 16K Slab.

 

Comment by Gerrit Updater [ 09/Apr/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31690/
Subject: LU-10826 ptlrpc: fix test_req_buffer_pressure behavior
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 040eca67f8d5422b0099d1b70594b5eb40a0f9ef

Comment by Peter Jones [ 09/Apr/18 ]

Landed for 2.12

Generated at Sat Feb 10 02:38:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.