Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10826

Regression in LU-9372 on OPA enviroment and no recovery triggered

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • None
    • None
    • master, centos7.4
      1 x MDS/MDT, 1 x OSS/OST and 1 client with OPA-10.6
    • 3
    • 9223372036854775807

    Description

      Looks like LU-9372 has regression on OPA enviroment and somehow when test_req_buffer_pressure is enabled (test_req_buffer_pressure=1) on OSS or MDS, client never reconnect and no trigger Lustre recovery.

      [root@es14k-vm1 ~]# umount -t lustre -a
      [root@es14k-vm1 ~]# lustre_rmmod 
      [root@es14k-vm1 ~]# vi /etc/modprobe.d/lustre.conf
      
      options ptlrpc test_req_buffer_pressure=1
      
      [root@es14k-vm1 ~]# mount -t lustre /dev/ddn/scratch0_ost0000 /lustre/scratch0/ost0000
      
      [root@es14k-vm1 ~]# cat /proc/fs/lustre/obdfilter/scratch0-OST0000/recovery_status 
      status: WAITING_FOR_CLIENTS
      

      Never client Reconnected and no lustre recovery triggered

      Attachments

        Issue Links

          Activity

            [LU-10826] Regression in LU-9372 on OPA enviroment and no recovery triggered
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31690/
            Subject: LU-10826 ptlrpc: fix test_req_buffer_pressure behavior
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 040eca67f8d5422b0099d1b70594b5eb40a0f9ef

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31690/ Subject: LU-10826 ptlrpc: fix test_req_buffer_pressure behavior Project: fs/lustre-release Branch: master Current Patch Set: Commit: 040eca67f8d5422b0099d1b70594b5eb40a0f9ef

            Shuichi, I am already looking for a way to auto-tune req_buffers_max based, possibly based on current number of active OSTs/targets, number of connected/to-be-recovered Clients, memory available, ...

            I also want to check some others options which may help to limit each rqbd-buffer memory footprint (presently allocated in a 32K slab when it size is only 17K (this since patch for LU-4755 "ptlrpc: enlarge OST_MAXREQSIZE for 4MB RPC"), like to use a dedicated kmem_cache (if no underlying interference/drift from the Slab/Slub layer), or try to reduce OST_MAXREQSIZE in order to have a full buffer size down to 16K (if it keeps the "4MB RPC" capability) and thus have it being allocated in a 16K Slab.

             

            bfaccini Bruno Faccini (Inactive) added a comment - Shuichi, I am already looking for a way to auto-tune req_buffers_max based, possibly based on current number of active OSTs/targets, number of connected/to-be-recovered Clients, memory available, ... I also want to check some others options which may help to limit each rqbd-buffer memory footprint (presently allocated in a 32K slab when it size is only 17K (this since patch for LU-4755 "ptlrpc: enlarge OST_MAXREQSIZE for 4MB RPC"), like to use a dedicated kmem_cache (if no underlying interference/drift from the Slab/Slub layer), or try to reduce OST_MAXREQSIZE in order to have a full buffer size down to 16K (if it keeps the "4MB RPC" capability) and thus have it being allocated in a 16K Slab.  

            Thanks Bruno, patch https://review.whamcloud.com/31690 works and now triggred Lustre recovery.
            However, main concerns are tunning of resources allocation for all clients requests with test_req_buffer_pressure=1.
            In fact, as LU-9372 described, if single OSS has many OSTs under limited amount of memory resources, Lustre recovery triggers OOM when Lustre recovery started in parallel for many OSTs. We want to prevent OOM on that situation. However, curently, once we set test_req_buffer_pressure=1, it's controled by memory size or we use req_buffers_max for manuall setting. Still not not automanted setting.

            ihara Shuichi Ihara (Inactive) added a comment - Thanks Bruno, patch https://review.whamcloud.com/31690 works and now triggred Lustre recovery. However, main concerns are tunning of resources allocation for all clients requests with test_req_buffer_pressure=1. In fact, as LU-9372 described, if single OSS has many OSTs under limited amount of memory resources, Lustre recovery triggers OOM when Lustre recovery started in parallel for many OSTs. We want to prevent OOM on that situation. However, curently, once we set test_req_buffer_pressure=1, it's controled by memory size or we use req_buffers_max for manuall setting. Still not not automanted setting.

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/31690
            Subject: LU-10826 ptlrpc: fix test_req_buffer_pressure behavior
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8420c6b4f04ee29c4374f9fdbe216c3f87e6fc26

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/31690 Subject: LU-10826 ptlrpc: fix test_req_buffer_pressure behavior Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8420c6b4f04ee29c4374f9fdbe216c3f87e6fc26

            Hmm, I may have been too restrictive in my LU-9372 patches when test_req_buffer_pressure is being configured.

            I will double-check on a test platform.

             

            bfaccini Bruno Faccini (Inactive) added a comment - Hmm, I may have been too restrictive in my LU-9372 patches when test_req_buffer_pressure is being configured. I will double-check on a test platform.  

            People

              bfaccini Bruno Faccini (Inactive)
              ihara Shuichi Ihara (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: