Test Plan for the "suppress_pings" ptlrpc Module Parameter 1. Introduction In Lustre 2.4, a new ptlrpc module parameter, "suppress_pings", is introduced to provide an option for reducing excessive OBD_PING messages in large clusters. The parameter is a switch and affects all MDTs and OSTs on a node. (MGS pings can not be suppressed.) By default, it is off (zero), giving a behavior identical to previous implementations. If it is on (non-zero), all clients of the affected targets who understand OBD_CONNECT_PINGLESS will know, at connect time, that pings are not required and will suppress keep-alive pings. In production environments, when suppressing pings, there must be an external mechanism to notify the targets of client deaths, via the targets' "evict_client" procfs entries. In addition, a highly available standalone MGS is also recommended when suppressing pings, so that clients are notified (through Imperative Recovery) of target recoveries. 2. Test Cases The following configurations will be referenced by later sections: Normal. All servers do not have "suppress_pings" set to any value. No other requirements. For example, either a combined or standalone MGS will do. Suppressed. All servers have "suppress_pings" set to "1" or any non-zero value. The MGS is standalone. 2.1. Pings not suppressed by default On a "normal" cluster, stop all workloads and check the numbers of "obd_ping" (or "ping", on the OSSs) samples by reading the following procfs files: MGS: /proc/fs/lustre/mgs/MGS/mgs/stats MDSs: /proc/fs/lustre/mds/MDS/mdt/stats OSSs: /proc/fs/lustre/obdfilter/*/stats Wait a while (e.g., an obd_timeout) and check the numbers again. All the numbers should have grown larger. 2.2. Pings suppressed on request On a "suppressed" cluster, stop all workloads, wait a minute (for all the transactions to be committed and for all the clients to learn that fact), check the same procfs files as in section 2.1, wait a while (e.g., an obd_timeout), and check the numbers again. The number on the MGS should have grown, while all the other numbers should have remained constant. 2.3. Clients notified of target recoveries in the absence of pings On a "suppressed" cluster, stop all workloads, wait a minute (for all the transactions to be committed and for all the clients to learn that fact), and restart an OST. Monitor the corresponding OSC states on the clients by reading this procfs file: /proc/fs/lustre/osc//state All OSCs should be notified of the OST restart (i.e., turning into "DISCONN") and reconnect (i.e., turning then through the recovery states into "FULL") eventually. Do the same thing again to an MDT and check the MDC states on the clients by reading this procefs file: /proc/fs/lustre/mdc//state All MDCs should behave in the same way as the OSCs. 2.4. Pings unsuppressible when uncommitted requests exist On a "suppressed" cluster, stop all workloads, wait a minute (for all the transactions to be committed and for all the clients to learn that fact), and check the same procfs files as in section 2.1. Create a file and get the FIDs and versions of its MDT object and all OST objects with the following commands: Client: lfs setstripe --count=-1 Client: dd if=/dev/zero of= bs=1M count= oflag=sync Client: lfs getstripe Client: lfs path2fid MDS: lctl --device getobjversion OSSs: lctl --device getobjversion -i -g (LU-2783) Client: echo -n > Check the "peer_committed" transaction numbers in these procfs files on the clients: /proc/fs/lustre/mdc//import /proc/fs/lustre/osc/*/import The transaction numbers should eventually grow to corresponding versions, which are transaction numbers themselves. Check the same procfs files as in section 2.1 again. The numbers of pings should have grown. 2.5. Normal workloads not affected with pings suppressed On a "suppressed" cluster, run the usual benchmarks (i.e., the "IO section" of SWL) with the same parameters as in the release test plan. No errors, OOMs, panics should happen.