Just as an update to the bug, Tyler and I spoke at length on the phone this morning. After a restart of the OSTs and clients, the filesystem was able to mount without problems and at least "lfs df" worked for all OSTs while we were on the phone.
However, the corruption on some of the OSTs, and the fact that all files are striped over all OSTs mean that some fraction of all files in the filesystem will have missing data. Since the filesystem is used only as a staging area, it is recommended that the filesystem is simply reformatted to get it back into a known state instead of spending more time isolating which files were corrupted and then having to restore them into the filesystem anyway. This will also avoid any potential bugs/or data corruption that may not be evident with limited testing.
We also discussed the current default configuration of striping all files across all 16 OSTs. I recommended to Tyler to use the "lfs setstripe -c
{stripes}
{new file}
" command to create some test files with different numbers of stripes and measure the performance to determine the minimum stripe count that will hit the peak single-client performance, since the clients are largely doing independent IO to different files. At that point, running multiple parallel read/write jobs on files with the smaller stripe count should be compared with running the same workload on all wide-striped files.
Based on our discussion of the workload, it seems likely that the IO performance of a small number of OSTs (2-4) would be as fast as the current peak performance seen by the clients, while reducing contention on the OSTs when multiple clients are doing IO. Reducing the stripe count may potentially increase the aggregate performance seen by multiple clients doing concurrent IO, because there is less chance of contention (seeking) on the OSTs being used by multiple clients.
Reducing the stripe count would also help isolate the clients from any problems or slowdowns caused by individual OSTs. If an OST is unavailable, then any file that is striped over that OST will also be unavailable.
If an OST is slow for some reason (e.g. RAID rebuild, marginal disk hardware, etc) then the IO to that file will be limited by the slowest OST, so the more OSTs a file is striped over the more likely such a problem is to hit a particular file. That said, if there is a minimum bandwidth requirement for a single file, instead of a desire to maximize the aggregate performance of multiple clients doing independent IO, then there needs to be enough stripes on the file so that N *
{slow OST}
is still fast enough to meet that minimum bandwidth.
Rob Baker of LMCO has confirmed that the critical situation is over and production is stable. Residual issues will be tracked under a new ticket in the future.