[LU-12504] Lustre stalls with "slow creates" on disabled OST Created: 02/Jul/19 Updated: 03/Jul/19 Resolved: 03/Jul/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.5 |
| Fix Version/s: | None |
| Type: | Question/Request | Priority: | Minor |
| Reporter: | Chris Hanna | Assignee: | Chris Hanna |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Greetings, We had an OST which was physically damaged recently on our Lustre 2.5.5 system. We were able to deactivate new file creation on the OST from the MDS (using lctl --device data-OST0036-osc-MDT0000 deactivate) , and lfs_migrate the data off, but then there were still quota problems when contacting the damaged OST. So, we tried to disable the OST from the client side as well. That worked, but now there are stray messages from our MDS warning of “slow creates” to this supposedly disabled OST, and filesystem creates are now very slow: Jul 2 08:40:21 mds1 kernel: Lustre: data-OST0036-osc-MDT0000: slow creates, last=[0x100360000:0xe4f61:0x0], next=[0x100360000:0xe4f61:0x0], reserved=0, syn_changes=0, syn_rpc_in_progress=0, status=-19 All of the below have been tried to fix this on the MDS: lctl --device data-OST0036-osc-MDT0000 deactivate On clients, the OST is disabled, and the logs show “Lustre: setting import data-OST0036_UUID INACTIVE by administrator request”: client$ lctl get_param osc.*-OST0036*.active The MDS also believes this OST is inactive: mds$ cat /proc/fs/lustre/osp/data-OST0036-osc-MDT0000/active However, the slow creates message persists on the MDS, about one every 10 minutes, always with the same “last” and “next” ids. Is there something we have missed, or some other way this should have been resolved to permanently remove this OST? We have not yet tried standing up a new OST at the same index, or restarting the MDS. (Update: Standing up a new OST to replace the defunct blank one, and setting back to active, cleaned this up. It still would be nice to know the proper way to handle this situation, though.) Thanks for any advice you may have, Chris
|
| Comments |
| Comment by Andreas Dilger [ 03/Jul/19 ] |
|
The handling of deactivated OSTs in 2.5.x was definitely not ideal. There were a number of fixes to this code over the years, in particular the addition of using osp.*.max_create_count=0 to disable only object precreation on that OST, without also disabling the unlink of objects as the files are migrated as setting active=0. Also, a number of other fixes to better handle object cleanup after disconnect and reconnect. It may be that the issue on your system was that the MDS thread was already in the loop trying to create objects on the failed OST, and it continued trying to create rather than checking/noticing that the OST was no longer available. The main patches to fix this in later releases were developed under: with a couple of follow-on fixes: These issues were (AFAIK) fully resolved in 2.10.7 and later releases. |
| Comment by Chris Hanna [ 03/Jul/19 ] |
|
Thanks Andreas! Right now, we are tied to this old version. I'll keep this in mind when doing these replacements in the future. I'm going to resolve this issue since we've been able to fix it with the new OST. Chris
|