[LU-11115] OST selection algorithm broken with max_create_count=0 or empty OSTs Created: 03/Jul/18 Updated: 15/Apr/19 Resolved: 30/Jul/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0, Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.12.0, Lustre 2.10.5 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Nathan Dauchy (Inactive) | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Server running: |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
We have blocked new object creation to some of our OSTs with commands like: lctl set_param osp.$OSTNAME.max_create_count=0 This is to drain data off of storage to be repurposed as spares. Three targets are already at 0%, and confirmed to have no remaining objects with e2scan and lester. 11 other targets are blocked and data is being migrated off. Noticed that a few of the other targets were filling up, while others had plenty of space. Watching it over a few days and the imbalance is getting worse. Confirmed that we are using default allocation settings: nbp7-mds1 ~ # lctl get_param lov.*.qos_* Tests creating 100k new files of stripe count 1 showed that the more full OSTs are indeed getting allocated objects more often. This looks like it might be similar to
|
| Comments |
| Comment by Nathan Dauchy (Inactive) [ 03/Jul/18 ] |
|
Implemented a emergency change to force round-robin allocation and block new writes to the most full OST... nbp7-mds1 ~ # lctl set_param lov.nbp7-MDT0000-mdtlov.qos_threshold_rr=100 |
| Comment by Nathan Dauchy (Inactive) [ 03/Jul/18 ] |
|
Current list of OSTs for which writes are blocked, and current capacity...
nbp7-mds1 ~ # lctl get_param osp.*.max_create_count | grep -v "=20000"
osp.nbp7-OST0026-osc-MDT0000.max_create_count=0
osp.nbp7-OST002a-osc-MDT0000.max_create_count=0
osp.nbp7-OST002e-osc-MDT0000.max_create_count=0
osp.nbp7-OST0032-osc-MDT0000.max_create_count=0
osp.nbp7-OST0036-osc-MDT0000.max_create_count=0
osp.nbp7-OST003a-osc-MDT0000.max_create_count=0
osp.nbp7-OST003e-osc-MDT0000.max_create_count=0
osp.nbp7-OST0042-osc-MDT0000.max_create_count=0
osp.nbp7-OST0046-osc-MDT0000.max_create_count=0
osp.nbp7-OST0048-osc-MDT0000.max_create_count=0
osp.nbp7-OST004a-osc-MDT0000.max_create_count=0
osp.nbp7-OST004e-osc-MDT0000.max_create_count=0
osp.nbp7-OST0052-osc-MDT0000.max_create_count=0
# lfs df -h /nobackupp7 UUID bytes Used Available Use% Mounted on nbp7-MDT0000_UUID 767.9G 35.6G 730.8G 5% /nobackupp7[MDT:0] nbp7-OST0000_UUID 21.8T 11.3T 9.4T 54% /nobackupp7[OST:0] nbp7-OST0001_UUID 21.8T 11.9T 8.8T 57% /nobackupp7[OST:1] nbp7-OST0002_UUID 21.8T 11.6T 9.1T 56% /nobackupp7[OST:2] nbp7-OST0003_UUID 21.8T 11.2T 9.5T 54% /nobackupp7[OST:3] nbp7-OST0004_UUID 21.8T 11.7T 9.0T 56% /nobackupp7[OST:4] nbp7-OST0005_UUID 21.8T 11.6T 9.1T 56% /nobackupp7[OST:5] nbp7-OST0006_UUID 21.8T 11.7T 9.0T 57% /nobackupp7[OST:6] nbp7-OST0007_UUID 21.8T 11.6T 9.1T 56% /nobackupp7[OST:7] nbp7-OST0008_UUID 21.8T 11.9T 8.8T 58% /nobackupp7[OST:8] nbp7-OST0009_UUID 21.8T 11.9T 8.8T 57% /nobackupp7[OST:9] nbp7-OST000a_UUID 21.8T 11.5T 9.2T 56% /nobackupp7[OST:10] nbp7-OST000b_UUID 21.8T 11.2T 9.5T 54% /nobackupp7[OST:11] nbp7-OST000c_UUID 21.8T 12.3T 8.4T 59% /nobackupp7[OST:12] nbp7-OST000d_UUID 21.8T 11.8T 8.9T 57% /nobackupp7[OST:13] nbp7-OST000e_UUID 21.8T 11.6T 9.1T 56% /nobackupp7[OST:14] nbp7-OST000f_UUID 21.8T 12.1T 8.7T 58% /nobackupp7[OST:15] nbp7-OST0010_UUID 21.8T 12.2T 8.6T 59% /nobackupp7[OST:16] nbp7-OST0011_UUID 21.8T 11.5T 9.2T 55% /nobackupp7[OST:17] nbp7-OST0012_UUID 21.8T 12.1T 8.6T 58% /nobackupp7[OST:18] nbp7-OST0013_UUID 21.8T 11.9T 8.8T 57% /nobackupp7[OST:19] nbp7-OST0014_UUID 21.8T 11.6T 9.1T 56% /nobackupp7[OST:20] nbp7-OST0015_UUID 21.8T 11.5T 9.2T 55% /nobackupp7[OST:21] nbp7-OST0016_UUID 21.8T 11.5T 9.2T 55% /nobackupp7[OST:22] nbp7-OST0017_UUID 21.8T 11.3T 9.4T 54% /nobackupp7[OST:23] nbp7-OST0018_UUID 21.8T 12.0T 8.7T 58% /nobackupp7[OST:24] nbp7-OST0019_UUID 21.8T 11.9T 8.8T 57% /nobackupp7[OST:25] nbp7-OST001a_UUID 21.8T 12.1T 8.6T 58% /nobackupp7[OST:26] nbp7-OST001b_UUID 21.8T 12.2T 8.5T 59% /nobackupp7[OST:27] nbp7-OST001c_UUID 21.8T 11.2T 9.5T 54% /nobackupp7[OST:28] nbp7-OST001d_UUID 21.8T 11.2T 9.5T 54% /nobackupp7[OST:29] nbp7-OST001e_UUID 21.8T 11.9T 8.8T 58% /nobackupp7[OST:30] nbp7-OST001f_UUID 21.8T 11.7T 9.0T 57% /nobackupp7[OST:31] nbp7-OST0020_UUID 21.8T 11.2T 9.5T 54% /nobackupp7[OST:32] nbp7-OST0021_UUID 21.8T 11.6T 9.1T 56% /nobackupp7[OST:33] nbp7-OST0022_UUID 21.8T 11.4T 9.3T 55% /nobackupp7[OST:34] nbp7-OST0023_UUID 21.8T 11.9T 8.9T 57% /nobackupp7[OST:35] nbp7-OST0024_UUID 21.8T 12.8T 7.9T 62% /nobackupp7[OST:36] nbp7-OST0025_UUID 21.8T 17.4T 3.3T 84% /nobackupp7[OST:37] nbp7-OST0026_UUID 21.8T 1.7T 19.0T 8% /nobackupp7[OST:38] nbp7-OST0028_UUID 21.8T 16.3T 4.4T 79% /nobackupp7[OST:40] nbp7-OST0029_UUID 21.8T 14.7T 6.0T 71% /nobackupp7[OST:41] nbp7-OST002a_UUID 21.8T 1.6T 19.1T 8% /nobackupp7[OST:42] nbp7-OST002c_UUID 21.8T 12.5T 8.3T 60% /nobackupp7[OST:44] nbp7-OST002d_UUID 21.8T 11.4T 9.3T 55% /nobackupp7[OST:45] nbp7-OST002e_UUID 21.8T 1.6T 19.1T 8% /nobackupp7[OST:46] nbp7-OST0030_UUID 21.8T 13.3T 7.4T 64% /nobackupp7[OST:48] nbp7-OST0031_UUID 21.8T 11.4T 9.3T 55% /nobackupp7[OST:49] nbp7-OST0032_UUID 21.8T 1.5T 19.2T 7% /nobackupp7[OST:50] nbp7-OST0034_UUID 21.8T 15.6T 5.1T 75% /nobackupp7[OST:52] nbp7-OST0035_UUID 21.8T 12.7T 8.0T 62% /nobackupp7[OST:53] nbp7-OST0036_UUID 21.8T 1.7T 19.0T 8% /nobackupp7[OST:54] nbp7-OST0038_UUID 21.8T 12.3T 8.4T 59% /nobackupp7[OST:56] nbp7-OST0039_UUID 21.8T 12.2T 8.5T 59% /nobackupp7[OST:57] nbp7-OST003a_UUID 21.8T 1.4T 19.2T 7% /nobackupp7[OST:58] nbp7-OST003c_UUID 21.8T 15.7T 5.0T 76% /nobackupp7[OST:60] nbp7-OST003d_UUID 21.8T 15.4T 5.3T 75% /nobackupp7[OST:61] nbp7-OST003e_UUID 21.8T 1.6T 19.1T 8% /nobackupp7[OST:62] nbp7-OST0040_UUID 21.8T 16.1T 4.7T 77% /nobackupp7[OST:64] nbp7-OST0041_UUID 21.8T 16.1T 4.7T 77% /nobackupp7[OST:65] nbp7-OST0042_UUID 21.8T 2.0T 18.7T 10% /nobackupp7[OST:66] nbp7-OST0044_UUID 21.8T 11.4T 9.3T 55% /nobackupp7[OST:68] nbp7-OST0045_UUID 21.8T 11.6T 9.1T 56% /nobackupp7[OST:69] nbp7-OST0046_UUID 21.8T 1.5T 19.2T 7% /nobackupp7[OST:70] nbp7-OST0048_UUID 21.8T 19.3T 1.4T 93% /nobackupp7[OST:72] nbp7-OST0049_UUID 21.8T 11.9T 8.8T 57% /nobackupp7[OST:73] nbp7-OST004a_UUID 21.8T 236.0M 20.7T 0% /nobackupp7[OST:74] nbp7-OST004c_UUID 21.8T 12.6T 8.1T 61% /nobackupp7[OST:76] nbp7-OST004d_UUID 21.8T 16.1T 4.7T 77% /nobackupp7[OST:77] nbp7-OST004e_UUID 21.8T 216.9M 20.7T 0% /nobackupp7[OST:78] nbp7-OST0050_UUID 21.8T 12.8T 8.0T 62% /nobackupp7[OST:80] nbp7-OST0051_UUID 21.8T 16.0T 4.8T 77% /nobackupp7[OST:81] nbp7-OST0052_UUID 21.8T 233.5M 20.7T 0% /nobackupp7[OST:82] filesystem_summary: 1.5P 772.9T 718.2T 52% /nobackupp7 |
| Comment by Nathan Dauchy (Inactive) [ 03/Jul/18 ] |
|
Test case... $ lfs setstripe --stripe_count 1 ostplace/ $ lfs getstripe ostplace ostplace stripe_count: 1 stripe_size: 1048576 stripe_offset: -1 $ cd ostplace/ $ for i in $(seq 1 100); do mkdir $i; done $ for j in $(seq 1 1000); do echo $j; for i in $(seq 1 100); do touch $i/$j; done; done $ ls */* | xargs -P 32 lfs getstripe | grep " 0x" | awk '{print $1}' | sort -n | uniq -c Example of imbalanced placement, more objects on the most full OST... $ ls */* | xargs -P 32 lfs getstripe | grep " 0x" | awk '{print $1}' | sort -n | uniq -c | sort -nr | head 5695 72 3616 60 3577 61 3555 77 3523 41 3459 64 3402 52 3377 37 3333 81 3306 65 |
| Comment by Peter Jones [ 04/Jul/18 ] |
|
Jian Could you please investigate? Thanks Peter |
| Comment by Andreas Dilger [ 04/Jul/18 ] |
|
I suspect this problem has been around for a long time. It relates to the OST after the deactivated OST getting double the number of objects allocated, or triple the number of allocations if two OSTs are inactive. . I think that we need to determine whether the problem is because of the starting OST, or the intermediate OST selection is the problem? I think in either case, the selected OST is the inactive one, it is skipped and the next active OST is used, then the starting OST index is incremented by one and the same OST is selected again. This is probably in lod_qos_alloc() or similar. |
| Comment by Jian Yu [ 05/Jul/18 ] |
|
Yes, Andreas. The exact function is lod_alloc_qos(), and in this function, the weighted random allocation algorithm is used. As per the comments, it finds available OSTs and calculates their weights (free space) first, then selects the OSTs the weights used as the probability. An OST with a higher weight is proportionately more likely to be selected than one with a lower weight. We can use the QOS_DEBUG() codes in the function to debug the allocation algorithm. Hi Nathan, |
| Comment by Nathan Dauchy (Inactive) [ 05/Jul/18 ] |
|
Jian, I'm scheduled for vacation through Monday, checking in occasionally, so won't be able to try the reproducer on our system. Besides that, I would rather not remove the workaround we have forcing to round-robin. In fact, if you have suggestions for a better one, please let us know! So, if you can take the test case provided and reproduce on one of your test systems or in a VM setup for debugging, that would be preferred. Thanks! |
| Comment by Jian Yu [ 05/Jul/18 ] |
|
Sure, Nathan. Let me reproduce and investigate further. Have a nice vacation! |
| Comment by Jian Yu [ 12/Jul/18 ] |
|
The issue can be reproduced. Lustre debug log on MDS shows that while choosing OST to create object, function lod_qos_prep_create() will try QoS algorithm first by calling lod_alloc_qos(). If free space is distributed evenly among OSTs, lod_alloc_qos() will return -EAGAIN, then lod_qos_prep_create() will call lod_alloc_rr() to use RR algorithm. Both in lod_alloc_qos() and lod_alloc_rr(), function lod_statfs_and_check() is used to check whether the OST target is available for new OST objects or not. However, OST target with max_create_count=0 is not checked in that function and just returned as an available OST. This issue affects lod_alloc_qos(), but not lod_alloc_rr() because the following extra codes are called in lod_check_and_reserve_ost() to check and skip OST target with max_create_count=0: lod_check_and_reserve_ost()
/*
* We expect number of precreated objects in f_ffree at
* the first iteration, skip OSPs with no objects ready
*/
if (sfs->os_fprecreated == 0 && speed == 0) {
QOS_DEBUG("#%d: precreation is empty\n", ost_idx);
goto out_return;
}
I'm creating a patch to fix lod_alloc_qos(). |
| Comment by Gerrit Updater [ 17/Jul/18 ] |
|
Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32823 |
| Comment by Jay Lan (Inactive) [ 23/Jul/18 ] |
|
Hi Jian Yu, Could you back port your patch to b2_10? There were conflicts. Thanks!
Changes to be committed: modified: lustre/lod/lod_qos.c Unmerged paths: deleted by us: lustre/include/uapi/linux/lustre/lustre_user.h |
| Comment by Jian Yu [ 23/Jul/18 ] |
|
Sure, Jay. |
| Comment by Gerrit Updater [ 23/Jul/18 ] |
|
Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32859 |
| Comment by Gerrit Updater [ 30/Jul/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32823/ |
| Comment by Peter Jones [ 30/Jul/18 ] |
|
Landed for 2.12 |
| Comment by Jay Lan (Inactive) [ 01/Aug/18 ] |
|
Peter, we need a nasa label on this ticket. Thanks. |
| Comment by Gerrit Updater [ 02/Aug/18 ] |
|
John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32859/ |