[LU-11115] OST selection algorithm broken with max_create_count=0 or empty OSTs Created: 03/Jul/18  Updated: 15/Apr/19  Resolved: 30/Jul/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.10.0
Fix Version/s: Lustre 2.12.0, Lustre 2.10.5

Type: Bug Priority: Critical
Reporter: Nathan Dauchy (Inactive) Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Server running:
CentOS-6 lcentos-release-6-9.el6.12.3.x86_64
lustre-2.7.3-1nasS_mofed33v3g_2.6.32_642.15.1.el6.20170609.x86_64.lustre273.x86_64


Issue Links:
Related
is related to LU-4825 lfs migrate not freeing space on OST Resolved
is related to LU-10823 max_create_count triggering uneven di... Resolved
is related to LU-11605 create_count stuck in 0 after changei... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We have blocked new object creation to some of our OSTs with commands like:

lctl set_param osp.$OSTNAME.max_create_count=0

This is to drain data off of storage to be repurposed as spares. Three targets are already at 0%, and confirmed to have no remaining objects with e2scan and lester. 11 other targets are blocked and data is being migrated off.

Noticed that a few of the other targets were filling up, while others had plenty of space. Watching it over a few days and the imbalance is getting worse.

Confirmed that we are using default allocation settings:

nbp7-mds1 ~ # lctl get_param lov.*.qos_* 
lov.nbp7-MDT0000-mdtlov.qos_maxage=5 Sec
lov.nbp7-MDT0000-mdtlov.qos_prio_free=91%
lov.nbp7-MDT0000-mdtlov.qos_threshold_rr=17%

Tests creating 100k new files of stripe count 1 showed that the more full OSTs are indeed getting allocated objects more often.

This looks like it might be similar to LU-10823.

 



 Comments   
Comment by Nathan Dauchy (Inactive) [ 03/Jul/18 ]

Implemented a emergency change to force round-robin allocation and block new writes to the most full OST...

nbp7-mds1 ~ # lctl set_param lov.nbp7-MDT0000-mdtlov.qos_threshold_rr=100
lov.nbp7-MDT0000-mdtlov.qos_threshold_rr=100
nbp7-mds1 ~ # lctl set_param osp.nbp7-OST0048-osc-MDT0000.max_create_count=0
osp.nbp7-OST0048-osc-MDT0000.max_create_count=0

Comment by Nathan Dauchy (Inactive) [ 03/Jul/18 ]

Current list of OSTs for which writes are blocked, and current capacity...

nbp7-mds1 ~ # lctl get_param osp.*.max_create_count | grep -v "=20000"
osp.nbp7-OST0026-osc-MDT0000.max_create_count=0
osp.nbp7-OST002a-osc-MDT0000.max_create_count=0
osp.nbp7-OST002e-osc-MDT0000.max_create_count=0
osp.nbp7-OST0032-osc-MDT0000.max_create_count=0
osp.nbp7-OST0036-osc-MDT0000.max_create_count=0
osp.nbp7-OST003a-osc-MDT0000.max_create_count=0
osp.nbp7-OST003e-osc-MDT0000.max_create_count=0
osp.nbp7-OST0042-osc-MDT0000.max_create_count=0
osp.nbp7-OST0046-osc-MDT0000.max_create_count=0
osp.nbp7-OST0048-osc-MDT0000.max_create_count=0
osp.nbp7-OST004a-osc-MDT0000.max_create_count=0
osp.nbp7-OST004e-osc-MDT0000.max_create_count=0
osp.nbp7-OST0052-osc-MDT0000.max_create_count=0
# lfs df -h /nobackupp7 
UUID bytes Used Available Use% Mounted on
nbp7-MDT0000_UUID 767.9G 35.6G 730.8G 5% /nobackupp7[MDT:0]
nbp7-OST0000_UUID 21.8T 11.3T 9.4T 54% /nobackupp7[OST:0]
nbp7-OST0001_UUID 21.8T 11.9T 8.8T 57% /nobackupp7[OST:1]
nbp7-OST0002_UUID 21.8T 11.6T 9.1T 56% /nobackupp7[OST:2]
nbp7-OST0003_UUID 21.8T 11.2T 9.5T 54% /nobackupp7[OST:3]
nbp7-OST0004_UUID 21.8T 11.7T 9.0T 56% /nobackupp7[OST:4]
nbp7-OST0005_UUID 21.8T 11.6T 9.1T 56% /nobackupp7[OST:5]
nbp7-OST0006_UUID 21.8T 11.7T 9.0T 57% /nobackupp7[OST:6]
nbp7-OST0007_UUID 21.8T 11.6T 9.1T 56% /nobackupp7[OST:7]
nbp7-OST0008_UUID 21.8T 11.9T 8.8T 58% /nobackupp7[OST:8]
nbp7-OST0009_UUID 21.8T 11.9T 8.8T 57% /nobackupp7[OST:9]
nbp7-OST000a_UUID 21.8T 11.5T 9.2T 56% /nobackupp7[OST:10]
nbp7-OST000b_UUID 21.8T 11.2T 9.5T 54% /nobackupp7[OST:11]
nbp7-OST000c_UUID 21.8T 12.3T 8.4T 59% /nobackupp7[OST:12]
nbp7-OST000d_UUID 21.8T 11.8T 8.9T 57% /nobackupp7[OST:13]
nbp7-OST000e_UUID 21.8T 11.6T 9.1T 56% /nobackupp7[OST:14]
nbp7-OST000f_UUID 21.8T 12.1T 8.7T 58% /nobackupp7[OST:15]
nbp7-OST0010_UUID 21.8T 12.2T 8.6T 59% /nobackupp7[OST:16]
nbp7-OST0011_UUID 21.8T 11.5T 9.2T 55% /nobackupp7[OST:17]
nbp7-OST0012_UUID 21.8T 12.1T 8.6T 58% /nobackupp7[OST:18]
nbp7-OST0013_UUID 21.8T 11.9T 8.8T 57% /nobackupp7[OST:19]
nbp7-OST0014_UUID 21.8T 11.6T 9.1T 56% /nobackupp7[OST:20]
nbp7-OST0015_UUID 21.8T 11.5T 9.2T 55% /nobackupp7[OST:21]
nbp7-OST0016_UUID 21.8T 11.5T 9.2T 55% /nobackupp7[OST:22]
nbp7-OST0017_UUID 21.8T 11.3T 9.4T 54% /nobackupp7[OST:23]
nbp7-OST0018_UUID 21.8T 12.0T 8.7T 58% /nobackupp7[OST:24]
nbp7-OST0019_UUID 21.8T 11.9T 8.8T 57% /nobackupp7[OST:25]
nbp7-OST001a_UUID 21.8T 12.1T 8.6T 58% /nobackupp7[OST:26]
nbp7-OST001b_UUID 21.8T 12.2T 8.5T 59% /nobackupp7[OST:27]
nbp7-OST001c_UUID 21.8T 11.2T 9.5T 54% /nobackupp7[OST:28]
nbp7-OST001d_UUID 21.8T 11.2T 9.5T 54% /nobackupp7[OST:29]
nbp7-OST001e_UUID 21.8T 11.9T 8.8T 58% /nobackupp7[OST:30]
nbp7-OST001f_UUID 21.8T 11.7T 9.0T 57% /nobackupp7[OST:31]
nbp7-OST0020_UUID 21.8T 11.2T 9.5T 54% /nobackupp7[OST:32]
nbp7-OST0021_UUID 21.8T 11.6T 9.1T 56% /nobackupp7[OST:33]
nbp7-OST0022_UUID 21.8T 11.4T 9.3T 55% /nobackupp7[OST:34]
nbp7-OST0023_UUID 21.8T 11.9T 8.9T 57% /nobackupp7[OST:35]
nbp7-OST0024_UUID 21.8T 12.8T 7.9T 62% /nobackupp7[OST:36]
nbp7-OST0025_UUID 21.8T 17.4T 3.3T 84% /nobackupp7[OST:37]
nbp7-OST0026_UUID 21.8T 1.7T 19.0T 8% /nobackupp7[OST:38]
nbp7-OST0028_UUID 21.8T 16.3T 4.4T 79% /nobackupp7[OST:40]
nbp7-OST0029_UUID 21.8T 14.7T 6.0T 71% /nobackupp7[OST:41]
nbp7-OST002a_UUID 21.8T 1.6T 19.1T 8% /nobackupp7[OST:42]
nbp7-OST002c_UUID 21.8T 12.5T 8.3T 60% /nobackupp7[OST:44]
nbp7-OST002d_UUID 21.8T 11.4T 9.3T 55% /nobackupp7[OST:45]
nbp7-OST002e_UUID 21.8T 1.6T 19.1T 8% /nobackupp7[OST:46]
nbp7-OST0030_UUID 21.8T 13.3T 7.4T 64% /nobackupp7[OST:48]
nbp7-OST0031_UUID 21.8T 11.4T 9.3T 55% /nobackupp7[OST:49]
nbp7-OST0032_UUID 21.8T 1.5T 19.2T 7% /nobackupp7[OST:50]
nbp7-OST0034_UUID 21.8T 15.6T 5.1T 75% /nobackupp7[OST:52]
nbp7-OST0035_UUID 21.8T 12.7T 8.0T 62% /nobackupp7[OST:53]
nbp7-OST0036_UUID 21.8T 1.7T 19.0T 8% /nobackupp7[OST:54]
nbp7-OST0038_UUID 21.8T 12.3T 8.4T 59% /nobackupp7[OST:56]
nbp7-OST0039_UUID 21.8T 12.2T 8.5T 59% /nobackupp7[OST:57]
nbp7-OST003a_UUID 21.8T 1.4T 19.2T 7% /nobackupp7[OST:58]
nbp7-OST003c_UUID 21.8T 15.7T 5.0T 76% /nobackupp7[OST:60]
nbp7-OST003d_UUID 21.8T 15.4T 5.3T 75% /nobackupp7[OST:61]
nbp7-OST003e_UUID 21.8T 1.6T 19.1T 8% /nobackupp7[OST:62]
nbp7-OST0040_UUID 21.8T 16.1T 4.7T 77% /nobackupp7[OST:64]
nbp7-OST0041_UUID 21.8T 16.1T 4.7T 77% /nobackupp7[OST:65]
nbp7-OST0042_UUID 21.8T 2.0T 18.7T 10% /nobackupp7[OST:66]
nbp7-OST0044_UUID 21.8T 11.4T 9.3T 55% /nobackupp7[OST:68]
nbp7-OST0045_UUID 21.8T 11.6T 9.1T 56% /nobackupp7[OST:69]
nbp7-OST0046_UUID 21.8T 1.5T 19.2T 7% /nobackupp7[OST:70]
nbp7-OST0048_UUID 21.8T 19.3T 1.4T 93% /nobackupp7[OST:72]
nbp7-OST0049_UUID 21.8T 11.9T 8.8T 57% /nobackupp7[OST:73]
nbp7-OST004a_UUID 21.8T 236.0M 20.7T 0% /nobackupp7[OST:74]
nbp7-OST004c_UUID 21.8T 12.6T 8.1T 61% /nobackupp7[OST:76]
nbp7-OST004d_UUID 21.8T 16.1T 4.7T 77% /nobackupp7[OST:77]
nbp7-OST004e_UUID 21.8T 216.9M 20.7T 0% /nobackupp7[OST:78]
nbp7-OST0050_UUID 21.8T 12.8T 8.0T 62% /nobackupp7[OST:80]
nbp7-OST0051_UUID 21.8T 16.0T 4.8T 77% /nobackupp7[OST:81]
nbp7-OST0052_UUID 21.8T 233.5M 20.7T 0% /nobackupp7[OST:82]

filesystem_summary: 1.5P 772.9T 718.2T 52% /nobackupp7
Comment by Nathan Dauchy (Inactive) [ 03/Jul/18 ]

Test case...

$ lfs setstripe --stripe_count 1 ostplace/
$ lfs getstripe ostplace
ostplace
stripe_count:  1 stripe_size:   1048576 stripe_offset: -1
$ cd ostplace/
$ for i in $(seq 1 100); do mkdir $i; done
$ for j in $(seq 1 1000); do echo $j; for i in $(seq 1 100); do touch $i/$j; done; done
$ ls */* | xargs -P 32 lfs getstripe | grep " 0x" | awk '{print $1}' | sort -n | uniq -c

Example of imbalanced placement, more objects on the most full OST...

$ ls */* | xargs -P 32 lfs getstripe | grep " 0x" | awk '{print $1}' | sort -n | uniq -c | sort -nr | head   
   5695 72
   3616 60
   3577 61
   3555 77
   3523 41
   3459 64
   3402 52
   3377 37
   3333 81
   3306 65
Comment by Peter Jones [ 04/Jul/18 ]

Jian

Could you please investigate?

Thanks

Peter

Comment by Andreas Dilger [ 04/Jul/18 ]

I suspect this problem has been around for a long time. It relates to the OST after the deactivated OST getting double the number of objects allocated, or triple the number of allocations if two OSTs are inactive. .

I think that we need to determine whether the problem is because of the starting OST, or the intermediate OST selection is the problem? I think in either case, the selected OST is the inactive one, it is skipped and the next active OST is used, then the starting OST index is incremented by one and the same OST is selected again.

This is probably in lod_qos_alloc() or similar.

Comment by Jian Yu [ 05/Jul/18 ]

Yes, Andreas. The exact function is lod_alloc_qos(), and in this function, the weighted random allocation algorithm is used. As per the comments, it finds available OSTs and calculates their weights (free space) first, then selects the OSTs the weights used as the probability. An OST with a higher weight is proportionately more likely to be selected than one with a lower weight.

We can use the QOS_DEBUG() codes in the function to debug the allocation algorithm.

Hi Nathan,
Could you please enable QOS debug by running lctl set_param debug='+other', reproduce the issue and gather Lustre debug logs for investigation? Thank you.

Comment by Nathan Dauchy (Inactive) [ 05/Jul/18 ]

Jian,

I'm scheduled for vacation through Monday, checking in occasionally, so won't be able to try the reproducer on our system. Besides that, I would rather not remove the workaround we have forcing to round-robin. In fact, if you have suggestions for a better one, please let us know!

So, if you can take the test case provided and reproduce on one of your test systems or in a VM setup for debugging, that would be preferred. Thanks!

Comment by Jian Yu [ 05/Jul/18 ]

Sure, Nathan. Let me reproduce and investigate further. Have a nice vacation!

Comment by Jian Yu [ 12/Jul/18 ]

The issue can be reproduced.

Lustre debug log on MDS shows that while choosing OST to create object, function lod_qos_prep_create() will try QoS algorithm first by calling lod_alloc_qos(). If free space is distributed evenly among OSTs, lod_alloc_qos() will return -EAGAIN, then lod_qos_prep_create() will call lod_alloc_rr() to use RR algorithm.

Both in lod_alloc_qos() and lod_alloc_rr(), function lod_statfs_and_check() is used to check whether the OST target is available for new OST objects or not. However, OST target with max_create_count=0 is not checked in that function and just returned as an available OST.

This issue affects lod_alloc_qos(), but not lod_alloc_rr() because the following extra codes are called in lod_check_and_reserve_ost() to check and skip OST target with max_create_count=0:

lod_check_and_reserve_ost()
        /*
         * We expect number of precreated objects in f_ffree at
         * the first iteration, skip OSPs with no objects ready
         */
        if (sfs->os_fprecreated == 0 && speed == 0) {
                QOS_DEBUG("#%d: precreation is empty\n", ost_idx);
                goto out_return;
        }

I'm creating a patch to fix lod_alloc_qos().

Comment by Gerrit Updater [ 17/Jul/18 ]

Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32823
Subject: LU-11115 lod: skip max_create_count=0 OST in QoS and RR algorithms
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9c390821eec3eea991b820544dce52fba3a73494

Comment by Jay Lan (Inactive) [ 23/Jul/18 ]

Hi Jian Yu,

Could you back port your patch to b2_10?

There were conflicts. Thanks!

 

Changes to be committed:

modified: lustre/lod/lod_qos.c
modified: lustre/osp/osp_precreate.c

Unmerged paths:
(use "git add/rm <file>..." as appropriate to mark resolution)

deleted by us: lustre/include/uapi/linux/lustre/lustre_user.h
both modified: lustre/lod/lod_object.c

Comment by Jian Yu [ 23/Jul/18 ]

Sure, Jay.

Comment by Gerrit Updater [ 23/Jul/18 ]

Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32859
Subject: LU-11115 lod: skip max_create_count=0 OST in QoS and RR algorithms
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 7889706668c671cdadb8febfe819f7e475bdf257

Comment by Gerrit Updater [ 30/Jul/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32823/
Subject: LU-11115 lod: skip max_create_count=0 OST in QoS and RR algorithms
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5b147e47de651f1c140f69314a2d6b56ff6b14d7

Comment by Peter Jones [ 30/Jul/18 ]

Landed for 2.12

Comment by Jay Lan (Inactive) [ 01/Aug/18 ]

Peter, we need a nasa label on this ticket. Thanks.

Comment by Gerrit Updater [ 02/Aug/18 ]

John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32859/
Subject: LU-11115 lod: skip max_create_count=0 OST in QoS and RR algorithms
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: d2f9ed4a20b5fae836560efc607e443fa996c2e2

Generated at Sat Feb 10 02:41:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.