Loading...

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: Lustre 2.14.0
Affects Version/s: Lustre 2.12.8
Labels:
None
Environment:
CentOS 7.9 (3.10.0-1160.6.1.el7_lustre.pl1.x86_64)

Severity:
3
Rank (Obsolete):
9223372036854775807

We are seeing a weird problem on our Oak storage system where MDT0000 doesn’t seem to be able to allocate new objects on specific OSTs after a MDT0000 failover. Lustre doesn’t seem to complain about that in the logs (which is already a problem per se), but we can see that something is wrong as prealloc_status is set to -11 (EAGAIN) for those:

[root@oak-md1-s1 ~]# lctl get_param osp.oak-OST*-osc-MDT0000/prealloc_status | grep -c "=-11$"
76

While 388 other OSTs are fine:

[root@oak-md1-s1 ~]# lctl get_param osp.oak-OST*-osc-MDT0000/prealloc_status | grep -c "=0$"
388

Note: prealloc_status for MDT0001 and MDT0002 located on the same MDS are fine, which likely indicates that the OSTs are fine:

[root@oak-md1-s1 ~]# lctl get_param osp.oak-OST*-osc-MDT0001/prealloc_status | grep -c "=0$"
464
[root@oak-md1-s1 ~]# lctl get_param osp.oak-OST*-osc-MDT0002/prealloc_status | grep -c "=0$"
464

MDT0 prealloc info for a problematic OST:

[root@oak-md1-s1 ~]# lctl get_param osp.oak-OST0098-osc-MDT0000/prealloc*
osp.oak-OST0098-osc-MDT0000.prealloc_last_id=1
osp.oak-OST0098-osc-MDT0000.prealloc_last_seq=0x0
osp.oak-OST0098-osc-MDT0000.prealloc_next_id=2
osp.oak-OST0098-osc-MDT0000.prealloc_next_seq=0x0
osp.oak-OST0098-osc-MDT0000.prealloc_reserved=0
osp.oak-OST0098-osc-MDT0000.prealloc_status=-11

osp stats for a problematic OST:

[root@oak-md1-s1 ~]# lctl get_param osp.oak-OST0098-osc-MDT0000/stats
osp.oak-OST0098-osc-MDT0000.stats=
snapshot_time             1676840498.011953710 secs.nsecs
req_waittime              22039 samples [usec] 31 4156496 65699472 70586321859268
req_active                22039 samples [reqs] 1 20 23968 43578
ost_setattr               47 samples [usec] 77 19916 25528 397438254
ost_destroy               480 samples [usec] 132 4079970 38902621 41255678589979
ost_connect               1 samples [usec] 99 99 99 9801
obd_ping                  21511 samples [usec] 31 4156496 26771224 29330245821234

vs. a working one for comparison purposes:

[root@oak-md1-s1 ~]# lctl get_param osp.oak-OST0099-osc-MDT0000/stats
osp.oak-OST0099-osc-MDT0000.stats=
snapshot_time             1676840500.538529214 secs.nsecs
req_waittime              133680 samples [usec] 30 5731842 227870174 257553761260390
req_active                133680 samples [reqs] 1 21 136003 155255
ost_setattr               66 samples [usec] 75 12523 61768 584301738
ost_create                429 samples [usec] 851 392236 42632140 5687640118670
ost_destroy               5919 samples [usec] 98 5731842 84323316 91990540215144
ost_connect               1 samples [usec] 110 110 110 12100
ost_statfs                105758 samples [usec] 30 5181321 64867010 151688813146608
obd_ping                  21507 samples [usec] 31 1552632 35985830 8186183466130

State for the same problematic OST:

[root@oak-md1-s1 ~]# lctl get_param osp.oak-OST0098-osc-MDT0000/state
osp.oak-OST0098-osc-MDT0000.state=
current_state: FULL
state_history:
 - [ 1676310563, CONNECTING ]
 - [ 1676310563, FULL ]

So far, we have no user reports of any issue with the filesystem, probably thanks to Lustre which does a good job at NOT using these OSTs for new objects, likely due to a proper check on prealloc_status. We only noticed the problem thanks to a micro-benchmark monitoring script that periodically performs some I/O on every OSTs of the filesystem by creating files on MDT0000:

# Just to make sure here, we’re using a directory on MDT0000
[root@oak-rbh02 stripes]# lfs getdirstripe .
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none

# Allocate a file using plain layout on a single OST with prealloc_status=-11
[root@oak-rbh02 stripes]# time lfs setstripe -c 1 -o 152 test152
lfs setstripe: setstripe error for 'test152': Connection timed out

real    1m40.006s
user    0m0.000s
sys    0m0.005s

[root@oak-rbh02 stripes]# lfs getstripe test152
test152 has no stripe info

We can see that setstripe does time out after 100s on this, which is Lustre’s timeout value.

It looks like to me that MDT0000 is somehow not able to run the preallocation routines for these 76 OSTs and they are stuck with this status. But nothing in the logs at the start of the MDT seems to indicate a problem. Just in case, I am attaching the Lustre logs at the start of both MDT0000 and MDT0003 (due to manual failover to doing a service maintenance on a server) as oak-md1-s1-lustre-mdt-failover.log along with lustre log file lustre-log.1676310742.6098.gz which was dumped during the start. But it’s not clear from these logs there was any preallocation problem. Lustre should have a better way of logging that when this happens.

Some ideas…

I tried things like setting force_sync=1, or changing the value of max_create_count to try to force a refresh of the object preallocation, to no avail.

Note that we do have the patch (available in 2.12.9):
LU-15009 ofd: continue precreate if LAST_ID is less on MDT
and I am wondering if that could be related, as the patch transforms a CERROR into a LCONSOLE(D_INFO,…) in some cases, which we would have likely missed, as we don’t run with +info, it’s way too verbose. It’s important enough that it should probably be logged as an error instead. But really not sure this is related to our issue..

As for the server version we’re running here, sorry it’s not ideal but it’s somewhere between 2.12.8 and 2.12.9 + additional patches not in 2.12.x yet (but only with patches from Gerrit). To see that patches we have since 2.12.8, see git-patches.log.

Any ideas, suggestions of things to try, before trying a stop/start of the impacted OSTs maybe to force reconnect (or is there a way to force a reconnect for a specific OST just for MDT0?).

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

oak-md1-s1-lustre-mdt-failover.log
19/Feb/23 9:20 PM
12 kB
Stephane Thiell
MDT0_prealloc_status_full.log
19/Feb/23 9:39 PM
21 kB
Stephane Thiell
lustre-log.1676310742.6098.gz
19/Feb/23 9:20 PM
2.61 MB
Stephane Thiell
LU-16578-oak-md1-s2_foreach_bt_2.gz
08/Mar/23 5:31 AM
195 kB
Stephane Thiell
LU-16578-oak-md1-s2_foreach_bt_1.gz
08/Mar/23 5:31 AM
191 kB
Stephane Thiell
ldebug.log.gz
08/Mar/23 5:19 AM
67.06 MB
Stephane Thiell
git-patches.log
19/Feb/23 9:21 PM
2 kB
Stephane Thiell

is duplicated by

LU-12397 osp: race around opd_new_connection

Resolved

is related to

LU-16623 lod_statfs_and_check() does not skip unusable OSTs

Resolved

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates