Details
-
Bug
-
Resolution: Duplicate
-
Major
-
Lustre 2.12.8
-
None
-
CentOS 7.9 (3.10.0-1160.6.1.el7_lustre.pl1.x86_64)
-
3
-
9223372036854775807
Description
We are seeing a weird problem on our Oak storage system where MDT0000 doesn’t seem to be able to allocate new objects on specific OSTs after a MDT0000 failover. Lustre doesn’t seem to complain about that in the logs (which is already a problem per se), but we can see that something is wrong as prealloc_status is set to -11 (EAGAIN) for those:
[root@oak-md1-s1 ~]# lctl get_param osp.oak-OST*-osc-MDT0000/prealloc_status | grep -c "=-11$" 76
While 388 other OSTs are fine:
[root@oak-md1-s1 ~]# lctl get_param osp.oak-OST*-osc-MDT0000/prealloc_status | grep -c "=0$" 388
Note: prealloc_status for MDT0001 and MDT0002 located on the same MDS are fine, which likely indicates that the OSTs are fine:
[root@oak-md1-s1 ~]# lctl get_param osp.oak-OST*-osc-MDT0001/prealloc_status | grep -c "=0$" 464 [root@oak-md1-s1 ~]# lctl get_param osp.oak-OST*-osc-MDT0002/prealloc_status | grep -c "=0$" 464
MDT0 prealloc info for a problematic OST:
[root@oak-md1-s1 ~]# lctl get_param osp.oak-OST0098-osc-MDT0000/prealloc* osp.oak-OST0098-osc-MDT0000.prealloc_last_id=1 osp.oak-OST0098-osc-MDT0000.prealloc_last_seq=0x0 osp.oak-OST0098-osc-MDT0000.prealloc_next_id=2 osp.oak-OST0098-osc-MDT0000.prealloc_next_seq=0x0 osp.oak-OST0098-osc-MDT0000.prealloc_reserved=0 osp.oak-OST0098-osc-MDT0000.prealloc_status=-11
osp stats for a problematic OST:
[root@oak-md1-s1 ~]# lctl get_param osp.oak-OST0098-osc-MDT0000/stats osp.oak-OST0098-osc-MDT0000.stats= snapshot_time 1676840498.011953710 secs.nsecs req_waittime 22039 samples [usec] 31 4156496 65699472 70586321859268 req_active 22039 samples [reqs] 1 20 23968 43578 ost_setattr 47 samples [usec] 77 19916 25528 397438254 ost_destroy 480 samples [usec] 132 4079970 38902621 41255678589979 ost_connect 1 samples [usec] 99 99 99 9801 obd_ping 21511 samples [usec] 31 4156496 26771224 29330245821234
vs. a working one for comparison purposes:
[root@oak-md1-s1 ~]# lctl get_param osp.oak-OST0099-osc-MDT0000/stats osp.oak-OST0099-osc-MDT0000.stats= snapshot_time 1676840500.538529214 secs.nsecs req_waittime 133680 samples [usec] 30 5731842 227870174 257553761260390 req_active 133680 samples [reqs] 1 21 136003 155255 ost_setattr 66 samples [usec] 75 12523 61768 584301738 ost_create 429 samples [usec] 851 392236 42632140 5687640118670 ost_destroy 5919 samples [usec] 98 5731842 84323316 91990540215144 ost_connect 1 samples [usec] 110 110 110 12100 ost_statfs 105758 samples [usec] 30 5181321 64867010 151688813146608 obd_ping 21507 samples [usec] 31 1552632 35985830 8186183466130
State for the same problematic OST:
[root@oak-md1-s1 ~]# lctl get_param osp.oak-OST0098-osc-MDT0000/state osp.oak-OST0098-osc-MDT0000.state= current_state: FULL state_history: - [ 1676310563, CONNECTING ] - [ 1676310563, FULL ]
So far, we have no user reports of any issue with the filesystem, probably thanks to Lustre which does a good job at NOT using these OSTs for new objects, likely due to a proper check on prealloc_status. We only noticed the problem thanks to a micro-benchmark monitoring script that periodically performs some I/O on every OSTs of the filesystem by creating files on MDT0000:
# Just to make sure here, we’re using a directory on MDT0000 [root@oak-rbh02 stripes]# lfs getdirstripe . lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none # Allocate a file using plain layout on a single OST with prealloc_status=-11 [root@oak-rbh02 stripes]# time lfs setstripe -c 1 -o 152 test152 lfs setstripe: setstripe error for 'test152': Connection timed out real 1m40.006s user 0m0.000s sys 0m0.005s [root@oak-rbh02 stripes]# lfs getstripe test152 test152 has no stripe info
We can see that setstripe does time out after 100s on this, which is Lustre’s timeout value.
It looks like to me that MDT0000 is somehow not able to run the preallocation routines for these 76 OSTs and they are stuck with this status. But nothing in the logs at the start of the MDT seems to indicate a problem. Just in case, I am attaching the Lustre logs at the start of both MDT0000 and MDT0003 (due to manual failover to doing a service maintenance on a server) as oak-md1-s1-lustre-mdt-failover.log along with lustre log file lustre-log.1676310742.6098.gz which was dumped during the start. But it’s not clear from these logs there was any preallocation problem. Lustre should have a better way of logging that when this happens.
Some ideas…
- I tried things like setting force_sync=1, or changing the value of max_create_count to try to force a refresh of the object preallocation, to no avail.
- Note that we do have the patch (available in 2.12.9):
LU-15009ofd: continue precreate if LAST_ID is less on MDT
and I am wondering if that could be related, as the patch transforms a CERROR into a LCONSOLE(D_INFO,…) in some cases, which we would have likely missed, as we don’t run with +info, it’s way too verbose. It’s important enough that it should probably be logged as an error instead. But really not sure this is related to our issue..
As for the server version we’re running here, sorry it’s not ideal but it’s somewhere between 2.12.8 and 2.12.9 + additional patches not in 2.12.x yet (but only with patches from Gerrit). To see that patches we have since 2.12.8, see git-patches.log.
Any ideas, suggestions of things to try, before trying a stop/start of the impacted OSTs maybe to force reconnect (or is there a way to force a reconnect for a specific OST just for MDT0?).