Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16578

osp prealloc_status stuck at -11 after MDT failover

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • Lustre 2.14.0
    • Lustre 2.12.8
    • None
    • CentOS 7.9 (3.10.0-1160.6.1.el7_lustre.pl1.x86_64)
    • 3
    • 9223372036854775807

    Description

      We are seeing a weird problem on our Oak storage system where MDT0000 doesn’t seem to be able to allocate new objects on specific OSTs after a MDT0000 failover. Lustre doesn’t seem to complain about that in the logs (which is already a problem per se), but we can see that something is wrong as prealloc_status is set to -11 (EAGAIN) for those:

      [root@oak-md1-s1 ~]# lctl get_param osp.oak-OST*-osc-MDT0000/prealloc_status | grep -c "=-11$"
      76
      

      While 388 other OSTs are fine:

      [root@oak-md1-s1 ~]# lctl get_param osp.oak-OST*-osc-MDT0000/prealloc_status | grep -c "=0$"
      388
      

      Note: prealloc_status for MDT0001 and MDT0002 located on the same MDS are fine, which likely indicates that the OSTs are fine:

      [root@oak-md1-s1 ~]# lctl get_param osp.oak-OST*-osc-MDT0001/prealloc_status | grep -c "=0$"
      464
      [root@oak-md1-s1 ~]# lctl get_param osp.oak-OST*-osc-MDT0002/prealloc_status | grep -c "=0$"
      464
      

      MDT0 prealloc info for a problematic OST:

      [root@oak-md1-s1 ~]# lctl get_param osp.oak-OST0098-osc-MDT0000/prealloc*
      osp.oak-OST0098-osc-MDT0000.prealloc_last_id=1
      osp.oak-OST0098-osc-MDT0000.prealloc_last_seq=0x0
      osp.oak-OST0098-osc-MDT0000.prealloc_next_id=2
      osp.oak-OST0098-osc-MDT0000.prealloc_next_seq=0x0
      osp.oak-OST0098-osc-MDT0000.prealloc_reserved=0
      osp.oak-OST0098-osc-MDT0000.prealloc_status=-11
      

      osp stats for a problematic OST:

      [root@oak-md1-s1 ~]# lctl get_param osp.oak-OST0098-osc-MDT0000/stats
      osp.oak-OST0098-osc-MDT0000.stats=
      snapshot_time             1676840498.011953710 secs.nsecs
      req_waittime              22039 samples [usec] 31 4156496 65699472 70586321859268
      req_active                22039 samples [reqs] 1 20 23968 43578
      ost_setattr               47 samples [usec] 77 19916 25528 397438254
      ost_destroy               480 samples [usec] 132 4079970 38902621 41255678589979
      ost_connect               1 samples [usec] 99 99 99 9801
      obd_ping                  21511 samples [usec] 31 4156496 26771224 29330245821234
      

      vs. a working one for comparison purposes:

      [root@oak-md1-s1 ~]# lctl get_param osp.oak-OST0099-osc-MDT0000/stats
      osp.oak-OST0099-osc-MDT0000.stats=
      snapshot_time             1676840500.538529214 secs.nsecs
      req_waittime              133680 samples [usec] 30 5731842 227870174 257553761260390
      req_active                133680 samples [reqs] 1 21 136003 155255
      ost_setattr               66 samples [usec] 75 12523 61768 584301738
      ost_create                429 samples [usec] 851 392236 42632140 5687640118670
      ost_destroy               5919 samples [usec] 98 5731842 84323316 91990540215144
      ost_connect               1 samples [usec] 110 110 110 12100
      ost_statfs                105758 samples [usec] 30 5181321 64867010 151688813146608
      obd_ping                  21507 samples [usec] 31 1552632 35985830 8186183466130
      

      State for the same problematic OST:

      [root@oak-md1-s1 ~]# lctl get_param osp.oak-OST0098-osc-MDT0000/state
      osp.oak-OST0098-osc-MDT0000.state=
      current_state: FULL
      state_history:
       - [ 1676310563, CONNECTING ]
       - [ 1676310563, FULL ]
      

      So far, we have no user reports of any issue with the filesystem, probably thanks to Lustre which does a good job at NOT using these OSTs for new objects, likely due to a proper check on prealloc_status. We only noticed the problem thanks to a micro-benchmark monitoring script that periodically performs some I/O on every OSTs of the filesystem by creating files on MDT0000:

      # Just to make sure here, we’re using a directory on MDT0000
      [root@oak-rbh02 stripes]# lfs getdirstripe .
      lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
      
      # Allocate a file using plain layout on a single OST with prealloc_status=-11
      [root@oak-rbh02 stripes]# time lfs setstripe -c 1 -o 152 test152
      lfs setstripe: setstripe error for 'test152': Connection timed out
      
      real    1m40.006s
      user    0m0.000s
      sys    0m0.005s
      
      [root@oak-rbh02 stripes]# lfs getstripe test152
      test152 has no stripe info
      

      We can see that setstripe does time out after 100s on this, which is Lustre’s timeout value.

      It looks like to me that MDT0000 is somehow not able to run the preallocation routines for these 76 OSTs and they are stuck with this status. But nothing in the logs at the start of the MDT seems to indicate a problem. Just in case, I am attaching the Lustre logs at the start of both MDT0000 and MDT0003 (due to manual failover to doing a service maintenance on a server) as oak-md1-s1-lustre-mdt-failover.log along with lustre log file lustre-log.1676310742.6098.gz which was dumped during the start. But it’s not clear from these logs there was any preallocation problem. Lustre should have a better way of logging that when this happens.

      Some ideas…

      • I tried things like setting force_sync=1, or changing the value of max_create_count to try to force a refresh of the object preallocation, to no avail.
      • Note that we do have the patch (available in 2.12.9):
        LU-15009 ofd: continue precreate if LAST_ID is less on MDT
        and I am wondering if that could be related, as the patch transforms a CERROR into a LCONSOLE(D_INFO,…) in some cases, which we would have likely missed, as we don’t run with +info, it’s way too verbose. It’s important enough that it should probably be logged as an error instead. But really not sure this is related to our issue..

      As for the server version we’re running here, sorry it’s not ideal but it’s somewhere between 2.12.8 and 2.12.9 + additional patches not in 2.12.x yet (but only with patches from Gerrit). To see that patches we have since 2.12.8, see  git-patches.log.

      Any ideas, suggestions of things to try, before trying a stop/start of the impacted OSTs maybe to force reconnect (or is there a way to force a reconnect for a specific OST just for MDT0?).

      Attachments

        1. oak-md1-s1-lustre-mdt-failover.log
          12 kB
        2. lustre-log.1676310742.6098.gz
          2.61 MB
        3. git-patches.log
          2 kB
        4. MDT0_prealloc_status_full.log
          21 kB
        5. ldebug.log.gz
          67.06 MB
        6. LU-16578-oak-md1-s2_foreach_bt_1.gz
          191 kB
        7. LU-16578-oak-md1-s2_foreach_bt_2.gz
          195 kB

        Issue Links

          Activity

            [LU-16578] osp prealloc_status stuck at -11 after MDT failover

            > Stephane, are you using a PFL layout with "-c -1" that stripes over all OSTs?
            Andreas, no, as far as I know, we are not doing that on this system. We have a monitoring script that is doing explicit single-OST plain striping with on each OST to monitor performance. Perhaps this contributed to the high load we saw. Our default PFL setting is very simple on this system that is designed for longer-term storage:

            # lfs getstripe -d /oak
              lcm_layout_gen:    0
              lcm_mirror_count:  1
              lcm_entry_count:   2
                lcme_id:             N/A
                lcme_mirror_id:      N/A
                lcme_flags:          0
                lcme_extent.e_start: 0
                lcme_extent.e_end:   2199023255552
                  stripe_count:  1       stripe_size:   1048576       pattern:       raid0       stripe_offset: -1
            
                lcme_id:             N/A
                lcme_mirror_id:      N/A
                lcme_flags:          0
                lcme_extent.e_start: 2199023255552
                lcme_extent.e_end:   EOF
                  stripe_count:  8       stripe_size:   1048576       pattern:       raid0       stripe_offset: -1
            

            Anyway, we applied the patch cherry-picked from https://review.whamcloud.com/c/fs/lustre-release/+/35078 (LU-12397) and we don't see any single OST with prealloc_status=-11 anymore. This ticket can be closed as duplicate of LU-12397. Thanks!

            sthiell Stephane Thiell added a comment - > Stephane, are you using a PFL layout with "-c -1" that stripes over all OSTs? Andreas, no, as far as I know, we are not doing that on this system. We have a monitoring script that is doing explicit single-OST plain striping with on each OST to monitor performance. Perhaps this contributed to the high load we saw. Our default PFL setting is very simple on this system that is designed for longer-term storage: # lfs getstripe -d /oak lcm_layout_gen: 0 lcm_mirror_count: 1 lcm_entry_count: 2 lcme_id: N/A lcme_mirror_id: N/A lcme_flags: 0 lcme_extent.e_start: 0 lcme_extent.e_end: 2199023255552 stripe_count: 1 stripe_size: 1048576 pattern: raid0 stripe_offset: -1 lcme_id: N/A lcme_mirror_id: N/A lcme_flags: 0 lcme_extent.e_start: 2199023255552 lcme_extent.e_end: EOF stripe_count: 8 stripe_size: 1048576 pattern: raid0 stripe_offset: -1 Anyway, we applied the patch cherry-picked from https://review.whamcloud.com/c/fs/lustre-release/+/35078 ( LU-12397 ) and we don't see any single OST with prealloc_status=-11 anymore. This ticket can be closed as duplicate of LU-12397 . Thanks!

            Actually LU-12397 would be a good candidate for that ( https://review.whamcloud.com/c/fs/lustre-release/+/35078 ) – the commit message description fits exactly and Stéphane doesn't have it on his branch.

            I'll have him give it a try.

            asmadeus Dominique Martinet added a comment - Actually LU-12397 would be a good candidate for that ( https://review.whamcloud.com/c/fs/lustre-release/+/35078 ) – the commit message description fits exactly and Stéphane doesn't have it on his branch. I'll have him give it a try.
            asmadeus Dominique Martinet added a comment - - edited

            Hi Andreas, thanks for the pointers!

            The LU-15393 patch looks like it could help with the high load observed, thanks!

            In this case though this is just chasing after the symptoms, so I'd like to understand why the prealloc isn't working. Going back to the 'foreach bt' Stéphane provided, at the time he had 60 osp-pre-X-0 threads stuck at the "need to be connected to OST" wait at the start of osp_precreate_thread:

                            while (osp_precreate_running(d)) {
                                    if ((d->opd_pre == NULL || d->opd_pre_recovering) &&
                                        d->opd_imp_connected &&
                                        !d->opd_got_disconnected)
                                            break;
                                    l_wait_event(d->opd_pre_waitq,
                                                 !osp_precreate_running(d) ||
                                                 d->opd_new_connection,
                                                 &lwi);
            

            (the other threads, and all threads from the mdt3 running on same mds are in the "connected, can handle precreates now" loop as they should be)

            I've had a quick look and didn't see any merged patch that seem to be touching this, but perhaps you'll have an idea for why threads would be stuck there?

            asmadeus Dominique Martinet added a comment - - edited Hi Andreas, thanks for the pointers! The LU-15393 patch looks like it could help with the high load observed, thanks! In this case though this is just chasing after the symptoms, so I'd like to understand why the prealloc isn't working. Going back to the 'foreach bt' Stéphane provided, at the time he had 60 osp-pre-X-0 threads stuck at the "need to be connected to OST" wait at the start of osp_precreate_thread :                 while (osp_precreate_running(d)) {                         if ((d->opd_pre == NULL || d->opd_pre_recovering) &&                             d->opd_imp_connected &&                             !d->opd_got_disconnected)                                 break ;                         l_wait_event(d->opd_pre_waitq,                                      !osp_precreate_running(d) ||                                      d->opd_new_connection,                                      &lwi); (the other threads, and all threads from the mdt3 running on same mds are in the "connected, can handle precreates now" loop as they should be) I've had a quick look and didn't see any merged patch that seem to be touching this, but perhaps you'll have an idea for why threads would be stuck there?

            Stephane, are you using a PFL layout with "-c -1" that stripes over all OSTs? It would be interesting here to see if changing this to use some fixed number (fewer than the number of OSTs) (e.g. "-c 32" or whatever) to see if this avoids the allocations from blocking? In newer Lustre you could instead set "lod.*.max_stripecount=N" on the MDS and it would force all "-c -1" layouts to use "-c N" instead of "all OSTs". This was added in patch https://review.whamcloud.com/45532 "LU-9162 lod: option to set max stripe count per filesystem", but it hasn't been backported to 2.12.

            Lustre previously also had a heuristic to allow fully-striped files to be content with at least 3/4 of the OSTs (if some OSTs are unavailable), but this was lost during some code reorganization, so stripe_count = -1 files will always try to allocate from all OSTs even if they are having some problems. I've started fixing on this in patch https://review.whamcloud.com/50250 "LU-16623 lod: handle object allocation consistently" but I haven't had a chance to finish the patch yet.

            Separately, there was a patch under LU-15393 to avoid blocking the MDS threads in OST object creation for a long time that may be useful here.

            adilger Andreas Dilger added a comment - Stephane, are you using a PFL layout with " -c -1 " that stripes over all OSTs? It would be interesting here to see if changing this to use some fixed number (fewer than the number of OSTs) (e.g. " -c 32 " or whatever) to see if this avoids the allocations from blocking? In newer Lustre you could instead set " lod.*.max_stripecount=N " on the MDS and it would force all " -c -1 " layouts to use " -c N " instead of "all OSTs". This was added in patch https://review.whamcloud.com/45532 " LU-9162 lod: option to set max stripe count per filesystem ", but it hasn't been backported to 2.12. Lustre previously also had a heuristic to allow fully-striped files to be content with at least 3/4 of the OSTs (if some OSTs are unavailable), but this was lost during some code reorganization, so stripe_count = -1 files will always try to allocate from all OSTs even if they are having some problems. I've started fixing on this in patch https://review.whamcloud.com/50250 " LU-16623 lod: handle object allocation consistently " but I haven't had a chance to finish the patch yet. Separately, there was a patch under LU-15393 to avoid blocking the MDS threads in OST object creation for a long time that may be useful here.
            asmadeus Dominique Martinet added a comment - - edited

            For the high load/threads blocked first; it's just a side-effect/other manifestation of the same underlying problem, but basically:

            • the thread you noticed with osp_precreate_reserve has a write lock on lod->lod_qos.lq_rw_sem (from lod_alloc_qos a bit higher in the stack), and is waiting for prealloc_status to become ready (osp_precreate_ready_condition) on some ost
            • other threads (both lod_alloc_qos and lod_qos_statfs_update are waiting on the same rw lock (also in write mode)

            since prealloc status never becomes ready, these wait until timeout (obdtimeout, 100s) at which point the lock is freed, but the next lod_alloc_qos will wait again... presumably for all corresponding create (or until all OSTs with bad precreate state are iterated over; didn't really check that far)

            lod_alloc_qos first calls osp_precreate_reserve with "slow" (in lod_alloc_qos) /"can_block" (in the rest of the stack) set to false, so that won't block with the lock taken, but if that wasn't enough and a file was requested with more stripes than precreates are available it'll be called again in blocking mode.
            Basically you'll hang a bit the mdt everytime a job tries to create a batch of files with lfs setstripe -1...

            I don't think much can be done about this here – we could just tell the client sorry there aren't enough OSTs available for your stripes, but in an optimal case it shouldn't be waiting that long in the first place so let's fix the underlying precreate issue instead.

            I'll have a look at that next (I was hoping I'd get a hint in there, but no luck)

            asmadeus Dominique Martinet added a comment - - edited For the high load/threads blocked first; it's just a side-effect/other manifestation of the same underlying problem, but basically: the thread you noticed with osp_precreate_reserve has a write lock on lod->lod_qos.lq_rw_sem (from lod_alloc_qos a bit higher in the stack), and is waiting for prealloc_status to become ready ( osp_precreate_ready_condition ) on some ost other threads (both lod_alloc_qos and lod_qos_statfs_update are waiting on the same rw lock (also in write mode) since prealloc status never becomes ready, these wait until timeout (obdtimeout, 100s) at which point the lock is freed, but the next lod_alloc_qos will wait again... presumably for all corresponding create (or until all OSTs with bad precreate state are iterated over; didn't really check that far) lod_alloc_qos first calls osp_precreate_reserve with "slow" (in lod_alloc_qos) /"can_block" (in the rest of the stack) set to false, so that won't block with the lock taken, but if that wasn't enough and a file was requested with more stripes than precreates are available it'll be called again in blocking mode. Basically you'll hang a bit the mdt everytime a job tries to create a batch of files with lfs setstripe -1... I don't think much can be done about this here – we could just tell the client sorry there aren't enough OSTs available for your stripes, but in an optimal case it shouldn't be waiting that long in the first place so let's fix the underlying precreate issue instead. I'll have a look at that next (I was hoping I'd get a hint in there, but no luck)

            This issue persists. We restarted MDT0000 several times. Every time, some OSTs report prealloc_status=-11. We got a different number of them every time,  but we couldn't get all of them to be at prealloc_status=0. Right now, 29 of them are still in that bad state.

            I tried to take some debug log with +info and noticed that:

            00000004:00000040:31.0:1677771272.969914:0:16555:0:(osp_precreate.c:1050:osp_pre_update_msfs()) oak-OST018b-osc-MDT0000: blocks=34861959081 free=14858726726 avail=14507148013 avail_mb=56668546 hwm_mb=265975 files=137329664 ffree=132179391 state=0: rc = -11
            
            00000004:00000040:28.0:1677771272.979391:0:16556:0:(osp_precreate.c:1050:osp_pre_update_msfs()) oak-OST018c-osc-MDT0000: blocks=34861959081 free=13552189747 avail=13200616749 avail_mb=51564909 hwm_mb=265975 files=137329664 ffree=132293767 state=0: rc = -11
            
            00000004:00000040:18.0:1677771273.126947:0:16534:0:(osp_precreate.c:1050:osp_pre_update_msfs()) oak-OST019d-osc-MDT0000: blocks=34861959081 free=13716915320 avail=13365340891 avail_mb=52208362 hwm_mb=265975 files=137329664 ffree=131951371 state=0: rc = -11
            
            00000004:00000040:16.0:1677771273.191533:0:16527:0:(osp_precreate.c:1050:osp_pre_update_msfs()) oak-OST01a5-osc-MDT0000: blocks=34861959081 free=15940628589 avail=15589045678 avail_mb=60894709 hwm_mb=265975 files=137329664 ffree=132008888 state=0: rc = -11
            

            Attaching the debug logs as ldebug.log.gz

            Now... very likely related to the issue described in this ticket, we are also having occurrences of high load on this MDS, due to many threads being blocked (that explains the high load of the server). I took two dump of "foreach bt" in a live crash to investigate:

            What I can see each time that many threads are blocked in mdt_object_local_lock() or lod_qos_prep_create().

            But each time, a single thread is blocked in osp_precreate_reserve():

            PID: 20498  TASK: ffff9d9b38a1d280  CPU: 28  COMMAND: "mdt07_076"
             #0 [ffff9d9b38b1f3b0] __schedule at ffffffffb65b78d8
             #1 [ffff9d9b38b1f418] schedule at ffffffffb65b7ca9
             #2 [ffff9d9b38b1f428] schedule_timeout at ffffffffb65b5778
             #3 [ffff9d9b38b1f4d8] osp_precreate_reserve at ffffffffc17b2fc8 [osp]
             #4 [ffff9d9b38b1f598] osp_declare_create at ffffffffc17a78ec [osp]
             #5 [ffff9d9b38b1f5f8] lod_sub_declare_create at ffffffffc16ee203 [lod]
             #6 [ffff9d9b38b1f658] lod_qos_declare_object_on at ffffffffc16e62ae [lod]
             #7 [ffff9d9b38b1f6a8] lod_alloc_qos.constprop.18 at ffffffffc16e7c57 [lod]
             #8 [ffff9d9b38b1f7c8] lod_qos_prep_create at ffffffffc16ed460 [lod]
             #9 [ffff9d9b38b1f8d0] lod_declare_instantiate_components at ffffffffc16cdf5a [lod]
            #10 [ffff9d9b38b1f908] lod_declare_layout_change at ffffffffc16e11e5 [lod]
            #11 [ffff9d9b38b1f970] mdd_declare_layout_change at ffffffffc1754d02 [mdd]
            #12 [ffff9d9b38b1f998] mdd_layout_change at ffffffffc175dd86 [mdd]
            #13 [ffff9d9b38b1fa08] mdt_layout_change at ffffffffc15e926f [mdt]
            #14 [ffff9d9b38b1fa48] mdt_intent_layout at ffffffffc15f0ff0 [mdt]
            #15 [ffff9d9b38b1fb08] mdt_intent_opc at ffffffffc15e695a [mdt]
            #16 [ffff9d9b38b1fb68] mdt_intent_policy at ffffffffc15eecac [mdt]
            #17 [ffff9d9b38b1fba8] ldlm_lock_enqueue at ffffffffc114e636 [ptlrpc]
            #18 [ffff9d9b38b1fc20] ldlm_handle_enqueue0 at ffffffffc1176166 [ptlrpc]
            #19 [ffff9d9b38b1fcb0] tgt_enqueue at ffffffffc1201182 [ptlrpc]
            #20 [ffff9d9b38b1fcd0] tgt_request_handle at ffffffffc120803a [ptlrpc]
            #21 [ffff9d9b38b1fd58] ptlrpc_server_handle_request at ffffffffc11abe4b [ptlrpc]
            #22 [ffff9d9b38b1fdf8] ptlrpc_main at ffffffffc11af7f4 [ptlrpc]
            #23 [ffff9d9b38b1fec8] kthread at ffffffffb5ecb511
            #24 [ffff9d9b38b1ff50] ret_from_fork_nospec_begin at ffffffffb65c51dd
            

            This is weird to me, as osp_precreate_reserve() is supposed to skip any OST with prealloc_status != 0. So I am wondering if we may have another problem where OSTs would not be in a good state for preallocation, either leading to prealloc_status=-11, or blocking osp_precreate_reserve() for OSTs that have a prealloc_status=0.

            Eventually, the MDT recovers. Logs look like this:

            Mar 07 16:27:49 oak-md1-s2 kernel: Pid: 20362, comm: mdt04_015 3.10.0-1160.83.1.el7_lustre.pl1.x86_64 #1 SMP Sun Feb 19 18:38:37 PST 2023
            Mar 07 16:27:49 oak-md1-s2 kernel: Call Trace:
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffb61ae0d7>] call_rwsem_down_write_failed+0x17/0x30
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc16e6d5d>] lod_alloc_qos.constprop.18+0x20d/0x1870 [lod]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc16ed460>] lod_qos_prep_create+0x1300/0x18c0 [lod]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc16edc3d>] lod_prepare_create+0x21d/0x2e0 [lod]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc16dd9ee>] lod_declare_striped_create+0x1ee/0x980 [lod]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc16e22d4>] lod_declare_create+0x204/0x590 [lod]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc1759c8a>] mdd_declare_create_object_internal+0xea/0x370 [mdd]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc174933c>] mdd_declare_create+0x4c/0xdf0 [mdd]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc174ce97>] mdd_create+0x877/0x14b0 [mdd]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc1615618>] mdt_reint_open+0x2588/0x3970 [mdt]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc1608303>] mdt_reint_rec+0x83/0x210 [mdt]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc15e35e3>] mdt_reint_internal+0x6f3/0xb00 [mdt]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc15f0432>] mdt_intent_open+0x82/0x3a0 [mdt]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc15e695a>] mdt_intent_opc+0x1ba/0xb50 [mdt]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc15eecac>] mdt_intent_policy+0x1ac/0x370 [mdt]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc114e636>] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc1176166>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc1201182>] tgt_enqueue+0x62/0x220 [ptlrpc]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc120803a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc11abe4b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc11af7f4>] ptlrpc_main+0xb44/0x1480 [ptlrpc]
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffb5ecb511>] kthread+0xd1/0xe0
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffb65c51dd>] ret_from_fork_nospec_begin+0x7/0x21
            Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
            
            <skip>
            
            Mar 07 16:49:28 oak-md1-s2 kernel: LNet: Service thread pid 20362 completed after 1499.99s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            Mar 07 16:49:28 oak-md1-s2 kernel: LustreError: 20477:0:(service.c:2132:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-10.51.13.9@o2ib3: deadline 100:85s ago
                                                 req@ffff9da42db43600 x1759758327487744/t0(0) o38-><?>@<?>:0/0 lens 520/0 e 0 to 0 dl 1678236483 ref 1 fl Interpret:/0/ffffffff rc 0/-1
            Mar 07 16:49:28 oak-md1-s2 kernel: LustreError: 20477:0:(service.c:2132:ptlrpc_server_handle_request()) Skipped 24 previous similar messages
            Mar 07 16:49:28 oak-md1-s2 kernel: Lustre: 20477:0:(service.c:2169:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (100:85s); client may timeout.  req@ffff9da42db43600 x1759758327487744/t0(0) o38-><?>@<?>:0/0 lens 520/0 e 0 to 
            Mar 07 16:49:28 oak-md1-s2 kernel: Lustre: 20477:0:(service.c:2169:ptlrpc_server_handle_request()) Skipped 320 previous similar messages
            Mar 07 16:49:28 oak-md1-s2 kernel: LustreError: 20514:0:(tgt_handler.c:651:process_req_last_xid()) @@@ Unexpected xid 63a065158cbc0 vs. last_xid 63a06515a7e7f
                                                 req@ffff9d7b32f82400 x1752648669252544/t0(0) o41->bf148b27-5bbf-4c21-8739-59abeeaf25b2@10.51.2.59@o2ib3:496/0 lens 440/0 e 0 to 0 dl 1678237146 ref 1 fl Interpret:/0/ffffffff rc 0/-1
            
            <back to normal, for now>

            Do you have any suggestion? Thanks!

            sthiell Stephane Thiell added a comment - This issue persists. We restarted MDT0000 several times. Every time, some OSTs report prealloc_status=-11 . We got a different number of them every time,  but we couldn't get all of them to be at prealloc_status=0. Right now, 29 of them are still in that bad state. I tried to take some debug log with +info and noticed that: 00000004:00000040:31.0:1677771272.969914:0:16555:0:(osp_precreate.c:1050:osp_pre_update_msfs()) oak-OST018b-osc-MDT0000: blocks=34861959081 free=14858726726 avail=14507148013 avail_mb=56668546 hwm_mb=265975 files=137329664 ffree=132179391 state=0: rc = -11 00000004:00000040:28.0:1677771272.979391:0:16556:0:(osp_precreate.c:1050:osp_pre_update_msfs()) oak-OST018c-osc-MDT0000: blocks=34861959081 free=13552189747 avail=13200616749 avail_mb=51564909 hwm_mb=265975 files=137329664 ffree=132293767 state=0: rc = -11 00000004:00000040:18.0:1677771273.126947:0:16534:0:(osp_precreate.c:1050:osp_pre_update_msfs()) oak-OST019d-osc-MDT0000: blocks=34861959081 free=13716915320 avail=13365340891 avail_mb=52208362 hwm_mb=265975 files=137329664 ffree=131951371 state=0: rc = -11 00000004:00000040:16.0:1677771273.191533:0:16527:0:(osp_precreate.c:1050:osp_pre_update_msfs()) oak-OST01a5-osc-MDT0000: blocks=34861959081 free=15940628589 avail=15589045678 avail_mb=60894709 hwm_mb=265975 files=137329664 ffree=132008888 state=0: rc = -11 Attaching the debug logs as ldebug.log.gz Now... very likely related to the issue described in this ticket, we are also having occurrences of high load on this MDS, due to many threads being blocked (that explains the high load of the server). I took two dump of "foreach bt" in a live crash to investigate: LU-16578-oak-md1-s2_foreach_bt_1.gz LU-16578-oak-md1-s2_foreach_bt_2.gz What I can see each time that many threads are blocked in mdt_object_local_lock() or lod_qos_prep_create() . But each time, a single thread is blocked in osp_precreate_reserve() : PID: 20498 TASK: ffff9d9b38a1d280 CPU: 28 COMMAND: "mdt07_076" #0 [ffff9d9b38b1f3b0] __schedule at ffffffffb65b78d8 #1 [ffff9d9b38b1f418] schedule at ffffffffb65b7ca9 #2 [ffff9d9b38b1f428] schedule_timeout at ffffffffb65b5778 #3 [ffff9d9b38b1f4d8] osp_precreate_reserve at ffffffffc17b2fc8 [osp] #4 [ffff9d9b38b1f598] osp_declare_create at ffffffffc17a78ec [osp] #5 [ffff9d9b38b1f5f8] lod_sub_declare_create at ffffffffc16ee203 [lod] #6 [ffff9d9b38b1f658] lod_qos_declare_object_on at ffffffffc16e62ae [lod] #7 [ffff9d9b38b1f6a8] lod_alloc_qos.constprop.18 at ffffffffc16e7c57 [lod] #8 [ffff9d9b38b1f7c8] lod_qos_prep_create at ffffffffc16ed460 [lod] #9 [ffff9d9b38b1f8d0] lod_declare_instantiate_components at ffffffffc16cdf5a [lod] #10 [ffff9d9b38b1f908] lod_declare_layout_change at ffffffffc16e11e5 [lod] #11 [ffff9d9b38b1f970] mdd_declare_layout_change at ffffffffc1754d02 [mdd] #12 [ffff9d9b38b1f998] mdd_layout_change at ffffffffc175dd86 [mdd] #13 [ffff9d9b38b1fa08] mdt_layout_change at ffffffffc15e926f [mdt] #14 [ffff9d9b38b1fa48] mdt_intent_layout at ffffffffc15f0ff0 [mdt] #15 [ffff9d9b38b1fb08] mdt_intent_opc at ffffffffc15e695a [mdt] #16 [ffff9d9b38b1fb68] mdt_intent_policy at ffffffffc15eecac [mdt] #17 [ffff9d9b38b1fba8] ldlm_lock_enqueue at ffffffffc114e636 [ptlrpc] #18 [ffff9d9b38b1fc20] ldlm_handle_enqueue0 at ffffffffc1176166 [ptlrpc] #19 [ffff9d9b38b1fcb0] tgt_enqueue at ffffffffc1201182 [ptlrpc] #20 [ffff9d9b38b1fcd0] tgt_request_handle at ffffffffc120803a [ptlrpc] #21 [ffff9d9b38b1fd58] ptlrpc_server_handle_request at ffffffffc11abe4b [ptlrpc] #22 [ffff9d9b38b1fdf8] ptlrpc_main at ffffffffc11af7f4 [ptlrpc] #23 [ffff9d9b38b1fec8] kthread at ffffffffb5ecb511 #24 [ffff9d9b38b1ff50] ret_from_fork_nospec_begin at ffffffffb65c51dd This is weird to me, as osp_precreate_reserve() is supposed to skip any OST with prealloc_status != 0. So I am wondering if we may have another problem where OSTs would not be in a good state for preallocation, either leading to prealloc_status=-11 , or blocking osp_precreate_reserve() for OSTs that have a prealloc_status=0 . Eventually, the MDT recovers. Logs look like this: Mar 07 16:27:49 oak-md1-s2 kernel: Pid: 20362, comm: mdt04_015 3.10.0-1160.83.1.el7_lustre.pl1.x86_64 #1 SMP Sun Feb 19 18:38:37 PST 2023 Mar 07 16:27:49 oak-md1-s2 kernel: Call Trace: Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffb61ae0d7>] call_rwsem_down_write_failed+0x17/0x30 Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc16e6d5d>] lod_alloc_qos.constprop.18+0x20d/0x1870 [lod] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc16ed460>] lod_qos_prep_create+0x1300/0x18c0 [lod] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc16edc3d>] lod_prepare_create+0x21d/0x2e0 [lod] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc16dd9ee>] lod_declare_striped_create+0x1ee/0x980 [lod] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc16e22d4>] lod_declare_create+0x204/0x590 [lod] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc1759c8a>] mdd_declare_create_object_internal+0xea/0x370 [mdd] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc174933c>] mdd_declare_create+0x4c/0xdf0 [mdd] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc174ce97>] mdd_create+0x877/0x14b0 [mdd] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc1615618>] mdt_reint_open+0x2588/0x3970 [mdt] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc1608303>] mdt_reint_rec+0x83/0x210 [mdt] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc15e35e3>] mdt_reint_internal+0x6f3/0xb00 [mdt] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc15f0432>] mdt_intent_open+0x82/0x3a0 [mdt] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc15e695a>] mdt_intent_opc+0x1ba/0xb50 [mdt] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc15eecac>] mdt_intent_policy+0x1ac/0x370 [mdt] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc114e636>] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc1176166>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc1201182>] tgt_enqueue+0x62/0x220 [ptlrpc] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc120803a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc11abe4b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffc11af7f4>] ptlrpc_main+0xb44/0x1480 [ptlrpc] Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffb5ecb511>] kthread+0xd1/0xe0 Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffb65c51dd>] ret_from_fork_nospec_begin+0x7/0x21 Mar 07 16:27:49 oak-md1-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff <skip> Mar 07 16:49:28 oak-md1-s2 kernel: LNet: Service thread pid 20362 completed after 1499.99s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Mar 07 16:49:28 oak-md1-s2 kernel: LustreError: 20477:0:(service.c:2132:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-10.51.13.9@o2ib3: deadline 100:85s ago req@ffff9da42db43600 x1759758327487744/t0(0) o38-><?>@<?>:0/0 lens 520/0 e 0 to 0 dl 1678236483 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Mar 07 16:49:28 oak-md1-s2 kernel: LustreError: 20477:0:(service.c:2132:ptlrpc_server_handle_request()) Skipped 24 previous similar messages Mar 07 16:49:28 oak-md1-s2 kernel: Lustre: 20477:0:(service.c:2169:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (100:85s); client may timeout. req@ffff9da42db43600 x1759758327487744/t0(0) o38-><?>@<?>:0/0 lens 520/0 e 0 to Mar 07 16:49:28 oak-md1-s2 kernel: Lustre: 20477:0:(service.c:2169:ptlrpc_server_handle_request()) Skipped 320 previous similar messages Mar 07 16:49:28 oak-md1-s2 kernel: LustreError: 20514:0:(tgt_handler.c:651:process_req_last_xid()) @@@ Unexpected xid 63a065158cbc0 vs. last_xid 63a06515a7e7f req@ffff9d7b32f82400 x1752648669252544/t0(0) o41->bf148b27-5bbf-4c21-8739-59abeeaf25b2@10.51.2.59@o2ib3:496/0 lens 440/0 e 0 to 0 dl 1678237146 ref 1 fl Interpret:/0/ffffffff rc 0/-1 <back to normal, for now> Do you have any suggestion? Thanks!

            People

              wc-triage WC Triage
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: