Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16578

osp prealloc_status stuck at -11 after MDT failover

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • Lustre 2.14.0
    • Lustre 2.12.8
    • None
    • CentOS 7.9 (3.10.0-1160.6.1.el7_lustre.pl1.x86_64)
    • 3
    • 9223372036854775807

    Description

      We are seeing a weird problem on our Oak storage system where MDT0000 doesn’t seem to be able to allocate new objects on specific OSTs after a MDT0000 failover. Lustre doesn’t seem to complain about that in the logs (which is already a problem per se), but we can see that something is wrong as prealloc_status is set to -11 (EAGAIN) for those:

      [root@oak-md1-s1 ~]# lctl get_param osp.oak-OST*-osc-MDT0000/prealloc_status | grep -c "=-11$"
      76
      

      While 388 other OSTs are fine:

      [root@oak-md1-s1 ~]# lctl get_param osp.oak-OST*-osc-MDT0000/prealloc_status | grep -c "=0$"
      388
      

      Note: prealloc_status for MDT0001 and MDT0002 located on the same MDS are fine, which likely indicates that the OSTs are fine:

      [root@oak-md1-s1 ~]# lctl get_param osp.oak-OST*-osc-MDT0001/prealloc_status | grep -c "=0$"
      464
      [root@oak-md1-s1 ~]# lctl get_param osp.oak-OST*-osc-MDT0002/prealloc_status | grep -c "=0$"
      464
      

      MDT0 prealloc info for a problematic OST:

      [root@oak-md1-s1 ~]# lctl get_param osp.oak-OST0098-osc-MDT0000/prealloc*
      osp.oak-OST0098-osc-MDT0000.prealloc_last_id=1
      osp.oak-OST0098-osc-MDT0000.prealloc_last_seq=0x0
      osp.oak-OST0098-osc-MDT0000.prealloc_next_id=2
      osp.oak-OST0098-osc-MDT0000.prealloc_next_seq=0x0
      osp.oak-OST0098-osc-MDT0000.prealloc_reserved=0
      osp.oak-OST0098-osc-MDT0000.prealloc_status=-11
      

      osp stats for a problematic OST:

      [root@oak-md1-s1 ~]# lctl get_param osp.oak-OST0098-osc-MDT0000/stats
      osp.oak-OST0098-osc-MDT0000.stats=
      snapshot_time             1676840498.011953710 secs.nsecs
      req_waittime              22039 samples [usec] 31 4156496 65699472 70586321859268
      req_active                22039 samples [reqs] 1 20 23968 43578
      ost_setattr               47 samples [usec] 77 19916 25528 397438254
      ost_destroy               480 samples [usec] 132 4079970 38902621 41255678589979
      ost_connect               1 samples [usec] 99 99 99 9801
      obd_ping                  21511 samples [usec] 31 4156496 26771224 29330245821234
      

      vs. a working one for comparison purposes:

      [root@oak-md1-s1 ~]# lctl get_param osp.oak-OST0099-osc-MDT0000/stats
      osp.oak-OST0099-osc-MDT0000.stats=
      snapshot_time             1676840500.538529214 secs.nsecs
      req_waittime              133680 samples [usec] 30 5731842 227870174 257553761260390
      req_active                133680 samples [reqs] 1 21 136003 155255
      ost_setattr               66 samples [usec] 75 12523 61768 584301738
      ost_create                429 samples [usec] 851 392236 42632140 5687640118670
      ost_destroy               5919 samples [usec] 98 5731842 84323316 91990540215144
      ost_connect               1 samples [usec] 110 110 110 12100
      ost_statfs                105758 samples [usec] 30 5181321 64867010 151688813146608
      obd_ping                  21507 samples [usec] 31 1552632 35985830 8186183466130
      

      State for the same problematic OST:

      [root@oak-md1-s1 ~]# lctl get_param osp.oak-OST0098-osc-MDT0000/state
      osp.oak-OST0098-osc-MDT0000.state=
      current_state: FULL
      state_history:
       - [ 1676310563, CONNECTING ]
       - [ 1676310563, FULL ]
      

      So far, we have no user reports of any issue with the filesystem, probably thanks to Lustre which does a good job at NOT using these OSTs for new objects, likely due to a proper check on prealloc_status. We only noticed the problem thanks to a micro-benchmark monitoring script that periodically performs some I/O on every OSTs of the filesystem by creating files on MDT0000:

      # Just to make sure here, we’re using a directory on MDT0000
      [root@oak-rbh02 stripes]# lfs getdirstripe .
      lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
      
      # Allocate a file using plain layout on a single OST with prealloc_status=-11
      [root@oak-rbh02 stripes]# time lfs setstripe -c 1 -o 152 test152
      lfs setstripe: setstripe error for 'test152': Connection timed out
      
      real    1m40.006s
      user    0m0.000s
      sys    0m0.005s
      
      [root@oak-rbh02 stripes]# lfs getstripe test152
      test152 has no stripe info
      

      We can see that setstripe does time out after 100s on this, which is Lustre’s timeout value.

      It looks like to me that MDT0000 is somehow not able to run the preallocation routines for these 76 OSTs and they are stuck with this status. But nothing in the logs at the start of the MDT seems to indicate a problem. Just in case, I am attaching the Lustre logs at the start of both MDT0000 and MDT0003 (due to manual failover to doing a service maintenance on a server) as oak-md1-s1-lustre-mdt-failover.log along with lustre log file lustre-log.1676310742.6098.gz which was dumped during the start. But it’s not clear from these logs there was any preallocation problem. Lustre should have a better way of logging that when this happens.

      Some ideas…

      • I tried things like setting force_sync=1, or changing the value of max_create_count to try to force a refresh of the object preallocation, to no avail.
      • Note that we do have the patch (available in 2.12.9):
        LU-15009 ofd: continue precreate if LAST_ID is less on MDT
        and I am wondering if that could be related, as the patch transforms a CERROR into a LCONSOLE(D_INFO,…) in some cases, which we would have likely missed, as we don’t run with +info, it’s way too verbose. It’s important enough that it should probably be logged as an error instead. But really not sure this is related to our issue..

      As for the server version we’re running here, sorry it’s not ideal but it’s somewhere between 2.12.8 and 2.12.9 + additional patches not in 2.12.x yet (but only with patches from Gerrit). To see that patches we have since 2.12.8, see  git-patches.log.

      Any ideas, suggestions of things to try, before trying a stop/start of the impacted OSTs maybe to force reconnect (or is there a way to force a reconnect for a specific OST just for MDT0?).

      Attachments

        1. oak-md1-s1-lustre-mdt-failover.log
          12 kB
        2. MDT0_prealloc_status_full.log
          21 kB
        3. lustre-log.1676310742.6098.gz
          2.61 MB
        4. LU-16578-oak-md1-s2_foreach_bt_2.gz
          195 kB
        5. LU-16578-oak-md1-s2_foreach_bt_1.gz
          191 kB
        6. ldebug.log.gz
          67.06 MB
        7. git-patches.log
          2 kB

        Issue Links

          Activity

            People

              wc-triage WC Triage
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: