Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12221

LBUG: (statahead.c:477:ll_sai_put()) ASSERTION( !sa_has_callback(sai) )

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0
    • Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      crash_arm64> dmesg | grep Build
      [  409.666886] Lustre: Lustre: Build Version: 2.11.0.200_cray_79_g4c42971
      
       crash_arm64> dmesg | grep ': 102742:'
      [63861.735486] LustreError: 102742:0:(statahead.c:478:ll_sai_put()) ASSERTION( !sa_has_callback(sai) ) failed:
      [63861.744009] LustreError: 102742:0:(statahead.c:478:ll_sai_put()) LBUG
      
      crash_arm64> bt
      PID: 102742  TASK: ffff80be57ca8000  CPU: 96  COMMAND: "ll_sa_102181"
       #0 [ffff00001fceba80] __crash_kexec at ffff000008138910
       #1 [ffff00001fcebbb0] (null) at ffffffc4
       #2 [ffff00001fcebbe0] __const_udelay at ffff00000837dba0
       #3 [ffff00001fcebbf0] panic at ffff0000081c6d6c
       #4 [ffff00001fcebcc0] lbug_with_loc at ffff000000a7c9e4 [libcfs]
       #5 [ffff00001fcebce0] ll_sai_put at ffff0000013d5658 [lustre]   @ llite/statahead.c: 476
       #6 [ffff00001fcebd50] ll_statahead_thread at ffff0000013da360 [lustre]    @ llite/statahead.c: 1204
       #7 [ffff00001fcebe70] kthread at ffff0000080cb3a4
      

      This LBUG was first seen with a Lustre 2.7 client. It can be easily reproduced with a heavy load of LTP statahead tests on a system with ~300 ARM nodes with 64 cores per node.

      The situation at the time of the LBUG: The ll_statahead_thread ll_sa_102181 has been stopped (sai_thread.t_flags == SVC_STOPPED) and is making its final call to ll_sai_put before returning through the out: code block. The LBUG is issued because a new entry has been placed on the sai_interim_entries list - a reply from the MDS was received and handled by ll_statahead_interpret. The ll_statahead_thread did not wait for the rpc to be completely processed before trying to exit.

      Dump available upon request.

      Values from bt -FF:
      ll_inode_info ffff809d0d6d99c0
      inode ffff809d0d6d9a48
      ll_statahead_info ffff809d04566800
      
      crash_arm64> epython sb_inodes -l ffff80be10087800 | grep ffff809d0d6d9a48
      0xffff80be10087800 lustre       0xffff809d0d6d9a48 144119544061917665         4096      1 0xffff809d80728cc0 /lus/snx11260/ostest.vers/alsorun.20190125183527.25496.ranger/CL_LLTPreadlinkat02.3.N2EP2O.1548522513/reaIck9AT (deleted)
       
      struct ll_statahead_info {
        sai_dentry = 0xffff809d80728cc0,
        sai_refcount = {
          counter = 0
        },
        sai_max = 2,
        sai_sent = 1,
        sai_replied = 1,
        sai_index = 2,
        sai_index_wait = 0,
        sai_hit = 0,
        sai_miss = 0,
        sai_consecutive_miss = 0,
        sai_miss_hidden = 0,
        sai_skip_hidden = 0,
        sai_ls_all = 0,
        sai_agl_valid = 0,
        sai_in_readpage = 0,
      ....
        sai_thread = {
          t_link = {
            next = 0x0,
            prev = 0x0
          },
          t_data = 0x0,
          t_flags = 1,
          t_id = 0,
          t_pid = 102742,
      ....
        sai_interim_entries = {
          next = 0xffff80be5aae9480,
          prev = 0xffff80be5aae9480
        },
      ....
      struct sa_entry {
        se_list = {
          next = 0xffff809d04566958,
          prev = 0xffff809d04566958
        },
        se_hash = {
          next = 0xffff809d04566988,
          prev = 0xffff809d04566988
        },
        se_index = 1,
        se_handle = 5636371587488280727,
        se_state = SA_ENTRY_INIT,
        se_size = 128,
        se_minfo = 0xffff80be5b17c400,
        se_req = 0xffff80be345d1340,
        se_inode = 0x0,
        se_qstr = {
          {
            {
              hash = 3580636640,
              len = 12
            },
            hash_len = 55120244192
          },
          name = 0xffff80be5aae94f0 "symlink_file"
        },
        se_fid = {
          f_seq = 8590194229,
          f_oid = 31203
          f_ver = 0
        }
      
      crash_arm64> ll_statahead_info.sai_thread ffff809d04566800 | grep t_flags
          t_flags = 1,     == SVC_STOPPED
      
      crash_arm64> epython dumprpc 0xffff80be345d1340
      thread         pid    ptlrpc_request      xid                nid                 opc  phase:flags    R:W  sent/deadline          ptlrpc_body
      ==================================================================================================================================================
                     7495   0xffff80be345d1340  1623677299024992   snx11260-MDT0000    COMP:R         0:0  1548522747/1548522842  0xffff80be1295d838
      

      I think the likely cause of the bug is a tricky timing issue with ARM processors due to their use of a weak memory consistency model. There's a race between a ptlrpcd daemon handling a statahead reply and the ll_statahead_thread trying to clear entries from its replied list when the thread tries to terminate. The ll_statahead_thread sees the updates made by rpc handler out of order.

      While terminating, the ll_statahead_thread waits for replies to all rpcs it's already sent. It then clears the entries from the reply list created by the rpc handler by calling sa_handle_callback. If the list is empty, sa_handle_callback does nothing.

      1184         while (sai->sai_sent != sai->sai_replied) {
      1185                 /* in case we're not woken up, timeout wait */
      1186                 lwi = LWI_TIMEOUT(msecs_to_jiffies(MSEC_PER_SEC >> 3),
      1187                                   NULL, NULL);
      1188                 l_wait_event(sa_thread->t_ctl_waitq,
      1189                         sai->sai_sent == sai->sai_replied, &lwi);
      1190         }
      1191
      1192         /* release resources held by statahead RPCs */
      1193         sa_handle_callback(sai);    
      .....
      1204         ll_sai_put(sai);              <---- LBUG 
      

      Meanwhile a reply is received and handled by a ptlrpcd daemon. It executes the following code from ll_statahead_interpret:

      700         spin_lock(&lli->lli_sa_lock);
      701         if (rc != 0) {
      702                 if (__sa_make_ready(sai, entry, rc))
      703                         waitq = &sai->sai_waitq;
      704         } else {
      705                 entry->se_minfo = minfo;
      706                 entry->se_req = ptlrpc_request_addref(req);
      ....
      711                 entry->se_handle = handle;
      712                 if (!sa_has_callback(sai))
      713                         waitq = &sai->sai_thread.t_ctl_waitq;
      714
      715                 list_add_tail(&entry->se_list, &sai->sai_interim_entries);       <--- list of rpc replies
      716         }
      717         sai->sai_replied++;
      718
      719         smp_mb();
      720         if (waitq != NULL)
      721                 wake_up(waitq);
      722         spin_unlock(&lli->lli_sa_lock);
      

      The LBUG occurs because sai_sent equals sai_replied but the sai_interim_entries list is not empty. The likely explanation for this state is that ll_statahead_thread sees the increment of sai_replied made by the ptlrpcd daemon but doesn't see the update to the sai_interim_entries list. ll_statahead_thread exists the wait event loop and calls sa_handle_callback. sa_handle_callback then sees an empty list and exits. But by the time sai_put is called, the update to sai_interim_entries is visible and triggers the assertion.

       645 static void sa_handle_callback(struct ll_statahead_info *sai)
       646 {
       647         struct ll_inode_info *lli;
       648
       649         lli = ll_i2info(sai->sai_dentry->d_inode);
       650
       651         while (sa_has_callback(sai)) {         <--- list update not yet visisible
       652                 struct sa_entry *entry;
       653
       654                 spin_lock(&lli->lli_sa_lock);
       655                 if (unlikely(!sa_has_callback(sai))) {
       656                         spin_unlock(&lli->lli_sa_lock);
       657                         break;
       658                 }
       659                 entry = list_entry(sai->sai_interim_entries.next,
       660                                    struct sa_entry, se_list);
       661                 list_del_init(&entry->se_list);
       662                 spin_unlock(&lli->lli_sa_lock);
       663
       664                 sa_instantiate(sai, entry);
       665         }
      

      There are two possible fixes:
      1. Add a memory barrier in ll_statahead_interpret to enforce the expected update ordering.
      2. Hold the lli_sa_lock spinlock whenever sa_has_callback() is called in sa_handle_callback().

      I think 2 is the better, more easily understood, more maintainable solution. Patch is on the way.

      Attachments

        Activity

          People

            amk Ann Koehler (Inactive)
            amk Ann Koehler (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: