[LU-12221] LBUG: (statahead.c:477:ll_sai_put()) ASSERTION( !sa_has_callback(sai) ) Created: 24/Apr/19  Updated: 16/Sep/21  Resolved: 25/May/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Major
Reporter: Ann Koehler (Inactive) Assignee: Ann Koehler (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
crash_arm64> dmesg | grep Build
[  409.666886] Lustre: Lustre: Build Version: 2.11.0.200_cray_79_g4c42971

 crash_arm64> dmesg | grep ': 102742:'
[63861.735486] LustreError: 102742:0:(statahead.c:478:ll_sai_put()) ASSERTION( !sa_has_callback(sai) ) failed:
[63861.744009] LustreError: 102742:0:(statahead.c:478:ll_sai_put()) LBUG

crash_arm64> bt
PID: 102742  TASK: ffff80be57ca8000  CPU: 96  COMMAND: "ll_sa_102181"
 #0 [ffff00001fceba80] __crash_kexec at ffff000008138910
 #1 [ffff00001fcebbb0] (null) at ffffffc4
 #2 [ffff00001fcebbe0] __const_udelay at ffff00000837dba0
 #3 [ffff00001fcebbf0] panic at ffff0000081c6d6c
 #4 [ffff00001fcebcc0] lbug_with_loc at ffff000000a7c9e4 [libcfs]
 #5 [ffff00001fcebce0] ll_sai_put at ffff0000013d5658 [lustre]   @ llite/statahead.c: 476
 #6 [ffff00001fcebd50] ll_statahead_thread at ffff0000013da360 [lustre]    @ llite/statahead.c: 1204
 #7 [ffff00001fcebe70] kthread at ffff0000080cb3a4

This LBUG was first seen with a Lustre 2.7 client. It can be easily reproduced with a heavy load of LTP statahead tests on a system with ~300 ARM nodes with 64 cores per node.

The situation at the time of the LBUG: The ll_statahead_thread ll_sa_102181 has been stopped (sai_thread.t_flags == SVC_STOPPED) and is making its final call to ll_sai_put before returning through the out: code block. The LBUG is issued because a new entry has been placed on the sai_interim_entries list - a reply from the MDS was received and handled by ll_statahead_interpret. The ll_statahead_thread did not wait for the rpc to be completely processed before trying to exit.

Dump available upon request.

Values from bt -FF:
ll_inode_info ffff809d0d6d99c0
inode ffff809d0d6d9a48
ll_statahead_info ffff809d04566800

crash_arm64> epython sb_inodes -l ffff80be10087800 | grep ffff809d0d6d9a48
0xffff80be10087800 lustre       0xffff809d0d6d9a48 144119544061917665         4096      1 0xffff809d80728cc0 /lus/snx11260/ostest.vers/alsorun.20190125183527.25496.ranger/CL_LLTPreadlinkat02.3.N2EP2O.1548522513/reaIck9AT (deleted)
 
struct ll_statahead_info {
  sai_dentry = 0xffff809d80728cc0,
  sai_refcount = {
    counter = 0
  },
  sai_max = 2,
  sai_sent = 1,
  sai_replied = 1,
  sai_index = 2,
  sai_index_wait = 0,
  sai_hit = 0,
  sai_miss = 0,
  sai_consecutive_miss = 0,
  sai_miss_hidden = 0,
  sai_skip_hidden = 0,
  sai_ls_all = 0,
  sai_agl_valid = 0,
  sai_in_readpage = 0,
....
  sai_thread = {
    t_link = {
      next = 0x0,
      prev = 0x0
    },
    t_data = 0x0,
    t_flags = 1,
    t_id = 0,
    t_pid = 102742,
....
  sai_interim_entries = {
    next = 0xffff80be5aae9480,
    prev = 0xffff80be5aae9480
  },
....
struct sa_entry {
  se_list = {
    next = 0xffff809d04566958,
    prev = 0xffff809d04566958
  },
  se_hash = {
    next = 0xffff809d04566988,
    prev = 0xffff809d04566988
  },
  se_index = 1,
  se_handle = 5636371587488280727,
  se_state = SA_ENTRY_INIT,
  se_size = 128,
  se_minfo = 0xffff80be5b17c400,
  se_req = 0xffff80be345d1340,
  se_inode = 0x0,
  se_qstr = {
    {
      {
        hash = 3580636640,
        len = 12
      },
      hash_len = 55120244192
    },
    name = 0xffff80be5aae94f0 "symlink_file"
  },
  se_fid = {
    f_seq = 8590194229,
    f_oid = 31203
    f_ver = 0
  }

crash_arm64> ll_statahead_info.sai_thread ffff809d04566800 | grep t_flags
    t_flags = 1,     == SVC_STOPPED

crash_arm64> epython dumprpc 0xffff80be345d1340
thread         pid    ptlrpc_request      xid                nid                 opc  phase:flags    R:W  sent/deadline          ptlrpc_body
==================================================================================================================================================
               7495   0xffff80be345d1340  1623677299024992   snx11260-MDT0000    COMP:R         0:0  1548522747/1548522842  0xffff80be1295d838

I think the likely cause of the bug is a tricky timing issue with ARM processors due to their use of a weak memory consistency model. There's a race between a ptlrpcd daemon handling a statahead reply and the ll_statahead_thread trying to clear entries from its replied list when the thread tries to terminate. The ll_statahead_thread sees the updates made by rpc handler out of order.

While terminating, the ll_statahead_thread waits for replies to all rpcs it's already sent. It then clears the entries from the reply list created by the rpc handler by calling sa_handle_callback. If the list is empty, sa_handle_callback does nothing.

1184         while (sai->sai_sent != sai->sai_replied) {
1185                 /* in case we're not woken up, timeout wait */
1186                 lwi = LWI_TIMEOUT(msecs_to_jiffies(MSEC_PER_SEC >> 3),
1187                                   NULL, NULL);
1188                 l_wait_event(sa_thread->t_ctl_waitq,
1189                         sai->sai_sent == sai->sai_replied, &lwi);
1190         }
1191
1192         /* release resources held by statahead RPCs */
1193         sa_handle_callback(sai);    
.....
1204         ll_sai_put(sai);              <---- LBUG 

Meanwhile a reply is received and handled by a ptlrpcd daemon. It executes the following code from ll_statahead_interpret:

700         spin_lock(&lli->lli_sa_lock);
701         if (rc != 0) {
702                 if (__sa_make_ready(sai, entry, rc))
703                         waitq = &sai->sai_waitq;
704         } else {
705                 entry->se_minfo = minfo;
706                 entry->se_req = ptlrpc_request_addref(req);
....
711                 entry->se_handle = handle;
712                 if (!sa_has_callback(sai))
713                         waitq = &sai->sai_thread.t_ctl_waitq;
714
715                 list_add_tail(&entry->se_list, &sai->sai_interim_entries);       <--- list of rpc replies
716         }
717         sai->sai_replied++;
718
719         smp_mb();
720         if (waitq != NULL)
721                 wake_up(waitq);
722         spin_unlock(&lli->lli_sa_lock);

The LBUG occurs because sai_sent equals sai_replied but the sai_interim_entries list is not empty. The likely explanation for this state is that ll_statahead_thread sees the increment of sai_replied made by the ptlrpcd daemon but doesn't see the update to the sai_interim_entries list. ll_statahead_thread exists the wait event loop and calls sa_handle_callback. sa_handle_callback then sees an empty list and exits. But by the time sai_put is called, the update to sai_interim_entries is visible and triggers the assertion.

 645 static void sa_handle_callback(struct ll_statahead_info *sai)
 646 {
 647         struct ll_inode_info *lli;
 648
 649         lli = ll_i2info(sai->sai_dentry->d_inode);
 650
 651         while (sa_has_callback(sai)) {         <--- list update not yet visisible
 652                 struct sa_entry *entry;
 653
 654                 spin_lock(&lli->lli_sa_lock);
 655                 if (unlikely(!sa_has_callback(sai))) {
 656                         spin_unlock(&lli->lli_sa_lock);
 657                         break;
 658                 }
 659                 entry = list_entry(sai->sai_interim_entries.next,
 660                                    struct sa_entry, se_list);
 661                 list_del_init(&entry->se_list);
 662                 spin_unlock(&lli->lli_sa_lock);
 663
 664                 sa_instantiate(sai, entry);
 665         }

There are two possible fixes:
1. Add a memory barrier in ll_statahead_interpret to enforce the expected update ordering.
2. Hold the lli_sa_lock spinlock whenever sa_has_callback() is called in sa_handle_callback().

I think 2 is the better, more easily understood, more maintainable solution. Patch is on the way.



 Comments   
Comment by Gerrit Updater [ 25/Apr/19 ]

Ann Koehler (amk@cray.com) uploaded a new patch: https://review.whamcloud.com/34760
Subject: LU-12221 statahead: sa_handle_callback get lli_sa_lock earlier
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 702009c4b4f74ad77faf3b6743a21223faeca862

Comment by Andreas Dilger [ 25/Apr/19 ]

I had to laugh a bit at your comment:

It can be easily reproduced with a heavy load of LTP statahead tests on a system with ~300 ARM nodes with 64 cores per node.

Only in HPC/Cray is a 300 clients x 24 cores test an "easy reproducer".

Thanks as always for your excellent analysis. I agree that using the spinlock is more easily understood and maintained compared to a memory barrier.

Comment by Ann Koehler (Inactive) [ 30/Apr/19 ]

Andreas, Glad you caught the humor. I chuckled a bit myself when I wrote that. But in the Cray world, being able to consistently reproduce a bug with a small test suite on a specific configuration does constitute easy even if the config is not one many people have.

Comment by Gerrit Updater [ 25/May/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34760/
Subject: LU-12221 statahead: sa_handle_callback get lli_sa_lock earlier
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 31ef093c219798696e25535941f48636af662802

Comment by Peter Jones [ 25/May/19 ]

Landed for 2.13

Generated at Sat Feb 10 02:50:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.