Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.11.0
-
None
-
3
-
9223372036854775807
Description
crash_arm64> dmesg | grep Build [ 409.666886] Lustre: Lustre: Build Version: 2.11.0.200_cray_79_g4c42971 crash_arm64> dmesg | grep ': 102742:' [63861.735486] LustreError: 102742:0:(statahead.c:478:ll_sai_put()) ASSERTION( !sa_has_callback(sai) ) failed: [63861.744009] LustreError: 102742:0:(statahead.c:478:ll_sai_put()) LBUG crash_arm64> bt PID: 102742 TASK: ffff80be57ca8000 CPU: 96 COMMAND: "ll_sa_102181" #0 [ffff00001fceba80] __crash_kexec at ffff000008138910 #1 [ffff00001fcebbb0] (null) at ffffffc4 #2 [ffff00001fcebbe0] __const_udelay at ffff00000837dba0 #3 [ffff00001fcebbf0] panic at ffff0000081c6d6c #4 [ffff00001fcebcc0] lbug_with_loc at ffff000000a7c9e4 [libcfs] #5 [ffff00001fcebce0] ll_sai_put at ffff0000013d5658 [lustre] @ llite/statahead.c: 476 #6 [ffff00001fcebd50] ll_statahead_thread at ffff0000013da360 [lustre] @ llite/statahead.c: 1204 #7 [ffff00001fcebe70] kthread at ffff0000080cb3a4
This LBUG was first seen with a Lustre 2.7 client. It can be easily reproduced with a heavy load of LTP statahead tests on a system with ~300 ARM nodes with 64 cores per node.
The situation at the time of the LBUG: The ll_statahead_thread ll_sa_102181 has been stopped (sai_thread.t_flags == SVC_STOPPED) and is making its final call to ll_sai_put before returning through the out: code block. The LBUG is issued because a new entry has been placed on the sai_interim_entries list - a reply from the MDS was received and handled by ll_statahead_interpret. The ll_statahead_thread did not wait for the rpc to be completely processed before trying to exit.
Dump available upon request.
Values from bt -FF:
ll_inode_info ffff809d0d6d99c0
inode ffff809d0d6d9a48
ll_statahead_info ffff809d04566800
crash_arm64> epython sb_inodes -l ffff80be10087800 | grep ffff809d0d6d9a48
0xffff80be10087800 lustre 0xffff809d0d6d9a48 144119544061917665 4096 1 0xffff809d80728cc0 /lus/snx11260/ostest.vers/alsorun.20190125183527.25496.ranger/CL_LLTPreadlinkat02.3.N2EP2O.1548522513/reaIck9AT (deleted)
struct ll_statahead_info {
sai_dentry = 0xffff809d80728cc0,
sai_refcount = {
counter = 0
},
sai_max = 2,
sai_sent = 1,
sai_replied = 1,
sai_index = 2,
sai_index_wait = 0,
sai_hit = 0,
sai_miss = 0,
sai_consecutive_miss = 0,
sai_miss_hidden = 0,
sai_skip_hidden = 0,
sai_ls_all = 0,
sai_agl_valid = 0,
sai_in_readpage = 0,
....
sai_thread = {
t_link = {
next = 0x0,
prev = 0x0
},
t_data = 0x0,
t_flags = 1,
t_id = 0,
t_pid = 102742,
....
sai_interim_entries = {
next = 0xffff80be5aae9480,
prev = 0xffff80be5aae9480
},
....
struct sa_entry {
se_list = {
next = 0xffff809d04566958,
prev = 0xffff809d04566958
},
se_hash = {
next = 0xffff809d04566988,
prev = 0xffff809d04566988
},
se_index = 1,
se_handle = 5636371587488280727,
se_state = SA_ENTRY_INIT,
se_size = 128,
se_minfo = 0xffff80be5b17c400,
se_req = 0xffff80be345d1340,
se_inode = 0x0,
se_qstr = {
{
{
hash = 3580636640,
len = 12
},
hash_len = 55120244192
},
name = 0xffff80be5aae94f0 "symlink_file"
},
se_fid = {
f_seq = 8590194229,
f_oid = 31203
f_ver = 0
}
crash_arm64> ll_statahead_info.sai_thread ffff809d04566800 | grep t_flags
t_flags = 1, == SVC_STOPPED
crash_arm64> epython dumprpc 0xffff80be345d1340
thread pid ptlrpc_request xid nid opc phase:flags R:W sent/deadline ptlrpc_body
==================================================================================================================================================
7495 0xffff80be345d1340 1623677299024992 snx11260-MDT0000 COMP:R 0:0 1548522747/1548522842 0xffff80be1295d838
I think the likely cause of the bug is a tricky timing issue with ARM processors due to their use of a weak memory consistency model. There's a race between a ptlrpcd daemon handling a statahead reply and the ll_statahead_thread trying to clear entries from its replied list when the thread tries to terminate. The ll_statahead_thread sees the updates made by rpc handler out of order.
While terminating, the ll_statahead_thread waits for replies to all rpcs it's already sent. It then clears the entries from the reply list created by the rpc handler by calling sa_handle_callback. If the list is empty, sa_handle_callback does nothing.
1184 while (sai->sai_sent != sai->sai_replied) {
1185 /* in case we're not woken up, timeout wait */
1186 lwi = LWI_TIMEOUT(msecs_to_jiffies(MSEC_PER_SEC >> 3),
1187 NULL, NULL);
1188 l_wait_event(sa_thread->t_ctl_waitq,
1189 sai->sai_sent == sai->sai_replied, &lwi);
1190 }
1191
1192 /* release resources held by statahead RPCs */
1193 sa_handle_callback(sai);
.....
1204 ll_sai_put(sai); <---- LBUG
Meanwhile a reply is received and handled by a ptlrpcd daemon. It executes the following code from ll_statahead_interpret:
700 spin_lock(&lli->lli_sa_lock);
701 if (rc != 0) {
702 if (__sa_make_ready(sai, entry, rc))
703 waitq = &sai->sai_waitq;
704 } else {
705 entry->se_minfo = minfo;
706 entry->se_req = ptlrpc_request_addref(req);
....
711 entry->se_handle = handle;
712 if (!sa_has_callback(sai))
713 waitq = &sai->sai_thread.t_ctl_waitq;
714
715 list_add_tail(&entry->se_list, &sai->sai_interim_entries); <--- list of rpc replies
716 }
717 sai->sai_replied++;
718
719 smp_mb();
720 if (waitq != NULL)
721 wake_up(waitq);
722 spin_unlock(&lli->lli_sa_lock);
The LBUG occurs because sai_sent equals sai_replied but the sai_interim_entries list is not empty. The likely explanation for this state is that ll_statahead_thread sees the increment of sai_replied made by the ptlrpcd daemon but doesn't see the update to the sai_interim_entries list. ll_statahead_thread exists the wait event loop and calls sa_handle_callback. sa_handle_callback then sees an empty list and exits. But by the time sai_put is called, the update to sai_interim_entries is visible and triggers the assertion.
645 static void sa_handle_callback(struct ll_statahead_info *sai)
646 {
647 struct ll_inode_info *lli;
648
649 lli = ll_i2info(sai->sai_dentry->d_inode);
650
651 while (sa_has_callback(sai)) { <--- list update not yet visisible
652 struct sa_entry *entry;
653
654 spin_lock(&lli->lli_sa_lock);
655 if (unlikely(!sa_has_callback(sai))) {
656 spin_unlock(&lli->lli_sa_lock);
657 break;
658 }
659 entry = list_entry(sai->sai_interim_entries.next,
660 struct sa_entry, se_list);
661 list_del_init(&entry->se_list);
662 spin_unlock(&lli->lli_sa_lock);
663
664 sa_instantiate(sai, entry);
665 }
There are two possible fixes:
1. Add a memory barrier in ll_statahead_interpret to enforce the expected update ordering.
2. Hold the lli_sa_lock spinlock whenever sa_has_callback() is called in sa_handle_callback().
I think 2 is the better, more easily understood, more maintainable solution. Patch is on the way.