Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.11.0
-
None
-
3
-
9223372036854775807
Description
crash_arm64> dmesg | grep Build [ 409.666886] Lustre: Lustre: Build Version: 2.11.0.200_cray_79_g4c42971 crash_arm64> dmesg | grep ': 102742:' [63861.735486] LustreError: 102742:0:(statahead.c:478:ll_sai_put()) ASSERTION( !sa_has_callback(sai) ) failed: [63861.744009] LustreError: 102742:0:(statahead.c:478:ll_sai_put()) LBUG crash_arm64> bt PID: 102742 TASK: ffff80be57ca8000 CPU: 96 COMMAND: "ll_sa_102181" #0 [ffff00001fceba80] __crash_kexec at ffff000008138910 #1 [ffff00001fcebbb0] (null) at ffffffc4 #2 [ffff00001fcebbe0] __const_udelay at ffff00000837dba0 #3 [ffff00001fcebbf0] panic at ffff0000081c6d6c #4 [ffff00001fcebcc0] lbug_with_loc at ffff000000a7c9e4 [libcfs] #5 [ffff00001fcebce0] ll_sai_put at ffff0000013d5658 [lustre] @ llite/statahead.c: 476 #6 [ffff00001fcebd50] ll_statahead_thread at ffff0000013da360 [lustre] @ llite/statahead.c: 1204 #7 [ffff00001fcebe70] kthread at ffff0000080cb3a4
This LBUG was first seen with a Lustre 2.7 client. It can be easily reproduced with a heavy load of LTP statahead tests on a system with ~300 ARM nodes with 64 cores per node.
The situation at the time of the LBUG: The ll_statahead_thread ll_sa_102181 has been stopped (sai_thread.t_flags == SVC_STOPPED) and is making its final call to ll_sai_put before returning through the out: code block. The LBUG is issued because a new entry has been placed on the sai_interim_entries list - a reply from the MDS was received and handled by ll_statahead_interpret. The ll_statahead_thread did not wait for the rpc to be completely processed before trying to exit.
Dump available upon request.
Values from bt -FF: ll_inode_info ffff809d0d6d99c0 inode ffff809d0d6d9a48 ll_statahead_info ffff809d04566800 crash_arm64> epython sb_inodes -l ffff80be10087800 | grep ffff809d0d6d9a48 0xffff80be10087800 lustre 0xffff809d0d6d9a48 144119544061917665 4096 1 0xffff809d80728cc0 /lus/snx11260/ostest.vers/alsorun.20190125183527.25496.ranger/CL_LLTPreadlinkat02.3.N2EP2O.1548522513/reaIck9AT (deleted) struct ll_statahead_info { sai_dentry = 0xffff809d80728cc0, sai_refcount = { counter = 0 }, sai_max = 2, sai_sent = 1, sai_replied = 1, sai_index = 2, sai_index_wait = 0, sai_hit = 0, sai_miss = 0, sai_consecutive_miss = 0, sai_miss_hidden = 0, sai_skip_hidden = 0, sai_ls_all = 0, sai_agl_valid = 0, sai_in_readpage = 0, .... sai_thread = { t_link = { next = 0x0, prev = 0x0 }, t_data = 0x0, t_flags = 1, t_id = 0, t_pid = 102742, .... sai_interim_entries = { next = 0xffff80be5aae9480, prev = 0xffff80be5aae9480 }, .... struct sa_entry { se_list = { next = 0xffff809d04566958, prev = 0xffff809d04566958 }, se_hash = { next = 0xffff809d04566988, prev = 0xffff809d04566988 }, se_index = 1, se_handle = 5636371587488280727, se_state = SA_ENTRY_INIT, se_size = 128, se_minfo = 0xffff80be5b17c400, se_req = 0xffff80be345d1340, se_inode = 0x0, se_qstr = { { { hash = 3580636640, len = 12 }, hash_len = 55120244192 }, name = 0xffff80be5aae94f0 "symlink_file" }, se_fid = { f_seq = 8590194229, f_oid = 31203 f_ver = 0 } crash_arm64> ll_statahead_info.sai_thread ffff809d04566800 | grep t_flags t_flags = 1, == SVC_STOPPED crash_arm64> epython dumprpc 0xffff80be345d1340 thread pid ptlrpc_request xid nid opc phase:flags R:W sent/deadline ptlrpc_body ================================================================================================================================================== 7495 0xffff80be345d1340 1623677299024992 snx11260-MDT0000 COMP:R 0:0 1548522747/1548522842 0xffff80be1295d838
I think the likely cause of the bug is a tricky timing issue with ARM processors due to their use of a weak memory consistency model. There's a race between a ptlrpcd daemon handling a statahead reply and the ll_statahead_thread trying to clear entries from its replied list when the thread tries to terminate. The ll_statahead_thread sees the updates made by rpc handler out of order.
While terminating, the ll_statahead_thread waits for replies to all rpcs it's already sent. It then clears the entries from the reply list created by the rpc handler by calling sa_handle_callback. If the list is empty, sa_handle_callback does nothing.
1184 while (sai->sai_sent != sai->sai_replied) { 1185 /* in case we're not woken up, timeout wait */ 1186 lwi = LWI_TIMEOUT(msecs_to_jiffies(MSEC_PER_SEC >> 3), 1187 NULL, NULL); 1188 l_wait_event(sa_thread->t_ctl_waitq, 1189 sai->sai_sent == sai->sai_replied, &lwi); 1190 } 1191 1192 /* release resources held by statahead RPCs */ 1193 sa_handle_callback(sai); ..... 1204 ll_sai_put(sai); <---- LBUG
Meanwhile a reply is received and handled by a ptlrpcd daemon. It executes the following code from ll_statahead_interpret:
700 spin_lock(&lli->lli_sa_lock); 701 if (rc != 0) { 702 if (__sa_make_ready(sai, entry, rc)) 703 waitq = &sai->sai_waitq; 704 } else { 705 entry->se_minfo = minfo; 706 entry->se_req = ptlrpc_request_addref(req); .... 711 entry->se_handle = handle; 712 if (!sa_has_callback(sai)) 713 waitq = &sai->sai_thread.t_ctl_waitq; 714 715 list_add_tail(&entry->se_list, &sai->sai_interim_entries); <--- list of rpc replies 716 } 717 sai->sai_replied++; 718 719 smp_mb(); 720 if (waitq != NULL) 721 wake_up(waitq); 722 spin_unlock(&lli->lli_sa_lock);
The LBUG occurs because sai_sent equals sai_replied but the sai_interim_entries list is not empty. The likely explanation for this state is that ll_statahead_thread sees the increment of sai_replied made by the ptlrpcd daemon but doesn't see the update to the sai_interim_entries list. ll_statahead_thread exists the wait event loop and calls sa_handle_callback. sa_handle_callback then sees an empty list and exits. But by the time sai_put is called, the update to sai_interim_entries is visible and triggers the assertion.
645 static void sa_handle_callback(struct ll_statahead_info *sai) 646 { 647 struct ll_inode_info *lli; 648 649 lli = ll_i2info(sai->sai_dentry->d_inode); 650 651 while (sa_has_callback(sai)) { <--- list update not yet visisible 652 struct sa_entry *entry; 653 654 spin_lock(&lli->lli_sa_lock); 655 if (unlikely(!sa_has_callback(sai))) { 656 spin_unlock(&lli->lli_sa_lock); 657 break; 658 } 659 entry = list_entry(sai->sai_interim_entries.next, 660 struct sa_entry, se_list); 661 list_del_init(&entry->se_list); 662 spin_unlock(&lli->lli_sa_lock); 663 664 sa_instantiate(sai, entry); 665 }
There are two possible fixes:
1. Add a memory barrier in ll_statahead_interpret to enforce the expected update ordering.
2. Hold the lli_sa_lock spinlock whenever sa_has_callback() is called in sa_handle_callback().
I think 2 is the better, more easily understood, more maintainable solution. Patch is on the way.