Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.8.0
-
3
-
9223372036854775807
Description
On our 2.8 DNE testbed, we are seeing, not too infrequently, MDCs that get stuck in the EVICTED state. The clients are running Lustre 2.8.0_0.0.llnlpreview.18 (see the lustre-release-fe-llnl repo).
The MDC seems to be permanently stuck. See the following example:
[root@opal70:lquake-MDT000a-mdc-ffff88201e65e000]# cat state current_state: EVICTED state_history: - [ 1470865592, DISCONN ] - [ 1470865612, CONNECTING ] - [ 1470865667, DISCONN ] - [ 1470865687, CONNECTING ] - [ 1470865742, DISCONN ] - [ 1470865762, CONNECTING ] - [ 1470865762, DISCONN ] - [ 1470865771, CONNECTING ] - [ 1470865771, REPLAY ] - [ 1470865771, REPLAY_LOCKS ] - [ 1470865771, REPLAY_WAIT ] - [ 1470865831, RECOVER ] - [ 1470865831, FULL ] - [ 1470950043, DISCONN ] - [ 1470950043, CONNECTING ] - [ 1470950043, EVICTED ] [root@opal70:lquake-MDT000a-mdc-ffff88201e65e000]# date +%s 1471481367
Note that it appears to have stopped trying to connect after the eviction, and that was apparently over six days ago.
On the client's console I see:
2016-08-17 17:24:02 [705378.725578] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0) 2016-08-17 17:24:02 [705378.741582] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages 2016-08-17 17:24:02 [705378.754003] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff880f415d8000 x1542220409542908/t0(0) o37->lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1 2016-08-17 17:24:02 [705378.786766] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages 2016-08-17 17:24:02 [705378.799218] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out. 2016-08-17 17:24:02 [705378.819955] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages 2016-08-17 17:34:02 [705979.145608] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0) 2016-08-17 17:34:02 [705979.161601] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages 2016-08-17 17:34:02 [705979.174037] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff880f415d8000 x1542220409542908/t0(0) o37->lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1 2016-08-17 17:34:02 [705979.206710] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages 2016-08-17 17:34:02 [705979.219194] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out. 2016-08-17 17:34:02 [705979.239870] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages 2016-08-17 17:35:08 [706044.254083] hsi0: can't use GFP_NOIO for QPs on device hfi1_0, using GFP_KERNEL 2016-08-17 17:39:41 [706317.503378] hsi0: can't use GFP_NOIO for QPs on device hfi1_0, using GFP_KERNEL 2016-08-17 17:44:02 [706579.565744] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0) 2016-08-17 17:44:03 [706579.581674] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages 2016-08-17 17:44:03 [706579.594086] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff880f415d8000 x1542220409542908/t0(0) o37->lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1 2016-08-17 17:44:03 [706579.626684] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages 2016-08-17 17:44:03 [706579.639068] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out. 2016-08-17 17:44:03 [706579.659692] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
The "lquake-MDT000a_UUID: rc = -110 waiting for callback" repeats every ten minutes.
PID: 27781 TASK: ffff880f6be4e780 CPU: 5 COMMAND: "ll_sa_25078" #0 [ffff88202318b700] __schedule+0x295 at ffffffff81651975 #1 [ffff88202318b768] schedule+0x29 at ffffffff81652049 #2 [ffff88202318b778] schedule_timeout+0x175 at ffffffff8164fa75 #3 [ffff88202318b820] ptlrpc_set_wait+0x4c0 at ffffffffa0dafda0 [ptlrpc] #4 [ffff88202318b8c8] ptlrpc_queue_wait+0x7d at ffffffffa0db025d [ptlrpc] #5 [ffff88202318b8e8] mdc_getpage+0x1e1 at ffffffffa0fadf61 [mdc] #6 [ffff88202318b9c8] mdc_read_page_remote+0x135 at ffffffffa0fae535 [mdc] #7 [ffff88202318ba48] do_read_cache_page+0x7f at ffffffff81170cbf #8 [ffff88202318ba90] read_cache_page+0x1c at ffffffff81170e1c #9 [ffff88202318baa0] mdc_read_page+0x1b4 at ffffffffa0fab314 [mdc] #10 [ffff88202318bb90] lmv_read_striped_page+0x5f8 at ffffffffa0ff14a7 [lmv] #11 [ffff88202318bca8] lmv_read_page+0x521 at ffffffffa0fe34e1 [lmv] #12 [ffff88202318bd00] ll_get_dir_page+0xc8 at ffffffffa1015178 [lustre] #13 [ffff88202318bd40] ll_statahead_thread+0x2bc at ffffffffa10691cc [lustre] #14 [ffff88202318bec8] kthread+0xcf at ffffffff810a997f #15 [ffff88202318bf50] ret_from_fork+0x58 at ffffffff8165d658
Another client node that has had a single MDC (out of 16) stuck in EVICTED state for nearly 7 days also has a single SA thread stuck:
crash> bt -sx 166895 PID: 166895 TASK: ffff880fdcd6a280 CPU: 1 COMMAND: "ll_sa_166412" #0 [ffff880f0f213700] __schedule+0x295 at ffffffff81651975 #1 [ffff880f0f213768] schedule+0x29 at ffffffff81652049 #2 [ffff880f0f213778] schedule_timeout+0x175 at ffffffff8164fa75 #3 [ffff880f0f213820] ptlrpc_set_wait+0x4c0 at ffffffffa0dc4da0 [ptlrpc] #4 [ffff880f0f2138c8] ptlrpc_queue_wait+0x7d at ffffffffa0dc525d [ptlrpc] #5 [ffff880f0f2138e8] mdc_getpage+0x1e1 at ffffffffa0fc2f61 [mdc] #6 [ffff880f0f2139c8] mdc_read_page_remote+0x135 at ffffffffa0fc3535 [mdc] #7 [ffff880f0f213a48] do_read_cache_page+0x7f at ffffffff81170cbf #8 [ffff880f0f213a90] read_cache_page+0x1c at ffffffff81170e1c #9 [ffff880f0f213aa0] mdc_read_page+0x1b4 at ffffffffa0fc0314 [mdc] #10 [ffff880f0f213b90] lmv_read_striped_page+0x5f8 at ffffffffa10064a7 [lmv] #11 [ffff880f0f213ca8] lmv_read_page+0x521 at ffffffffa0ff84e1 [lmv] #12 [ffff880f0f213d00] ll_get_dir_page+0xc8 at ffffffffa102a178 [lustre] #13 [ffff880f0f213d40] ll_statahead_thread+0x2bc at ffffffffa107e1cc [lustre] #14 [ffff880f0f213ec8] kthread+0xcf at ffffffff810a997f #15 [ffff880f0f213f50] ret_from_fork+0x58 at ffffffff8165d658
I can't necessarily say that SA is implicated though. It could simply be that SA is hanging because someone discovered the problem by running "ls".
Attachments
Issue Links
- is related to
-
LU-7434 lost bulk leads to a hang
- Resolved