[LU-8511] mdc stuck in EVICTED state Created: 18/Aug/16 Updated: 31/Oct/16 Resolved: 31/Oct/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Christopher Morrone | Assignee: | Zhenyu Xu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
On our 2.8 DNE testbed, we are seeing, not too infrequently, MDCs that get stuck in the EVICTED state. The clients are running Lustre 2.8.0_0.0.llnlpreview.18 (see the lustre-release-fe-llnl repo). The MDC seems to be permanently stuck. See the following example: [root@opal70:lquake-MDT000a-mdc-ffff88201e65e000]# cat state current_state: EVICTED state_history: - [ 1470865592, DISCONN ] - [ 1470865612, CONNECTING ] - [ 1470865667, DISCONN ] - [ 1470865687, CONNECTING ] - [ 1470865742, DISCONN ] - [ 1470865762, CONNECTING ] - [ 1470865762, DISCONN ] - [ 1470865771, CONNECTING ] - [ 1470865771, REPLAY ] - [ 1470865771, REPLAY_LOCKS ] - [ 1470865771, REPLAY_WAIT ] - [ 1470865831, RECOVER ] - [ 1470865831, FULL ] - [ 1470950043, DISCONN ] - [ 1470950043, CONNECTING ] - [ 1470950043, EVICTED ] [root@opal70:lquake-MDT000a-mdc-ffff88201e65e000]# date +%s 1471481367 Note that it appears to have stopped trying to connect after the eviction, and that was apparently over six days ago. On the client's console I see: 2016-08-17 17:24:02 [705378.725578] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0) 2016-08-17 17:24:02 [705378.741582] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages 2016-08-17 17:24:02 [705378.754003] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff880f415d8000 x1542220409542908/t0(0) o37->lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1 2016-08-17 17:24:02 [705378.786766] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages 2016-08-17 17:24:02 [705378.799218] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out. 2016-08-17 17:24:02 [705378.819955] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages 2016-08-17 17:34:02 [705979.145608] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0) 2016-08-17 17:34:02 [705979.161601] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages 2016-08-17 17:34:02 [705979.174037] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff880f415d8000 x1542220409542908/t0(0) o37->lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1 2016-08-17 17:34:02 [705979.206710] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages 2016-08-17 17:34:02 [705979.219194] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out. 2016-08-17 17:34:02 [705979.239870] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages 2016-08-17 17:35:08 [706044.254083] hsi0: can't use GFP_NOIO for QPs on device hfi1_0, using GFP_KERNEL 2016-08-17 17:39:41 [706317.503378] hsi0: can't use GFP_NOIO for QPs on device hfi1_0, using GFP_KERNEL 2016-08-17 17:44:02 [706579.565744] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0) 2016-08-17 17:44:03 [706579.581674] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages 2016-08-17 17:44:03 [706579.594086] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff880f415d8000 x1542220409542908/t0(0) o37->lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1 2016-08-17 17:44:03 [706579.626684] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages 2016-08-17 17:44:03 [706579.639068] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out. 2016-08-17 17:44:03 [706579.659692] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages The "lquake-MDT000a_UUID: rc = -110 waiting for callback" repeats every ten minutes. PID: 27781 TASK: ffff880f6be4e780 CPU: 5 COMMAND: "ll_sa_25078" #0 [ffff88202318b700] __schedule+0x295 at ffffffff81651975 #1 [ffff88202318b768] schedule+0x29 at ffffffff81652049 #2 [ffff88202318b778] schedule_timeout+0x175 at ffffffff8164fa75 #3 [ffff88202318b820] ptlrpc_set_wait+0x4c0 at ffffffffa0dafda0 [ptlrpc] #4 [ffff88202318b8c8] ptlrpc_queue_wait+0x7d at ffffffffa0db025d [ptlrpc] #5 [ffff88202318b8e8] mdc_getpage+0x1e1 at ffffffffa0fadf61 [mdc] #6 [ffff88202318b9c8] mdc_read_page_remote+0x135 at ffffffffa0fae535 [mdc] #7 [ffff88202318ba48] do_read_cache_page+0x7f at ffffffff81170cbf #8 [ffff88202318ba90] read_cache_page+0x1c at ffffffff81170e1c #9 [ffff88202318baa0] mdc_read_page+0x1b4 at ffffffffa0fab314 [mdc] #10 [ffff88202318bb90] lmv_read_striped_page+0x5f8 at ffffffffa0ff14a7 [lmv] #11 [ffff88202318bca8] lmv_read_page+0x521 at ffffffffa0fe34e1 [lmv] #12 [ffff88202318bd00] ll_get_dir_page+0xc8 at ffffffffa1015178 [lustre] #13 [ffff88202318bd40] ll_statahead_thread+0x2bc at ffffffffa10691cc [lustre] #14 [ffff88202318bec8] kthread+0xcf at ffffffff810a997f #15 [ffff88202318bf50] ret_from_fork+0x58 at ffffffff8165d658 Another client node that has had a single MDC (out of 16) stuck in EVICTED state for nearly 7 days also has a single SA thread stuck: crash> bt -sx 166895 PID: 166895 TASK: ffff880fdcd6a280 CPU: 1 COMMAND: "ll_sa_166412" #0 [ffff880f0f213700] __schedule+0x295 at ffffffff81651975 #1 [ffff880f0f213768] schedule+0x29 at ffffffff81652049 #2 [ffff880f0f213778] schedule_timeout+0x175 at ffffffff8164fa75 #3 [ffff880f0f213820] ptlrpc_set_wait+0x4c0 at ffffffffa0dc4da0 [ptlrpc] #4 [ffff880f0f2138c8] ptlrpc_queue_wait+0x7d at ffffffffa0dc525d [ptlrpc] #5 [ffff880f0f2138e8] mdc_getpage+0x1e1 at ffffffffa0fc2f61 [mdc] #6 [ffff880f0f2139c8] mdc_read_page_remote+0x135 at ffffffffa0fc3535 [mdc] #7 [ffff880f0f213a48] do_read_cache_page+0x7f at ffffffff81170cbf #8 [ffff880f0f213a90] read_cache_page+0x1c at ffffffff81170e1c #9 [ffff880f0f213aa0] mdc_read_page+0x1b4 at ffffffffa0fc0314 [mdc] #10 [ffff880f0f213b90] lmv_read_striped_page+0x5f8 at ffffffffa10064a7 [lmv] #11 [ffff880f0f213ca8] lmv_read_page+0x521 at ffffffffa0ff84e1 [lmv] #12 [ffff880f0f213d00] ll_get_dir_page+0xc8 at ffffffffa102a178 [lustre] #13 [ffff880f0f213d40] ll_statahead_thread+0x2bc at ffffffffa107e1cc [lustre] #14 [ffff880f0f213ec8] kthread+0xcf at ffffffff810a997f #15 [ffff880f0f213f50] ret_from_fork+0x58 at ffffffff8165d658 I can't necessarily say that SA is implicated though. It could simply be that SA is hanging because someone discovered the problem by running "ls". |
| Comments |
| Comment by Peter Jones [ 18/Aug/16 ] |
|
Bobijam Could you please advise on this issue? Thanks Peter |
| Comment by Zhenyu Xu [ 19/Aug/16 ] |
|
I think it relates to |
| Comment by Christopher Morrone [ 06/Oct/16 ] |
|
When can we expect the port to be reviewed? |
| Comment by Peter Jones [ 31/Oct/16 ] |
|
Duplicate of |