Loading...

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.8.0
Labels:
- llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

On our 2.8 DNE testbed, we are seeing, not too infrequently, MDCs that get stuck in the EVICTED state. The clients are running Lustre 2.8.0_0.0.llnlpreview.18 (see the lustre-release-fe-llnl repo).

The MDC seems to be permanently stuck. See the following example:

[root@opal70:lquake-MDT000a-mdc-ffff88201e65e000]# cat state
current_state: EVICTED
state_history:
 - [ 1470865592, DISCONN ]
 - [ 1470865612, CONNECTING ]
 - [ 1470865667, DISCONN ]
 - [ 1470865687, CONNECTING ]
 - [ 1470865742, DISCONN ]
 - [ 1470865762, CONNECTING ]
 - [ 1470865762, DISCONN ]
 - [ 1470865771, CONNECTING ]
 - [ 1470865771, REPLAY ]
 - [ 1470865771, REPLAY_LOCKS ]
 - [ 1470865771, REPLAY_WAIT ]
 - [ 1470865831, RECOVER ]
 - [ 1470865831, FULL ]
 - [ 1470950043, DISCONN ]
 - [ 1470950043, CONNECTING ]
 - [ 1470950043, EVICTED ]
[root@opal70:lquake-MDT000a-mdc-ffff88201e65e000]# date +%s
1471481367

Note that it appears to have stopped trying to connect after the eviction, and that was apparently over six days ago.

On the client's console I see:

2016-08-17 17:24:02 [705378.725578] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0)
2016-08-17 17:24:02 [705378.741582] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
2016-08-17 17:24:02 [705378.754003] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff880f415d8000 x1542220409542908/t0(0) o37->lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1
2016-08-17 17:24:02 [705378.786766] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
2016-08-17 17:24:02 [705378.799218] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out.
2016-08-17 17:24:02 [705378.819955] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
2016-08-17 17:34:02 [705979.145608] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0)
2016-08-17 17:34:02 [705979.161601] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
2016-08-17 17:34:02 [705979.174037] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff880f415d8000 x1542220409542908/t0(0) o37->lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1
2016-08-17 17:34:02 [705979.206710] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
2016-08-17 17:34:02 [705979.219194] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out.
2016-08-17 17:34:02 [705979.239870] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
2016-08-17 17:35:08 [706044.254083] hsi0: can't use GFP_NOIO for QPs on device hfi1_0, using GFP_KERNEL
2016-08-17 17:39:41 [706317.503378] hsi0: can't use GFP_NOIO for QPs on device hfi1_0, using GFP_KERNEL
2016-08-17 17:44:02 [706579.565744] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0)
2016-08-17 17:44:03 [706579.581674] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
2016-08-17 17:44:03 [706579.594086] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff880f415d8000 x1542220409542908/t0(0) o37->lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1
2016-08-17 17:44:03 [706579.626684] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
2016-08-17 17:44:03 [706579.639068] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out.
2016-08-17 17:44:03 [706579.659692] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages

The "lquake-MDT000a_UUID: rc = -110 waiting for callback" repeats every ten minutes.

PID: 27781  TASK: ffff880f6be4e780  CPU: 5   COMMAND: "ll_sa_25078"
 #0 [ffff88202318b700] __schedule+0x295 at ffffffff81651975
 #1 [ffff88202318b768] schedule+0x29 at ffffffff81652049
 #2 [ffff88202318b778] schedule_timeout+0x175 at ffffffff8164fa75
 #3 [ffff88202318b820] ptlrpc_set_wait+0x4c0 at ffffffffa0dafda0 [ptlrpc]
 #4 [ffff88202318b8c8] ptlrpc_queue_wait+0x7d at ffffffffa0db025d [ptlrpc]
 #5 [ffff88202318b8e8] mdc_getpage+0x1e1 at ffffffffa0fadf61 [mdc]
 #6 [ffff88202318b9c8] mdc_read_page_remote+0x135 at ffffffffa0fae535 [mdc]
 #7 [ffff88202318ba48] do_read_cache_page+0x7f at ffffffff81170cbf
 #8 [ffff88202318ba90] read_cache_page+0x1c at ffffffff81170e1c
 #9 [ffff88202318baa0] mdc_read_page+0x1b4 at ffffffffa0fab314 [mdc]
#10 [ffff88202318bb90] lmv_read_striped_page+0x5f8 at ffffffffa0ff14a7 [lmv]
#11 [ffff88202318bca8] lmv_read_page+0x521 at ffffffffa0fe34e1 [lmv]
#12 [ffff88202318bd00] ll_get_dir_page+0xc8 at ffffffffa1015178 [lustre]
#13 [ffff88202318bd40] ll_statahead_thread+0x2bc at ffffffffa10691cc [lustre]
#14 [ffff88202318bec8] kthread+0xcf at ffffffff810a997f
#15 [ffff88202318bf50] ret_from_fork+0x58 at ffffffff8165d658

Another client node that has had a single MDC (out of 16) stuck in EVICTED state for nearly 7 days also has a single SA thread stuck:

crash> bt -sx 166895
PID: 166895  TASK: ffff880fdcd6a280  CPU: 1   COMMAND: "ll_sa_166412"
 #0 [ffff880f0f213700] __schedule+0x295 at ffffffff81651975
 #1 [ffff880f0f213768] schedule+0x29 at ffffffff81652049
 #2 [ffff880f0f213778] schedule_timeout+0x175 at ffffffff8164fa75
 #3 [ffff880f0f213820] ptlrpc_set_wait+0x4c0 at ffffffffa0dc4da0 [ptlrpc]
 #4 [ffff880f0f2138c8] ptlrpc_queue_wait+0x7d at ffffffffa0dc525d [ptlrpc]
 #5 [ffff880f0f2138e8] mdc_getpage+0x1e1 at ffffffffa0fc2f61 [mdc]
 #6 [ffff880f0f2139c8] mdc_read_page_remote+0x135 at ffffffffa0fc3535 [mdc]
 #7 [ffff880f0f213a48] do_read_cache_page+0x7f at ffffffff81170cbf
 #8 [ffff880f0f213a90] read_cache_page+0x1c at ffffffff81170e1c
 #9 [ffff880f0f213aa0] mdc_read_page+0x1b4 at ffffffffa0fc0314 [mdc]
#10 [ffff880f0f213b90] lmv_read_striped_page+0x5f8 at ffffffffa10064a7 [lmv]
#11 [ffff880f0f213ca8] lmv_read_page+0x521 at ffffffffa0ff84e1 [lmv]
#12 [ffff880f0f213d00] ll_get_dir_page+0xc8 at ffffffffa102a178 [lustre]
#13 [ffff880f0f213d40] ll_statahead_thread+0x2bc at ffffffffa107e1cc [lustre]
#14 [ffff880f0f213ec8] kthread+0xcf at ffffffff810a997f
#15 [ffff880f0f213f50] ret_from_fork+0x58 at ffffffff8165d658

I can't necessarily say that SA is implicated though. It could simply be that SA is hanging because someone discovered the problem by running "ls".

Attachments

Issue Links

is related to

LU-7434 lost bulk leads to a hang

Resolved

mdc stuck in EVICTED state

Details

Description

Attachments

Issue Links

Activity

People

Dates