Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.12.3
-
None
-
CentOS 7.6, MOFED 4.7, Lustre 2.12.3
-
3
-
9223372036854775807
Description
Hi,
We are noticing a few clients (5) being evicted by MDTs or OSTs and not reconnecting. We're only seeing this since the upgrade to 2.12.3.
Example with sh-103-53 10.9.103.53@o2ib4:
[root@sh-103-53 ~]# lfs df -v /scratch/ UUID 1K-blocks Used Available Use% Mounted on fir-MDT0000_UUID 18287292984 8929938396 8420624916 52% /scratch[MDT:0] f fir-MDT0001_UUID 18287292984 4272171644 13078803336 25% /scratch[MDT:1] f MDT0002 : inactive device fir-MDT0003_UUID 18287292984 3657699480 13693239252 22% /scratch[MDT:3] f OST0000 : inactive device OST0001 : inactive device fir-OST0002_UUID 61986877596 30702519208 30658586164 51% /scratch[OST:2] fir-OST0003_UUID 61986877596 30350994280 31010091948 50% /scratch[OST:3] fir-OST0004_UUID 61986877596 30501193096 30860120040 50% /scratch[OST:4] fir-OST0005_UUID 61986877596 30550135432 30811199588 50% /scratch[OST:5] fir-OST0006_UUID 61986877596 30070489688 31290937284 50% /scratch[OST:6] fir-OST0007_UUID 61986877596 30744050608 30617311348 51% /scratch[OST:7] fir-OST0008_UUID 61986877596 30493910864 30867362968 50% /scratch[OST:8] fir-OST0009_UUID 61986877596 30649739912 30711554208 50% /scratch[OST:9] fir-OST000a_UUID 61986877596 29661667812 31699536896 49% /scratch[OST:10] fir-OST000b_UUID 61986877596 29826938776 31534313376 49% /scratch[OST:11] OST000c : inactive device <hung>
They stay in the evicted state.
- MDT
[root@sh-103-53 ~]# cd /proc/fs/lustre/mdc/fir-MDT0002-mdc-ffff9781f2230800/ [root@sh-103-53 fir-MDT0002-mdc-ffff9781f2230800]# date +%s; cat state 1572895124 current_state: EVICTED state_history: - [ 1572839940, CONNECTING ] - [ 1572839995, DISCONN ] - [ 1572840015, CONNECTING ] - [ 1572840070, DISCONN ] - [ 1572840090, CONNECTING ] - [ 1572840145, DISCONN ] - [ 1572840165, CONNECTING ] - [ 1572840165, REPLAY ] - [ 1572840165, REPLAY_LOCKS ] - [ 1572841556, CONNECTING ] - [ 1572841556, EVICTED ] - [ 1572841556, RECOVER ] - [ 1572841556, FULL ] - [ 1572888266, DISCONN ] - [ 1572888266, CONNECTING ] - [ 1572888266, EVICTED ]
Logs from the MDS fir-md1-s3 or 10.0.10.53@o2ib7:
Nov 04 09:24:19 fir-md1-s3 kernel: LustreError: 40361:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 150s: evicting client at 10.9.103.53@o2ib4 ns: mdt-fir-MDT0002_UUID lock: ffff9a5eea6057c0/0x3428b9d2e97b844b lrc: 3/0,0 mode: PW/PW res: [0x2c0032e40:0x1:0x0].0x0 bits 0x40/0x0 rrc: 5 type: IBT flags: 0x60200400000020 nid: 10.9.103.53@o2ib4 remote: 0x51ab3c4efd129db expref: 24 pid: 43285 timeout: 55884 lvb_type: 0
- OST
[root@sh-103-53 ~]# cd /proc/fs/lustre/osc/fir-OST000c-osc-ffff9781f2230800 [root@sh-103-53 fir-OST000c-osc-ffff9781f2230800]# date +%s; cat state 1572895019 current_state: EVICTED state_history: - [ 1572839919, DISCONN ] - [ 1572839940, CONNECTING ] - [ 1572839995, DISCONN ] - [ 1572840015, CONNECTING ] - [ 1572840070, DISCONN ] - [ 1572840090, CONNECTING ] - [ 1572840145, DISCONN ] - [ 1572840165, CONNECTING ] - [ 1572840165, REPLAY ] - [ 1572840165, REPLAY_LOCKS ] - [ 1572840165, REPLAY_WAIT ] - [ 1572840409, RECOVER ] - [ 1572840409, FULL ] - [ 1572888417, DISCONN ] - [ 1572888417, CONNECTING ] - [ 1572888417, EVICTED ]
Logs from the OSS of fir-OST000c (fir-io2-s1 or 10.0.10.103@o2ib7):
Nov 04 09:26:51 fir-io2-s1 kernel: LustreError: 53462:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 10.9.103.53@o2ib4 ns: filter-fir-OST000c_UUID lock: ffff966494a91f80/0xe2c2779bcc5329d5 lrc: 3/0,0 mode: PW/PW res: [0x3c0000400:0x1245ef1:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 8388608->536870911) flags: 0x60000400010020 nid: 10.9.103.53@o2ib4 remote: 0x51ab3c4efd129f0 expref: 6 pid: 56541 timeout: 237932 lvb_type: 0 Nov 04 09:30:44 fir-io2-s1 kernel: Lustre: fir-OST000c: haven't heard from client f169454a-d158-5ca7-0fb6-b3c51a09a392 (at 10.9.103.53@o2ib4) in 227 seconds. I think it's dead, and I am evicting it. exp ffff962e33420000, cur 1572888644 expire 1572888494 last 1572888417
Attaching kernel logs for the client sh-103-53 as sh-103-53.log
Attaching kernel logs for the MDS {
as fir-md1-s3.log
Attaching kernel logs for the OSS fir-io2-s1 as fir-io2-s1.log![]()
Thanks!
Stephane