[LU-12933] Evicted client doesn't reconnect Created: 04/Nov/19 Updated: 07/Dec/19 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | Peter Jones |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.6, MOFED 4.7, Lustre 2.12.3 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Hi, We are noticing a few clients (5) being evicted by MDTs or OSTs and not reconnecting. We're only seeing this since the upgrade to 2.12.3. Example with sh-103-53 10.9.103.53@o2ib4: [root@sh-103-53 ~]# lfs df -v /scratch/ UUID 1K-blocks Used Available Use% Mounted on fir-MDT0000_UUID 18287292984 8929938396 8420624916 52% /scratch[MDT:0] f fir-MDT0001_UUID 18287292984 4272171644 13078803336 25% /scratch[MDT:1] f MDT0002 : inactive device fir-MDT0003_UUID 18287292984 3657699480 13693239252 22% /scratch[MDT:3] f OST0000 : inactive device OST0001 : inactive device fir-OST0002_UUID 61986877596 30702519208 30658586164 51% /scratch[OST:2] fir-OST0003_UUID 61986877596 30350994280 31010091948 50% /scratch[OST:3] fir-OST0004_UUID 61986877596 30501193096 30860120040 50% /scratch[OST:4] fir-OST0005_UUID 61986877596 30550135432 30811199588 50% /scratch[OST:5] fir-OST0006_UUID 61986877596 30070489688 31290937284 50% /scratch[OST:6] fir-OST0007_UUID 61986877596 30744050608 30617311348 51% /scratch[OST:7] fir-OST0008_UUID 61986877596 30493910864 30867362968 50% /scratch[OST:8] fir-OST0009_UUID 61986877596 30649739912 30711554208 50% /scratch[OST:9] fir-OST000a_UUID 61986877596 29661667812 31699536896 49% /scratch[OST:10] fir-OST000b_UUID 61986877596 29826938776 31534313376 49% /scratch[OST:11] OST000c : inactive device <hung> They stay in the evicted state.
[root@sh-103-53 ~]# cd /proc/fs/lustre/mdc/fir-MDT0002-mdc-ffff9781f2230800/ [root@sh-103-53 fir-MDT0002-mdc-ffff9781f2230800]# date +%s; cat state 1572895124 current_state: EVICTED state_history: - [ 1572839940, CONNECTING ] - [ 1572839995, DISCONN ] - [ 1572840015, CONNECTING ] - [ 1572840070, DISCONN ] - [ 1572840090, CONNECTING ] - [ 1572840145, DISCONN ] - [ 1572840165, CONNECTING ] - [ 1572840165, REPLAY ] - [ 1572840165, REPLAY_LOCKS ] - [ 1572841556, CONNECTING ] - [ 1572841556, EVICTED ] - [ 1572841556, RECOVER ] - [ 1572841556, FULL ] - [ 1572888266, DISCONN ] - [ 1572888266, CONNECTING ] - [ 1572888266, EVICTED ] Logs from the MDS fir-md1-s3 or 10.0.10.53@o2ib7: Nov 04 09:24:19 fir-md1-s3 kernel: LustreError: 40361:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 150s: evicting client at 10.9.103.53@o2ib4 ns: mdt-fir-MDT0002_UUID lock: ffff9a5eea6057c0/0x3428b9d2e97b844b lrc: 3/0,0 mode: PW/PW res: [0x2c0032e40:0x1:0x0].0x0 bits 0x40/0x0 rrc: 5 type: IBT flags: 0x60200400000020 nid: 10.9.103.53@o2ib4 remote: 0x51ab3c4efd129db expref: 24 pid: 43285 timeout: 55884 lvb_type: 0
[root@sh-103-53 ~]# cd /proc/fs/lustre/osc/fir-OST000c-osc-ffff9781f2230800 [root@sh-103-53 fir-OST000c-osc-ffff9781f2230800]# date +%s; cat state 1572895019 current_state: EVICTED state_history: - [ 1572839919, DISCONN ] - [ 1572839940, CONNECTING ] - [ 1572839995, DISCONN ] - [ 1572840015, CONNECTING ] - [ 1572840070, DISCONN ] - [ 1572840090, CONNECTING ] - [ 1572840145, DISCONN ] - [ 1572840165, CONNECTING ] - [ 1572840165, REPLAY ] - [ 1572840165, REPLAY_LOCKS ] - [ 1572840165, REPLAY_WAIT ] - [ 1572840409, RECOVER ] - [ 1572840409, FULL ] - [ 1572888417, DISCONN ] - [ 1572888417, CONNECTING ] - [ 1572888417, EVICTED ] Logs from the OSS of fir-OST000c (fir-io2-s1 or 10.0.10.103@o2ib7): Nov 04 09:26:51 fir-io2-s1 kernel: LustreError: 53462:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 10.9.103.53@o2ib4 ns: filter-fir-OST000c_UUID lock: ffff966494a91f80/0xe2c2779bcc5329d5 lrc: 3/0,0 mode: PW/PW res: [0x3c0000400:0x1245ef1:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 8388608->536870911) flags: 0x60000400010020 nid: 10.9.103.53@o2ib4 remote: 0x51ab3c4efd129f0 expref: 6 pid: 56541 timeout: 237932 lvb_type: 0 Nov 04 09:30:44 fir-io2-s1 kernel: Lustre: fir-OST000c: haven't heard from client f169454a-d158-5ca7-0fb6-b3c51a09a392 (at 10.9.103.53@o2ib4) in 227 seconds. I think it's dead, and I am evicting it. exp ffff962e33420000, cur 1572888644 expire 1572888494 last 1572888417 Attaching kernel logs for the client sh-103-53 as sh-103-53.log as fir-md1-s3.log Thanks!
|
| Comments |
| Comment by Stephane Thiell [ 04/Nov/19 ] |
|
More info on some of the other clients found in that state (we can see that it's not always the same targets that are impacted). The rest of the nodes on Sherlock (1000+) are apparently fine.
|
| Comment by Stephane Thiell [ 04/Nov/19 ] |
|
| Comment by Andreas Dilger [ 05/Nov/19 ] |
|
Do the OSC/MDC connections that are inactive on a particular node change over time? For example, if you checked sh-103-68 and sh-31-09 again, do they still have the same inactive connections, or do they have different inactive connections? The clients that are showing fir-MDT0000 as being disconnected might be fallout from You could try running on e.g. sh-103-68 "lctl --device 8 activate" to see if the client can reconnect to fir-OST0001 manually? |
| Comment by Stephane Thiell [ 05/Nov/19 ] |
|
Thanks! At this point, we don't have any client in that state anymore, those have been rebooted since then. We have now a test in NHC (Node Health Check) on Sherlock that will detect any new occurrence of inactive target connections. |