Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12933

Evicted client doesn't reconnect

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.3
    • None
    • CentOS 7.6, MOFED 4.7, Lustre 2.12.3
    • 3
    • 9223372036854775807

    Description

      Hi,

      We are noticing a few clients (5) being evicted by MDTs or OSTs and not reconnecting. We're only seeing this since the upgrade to 2.12.3.

      Example with sh-103-53 10.9.103.53@o2ib4:

      [root@sh-103-53 ~]# lfs df -v /scratch/
      UUID                   1K-blocks        Used   Available Use% Mounted on
      fir-MDT0000_UUID     18287292984  8929938396  8420624916  52% /scratch[MDT:0] f
      fir-MDT0001_UUID     18287292984  4272171644 13078803336  25% /scratch[MDT:1] f
      MDT0002             : inactive device
      fir-MDT0003_UUID     18287292984  3657699480 13693239252  22% /scratch[MDT:3] f
      OST0000             : inactive device
      OST0001             : inactive device
      fir-OST0002_UUID     61986877596 30702519208 30658586164  51% /scratch[OST:2]
      fir-OST0003_UUID     61986877596 30350994280 31010091948  50% /scratch[OST:3]
      fir-OST0004_UUID     61986877596 30501193096 30860120040  50% /scratch[OST:4]
      fir-OST0005_UUID     61986877596 30550135432 30811199588  50% /scratch[OST:5]
      fir-OST0006_UUID     61986877596 30070489688 31290937284  50% /scratch[OST:6]
      fir-OST0007_UUID     61986877596 30744050608 30617311348  51% /scratch[OST:7]
      fir-OST0008_UUID     61986877596 30493910864 30867362968  50% /scratch[OST:8]
      fir-OST0009_UUID     61986877596 30649739912 30711554208  50% /scratch[OST:9]
      fir-OST000a_UUID     61986877596 29661667812 31699536896  49% /scratch[OST:10]
      fir-OST000b_UUID     61986877596 29826938776 31534313376  49% /scratch[OST:11]
      OST000c             : inactive device
      <hung>
      

      They stay in the evicted state.

      • MDT
      [root@sh-103-53 ~]# cd /proc/fs/lustre/mdc/fir-MDT0002-mdc-ffff9781f2230800/
      [root@sh-103-53 fir-MDT0002-mdc-ffff9781f2230800]# date +%s; cat state 
      1572895124
      current_state: EVICTED
      state_history:
       - [ 1572839940, CONNECTING ]
       - [ 1572839995, DISCONN ]
       - [ 1572840015, CONNECTING ]
       - [ 1572840070, DISCONN ]
       - [ 1572840090, CONNECTING ]
       - [ 1572840145, DISCONN ]
       - [ 1572840165, CONNECTING ]
       - [ 1572840165, REPLAY ]
       - [ 1572840165, REPLAY_LOCKS ]
       - [ 1572841556, CONNECTING ]
       - [ 1572841556, EVICTED ]
       - [ 1572841556, RECOVER ]
       - [ 1572841556, FULL ]
       - [ 1572888266, DISCONN ]
       - [ 1572888266, CONNECTING ]
       - [ 1572888266, EVICTED ]
      

      Logs from the MDS fir-md1-s3 or 10.0.10.53@o2ib7:

      Nov 04 09:24:19 fir-md1-s3 kernel: LustreError: 40361:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 150s: evicting client at 10.9.103.53@o2ib4  ns: mdt-fir-MDT0002_UUID lock: ffff9a5eea6057c0/0x3428b9d2e97b844b lrc: 3/0,0 mode: PW/PW res: [0x2c0032e40:0x1:0x0].0x0 bits 0x40/0x0 rrc: 5 type: IBT flags: 0x60200400000020 nid: 10.9.103.53@o2ib4 remote: 0x51ab3c4efd129db expref: 24 pid: 43285 timeout: 55884 lvb_type: 0
      
      • OST 
      [root@sh-103-53 ~]# cd /proc/fs/lustre/osc/fir-OST000c-osc-ffff9781f2230800
      [root@sh-103-53 fir-OST000c-osc-ffff9781f2230800]# date +%s; cat state 
      1572895019
      current_state: EVICTED
      state_history:
       - [ 1572839919, DISCONN ]
       - [ 1572839940, CONNECTING ]
       - [ 1572839995, DISCONN ]
       - [ 1572840015, CONNECTING ]
       - [ 1572840070, DISCONN ]
       - [ 1572840090, CONNECTING ]
       - [ 1572840145, DISCONN ]
       - [ 1572840165, CONNECTING ]
       - [ 1572840165, REPLAY ]
       - [ 1572840165, REPLAY_LOCKS ]
       - [ 1572840165, REPLAY_WAIT ]
       - [ 1572840409, RECOVER ]
       - [ 1572840409, FULL ]
       - [ 1572888417, DISCONN ]
       - [ 1572888417, CONNECTING ]
       - [ 1572888417, EVICTED ]
      

      Logs from the OSS of fir-OST000c (fir-io2-s1 or 10.0.10.103@o2ib7):

      Nov 04 09:26:51 fir-io2-s1 kernel: LustreError: 53462:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 10.9.103.53@o2ib4  ns: filter-fir-OST000c_UUID lock: ffff966494a91f80/0xe2c2779bcc5329d5 lrc: 3/0,0 mode: PW/PW res: [0x3c0000400:0x1245ef1:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 8388608->536870911) flags: 0x60000400010020 nid: 10.9.103.53@o2ib4 remote: 0x51ab3c4efd129f0 expref: 6 pid: 56541 timeout: 237932 lvb_type: 0
      Nov 04 09:30:44 fir-io2-s1 kernel: Lustre: fir-OST000c: haven't heard from client f169454a-d158-5ca7-0fb6-b3c51a09a392 (at 10.9.103.53@o2ib4) in 227 seconds. I think it's dead, and I am evicting it. exp ffff962e33420000, cur 1572888644 expire 1572888494 last 1572888417
      

      Attaching kernel logs for the client sh-103-53 as sh-103-53.log
      Attaching kernel logs for the MDS {

      {fir-md1-s3}

      as fir-md1-s3.log
      Attaching kernel logs for the OSS fir-io2-s1 as fir-io2-s1.log

      Thanks!
      Stephane

       

      Attachments

        1. fir-io2-s1.log
          527 kB
        2. fir-md1-s3.log
          281 kB
        3. sh-103-53.log
          2.22 MB

        Activity

          People

            pjones Peter Jones
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: