Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-180

Sometimes evicted clients never reconnect

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.0.0
    • None
    • 3
    • 8539

    Description

      Hi,

      On our benchmarking cluster we observe a very strange behavior of Lustre: sometimes clients that are evicted by an OST never reconnect. The consequence is that we have a lot of Defunct processes on the impacted compute nodes, so we have to reboot them.

      For instance on a compute node we can see that one OST connection is missing:

      [root@kay297 ~]# lfs df
      UUID 1K-blocks Used Available Use% Mounted on
      scratch-MDT0000_UUID 983490512 887336 982603176 0% /scratch_lustre[MDT:0]
      scratch-OST0000_UUID 7691221300 547721748 7143497248 7% /scratch_lustre[OST:0]
      scratch-OST0001_UUID 7691221300 548640356 7142579792 7% /scratch_lustre[OST:1]
      scratch-OST0002_UUID 7691221300 555585984 7135634108 7% /scratch_lustre[OST:2]
      scratch-OST0003_UUID 7691221300 551102404 7140118212 7% /scratch_lustre[OST:3]
      scratch-OST0004_UUID 7691221300 569235872 7121985356 7% /scratch_lustre[OST:4]
      scratch-OST0005_UUID 7691221300 565971400 7125210680 7% /scratch_lustre[OST:5]
      scratch-OST0006_UUID 7691221300 551380176 7139839904 7% /scratch_lustre[OST:6]
      scratch-OST0007_UUID 7691221300 552060560 7139160472 7% /scratch_lustre[OST:7]
      scratch-OST0008_UUID 7691221300 540060736 7151160480 7% /scratch_lustre[OST:8]
      scratch-OST0009_UUID 7691221300 542803928 7148417244 7% /scratch_lustre[OST:9]
      scratch-OST000a_UUID 7691221300 549910932 7141309212 7% /scratch_lustre[OST:10]
      scratch-OST000b_UUID 7691221300 553465732 7137754416 7% /scratch_lustre[OST:11]
      scratch-OST000c_UUID 7691221300 547134756 7144086380 7% /scratch_lustre[OST:12]
      scratch-OST000d_UUID 7691221300 542512828 7148708296 7% /scratch_lustre[OST:13]
      scratch-OST000e_UUID 7691221300 540940940 7150278116 7% /scratch_lustre[OST:14]
      scratch-OST000f_UUID 7691221300 552187304 7139031756 7% /scratch_lustre[OST:15]
      scratch-OST0010_UUID 7691221300 553010540 7138207368 7% /scratch_lustre[OST:16]
      scratch-OST0011_UUID 7691221300 549111608 7142109332 7% /scratch_lustre[OST:17]
      OST0012 : inactive device
      scratch-OST0013_UUID 7691221300 545678392 7145547784 7% /scratch_lustre[OST:19]
      scratch-OST0014_UUID 7691221300 545673392 7145547784 7% /scratch_lustre[OST:20]
      scratch-OST0015_UUID 7691221300 553029372 7138191732 7% /scratch_lustre[OST:21]
      scratch-OST0016_UUID 7691221300 578557784 7112659292 7% /scratch_lustre[OST:22]
      scratch-OST0017_UUID 7691221300 553574948 7137640080 7% /scratch_lustre[OST:23]
      scratch-OST0018_UUID 7691221300 593382232 7097838936 7% /scratch_lustre[OST:24]
      scratch-OST0019_UUID 7691221300 550952100 7140232336 7% /scratch_lustre[OST:25]
      scratch-OST001a_UUID 7691221300 604897244 7086322904 7% /scratch_lustre[OST:26]
      scratch-OST001b_UUID 7691221300 545086976 7146133184 7% /scratch_lustre[OST:27]
      scratch-OST001c_UUID 7691221300 550491540 7140725472 7% /scratch_lustre[OST:28]
      scratch-OST001d_UUID 7691221300 542008368 7149211712 7% /scratch_lustre[OST:29]

      filesystem summary: 215354196400 16076170152 199823591804 7% /scratch_lustre

      The phenomenon is erratic, it does not always impact the same clients or the same OSTs.

      I attach the syslogs of kay297 (client) and kay3 (OSS).

      Attachments

        1. client.log.gz
          43 kB
        2. deadlock-trace.log
          9 kB
        3. kay297
          39 kB
        4. kay3
          22 kB

        Activity

          People

            niu Niu Yawei (Inactive)
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: