Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.0.0
-
None
-
3
-
8539
Description
Hi,
On our benchmarking cluster we observe a very strange behavior of Lustre: sometimes clients that are evicted by an OST never reconnect. The consequence is that we have a lot of Defunct processes on the impacted compute nodes, so we have to reboot them.
For instance on a compute node we can see that one OST connection is missing:
[root@kay297 ~]# lfs df
UUID 1K-blocks Used Available Use% Mounted on
scratch-MDT0000_UUID 983490512 887336 982603176 0% /scratch_lustre[MDT:0]
scratch-OST0000_UUID 7691221300 547721748 7143497248 7% /scratch_lustre[OST:0]
scratch-OST0001_UUID 7691221300 548640356 7142579792 7% /scratch_lustre[OST:1]
scratch-OST0002_UUID 7691221300 555585984 7135634108 7% /scratch_lustre[OST:2]
scratch-OST0003_UUID 7691221300 551102404 7140118212 7% /scratch_lustre[OST:3]
scratch-OST0004_UUID 7691221300 569235872 7121985356 7% /scratch_lustre[OST:4]
scratch-OST0005_UUID 7691221300 565971400 7125210680 7% /scratch_lustre[OST:5]
scratch-OST0006_UUID 7691221300 551380176 7139839904 7% /scratch_lustre[OST:6]
scratch-OST0007_UUID 7691221300 552060560 7139160472 7% /scratch_lustre[OST:7]
scratch-OST0008_UUID 7691221300 540060736 7151160480 7% /scratch_lustre[OST:8]
scratch-OST0009_UUID 7691221300 542803928 7148417244 7% /scratch_lustre[OST:9]
scratch-OST000a_UUID 7691221300 549910932 7141309212 7% /scratch_lustre[OST:10]
scratch-OST000b_UUID 7691221300 553465732 7137754416 7% /scratch_lustre[OST:11]
scratch-OST000c_UUID 7691221300 547134756 7144086380 7% /scratch_lustre[OST:12]
scratch-OST000d_UUID 7691221300 542512828 7148708296 7% /scratch_lustre[OST:13]
scratch-OST000e_UUID 7691221300 540940940 7150278116 7% /scratch_lustre[OST:14]
scratch-OST000f_UUID 7691221300 552187304 7139031756 7% /scratch_lustre[OST:15]
scratch-OST0010_UUID 7691221300 553010540 7138207368 7% /scratch_lustre[OST:16]
scratch-OST0011_UUID 7691221300 549111608 7142109332 7% /scratch_lustre[OST:17]
OST0012 : inactive device
scratch-OST0013_UUID 7691221300 545678392 7145547784 7% /scratch_lustre[OST:19]
scratch-OST0014_UUID 7691221300 545673392 7145547784 7% /scratch_lustre[OST:20]
scratch-OST0015_UUID 7691221300 553029372 7138191732 7% /scratch_lustre[OST:21]
scratch-OST0016_UUID 7691221300 578557784 7112659292 7% /scratch_lustre[OST:22]
scratch-OST0017_UUID 7691221300 553574948 7137640080 7% /scratch_lustre[OST:23]
scratch-OST0018_UUID 7691221300 593382232 7097838936 7% /scratch_lustre[OST:24]
scratch-OST0019_UUID 7691221300 550952100 7140232336 7% /scratch_lustre[OST:25]
scratch-OST001a_UUID 7691221300 604897244 7086322904 7% /scratch_lustre[OST:26]
scratch-OST001b_UUID 7691221300 545086976 7146133184 7% /scratch_lustre[OST:27]
scratch-OST001c_UUID 7691221300 550491540 7140725472 7% /scratch_lustre[OST:28]
scratch-OST001d_UUID 7691221300 542008368 7149211712 7% /scratch_lustre[OST:29]
filesystem summary: 215354196400 16076170152 199823591804 7% /scratch_lustre
The phenomenon is erratic, it does not always impact the same clients or the same OSTs.
I attach the syslogs of kay297 (client) and kay3 (OSS).