Andreas,
I also tried to reproduce the problem on some test hardware by creating a filesystem with the same exact 2.5.2 version of Lustre installed on our /scratch filesystem and was unsuccessful to reproduce as well. There must be something else going on with our /scratch filesystem, either due to large scale with 348 OSTs or the upgrade from the 2.1.5 version we were running, so i'm going to compare the setup of the filesystem and see if I can find any differences.
In regard to the debug output, we could not wait to put the system back into production so we developed a manual process to distribute files across by setting the stripe offset to a random OST for active user directories. We are cycling the first two active OSTs so that files created in directories where the stripe_offset is still set to -1 get distributed as well. Not efficient or as good performance, but at least it lets us run jobs for users and distribute files across all the OSTs. I can certainly generate the debug output, but afraid it would be polluted with the activity from all the users. That and we had to deactivate the first 16 OSTs since they reached > 93% capacity. We have a maintenance scheduled for next Tuesday and can collect data on a quiet system then. I've included the output from the prealloc information in case it might be useful. I noticed two had -5 as the prealloc_status, those OSTs are in the list of inactive OSTs, which is in the attached file as well. Note that in looking through the prealloc output, found these three sets of messages corresponding to those OSTs:
Oct 22 00:43:29 mds5 kernel: Lustre: setting import scratch-OST001d_UUID INACTIVE by administrator request
Oct 22 00:43:29 mds5 kernel: Lustre: Skipped 8 previous similar messages
Oct 22 00:43:29 mds5 kernel: LustreError: 22062:0:(osp_precreate.c:464:osp_precreate_send()) scratch-OST001d-osc-MDT0000: can't precreate: rc = -5
Oct 22 00:43:29 mds5 kernel: LustreError: 22062:0:(osp_precreate.c:968:osp_precreate_thread()) scratch-OST001d-osc-MDT0000: cannot precreate objects: rc = -5
Oct 22 01:04:06 mds5 kernel: Lustre: setting import scratch-OST0021_UUID INACTIVE by administrator request
Oct 22 01:04:06 mds5 kernel: LustreError: 22070:0:(osp_precreate.c:464:osp_precreate_send()) scratch-OST0021-osc-MDT0000: can't precreate: rc = -5
Oct 22 01:04:06 mds5 kernel: LustreError: 22070:0:(osp_precreate.c:968:osp_precreate_thread()) scratch-OST0021-osc-MDT0000: cannot precreate objects: rc = -5
Oct 22 15:07:21 mds5 kernel: Lustre: setting import scratch-OST0024_UUID INACTIVE by administrator request
Oct 22 15:07:21 mds5 kernel: Lustre: Skipped 5 previous similar messages
Oct 22 15:07:21 mds5 kernel: LustreError: 22084:0:(osp_precreate.c:464:osp_precreate_send()) scratch-OST0026-osc-MDT0000: can't precreate: rc = -5
Oct 22 15:07:21 mds5 kernel: LustreError: 22084:0:(osp_precreate.c:968:osp_precreate_thread()) scratch-OST0026-osc-MDT0000: cannot precreate objects: rc = -5
We faced the same issue at CEA. We analyze the problem and tracked it down to lod_qos_statfs_update(). Indead, in this function, each OST is checked successively. If a OST has active=0, lod_statfs_and_check() will return ENOTCONN and the for-loop will break. Following OSTs won't be checked. Their metadata will not be updated (zero) and they will not be used when allocating file objects.
A simple workaround for this bug is to deactivate QOS allocation setting qos_threshold_rr to 100 and so going back to a simple round-robin.
A fix seems to simply replace the break with continue