Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.15.0
-
None
-
CentOS 7.9, servers 2.12.8, client 2.15.0-RC3, Q-logic/Intel QDR Infiniband.
-
9223372036854775807
Description
In trying to work around lock timeouts with parallel compilation of GCC (client and server both on Lustre 2.12.8), I tried upgrading a client node to 2.15.0-RC3. The lock timeouts went away, but errors like the following appeared in dmesg:
[ +0.309601] LustreError: cowardly refusing to write 4123 bytes in a page [ +0.000010] LustreError: 11121:0:(jobid.c:348:cfs_get_environ()) key: SLURM_JOB_ID, entry: MAKEFLAGS=w --jobserver-fds=3,4 -j -- MAKEINFO=makeinfo\ --split-size=5000000\ --split-size=5000000\ --split-size=5000000 CONFIG_SHELL=/bin/sh TFLAGS= STAGEautofeedback_TFLAGS=-fchecking=1 STAGEautofeedback_GENERATOR_CFLAGS= STAGEautofeedback_CXXFLAGS=-g\ -O2\ -fchecking=1 STAGEautofeedback_CFLAGS=-g\ -O2\ -fchecking=1 STAGEautoprofile_TFLAGS=-fno-checking STAGEautoprofile_GENERATOR_CFLAGS= STAGEautoprofile_CXXFLAGS=-g\ -O2\ -fno-checking\ -gtoggle\ -g STAGEautoprofile_CFLAGS=-g\ -O2\ -fno-checking\ -gtoggle\ -g STAGEfeedback_TFLAGS= STAGEfeedback_GENERATOR_CFLAGS= STAGEfeedback_CXXFLAGS=-g\ -O2\ -fprofile-use STAGEfeedback_CFLAGS=-g\ -O2\ -fprofile-use STAGEtrain_TFLAGS= STAGEtrain_GENERATOR_CFLAGS= STAGEtrain_CXXFLAGS=-g\ -O2 STAGEtrain_CFLAGS=-g\ -O2 STAGEprofile_TFLAGS=-fno-checking STAGEprofile_GENERATOR_CFLAGS= STAGEprofile_CXXFLAGS=-g\ -O2\ -fno-checking\ -gtoggle\ -fprofile-generate STAGEp
Our main lustre system, DFS-L, was configured to collect slurm jobstats (jobid_var=SLURM_JOB_ID), but our scratch system, XXX-L, was still at the default (jobid_var=disable). The compilation of GCC was being done on XXX-L, but outside of slurm (SLURM_JOB_ID was not set).
Turning off jobstats gathering on DFS-L with "lctl conf_param DFS-L.sys.jobid_var=disable" on the MGS made the error message go away.
Is it possible to have multiple lustre file systems coexist with different jobid_var settings? We are not currently using the slurm jobstats, so keeping it off everywhere is fine for now.