[LU-15745] Client error when multiple filesystems mounted with differing jobid_var settings Created: 14/Apr/22 Updated: 14/Apr/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Nathan Crawford | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.9, servers 2.12.8, client 2.15.0-RC3, Q-logic/Intel QDR Infiniband. |
||
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
In trying to work around lock timeouts with parallel compilation of GCC (client and server both on Lustre 2.12.8), I tried upgrading a client node to 2.15.0-RC3. The lock timeouts went away, but errors like the following appeared in dmesg: [ +0.309601] LustreError: cowardly refusing to write 4123 bytes in a page [ +0.000010] LustreError: 11121:0:(jobid.c:348:cfs_get_environ()) key: SLURM_JOB_ID, entry: MAKEFLAGS=w --jobserver-fds=3,4 -j -- MAKEINFO=makeinfo\ --split-size=5000000\ --split-size=5000000\ --split-size=5000000 CONFIG_SHELL=/bin/sh TFLAGS= STAGEautofeedback_TFLAGS=-fchecking=1 STAGEautofeedback_GENERATOR_CFLAGS= STAGEautofeedback_CXXFLAGS=-g\ -O2\ -fchecking=1 STAGEautofeedback_CFLAGS=-g\ -O2\ -fchecking=1 STAGEautoprofile_TFLAGS=-fno-checking STAGEautoprofile_GENERATOR_CFLAGS= STAGEautoprofile_CXXFLAGS=-g\ -O2\ -fno-checking\ -gtoggle\ -g STAGEautoprofile_CFLAGS=-g\ -O2\ -fno-checking\ -gtoggle\ -g STAGEfeedback_TFLAGS= STAGEfeedback_GENERATOR_CFLAGS= STAGEfeedback_CXXFLAGS=-g\ -O2\ -fprofile-use STAGEfeedback_CFLAGS=-g\ -O2\ -fprofile-use STAGEtrain_TFLAGS= STAGEtrain_GENERATOR_CFLAGS= STAGEtrain_CXXFLAGS=-g\ -O2 STAGEtrain_CFLAGS=-g\ -O2 STAGEprofile_TFLAGS=-fno-checking STAGEprofile_GENERATOR_CFLAGS= STAGEprofile_CXXFLAGS=-g\ -O2\ -fno-checking\ -gtoggle\ -fprofile-generate STAGEp Our main lustre system, DFS-L, was configured to collect slurm jobstats (jobid_var=SLURM_JOB_ID), but our scratch system, XXX-L, was still at the default (jobid_var=disable). The compilation of GCC was being done on XXX-L, but outside of slurm (SLURM_JOB_ID was not set). Turning off jobstats gathering on DFS-L with "lctl conf_param DFS-L.sys.jobid_var=disable" on the MGS made the error message go away. Is it possible to have multiple lustre file systems coexist with different jobid_var settings? We are not currently using the slurm jobstats, so keeping it off everywhere is fine for now. |