[LU-15745] Client error when multiple filesystems mounted with differing jobid_var settings Created: 14/Apr/22  Updated: 14/Apr/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Nathan Crawford Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.9, servers 2.12.8, client 2.15.0-RC3, Q-logic/Intel QDR Infiniband.


Rank (Obsolete): 9223372036854775807

 Description   

In trying to work around lock timeouts with parallel compilation of GCC (client and server both on Lustre 2.12.8), I tried upgrading a client node to 2.15.0-RC3. The lock timeouts went away, but errors like the following appeared in dmesg:

[  +0.309601] LustreError: cowardly refusing to write 4123 bytes in a page
[  +0.000010] LustreError: 11121:0:(jobid.c:348:cfs_get_environ()) key: SLURM_JOB_ID, entry: MAKEFLAGS=w --jobserver-fds=3,4 -j -- MAKEINFO=makeinfo\ --split-size=5000000\ --split-size=5000000\ --split-size=5000000 CONFIG_SHELL=/bin/sh TFLAGS= STAGEautofeedback_TFLAGS=-fchecking=1 STAGEautofeedback_GENERATOR_CFLAGS= STAGEautofeedback_CXXFLAGS=-g\ -O2\ -fchecking=1 STAGEautofeedback_CFLAGS=-g\ -O2\ -fchecking=1 STAGEautoprofile_TFLAGS=-fno-checking STAGEautoprofile_GENERATOR_CFLAGS= STAGEautoprofile_CXXFLAGS=-g\ -O2\ -fno-checking\ -gtoggle\ -g STAGEautoprofile_CFLAGS=-g\ -O2\ -fno-checking\ -gtoggle\ -g STAGEfeedback_TFLAGS= STAGEfeedback_GENERATOR_CFLAGS= STAGEfeedback_CXXFLAGS=-g\ -O2\ -fprofile-use STAGEfeedback_CFLAGS=-g\ -O2\ -fprofile-use STAGEtrain_TFLAGS= STAGEtrain_GENERATOR_CFLAGS= STAGEtrain_CXXFLAGS=-g\ -O2 STAGEtrain_CFLAGS=-g\ -O2 STAGEprofile_TFLAGS=-fno-checking STAGEprofile_GENERATOR_CFLAGS= STAGEprofile_CXXFLAGS=-g\ -O2\ -fno-checking\ -gtoggle\ -fprofile-generate STAGEp

  Our main lustre system, DFS-L, was configured to collect slurm jobstats (jobid_var=SLURM_JOB_ID), but our scratch system, XXX-L, was still at the default (jobid_var=disable). The compilation of GCC was being done on XXX-L, but outside of slurm (SLURM_JOB_ID was not set).

  Turning off jobstats gathering on DFS-L with "lctl conf_param DFS-L.sys.jobid_var=disable" on the MGS made the error message go away.

  Is it possible to have multiple lustre file systems coexist with different jobid_var settings? We are not currently using the slurm jobstats, so keeping it off everywhere is fine for now.


Generated at Sat Feb 10 03:20:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.