Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.5.3
-
None
-
3
-
9223372036854775807
Description
We have "Too long env variable" errors on a Lustre cluster at Stanford leading to broken JobStats report (using SLURM_JOB_ID). Jobids associated with processes reporting these errors are just ignored:
LNetError: 15288:0:(linux-curproc.c:241:cfs_get_environ()) Too long env variable. LNetError: 15288:0:(linux-curproc.c:241:cfs_get_environ()) Skipped 2097 previous similar messages
In our case, user process environ size is a bit more than 32K.
It seems the problem comes from lustre_get_jobid() which uses the process environ variable to store some info when jobstats is enabled, but cfs_get_environ() is not able to handle large environ (which may be wise). However, we think an user shouldn't be able to disable jobstats like that. A change to cfs_get_environ() might not be enough. Please advice.
Please find below the commands used to track the issue:
[root@gpu-13-1 ~]# ps uw -q 15288 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND suuser 15288 98.4 6.2 108826468 4144960 ? Sl 13:55 235:46 terachem run.in [root@gpu-13-1 ~]# cat /proc/15288/environ | wc -c 32936 [root@gpu-13-1 ~]# scontrol pidinfo 15288 Slurm job id 2376464 ends at Sun Jun 07 13:55:09 2015 slurm_get_rem_time is 159433 [root@gpu-13-1 ~]# squeue -j 2376464 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2376464 slac temp800_ suuser R 3:43:25 1 gpu-13-1 [root@gpu-13-1 ~]# lsof -p 15288 | grep /scratch terachem 15288 suuser 1w REG 2395,496332 386348 144116383972642817 /scratch/users/suuser/FeC2_catalyst/temp800_noFeECP_nanoFeC2/chunk_0080/run.out terachem 15288 suuser 2w REG 2395,496332 43 144116383972642818 /scratch/users/suuser/FeC2_catalyst/temp800_noFeECP_nanoFeC2/chunk_0080/run.err [root@gpu-13-1 ~]# ls -l /scratch/users/suuser/FeC2_catalyst/temp800_noFeECP_nanoFeC2/chunk_0080/run.out -rw-r--r-- 1 suuser sugrp 386636 Jun 5 17:40 /scratch/users/suuser/FeC2_catalyst/temp800_noFeECP_nanoFeC2/chunk_0080/run.out [root@gpu-13-1 ~]# date Fri Jun 5 17:40:12 PDT 2015
fsname is regal mounted on /scratch.
No jobstats report seen from this job:
[root@rcf-mgnt ~]# clush -w regal-oss[00-07] lctl get_param obdfilter.*.job_stats \| grep 2376464 clush: regal-oss07: exited with exit code 1 clush: regal-oss06: exited with exit code 1 clush: regal-oss00: exited with exit code 1 clush: regal-oss01: exited with exit code 1 clush: regal-oss04: exited with exit code 1 clush: regal-oss03: exited with exit code 1 clush: regal-oss02: exited with exit code 1 clush: regal-oss05: exited with exit code 1 [root@regal-mds1 ~]# lctl get_param mdt.regal-MDT0000.job_stats | grep 2376464 [root@regal-mds1 ~]#