[LU-6695] Jobstats breaks when "Too long env variable." errors occur Created: 06/Jun/15 Updated: 31/Aug/15 Resolved: 16/Jun/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Stephane Thiell | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
We have "Too long env variable" errors on a Lustre cluster at Stanford leading to broken JobStats report (using SLURM_JOB_ID). Jobids associated with processes reporting these errors are just ignored: LNetError: 15288:0:(linux-curproc.c:241:cfs_get_environ()) Too long env variable. LNetError: 15288:0:(linux-curproc.c:241:cfs_get_environ()) Skipped 2097 previous similar messages In our case, user process environ size is a bit more than 32K. Please find below the commands used to track the issue: [root@gpu-13-1 ~]# ps uw -q 15288
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
suuser 15288 98.4 6.2 108826468 4144960 ? Sl 13:55 235:46 terachem run.in
[root@gpu-13-1 ~]# cat /proc/15288/environ | wc -c
32936
[root@gpu-13-1 ~]# scontrol pidinfo 15288
Slurm job id 2376464 ends at Sun Jun 07 13:55:09 2015
slurm_get_rem_time is 159433
[root@gpu-13-1 ~]# squeue -j 2376464
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2376464 slac temp800_ suuser R 3:43:25 1 gpu-13-1
[root@gpu-13-1 ~]# lsof -p 15288 | grep /scratch
terachem 15288 suuser 1w REG 2395,496332 386348 144116383972642817 /scratch/users/suuser/FeC2_catalyst/temp800_noFeECP_nanoFeC2/chunk_0080/run.out
terachem 15288 suuser 2w REG 2395,496332 43 144116383972642818 /scratch/users/suuser/FeC2_catalyst/temp800_noFeECP_nanoFeC2/chunk_0080/run.err
[root@gpu-13-1 ~]# ls -l /scratch/users/suuser/FeC2_catalyst/temp800_noFeECP_nanoFeC2/chunk_0080/run.out
-rw-r--r-- 1 suuser sugrp 386636 Jun 5 17:40 /scratch/users/suuser/FeC2_catalyst/temp800_noFeECP_nanoFeC2/chunk_0080/run.out
[root@gpu-13-1 ~]# date
Fri Jun 5 17:40:12 PDT 2015
fsname is regal mounted on /scratch. [root@rcf-mgnt ~]# clush -w regal-oss[00-07] lctl get_param obdfilter.*.job_stats \| grep 2376464 clush: regal-oss07: exited with exit code 1 clush: regal-oss06: exited with exit code 1 clush: regal-oss00: exited with exit code 1 clush: regal-oss01: exited with exit code 1 clush: regal-oss04: exited with exit code 1 clush: regal-oss03: exited with exit code 1 clush: regal-oss02: exited with exit code 1 clush: regal-oss05: exited with exit code 1 [root@regal-mds1 ~]# lctl get_param mdt.regal-MDT0000.job_stats | grep 2376464 [root@regal-mds1 ~]# |
| Comments |
| Comment by Matt Ezell [ 06/Jun/15 ] |
|
We have also seen this during our recent testing of jobstats at ORNL. |
| Comment by Niu Yawei (Inactive) [ 08/Jun/15 ] |
Do you mean a single env variable is larger than 32k or the whole environ? cfs_get_environ() can't handle the variable which is larger than page size. |
| Comment by Stephane Thiell [ 08/Jun/15 ] |
|
Hi Niu, |
| Comment by Niu Yawei (Inactive) [ 08/Jun/15 ] |
|
I see, we didn't expect such long env variables. Looks we'd just skip these long variables in cfs_get_environ(). |
| Comment by Gerrit Updater [ 08/Jun/15 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/15177 |
| Comment by Gerrit Updater [ 16/Jun/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15177/ |
| Comment by Peter Jones [ 16/Jun/15 ] |
|
Landed for 2.8 |