[LU-6695] Jobstats breaks when "Too long env variable." errors occur - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.5.3
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We have "Too long env variable" errors on a Lustre cluster at Stanford leading to broken JobStats report (using SLURM_JOB_ID). Jobids associated with processes reporting these errors are just ignored:

LNetError: 15288:0:(linux-curproc.c:241:cfs_get_environ()) Too long env variable.
LNetError: 15288:0:(linux-curproc.c:241:cfs_get_environ()) Skipped 2097 previous similar messages

In our case, user process environ size is a bit more than 32K.
It seems the problem comes from lustre_get_jobid() which uses the process environ variable to store some info when jobstats is enabled, but cfs_get_environ() is not able to handle large environ (which may be wise). However, we think an user shouldn't be able to disable jobstats like that. A change to cfs_get_environ() might not be enough. Please advice.

Please find below the commands used to track the issue:

[root@gpu-13-1 ~]# ps uw -q 15288
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
suuser   15288 98.4  6.2 108826468 4144960 ?   Sl   13:55 235:46 terachem run.in

[root@gpu-13-1 ~]# cat /proc/15288/environ | wc -c
32936

[root@gpu-13-1 ~]# scontrol pidinfo 15288
Slurm job id 2376464 ends at Sun Jun 07 13:55:09 2015
slurm_get_rem_time is 159433

[root@gpu-13-1 ~]# squeue -j 2376464
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2376464      slac temp800_   suuser  R    3:43:25      1 gpu-13-1

[root@gpu-13-1 ~]# lsof -p 15288 | grep /scratch
terachem 15288 suuser    1w   REG 2395,496332    386348 144116383972642817 /scratch/users/suuser/FeC2_catalyst/temp800_noFeECP_nanoFeC2/chunk_0080/run.out
terachem 15288 suuser    2w   REG 2395,496332        43 144116383972642818 /scratch/users/suuser/FeC2_catalyst/temp800_noFeECP_nanoFeC2/chunk_0080/run.err

[root@gpu-13-1 ~]# ls -l /scratch/users/suuser/FeC2_catalyst/temp800_noFeECP_nanoFeC2/chunk_0080/run.out
-rw-r--r-- 1 suuser sugrp 386636 Jun  5 17:40 /scratch/users/suuser/FeC2_catalyst/temp800_noFeECP_nanoFeC2/chunk_0080/run.out
[root@gpu-13-1 ~]# date
Fri Jun  5 17:40:12 PDT 2015

fsname is regal mounted on /scratch.
No jobstats report seen from this job:

[root@rcf-mgnt ~]# clush -w regal-oss[00-07] lctl get_param obdfilter.*.job_stats \| grep 2376464
clush: regal-oss07: exited with exit code 1
clush: regal-oss06: exited with exit code 1
clush: regal-oss00: exited with exit code 1
clush: regal-oss01: exited with exit code 1
clush: regal-oss04: exited with exit code 1
clush: regal-oss03: exited with exit code 1
clush: regal-oss02: exited with exit code 1
clush: regal-oss05: exited with exit code 1

[root@regal-mds1 ~]# lctl get_param mdt.regal-MDT0000.job_stats | grep 2376464
[root@regal-mds1 ~]#

Attachments

Activity

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 06/Jun/15 1:37 AM

Updated:: 31/Aug/15 4:02 PM

Resolved:: 16/Jun/15 1:30 PM