Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6695

Jobstats breaks when "Too long env variable." errors occur

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.8.0
    • Lustre 2.5.3
    • None
    • 3
    • 9223372036854775807

    Description

      We have "Too long env variable" errors on a Lustre cluster at Stanford leading to broken JobStats report (using SLURM_JOB_ID). Jobids associated with processes reporting these errors are just ignored:

      LNetError: 15288:0:(linux-curproc.c:241:cfs_get_environ()) Too long env variable.
      LNetError: 15288:0:(linux-curproc.c:241:cfs_get_environ()) Skipped 2097 previous similar messages
      

      In our case, user process environ size is a bit more than 32K.
      It seems the problem comes from lustre_get_jobid() which uses the process environ variable to store some info when jobstats is enabled, but cfs_get_environ() is not able to handle large environ (which may be wise). However, we think an user shouldn't be able to disable jobstats like that. A change to cfs_get_environ() might not be enough. Please advice.

      Please find below the commands used to track the issue:

      [root@gpu-13-1 ~]# ps uw -q 15288
      USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
      suuser   15288 98.4  6.2 108826468 4144960 ?   Sl   13:55 235:46 terachem run.in
      
      [root@gpu-13-1 ~]# cat /proc/15288/environ | wc -c
      32936
      
      [root@gpu-13-1 ~]# scontrol pidinfo 15288
      Slurm job id 2376464 ends at Sun Jun 07 13:55:09 2015
      slurm_get_rem_time is 159433
      
      [root@gpu-13-1 ~]# squeue -j 2376464
                   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2376464      slac temp800_   suuser  R    3:43:25      1 gpu-13-1
      
      [root@gpu-13-1 ~]# lsof -p 15288 | grep /scratch
      terachem 15288 suuser    1w   REG 2395,496332    386348 144116383972642817 /scratch/users/suuser/FeC2_catalyst/temp800_noFeECP_nanoFeC2/chunk_0080/run.out
      terachem 15288 suuser    2w   REG 2395,496332        43 144116383972642818 /scratch/users/suuser/FeC2_catalyst/temp800_noFeECP_nanoFeC2/chunk_0080/run.err
      
      [root@gpu-13-1 ~]# ls -l /scratch/users/suuser/FeC2_catalyst/temp800_noFeECP_nanoFeC2/chunk_0080/run.out
      -rw-r--r-- 1 suuser sugrp 386636 Jun  5 17:40 /scratch/users/suuser/FeC2_catalyst/temp800_noFeECP_nanoFeC2/chunk_0080/run.out
      [root@gpu-13-1 ~]# date
      Fri Jun  5 17:40:12 PDT 2015
      

      fsname is regal mounted on /scratch.
      No jobstats report seen from this job:

      [root@rcf-mgnt ~]# clush -w regal-oss[00-07] lctl get_param obdfilter.*.job_stats \| grep 2376464
      clush: regal-oss07: exited with exit code 1
      clush: regal-oss06: exited with exit code 1
      clush: regal-oss00: exited with exit code 1
      clush: regal-oss01: exited with exit code 1
      clush: regal-oss04: exited with exit code 1
      clush: regal-oss03: exited with exit code 1
      clush: regal-oss02: exited with exit code 1
      clush: regal-oss05: exited with exit code 1
      
      [root@regal-mds1 ~]# lctl get_param mdt.regal-MDT0000.job_stats | grep 2376464
      [root@regal-mds1 ~]# 
      

      Attachments

        Activity

          [LU-6695] Jobstats breaks when "Too long env variable." errors occur
          pjones Peter Jones added a comment -

          Landed for 2.8

          pjones Peter Jones added a comment - Landed for 2.8

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15177/
          Subject: LU-6695 jobstats: skip too long env variables
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 3c8a2d49ef4a17aad2973475178aea794b669f38

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15177/ Subject: LU-6695 jobstats: skip too long env variables Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3c8a2d49ef4a17aad2973475178aea794b669f38

          Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/15177
          Subject: LU-6695 jobstats: skip too long env variables
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 6463d85889467ce564c9fdcf0a792562d2c1aae6

          gerrit Gerrit Updater added a comment - Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/15177 Subject: LU-6695 jobstats: skip too long env variables Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6463d85889467ce564c9fdcf0a792562d2c1aae6

          I see, we didn't expect such long env variables. Looks we'd just skip these long variables in cfs_get_environ().

          niu Niu Yawei (Inactive) added a comment - I see, we didn't expect such long env variables. Looks we'd just skip these long variables in cfs_get_environ().

          Hi Niu,
          Oh, I meant the whole environ. I've just checked and the two largest variables are PATH and LD_LIBRARY_PATH with 17559 and 6979 bytes, respectively, each one containing a large set of paths.

          sthiell Stephane Thiell added a comment - Hi Niu, Oh, I meant the whole environ. I've just checked and the two largest variables are PATH and LD_LIBRARY_PATH with 17559 and 6979 bytes, respectively, each one containing a large set of paths.

          In our case, user process environ size is a bit more than 32K.

          Do you mean a single env variable is larger than 32k or the whole environ? cfs_get_environ() can't handle the variable which is larger than page size.

          niu Niu Yawei (Inactive) added a comment - In our case, user process environ size is a bit more than 32K. Do you mean a single env variable is larger than 32k or the whole environ? cfs_get_environ() can't handle the variable which is larger than page size.
          ezell Matt Ezell added a comment -

          We have also seen this during our recent testing of jobstats at ORNL.

          ezell Matt Ezell added a comment - We have also seen this during our recent testing of jobstats at ORNL.

          People

            niu Niu Yawei (Inactive)
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: