[LU-3577] BUG: soft lockup - CPU#25 stuck for 67s! [jbd2/dm-8-8:8966]; Kernel panic - not syncing: softlockup: hung tasks Created: 11/Jul/13  Updated: 21/Mar/18  Resolved: 21/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Roger Spellman (Inactive) Assignee: Peter Jones
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Kernel: 2.6.32-279.19.1.el6_lustre.2.1.5_1.0.3
Distro: CentOS release 6.4 (Final)
CPUs: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
32 Cores
128GB RAM on MDSes; 64GB RAM on OSSes
Active/Standby MDS
Two OSSes
12 OSTs
Each OST is 22T


Attachments: File bug.3     File bug.4    
Severity: 2
Rank (Obsolete): 9055

 Description   

We have a Lustre 2.1.5 system with two MDSes (active / standby), and two OSSes (active / active). Each OSS has 6 OSTs.

We filled the file system to 100%. To remove the files, one Lustre client ran the following script:

rm -rf /mnt/hss45/ost/ost-00/* &
rm -rf /mnt/hss45/ost/ost-01/* &
rm -rf /mnt/hss45/ost/ost-02/* &
rm -rf /mnt/hss45/ost/ost-03/* &
rm -rf /mnt/hss45/ost/ost-04/* &
rm -rf /mnt/hss45/ost/ost-05/* &
rm -rf /mnt/hss45/ost/ost-06/* &
rm -rf /mnt/hss45/ost/ost-07/* &
rm -rf /mnt/hss45/ost/ost-08/* &
rm -rf /mnt/hss45/ost/ost-09/* &
rm -rf /mnt/hss45/ost/ost-10/* &
rm -rf /mnt/hss45/ost/ost-11/* &

One OSS crashed with this error:
BUG: soft lockup - CPU#25 stuck for 67s! [jbd2/dm-8-8:8966]
. . .
Kernel panic - not syncing: softlockup: hung tasks

The OSS was STONITH'ed.

Shortly thereafter, the second OSS got the same error:

BUG: soft lockup - CPU#17 stuck for 67s! [jbd2/dm-6-8:21440]
Kernel panic - not syncing: softlockup: hung tasks

I have attached the full console output. There was nothing in /var/log/messages.



 Comments   
Comment by Peter Jones [ 11/Jul/13 ]

Thanks for the report Roger. Given that you are running RHEL 6.4 is there any reason you chose 2.1.5 over 2.1.6? Other than rebuilding for RHEL6.4, are there any other changes you made from a standard 2.1.5?

Comment by Roger Spellman (Inactive) [ 11/Jul/13 ]

Peter,

We are using 2.1.5 because we started this project a couple of months ago (before 2.1.6 was release), and we have promised a 2.1.x release to a customer pretty soon. We are pretty far into our QA cycle, so switching Lustre versions right now would set us back a bit.

We will go to 2.1.6 very soon. But, if you say that this is a known bug in 2.1.5 that is fixed in 2.1.6, that will push us to 2.1.6 even sooner.

We make changes to configure scripts and Makefiles, so that we can build on our build machine.

We make some minor functional changes to the code (we made them some time ago, in earlier releases). Here are the patches that are functional changes.

diff -rcN -x '~' -x '.orig' /build/lustre/lustre-2.1.5/lustre/ldlm/ldlm_pool.c 2.1.5/trunk/lustre-working_lustre.patch/lustre/ldlm/ldlm_pool.c

      • /build/lustre/lustre-2.1.5/lustre/ldlm/ldlm_pool.c Tue Apr 30 10:34:06 2013
      • 2.1.5/trunk/lustre-working_lustre.patch/lustre/ldlm/ldlm_pool.c Tue Jun 18 15:23:14 2013
        ***************
      • 143,149 ****
        /*
  • Max age for locks on clients.
    */
    ! #define LDLM_POOL_MAX_AGE (36000)

/*

  • The granularity of SLV calculation.
      • 143,157 ----
        /*
  • Max age for locks on clients.
    */
    ! //#define LDLM_POOL_MAX_AGE (36000)
    ! /*
    ! * Max age for locks on clients.
    ! * Terascala: Set to default 2 minute max age
    ! * Units are seconds.
    ! * This actually kicks in lru eviction after 7 minutes at this setting.
    ! */
    ! static u_int32_t ldlm_pool_max_age=120;
    !

/*

  • The granularity of SLV calculation.
    ***************
      • 162,171 ****
        static inline _u64 ldlm_pool_slv_max(_u32 L) { /* ! * Allow to have all locks for 1 client for 10 hrs. ! * Formula is the following: limit * 10h / 1 client. */ ! __u64 lim = (__u64)L * LDLM_POOL_MAX_AGE / 1; return lim; }

— 170,179 ----
static inline _u64 ldlm_pool_slv_max(_u32 L)

{ /* ! * Allow to have all locks for 1 client for 10 minutes. ! * Formula is the following: limit * 2 min / 1 client. */ ! __u64 lim = (__u64)L * ldlm_pool_max_age / 1; /* Terascala */ return lim; }

***************

      • 805,810 ****
      • 813,825 ----
        pool_vars[0].write_fptr = lprocfs_wr_atomic;
        lprocfs_add_vars(pl->pl_proc_dir, pool_vars, 0);

+ /* Terascala */
+ snprintf(var_name, MAX_STRING_SIZE, "ldlm_pool_max_age");
+ pool_vars[0].data = &ldlm_pool_max_age;
+ pool_vars[0].read_fptr = lprocfs_rd_uint;
+ pool_vars[0].write_fptr = lprocfs_wr_uint;
+ lprocfs_add_vars(pl->pl_proc_dir, pool_vars, 0);
+
snprintf(var_name, MAX_STRING_SIZE, "state");
pool_vars[0].data = pl;
pool_vars[0].read_fptr = lprocfs_rd_pool_state;
diff -rcN -x '~' -x '.orig' /build/lustre/lustre-2.1.5/lustre/liblustre/super.c 2.1.5/trunk/lustre-working_lustre.patch/lustre/liblustre/super.c

      • /build/lustre/lustre-2.1.5/lustre/liblustre/super.c Tue Apr 30 10:34:12 2013
      • 2.1.5/trunk/lustre-working_lustre.patch/lustre/liblustre/super.c Tue Jun 18 15:23:14 2013
        ***************
      • 141,146 ****
      • 141,147 ----
        struct mdt_body *body = md->body;
        struct lov_stripe_md *lsm = md->lsm;
        struct intnl_stat *st = llu_i2stat(inode);
        + struct ll_sb_info *sbi = ll_i2sbi(inode);

LASSERT ((lsm != NULL) == ((body->valid & OBD_MD_FLEASIZE) != 0));

***************

      • 181,187 ****
        lli->lli_lvb.lvb_ctime = body->ctime;
        }
        if (S_ISREG(st->st_mode))
        ! st->st_blksize = min(2UL * PTLRPC_MAX_BRW_SIZE, LL_MAX_BLKSIZE);
        else
        st->st_blksize = 4096;
        if (body->valid & OBD_MD_FLUID)
      • 182,188 ----
        lli->lli_lvb.lvb_ctime = body->ctime;
        }
        if (S_ISREG(st->st_mode))
        ! st->st_blksize = min(2UL * PTLRPC_MAX_BRW_SIZE, 1UL<< sbi->ll_max_blksize_bits);
        else
        st->st_blksize = 4096;
        if (body->valid & OBD_MD_FLUID)

Hope this helps.

Generated at Sat Feb 10 01:35:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.