Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5053

soft lockup in ptlrpcd on client

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0
    • Lustre 2.4.1
    • 3
    • 13959

    Description

      On our BG/Q client nodes (ppc64), we frequently see soft lockup messages like so:

      2014-05-12 17:24:44.936384 {RMP08Ma185908506} [mmcs]{0}.0.1:  kernel:BUG: soft lockup - CPU#32 stuck for 67s! [ptlrpcd_0:3202]
      2014-05-12 17:24:45.140774 {RMP08Ma185908506} [mmcs]{0}.16.1: BUG: soft lockup - CPU#65 stuck for 67s! [ptlrpcd_1:3203]
      2014-05-12 17:24:45.141140 {RMP08Ma185908506} [mmcs]{0}.16.1: Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
      2014-05-12 17:24:45.141506 {RMP08Ma185908506} [mmcs]{0}.16.1: NIP: 80000000003715bc LR: 80000000003717cc CTR: 000000000000007d
      2014-05-12 17:24:45.141979 {RMP08Ma185908506} [mmcs]{0}.16.1: REGS: c0000003eb43ab40 TRAP: 0901   Not tainted  (2.6.32-358.11.1.bgq.4blueos.V1R2M1.bl2.1_0.ppc64)
      2014-05-12 17:24:45.142274 {RMP08Ma185908506} [mmcs]{0}.16.1: MSR: 0000000080029000 <EE,ME,CE>  CR: 44224424  XER: 00000000
      2014-05-12 17:24:45.142575 {RMP08Ma185908506} [mmcs]{0}.16.1: TASK = c0000003ecc14060[3203] 'ptlrpcd_1' THREAD: c0000003eb438000 CPU: 65
      2014-05-12 17:24:45.142979 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR00: 000000000003dfb8 c0000003eb43adc0 80000000003b5550 00000000000015b0 
      2014-05-12 17:24:45.143352 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR04: c00000031d82c330 0000000000000000 0000000000000000 0000000000000053 
      2014-05-12 17:24:45.143728 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR08: 0000000000000048 c00000031d82d110 0000000000000071 000000001eaffaf3 
      2014-05-12 17:24:45.144048 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR12: 0000000000002720 c0000000007f7200 
      2014-05-12 17:24:45.144424 {RMP08Ma185908506} [mmcs]{0}.16.1: NIP [80000000003715bc] .__adler32+0x9c/0x220 [libcfs]
      2014-05-12 17:24:45.144861 {RMP08Ma185908506} [mmcs]{0}.16.1: LR [80000000003717cc] .adler32_update+0x1c/0x40 [libcfs]
      2014-05-12 17:24:45.145111 {RMP08Ma185908506} [mmcs]{0}.16.1: Call Trace:
      2014-05-12 17:24:45.145442 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43adc0] [c0000003eb43ae50] 0xc0000003eb43ae50 (unreliable)
      2014-05-12 17:24:45.145800 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43ae40] [c0000000002041c4] .crypto_shash_update+0x4c/0x60
      2014-05-12 17:24:45.146124 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43aeb0] [c000000000204218] .shash_compat_update+0x40/0x80
      2014-05-12 17:24:45.146441 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43af70] [800000000037072c] .cfs_crypto_hash_update_page+0x8c/0xb0 [libcfs]
      2014-05-12 17:24:45.146758 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b030] [8000000001284f94] .osc_checksum_bulk+0x1c4/0x890 [osc]
      2014-05-12 17:24:45.147160 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b180] [80000000012863c0] .osc_brw_prep_request+0xd60/0x1d80 [osc]
      2014-05-12 17:24:45.147512 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b340] [800000000129931c] .osc_build_rpc+0xc8c/0x2320 [osc]
      2014-05-12 17:24:45.147814 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b4f0] [80000000012bf304] .osc_send_write_rpc+0x594/0xb40 [osc]
      2014-05-12 17:24:45.148150 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b6d0] [80000000012bff84] .osc_check_rpcs+0x6d4/0x1770 [osc]
      2014-05-12 17:24:45.148447 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b900] [80000000012c132c] .osc_io_unplug0+0x30c/0x6b0 [osc]
      2014-05-12 17:24:45.148812 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43ba20] [800000000129b174] .brw_interpret+0x7c4/0x1940 [osc]
      2014-05-12 17:24:45.149257 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43bb60] [8000000000edb6b4] .ptlrpc_check_set+0x3a4/0x4bb0 [ptlrpc]
      2014-05-12 17:24:45.149659 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43bd20] [8000000000f2a80c] .ptlrpcd_check+0x66c/0x870 [ptlrpc]
      2014-05-12 17:24:45.149960 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43be40] [8000000000f2acc8] .ptlrpcd+0x2b8/0x4c0 [ptlrpc]
      2014-05-12 17:24:45.150341 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43bf90] [c00000000001b9a8] .kernel_thread+0x54/0x70
      

      I have noticed too that I can watch the cpu usage using top, and I will see about two CPUs that are pegged at exactly 100% CPU usage, all in the sys column. The rest of the CPUs are more evenly split among different usage columns: a little in user, most in sys and wa.

      The CPUs with the 100% usage seem to correspond to the ptlrpcd processes that trigger the soft lockup warning, which I suppose is not terribly surprising.

      We have seen this happening for well over a year, but are just getting around to look closer at it due to other recent work on the BG/Q clients.

      Attachments

        Issue Links

          Activity

            People

              green Oleg Drokin
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: