Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4790

CPU soft lockups (60+ seconds) in ptlrpcd threads running IOR resulting in eviction

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • None
    • None
    • 3
    • 13189

    Description

      I'm able to easily reproduce an eviction by running a simple IOR on one of our PPC64 clusters (Vulcan). It looks like this is due to CPU soft lockups of the ptlrpcd threads in conjunction with threads being incorrectly bound to a single CPU.

      The reproducer can be as simple as a file-per-process IOR using a 1M transfer size:

      srun -- ~/ior/src/ior -o /p/lustre/file -v -w -k -t1m -b1g -F
      

      When using a 512 compute node allocation, this will result in 128 threads on each of 4 IO nodes performing the actual IO on behalf of the function shipped compute nodes. So as far as the lustre client is concerned, 128 threads are writing to separate files with a 1M transfer size.

      The CPU soft lockups that result from this look like so:

      2014-03-19 13:38:45.928533 {RMP19Ma080953563} [mmcs]{2}.0.0: BUG: soft lockup - CPU#0 stuck for 66s! [ptlrpcd_1:3201]
      2014-03-19 13:38:45.928973 {RMP19Ma080953563} [mmcs]{2}.0.0: Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
      2014-03-19 13:38:45.929415 {RMP19Ma080953563} [mmcs]{2}.0.0: NIP: 80000000003515d0 LR: 80000000003517cc CTR: 000000000000007d
      2014-03-19 13:38:45.929697 {RMP19Ma080953563} [mmcs]{2}.0.0: REGS: c0000003c34d6b30 TRAP: 0901   Not tainted  (2.6.32-358.11.1.bgq.2llnl.V1R2M1.bl2.1_0.ppc64)
      2014-03-19 13:38:45.930094 {RMP19Ma080953563} [mmcs]{2}.0.0: MSR: 0000000080029000 <EE,ME,CE>  CR: 44224424  XER: 00000000
      2014-03-19 13:38:45.930449 {RMP19Ma080953563} [mmcs]{2}.0.0: TASK = c0000003c318f500[3201] 'ptlrpcd_1' THREAD: c0000003c34d4000 CPU: 0
      2014-03-19 13:38:45.930893 {RMP19Ma080953563} [mmcs]{2}.0.0: GPR00: 000000001ea5240a c0000003c34d6db0 8000000000395550 00000000000015b0 
      2014-03-19 13:38:45.931227 {RMP19Ma080953563} [mmcs]{2}.0.0: GPR04: c0000003ad9dd8e0 000000001ea8bb55 000000000003974c 0000000000000053 
      2014-03-19 13:38:45.931607 {RMP19Ma080953563} [mmcs]{2}.0.0: GPR08: 000000000000009c c0000003ad9de6c0 0000000000000029 00000000000000ff 
      2014-03-19 13:38:45.931929 {RMP19Ma080953563} [mmcs]{2}.0.0: GPR12: 0000000000001170 c0000000007c2500 
      2014-03-19 13:38:45.932301 {RMP19Ma080953563} [mmcs]{2}.0.0: NIP [80000000003515d0] .__adler32+0xb0/0x220 [libcfs]
      2014-03-19 13:38:45.932648 {RMP19Ma080953563} [mmcs]{2}.0.0: LR [80000000003517cc] .adler32_update+0x1c/0x40 [libcfs]
      2014-03-19 13:38:45.933093 {RMP19Ma080953563} [mmcs]{2}.0.0: Call Trace:
      2014-03-19 13:38:45.933443 {RMP19Ma080953563} [mmcs]{2}.0.0: [c0000003c34d6db0] [c0000003c34d6e40] 0xc0000003c34d6e40 (unreliable)
      2014-03-19 13:38:45.933807 {RMP19Ma080953563} [mmcs]{2}.0.0: [c0000003c34d6e30] [c0000000002041c4] .crypto_shash_update+0x4c/0x60
      2014-03-19 13:38:45.934089 {RMP19Ma080953563} [mmcs]{2}.0.0: [c0000003c34d6ea0] [c000000000204218] .shash_compat_update+0x40/0x80
      2014-03-19 13:38:45.934450 {RMP19Ma080953563} [mmcs]{2}.0.0: [c0000003c34d6f60] [800000000035072c] .cfs_crypto_hash_update_page+0x8c/0xb0 [libcfs]
      2014-03-19 13:38:45.934778 {RMP19Ma080953563} [mmcs]{2}.0.0: [c0000003c34d7020] [8000000001244f94] .osc_checksum_bulk+0x1c4/0x890 [osc]
      2014-03-19 13:38:45.935105 {RMP19Ma080953563} [mmcs]{2}.0.0: [c0000003c34d7170] [80000000012463c0] .osc_brw_prep_request+0xd60/0x1d80 [osc]
      2014-03-19 13:38:45.935373 {RMP19Ma080953563} [mmcs]{2}.0.0: [c0000003c34d7330] [800000000125931c] .osc_build_rpc+0xc8c/0x2320 [osc]
      2014-03-19 13:38:45.935676 {RMP19Ma080953563} [mmcs]{2}.0.0: [c0000003c34d74e0] [800000000127f2f4] .osc_send_write_rpc+0x594/0xb40 [osc]
      2014-03-19 13:38:45.935955 {RMP19Ma080953563} [mmcs]{2}.0.0: [c0000003c34d76c0] [800000000127ff74] .osc_check_rpcs+0x6d4/0x1770 [osc]
      2014-03-19 13:38:45.936251 {RMP19Ma080953563} [mmcs]{2}.0.0: [c0000003c34d78f0] [800000000128131c] .osc_io_unplug0+0x30c/0x6b0 [osc]
      2014-03-19 13:38:45.936528 {RMP19Ma080953563} [mmcs]{2}.0.0: [c0000003c34d7a10] [800000000125b174] .brw_interpret+0x7c4/0x1940 [osc]
      2014-03-19 13:38:45.936881 {RMP19Ma080953563} [mmcs]{2}.0.0: [c0000003c34d7b50] [8000000000e9a964] .ptlrpc_check_set+0x3a4/0x4d30 [ptlrpc]
      2014-03-19 13:38:45.937166 {RMP19Ma080953563} [mmcs]{2}.0.0: [c0000003c34d7d20] [8000000000ee9a9c] .ptlrpcd_check+0x66c/0x870 [ptlrpc]
      2014-03-19 13:38:45.937456 {RMP19Ma080953563} [mmcs]{2}.0.0: [c0000003c34d7e40] [8000000000ee9f58] .ptlrpcd+0x2b8/0x4c0 [ptlrpc]
      2014-03-19 13:38:45.937733 {RMP19Ma080953563} [mmcs]{2}.0.0: [c0000003c34d7f90] [c00000000001b9a8] .kernel_thread+0x54/0x70
      2014-03-19 13:38:45.938068 {RMP19Ma080953563} [mmcs]{2}.0.0: Instruction dump:
      2014-03-19 13:38:45.938421 {RMP19Ma080953563} [mmcs]{2}.0.0: 89490000 88a90001 88c90002 89090003 88e90004 7c005214 89490005 7ca02a14 
      2014-03-19 13:38:45.938714 {RMP19Ma080953563} [mmcs]{2}.0.0: 7c005a14 89690006 7cc53214 7ca02a14 <88090008> 7d064214 7cc53214 7ce83a14 
      2014-03-19 13:38:45.939210 {RMP19Ma080953563} [mmcs]{2}.13.0: ^GMessage from syslogd@(none) at Mar 19 13:38:45 ...
      2014-03-19 13:38:45.939508 {RMP19Ma080953563} [mmcs]{2}.13.0:  kernel:BUG: soft lockup - CPU#0 stuck for 66s! [ptlrpcd_1:3201]
      

      While most of the stacks that have dumped are in the __adler32 function, not all of them are. I think the ptlrpd thread is processing requests just fine, it's just failing to schedule itself, and thus consuming an absurd amount of CPU.

      As far as I can tell, all of the ptlrpcd threads are being pinned to CPU 0 along with the ll_ping thread. Thus, when one of the ptlrpcd threads sits on CPU 0 for so long, it starves the ll_ping thread which leads to the evictions I am seeing. The fact that all these threads are being pinned to CPU 0 is probably a bug in and of itself, but even with that behavior, a ptlrpcd thread should never sit on a core for 60+ seconds.

      My theory is that async write requests are being queued up by the writing threads on the client node, and it's just taking the ptlrpcd thread an absurd amount of time to process all of them. Unfortunately, I can't seem to get a grip on how the client is supposed to work, so I haven't gathered any evidence to support this claim. Looking at the logs and the unstable page stats in proc, it does look like pages are being processed by the system just fine. If it wasn't for the soft lockup messages and evictions, I probably wouldn't have even noticed anything was wrong.

      Keep in mind, the client nodes on this system have 68 cores, are PPC64, and have 16K pages. The ptlrpcd threads seem to run only on core 0, but the threads submitting the write requests run on all cores.

      Can anybody give me a lead on what might be going wrong here? Is it possible that requests keep getting stuffed into the ptlrpcd thread's queue, keeping that thread running on the core without it ever scheduling? Looking at the ptlrpcd thread code, I don't see anyplace where it would schedule except for within the ptlrpcd function itself. So if the request list it is processing is very long, is it conceivable that it would sit on a core for a very long time processing the requests without scheduling? The client code is so difficult to wade through, it's hard to understand exactly how all the machinery is supposed to work, let alone what might be going wrong.

      EDIT: I also should note, this is running the our local 2.4.0-15chaos tag.

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: