Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2263

CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes

Details

    • 3
    • 5413

    Description

      The IO nodes on Sequoia are going unresponsive with many CPU soft lockup messages in the logs. For example, a couple messages from a node in this state are as follows:

      2012-11-01 23:44:47.681648 {DefaultControlEventListener} [mmcs]{161}.0.2: BUG: soft lockup - CPU#2 stuck for 67s! [ptlrpcd_56:3326]
      2012-11-01 23:44:47.720983 {DefaultControlEventListener} [mmcs]{161}.0.2: Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
      2012-11-01 23:44:47.760955 {DefaultControlEventListener} [mmcs]{161}.0.2: NIP: c00000000042e194 LR: 8000000003a674bc CTR: c00000000042e160
      2012-11-01 23:44:47.801187 {DefaultControlEventListener} [mmcs]{161}.0.2: REGS: c0000003ca2e7900 TRAP: 0901   Not tainted  (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64)
      2012-11-01 23:44:47.841287 {DefaultControlEventListener} [mmcs]{161}.0.2: MSR: 0000000080029000 <EE,ME,CE>  CR: 84284444  XER: 20000000
      2012-11-01 23:44:47.880979 {DefaultControlEventListener} [mmcs]{161}.0.2: TASK = c0000003ca1c7980[3326] 'ptlrpcd_56' THREAD: c0000003ca2e4000 CPU: 2
      2012-11-01 23:44:47.921379 {DefaultControlEventListener} [mmcs]{161}.0.2: GPR00: 0000000080000038 c0000003ca2e7b80 c0000000006de510 c0000003ceca6278
      2012-11-01 23:44:47.960900 {DefaultControlEventListener} [mmcs]{161}.0.2: GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000000
      2012-11-01 23:44:48.000948 {DefaultControlEventListener} [mmcs]{161}.0.2: GPR08: 0000000000200200 0000000080000002 c0000003cefddc68 c00000000042e160
      2012-11-01 23:44:48.105877 {DefaultControlEventListener} [mmcs]{161}.0.2: GPR12: 8000000003aebdb8 c000000000743f00
      2012-11-01 23:44:48.106188 {DefaultControlEventListener} [mmcs]{161}.0.2: NIP [c00000000042e194] ._spin_lock+0x34/0x44
      2012-11-01 23:44:48.121006 {DefaultControlEventListener} [mmcs]{161}.0.2: LR [8000000003a674bc] .ptlrpc_check_set+0xeac/0x4e80 [ptlrpc]
      2012-11-01 23:44:48.161100 {DefaultControlEventListener} [mmcs]{161}.0.2: Call Trace:
      2012-11-01 23:44:48.201258 {DefaultControlEventListener} [mmcs]{161}.0.2: [c0000003ca2e7b80] [8000000003a674ac] .ptlrpc_check_set+0xe9c/0x4e80 [ptlrpc] (unreliable)
      2012-11-01 23:44:48.240907 {DefaultControlEventListener} [mmcs]{161}.0.2: [c0000003ca2e7d20] [8000000003abd1dc] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc]
      2012-11-01 23:44:48.280969 {DefaultControlEventListener} [mmcs]{161}.0.2: 

      The line it's stuck on:

      (gdb) l *ptlrpc_check_set+0xeac
      0x474bc is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/client.c:1666).
      

      And the source level info:

      1663                                 if (!ptlrpc_unregister_reply(req, 1))
      1664                                         continue;
      1665 
      1666                                 cfs_spin_lock(&imp->imp_lock);
      1667                                 if (ptlrpc_import_delay_req(imp, req, &status)){
      1668                                         /* put on delay list - only if we wait
      1669                                          * recovery finished - before send */
      

      Details of the next message on the console:

      BUG: soft lockup - CPU#16 stuck for 67s! [ptlrpcd_rcv:3269]
      2012-11-01 23:44:48.320953 {DefaultControlEventListener} [mmcs]{161}.0.2: Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
      2012-11-01 23:44:48.361045 {DefaultControlEventListener} [mmcs]{161}.0.2: NIP: c00000000042e18c LR: 8000000003a66b04 CTR: c00000000042e160
      2012-11-01 23:44:48.400942 {DefaultControlEventListener} [mmcs]{161}.0.2: REGS: c0000003ca203900 TRAP: 0901   Not tainted  (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64)
      2012-11-01 23:44:48.440963 {DefaultControlEventListener} [mmcs]{161}.0.2: MSR: 0000000080029000 <EE,ME,CE>  CR: 84228484  XER: 20000000
      2012-11-01 23:44:48.480905 {DefaultControlEventListener} [mmcs]{161}.0.2: TASK = c0000003c5d432a0[3269] 'ptlrpcd_rcv' THREAD: c0000003ca200000 CPU: 16
      2012-11-01 23:44:48.520898 {DefaultControlEventListener} [mmcs]{161}.0.2: GPR00: 0000000080000038 c0000003ca203b80 c0000000006de510 c0000003ceca6278
      2012-11-01 23:44:48.560925 {DefaultControlEventListener} [mmcs]{161}.0.2: GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000000
      2012-11-01 23:44:48.600920 {DefaultControlEventListener} [mmcs]{161}.0.2: GPR08: c0000003ceca6260 0000000080000010 c0000003ce673420 c00000000042e160
      2012-11-01 23:44:48.640881 {DefaultControlEventListener} [mmcs]{161}.0.2: GPR12: 8000000003aebdb8 c00000000074f500
      2012-11-01 23:44:48.680904 {DefaultControlEventListener} [mmcs]{161}.0.2: NIP [c00000000042e18c] ._spin_lock+0x2c/0x44
      2012-11-01 23:44:48.720885 {DefaultControlEventListener} [mmcs]{161}.0.2: LR [8000000003a66b04] .ptlrpc_check_set+0x4f4/0x4e80 [ptlrpc]
      2012-11-01 23:44:48.761372 {DefaultControlEventListener} [mmcs]{161}.0.2: Call Trace:
      2012-11-01 23:44:48.800923 {DefaultControlEventListener} [mmcs]{161}.0.2: [c0000003ca203b80] [8000000003a66974] .ptlrpc_check_set+0x364/0x4e80 [ptlrpc] (unreliable)
      2012-11-01 23:44:48.840892 {DefaultControlEventListener} [mmcs]{161}.0.2: [c0000003ca203d20] [8000000003abd1dc] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc]
      2012-11-01 23:44:48.880936 {DefaultControlEventListener} [mmcs]{161}.0.2: [c0000003ca203e40] [8000000003abd76c] .ptlrpcd+0x35c/0x510 [ptlrpc]
      2012-11-01 23:44:48.920887 {DefaultControlEventListener} [mmcs]{161}.0.2: [c0000003ca203f90] [c00000000001a9e0] .kernel_thread+0x54/0x70
      2012-11-01 23:44:48.961113 {DefaultControlEventListener} [mmcs]{161}.0.2: Instruction dump:
      2012-11-01 23:44:49.000886 {DefaultControlEventListener} [mmcs]{161}.0.2: 4bffffc8 38000000 980d0c94 812d0000 7c001829 2c000000 40c20010 7d20192d
      2012-11-01 23:44:49.040902 {DefaultControlEventListener} [mmcs]{161}.0.2: 40c2fff0 4c00012c 2fa00000 4dfe0020 <7c210b78> 80030000 2fa00000 40defff4
      2012-11-01 23:44:49.121348 {DefaultControlEventListener} [mmcs]{161}.0.2: [c0000003ca2e7e40] [8000000003abd76c] .ptlrpcd+0x35c/0x510 [ptlrpc]
      2012-11-01 23:44:49.160915 {DefaultControlEventListener} [mmcs]{161}.0.2: [c0000003ca2e7f90] [c00000000001a9e0] .kernel_thread+0x54/0x70
      2012-11-01 23:44:49.201038 {DefaultControlEventListener} [mmcs]{161}.0.2: Instruction dump:
      2012-11-01 23:44:49.241229 {DefaultControlEventListener} [mmcs]{161}.0.2: 980d0c94 812d0000 7c001829 2c000000 40c20010 7d20192d 40c2fff0 4c00012c
      2012-11-01 23:44:49.281578 {DefaultControlEventListener} [mmcs]{161}.0.2: 2fa00000 4dfe0020 7c210b78 80030000 <2fa00000> 40defff4 7c421378 4bffffc8
      

      The line it's stuck on:

      (gdb) l *ptlrpc_check_set+0x4f4
      0x46b04 is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/client.c:1852).
      

      And the source level info:

      1849 >------->------->-------libcfs_nid2str(imp->imp_connection->c_peer.nid),
      1850 >------->------->-------lustre_msg_get_opc(req->rq_reqmsg));
      1851 
      1852                 cfs_spin_lock(&imp->imp_lock);
      1853                 /* Request already may be not on sending or delaying list. This
      1854                  * may happen in the case of marking it erroneous for the case
      1855                  * ptlrpc_import_delay_req(req, status) find it impossible to
      

      The symptoms look very similar to what we saw on the old Orion code base, which boiled down to "too many" ptlrpc threads contending with each other while interrupts were disabled. To get past that previous issue we pulled in a patch to limit the number of threads created, thus reducing the contention on the lock in question.

      Attachments

        Issue Links

          Activity

            [LU-2263] CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes

            Patch landed to master.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to master.

            yes... OK. lets see how it does at LLNL.

            bzzz Alex Zhuravlev added a comment - yes... OK. lets see how it does at LLNL.

            No I can't, those traces look like someone forgot to unlock imp_lock, but I have already looked through all use-cases of imp_lock for three times and can't find any case of that. This is the only buggy code I can find so far.

            liang Liang Zhen (Inactive) added a comment - No I can't, those traces look like someone forgot to unlock imp_lock, but I have already looked through all use-cases of imp_lock for three times and can't find any case of that. This is the only buggy code I can find so far.

            interesting.. in the traces above can you find one holding rq_lock ?

            bzzz Alex Zhuravlev added a comment - interesting.. in the traces above can you find one holding rq_lock ?

            I did find a deadlock....

            ptlrpc_at_recv_early_reply() calls ptlrpc_at_set_req_timeout() with hold of ptlrpc_request::rq_lock, and ptlrpc_at_set_req_timeout()->import_at_get_index() requires imp_lock, it violates locking-order rule of Lustre, which assumes rq_lock can nest in imp_lock.
            This is introduced by BZ16999
            Path is here: http://review.whamcloud.com/#change,4880

            liang Liang Zhen (Inactive) added a comment - I did find a deadlock.... ptlrpc_at_recv_early_reply() calls ptlrpc_at_set_req_timeout() with hold of ptlrpc_request::rq_lock, and ptlrpc_at_set_req_timeout()->import_at_get_index() requires imp_lock, it violates locking-order rule of Lustre, which assumes rq_lock can nest in imp_lock. This is introduced by BZ16999 Path is here: http://review.whamcloud.com/#change,4880

            And just to be clear, I believe that this WAS a deadlock situation, as detailed in this comment.

            morrone Christopher Morrone (Inactive) added a comment - And just to be clear, I believe that this WAS a deadlock situation, as detailed in this comment .

            so, it's roughly 1.1/128 or ~9MB/s per OST on an average ION and actually very few RPC/second/import...

            bzzz Alex Zhuravlev added a comment - so, it's roughly 1.1/128 or ~9MB/s per OST on an average ION and actually very few RPC/second/import...

            On writes, we currently see about 850 GB/s, so with 768 IONs that is roughly 1.1 GB/s per ION (the ION is the lustre client).

            morrone Christopher Morrone (Inactive) added a comment - On writes, we currently see about 850 GB/s, so with 768 IONs that is roughly 1.1 GB/s per ION (the ION is the lustre client).

            When I had problems, I was running an ior something like this:

              ior -F -e -g -C -t 1m -b 512m -o /p/ls1/morrone/foo

            The ior was 98,304 MPI tasks in size.

            Under normal conditions, each Sequoia I/O Node (ION) handles I/O for 128 compute nodes. So a given ION will handle 128 files. We have currently left the filesystem default at 1-stripe per file, so each ION will be writing to up to 128 different OSTs at more-or-less the same time.

            morrone Christopher Morrone (Inactive) added a comment - When I had problems, I was running an ior something like this: ior -F -e -g -C -t 1m -b 512m -o /p/ls1/morrone/foo The ior was 98,304 MPI tasks in size. Under normal conditions, each Sequoia I/O Node (ION) handles I/O for 128 compute nodes. So a given ION will handle 128 files. We have currently left the filesystem default at 1-stripe per file, so each ION will be writing to up to 128 different OSTs at more-or-less the same time.

            Christopher, Prakash, could you describe the workload and/or estimate it in RPC/second please ?

            we usually limit number of RPCs in flight to 8 (per import). to achieve 1GB/s per OST with 1MB RPC we'd need 1024 RPC/second. say, we're grabbing imp_lock roughly 10 times for every processed RPC, which gives us to ~10K/sec lock instances. but this is in the case of 1GB/s going to a single OST.

            I think it'd be helpful to collect data on CONFIG_LOCK_STAT-enabled kernel.

            if this is not a deadlock, then I'd consider few possible contributors:
            1) CDEBUG() overhead while imp_lock is held
            2) cache ping-pong if OST_WRITE go to different partitions
            3) a lot of non-committed requests making ptlrpc_free_committed() slow (which is running with imp_lock held)
            4) lack of need_resched() in ptlrpc_check_set() (though in all listed cases it was in _spin_lock())

            bzzz Alex Zhuravlev added a comment - Christopher, Prakash, could you describe the workload and/or estimate it in RPC/second please ? we usually limit number of RPCs in flight to 8 (per import). to achieve 1GB/s per OST with 1MB RPC we'd need 1024 RPC/second. say, we're grabbing imp_lock roughly 10 times for every processed RPC, which gives us to ~10K/sec lock instances. but this is in the case of 1GB/s going to a single OST. I think it'd be helpful to collect data on CONFIG_LOCK_STAT-enabled kernel. if this is not a deadlock, then I'd consider few possible contributors: 1) CDEBUG() overhead while imp_lock is held 2) cache ping-pong if OST_WRITE go to different partitions 3) a lot of non-committed requests making ptlrpc_free_committed() slow (which is running with imp_lock held) 4) lack of need_resched() in ptlrpc_check_set() (though in all listed cases it was in _spin_lock())

            People

              liang Liang Zhen (Inactive)
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: