Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12564

ptlrpcd daemon sleeps while holding imp_lock spinlock

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0, Lustre 2.12.7
    • Lustre 2.11.0, Lustre 2.10.8
    • None
    • 3
    • 9223372036854775807

    Description

      RCU stalls occurring in ptlrpc_check_set were tracked to a ptlrpcd daemon that was sleeping while holding an import lock spinlock.

      PID: 12920  TASK: ffff880834b47040  CPU: 7   COMMAND: "ptlrpcd_03_02"
       #0 [ffffc90005d5bb20] __schedule at ffffffff816a067c
       #1 [ffffc90005d5bba8] __cond_resched_lock at ffffffff810855ea
       #2 [ffffc90005d5bbc0] __purge_vmap_area_lazy at ffffffff811a2210
       #3 [ffffc90005d5bbf0] free_vmap_area_noflush at ffffffff811a22bd
       #4 [ffffc90005d5bc08] remove_vm_area at ffffffff811a4477
       #5 [ffffc90005d5bc28] __vunmap at ffffffff811a44b6
       #6 [ffffc90005d5bc50] vfree at ffffffff811a459e
       #7 [ffffc90005d5bc68] null_free_repbuf at ffffffffa08190d8 [ptlrpc]
       #8 [ffffc90005d5bc80] sptlrpc_cli_free_repbuf at ffffffffa080b3d0 [ptlrpc]
       #9 [ffffc90005d5bca0] __ptlrpc_req_finished at ffffffffa07cfb35 [ptlrpc]
      #10 [ffffc90005d5bcd0] ptlrpc_free_request at ffffffffa07d03ca [ptlrpc]
      #11 [ffffc90005d5bce8] ptlrpc_free_committed at ffffffffa07d2bca [ptlrpc]
      #12 [ffffc90005d5bd30] ptlrpc_check_set at ffffffffa07d5eb1 [ptlrpc]
      #13 [ffffc90005d5bdd0] ptlrpcd_check at ffffffffa08022dc [ptlrpc]
      #14 [ffffc90005d5be18] ptlrpcd at ffffffffa080284e [ptlrpc]
      #15 [ffffc90005d5bf08] kthread at ffffffff8107c167
      #16 [ffffc90005d5bf50] ret_from_fork at ffffffff8180024a
      

      Examining the code in ptlrpc_check_set()->after_reply() shows that ptlrpc_free_committed is called while holding the imp_lock.

      1408 static int after_reply(struct ptlrpc_request *req)
      1409 {
      1410         struct obd_import *imp = req->rq_import;
      .....
      1551         if (imp->imp_replayable) {
      1552                 spin_lock(&imp->imp_lock);
      1553                 /*
      1554                  * No point in adding already-committed requests to the replay
      1555                  * list, we will just remove them immediately. b=9829
      1556                  */
      1557                 if (req->rq_transno != 0 &&
      1558                     (req->rq_transno >
      1559                      lustre_msg_get_last_committed(req->rq_repmsg) ||
      1560                      req->rq_replay)) {
      1561                         /** version recovery */
      1562                         ptlrpc_save_versions(req);
      1563                         ptlrpc_retain_replayable_request(req, imp);
      1564                 } else if (req->rq_commit_cb &&
      1565                            list_empty(&req->rq_replay_list)) {
      1566                         /*
      1567                          * NB: don't call rq_commit_cb if it's already on
      1568                          * rq_replay_list, ptlrpc_free_committed() will call
      1569                          * it later, see LU-3618 for details
      1570                          */
      1571                         spin_unlock(&imp->imp_lock);
      1572                         req->rq_commit_cb(req);
      1573                         spin_lock(&imp->imp_lock);
      1574                 }
      1575
      1576                 /*
      1577                  * Replay-enabled imports return commit-status information.
      1578                  */
      1579                 committed = lustre_msg_get_last_committed(req->rq_repmsg);
      1580                 if (likely(committed > imp->imp_peer_committed_transno))
      1581                         imp->imp_peer_committed_transno = committed;
      1582
      1583                 ptlrpc_free_committed(imp);
      .....
      1599                 spin_unlock(&imp->imp_lock);
      1600         }
      1601
      1602         RETURN(rc);
      1603 }
      

      This causes a deadlock when another ptlrpcd daemon spins on a CPU trying to get the lock and the holder of the lock gets scheduled on the same CPU.

      crash_x86_64> bt -t 12918
      PID: 12918  TASK: ffff880825b3a1c0  CPU: 7   COMMAND: "ptlrpcd_03_00"
                    START: __schedule at ffffffff816a067c
        [ffffc9000536bd10] lustre_msg_set_transno at ffffffffa07e1df7 [ptlrpc]
        [ffffc9000536bd20] _raw_spin_lock at ffffffff816a5951     <--- waiting on imp_lock
        [ffffc9000536bd30] ptlrpc_check_set at ffffffffa07d5e3e [ptlrpc]
        [ffffc9000536bd88] schedule_timeout at ffffffff816a4459
        [ffffc9000536bdd0] ptlrpcd_check at ffffffffa08022dc [ptlrpc]
        [ffffc9000536be18] ptlrpcd at ffffffffa080284e [ptlrpc]
        [ffffc9000536be58] default_wake_function at ffffffff81086380
        [ffffc9000536bf08] kthread at ffffffff8107c167
        [ffffc9000536bf18] ptlrpcd at ffffffffa08024d0 [ptlrpc]
        [ffffc9000536bf20] kthread at ffffffff8107c040
        [ffffc9000536bf50] ret_from_fork at ffffffff8180024a
      
      crash_x86_64> bt -l 12918 | grep -A 1 ptlrpc_check_set
       #2 [ffffc9000536bd30] ptlrpc_check_set at ffffffffa07d5e3e [ptlrpc]
          /home/abuild/rpmbuild/BUILD/cray-lustre-2.11.0.200_cray_90_g022bd19/lustre/ptlrpc/client.c: 1474
      
      ptlrpcd_03_00 is trying to get the imp_lock; ptlrpd_03_02 holds it:
      
      CPU 7 RUNQUEUE: ffff88085f9e1f00
        CURRENT: PID: 12918  TASK: ffff880825b3a1c0  COMMAND: "ptlrpcd_03_00"
        RT PRIO_ARRAY: ffff88085f9e2140
           [no tasks queued]
        CFS RB_ROOT: ffff88085f9e1fb0
           [120] PID: 3968   TASK: ffff88050760afc0  COMMAND: "kworker/7:0"
           [120] PID: 13361  TASK: ffff88083f701080  COMMAND: "gunicorn"
           [120] PID: 32734  TASK: ffff880333e06480  COMMAND: "tail"
           [120] PID: 13384  TASK: ffff880837144140  COMMAND: "gunicorn"
           [120] PID: 15161  TASK: ffff880825ea1240  COMMAND: "sssd_nss"
           [120] PID: 14128  TASK: ffff88076005a040  COMMAND: "mongod"
           [120] PID: 10737  TASK: ffff880836f86240  COMMAND: "cron"
           [120] PID: 44     TASK: ffff88082372b000  COMMAND: "ksoftirqd/7"
           [120] PID: 12920  TASK: ffff880834b47040  COMMAND: "ptlrpcd_03_02"
      

      Attachments

        Issue Links

          Activity

            [LU-12564] ptlrpcd daemon sleeps while holding imp_lock spinlock

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40110/
            Subject: LU-12564 libcfs: Use vfree_atomic instead of vfree
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 1dc31de15ff270befceb64257bd84f15ffe75000

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40110/ Subject: LU-12564 libcfs: Use vfree_atomic instead of vfree Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 1dc31de15ff270befceb64257bd84f15ffe75000
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40136/
            Subject: LU-12564 libcfs: Use vfree_atomic instead of vfree
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7a9c0ca690eb00a6e314322b62fb1fd3e9b31f0e

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40136/ Subject: LU-12564 libcfs: Use vfree_atomic instead of vfree Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7a9c0ca690eb00a6e314322b62fb1fd3e9b31f0e

            Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40136
            Subject: LU-12564 libcfs: Use vfree_atomic instead of vfree
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e58c298d094cefe67a7d9625107188f2589afefe

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40136 Subject: LU-12564 libcfs: Use vfree_atomic instead of vfree Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e58c298d094cefe67a7d9625107188f2589afefe

            It seems that the sleepiness of vfree() has been hit in other places, since vfree_atomic() does exactly what Patrick suggested - putting the memory on a linked list and then freeing it in a work queue.

            adilger Andreas Dilger added a comment - It seems that the sleepiness of vfree() has been hit in other places, since vfree_atomic() does exactly what Patrick suggested - putting the memory on a linked list and then freeing it in a work queue.

            Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40110
            Subject: LU-12564 Use vfree_atomic instead of vfree
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 3fc8f4d0fdc3f37b3764681c2208f7a40c794922

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40110 Subject: LU-12564 Use vfree_atomic instead of vfree Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 3fc8f4d0fdc3f37b3764681c2208f7a40c794922
            gerrit Gerrit Updater added a comment - - edited

            Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40080
            Subject: LU-12564 ptlrpc: Do not free requests under spinlock in ptlrpc_free_committed
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 084e27911d36e6033bce616e30f8c8abcaa724d1

            gerrit Gerrit Updater added a comment - - edited Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40080 Subject: LU-12564 ptlrpc: Do not free requests under spinlock in ptlrpc_free_committed Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 084e27911d36e6033bce616e30f8c8abcaa724d1

            Ah, all right, thank you.  I think maybe - amk - that's the thing to try, and we can politely ask Neil Brown to look at it.  (A work queue feels hacky, though obviously valid, so if we can avoid it... though again another reason for asking Neil is I imagine he knows what the normal pattern is in this sort of situation.)

            "Another pattern I use is moving the free 'req' entries to a local list_head and free them in bulk outside the spinlock."

            Yeah, I considered this as the first option.  In this case, "outside the spinlock" can be several functions away, and there are various callers in to the chain who hold the spinlock for their own purposes, so it seemed impractical.  (To add to the confusion, as you probably noticed, the relevant area is sometimes called without the lock held.  So that case would be straightforward to fix, since the freeing code only needs the spinlock relatively briefly...  But it's also called by stuff that holds it over much larger swathes of code...)

            pfarrell Patrick Farrell (Inactive) added a comment - - edited Ah, all right, thank you.  I think maybe - amk - that's the thing to try, and we can politely ask Neil Brown to look at it.  (A work queue feels hacky, though obviously valid, so if we can avoid it... though again another reason for asking Neil is I imagine he knows what the normal pattern is in this sort of situation.) "Another pattern I use is moving the free 'req' entries to a local list_head and free them in bulk outside the spinlock." Yeah, I considered this as the first option.  In this case, "outside the spinlock" can be several functions away, and there are various callers in to the chain who hold the spinlock for their own purposes, so it seemed impractical.  (To add to the confusion, as you probably noticed, the relevant area is sometimes called without the lock held.  So that case would be straightforward to fix, since the freeing code only needs the spinlock relatively briefly...  But it's also called by stuff that holds it over much larger swathes of code...)
            stancheff Shaun Tancheff added a comment - - edited

            I started looking through the code for the RCU option and I got a little lost  so I couldn't confirm that RCU is a good choice there ... a similar in_atomic() like scheme would also work .. pushing to a work queue if the kvfree could sleep.

            I think an RCU-alike would also work here and sleeping in the call_rcu()/kfree_rcu() type of context should be okay.

            Another pattern I use is moving the free 'req' entries to a local list_head and free them in bulk outside the spinlock.

            stancheff Shaun Tancheff added a comment - - edited I started looking through the code for the RCU option and I got a little lost  so I couldn't confirm that RCU is a good choice there ... a similar in_atomic() like scheme would also work .. pushing to a work queue if the kvfree could sleep. I think  an RCU-alike would also work here and sleeping in the call_rcu()/kfree_rcu() type of context should be okay. Another pattern I use is moving the free 'req' entries to a local list_head and free them in bulk outside the spinlock.

            James,

            Do you have a ticket for the udev, etc, conversion, assuming that's the source of the kasprintf stuff?  Shaun's suggestion looks very useful (thank you) and we should get it associated with that.  (And also try to move the discussion of that issue there.)

            Shaun,

            Any comment on my RCU suggestion for the memory freeing issue (a little earlier in the LU, before James brought up the related kasprintf issue)?

            pfarrell Patrick Farrell (Inactive) added a comment - James, Do you have a ticket for the udev, etc, conversion, assuming that's the source of the kasprintf stuff?  Shaun's suggestion looks very useful (thank you) and we should get it associated with that.  (And also try to move the discussion of that issue there.) Shaun, Any comment on my RCU suggestion for the memory freeing issue (a little earlier in the LU, before James brought up the related kasprintf issue)?

            People

              green Oleg Drokin
              amk Ann Koehler (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: