Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6447

mdt_identity_upcall calls sleeping function under rwlock

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      Running on RHEL7.1 with CONFIG_DEBUG_SLEEP_ATOMIC enabled caught this gem:

      Mar 27 21:12:13 centos6-16 kernel: BUG: sleeping function called from invalid context at mm/slab.c:3054
      Mar 27 21:12:13 centos6-16 kernel: in_atomic(): 1, irqs_disabled(): 0, pid: 19599, name: mdt00_001
      Mar 27 21:12:13 centos6-16 kernel: 1 lock held by mdt00_001/19599:
      Mar 27 21:12:13 centos6-16 kernel: #0:  (&cache->uc_upcall_rwlock){......}, at: [<ffffffffa0b15ad1>] mdt_identity_do_upcall+0x91/0x470 [mdt]
      Mar 27 21:12:13 centos6-16 kernel: CPU: 3 PID: 19599 Comm: mdt00_001 Tainted: GF       W  O--------------   3.10.0-debug #5
      Mar 27 21:12:13 centos6-16 kernel: Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      Mar 27 21:12:13 centos6-16 kernel: ffff8800808945c0 000000006b10abfb ffff88008b1eb808 ffffffff816ccb68
      Mar 27 21:12:13 centos6-16 kernel: ffff88008b1eb820 ffffffff810a8fd9 0000000000000000 ffff88008b1eb8b8
      Mar 27 21:12:13 centos6-16 kernel: ffffffff811bdeda ffff880074b48000 0000000000000246 0000000000000000
      Mar 27 21:12:13 centos6-16 kernel: Call Trace:
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffff816ccb68>] dump_stack+0x19/0x1b
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffff810a8fd9>] __might_sleep+0xe9/0x110
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffff811bdeda>] __kmalloc_track_caller+0x11a/0x620
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa0b15b0b>] ? mdt_identity_do_upcall+0xcb/0x470 [mdt]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffff81182f91>] kstrdup+0x31/0x60
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa0b15b0b>] mdt_identity_do_upcall+0xcb/0x470 [mdt]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa02de48f>] upcall_cache_get_entry+0x2af/0x8e0 [obdclass]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa050ef87>] ? lustre_msg_buf+0x17/0x60 [ptlrpc]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa0536882>] ? __req_capsule_get+0x162/0x710 [ptlrpc]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa05116cf>] ? lustre_pack_reply_flags+0x6f/0x1e0 [ptlrpc]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa0b16437>] mdt_identity_get+0x17/0x50 [mdt]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa0af921a>] mdt_init_ucred_reint+0x23a/0x380 [mdt]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa0ae9dcb>] mdt_reint_internal+0x24b/0x760 [mdt]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa0aea442>] mdt_intent_reint+0x162/0x400 [mdt]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa0af447a>] mdt_intent_policy+0x57a/0xbe0 [mdt]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa04c6f66>] ldlm_lock_enqueue+0x326/0x900 [ptlrpc]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa014f505>] ? cfs_hash_rw_unlock+0x15/0x20 [libcfs]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa04ed672>] ldlm_handle_enqueue0+0x502/0x1520 [ptlrpc]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa05135b0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa05670e2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa056b5c5>] tgt_request_handle+0x645/0xfe0 [ptlrpc]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa051c881>] ptlrpc_server_handle_request+0x231/0xab0 [ptlrpc]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa051a3f8>] ? ptlrpc_wait_event+0xb8/0x360 [ptlrpc]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa05207d0>] ptlrpc_main+0xae0/0x1ee0 [ptlrpc]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffff816d5b97>] ? _raw_spin_unlock_irq+0x27/0x50
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffff816d38be>] ? __schedule+0x2fe/0x810
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffffa051fcf0>] ? ptlrpc_register_service+0xf20/0xf20 [ptlrpc]
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffff8109c00a>] kthread+0xea/0xf0
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffff8109bf20>] ? kthread_create_on_node+0x140/0x140
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffff816df2bc>] ret_from_fork+0x7c/0xb0
      Mar 27 21:12:13 centos6-16 kernel: [<ffffffff8109bf20>] ? kthread_create_on_node+0x140/0x140
      

      Attachments

        Activity

          [LU-6447] mdt_identity_upcall calls sleeping function under rwlock
          pjones Peter Jones added a comment -

          Landed for 2.9

          pjones Peter Jones added a comment - Landed for 2.9

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14432/
          Subject: LU-6447 mdt: mdt_identity_upcall to not block with rwlock held
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: e8273a3dd71c4e6ab5ca9de3fbfbc0f7603d6930

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14432/ Subject: LU-6447 mdt: mdt_identity_upcall to not block with rwlock held Project: fs/lustre-release Branch: master Current Patch Set: Commit: e8273a3dd71c4e6ab5ca9de3fbfbc0f7603d6930
          pjones Peter Jones added a comment -

          Niu

          Could you please refresh Oleg's patch to address the existing concerns?

          Thanks

          Peter

          pjones Peter Jones added a comment - Niu Could you please refresh Oleg's patch to address the existing concerns? Thanks Peter

          This seems to be implicated in long (15 second) hangs that I am sometimes seeing when poking around the lustre filesystem interactively. That is going to be a big issue for our hotline when 2.8 goes live, so we really need a fix. I would this should be on the docket for fixing before 2.9 comes out.

          morrone Christopher Morrone (Inactive) added a comment - - edited This seems to be implicated in long (15 second) hangs that I am sometimes seeing when poking around the lustre filesystem interactively. That is going to be a big issue for our hotline when 2.8 goes live, so we really need a fix. I would this should be on the docket for fixing before 2.9 comes out.
          green Oleg Drokin added a comment -

          Yes, it's a problem on rhel7 with debug atomic sleep option enabled.
          I am carrying my patch referenced above even though it's not perfect meanwhile to reduce amount of noise in my logs.

          green Oleg Drokin added a comment - Yes, it's a problem on rhel7 with debug atomic sleep option enabled. I am carrying my patch referenced above even though it's not perfect meanwhile to reduce amount of noise in my logs.

          We've started seeing a very similar call stack with the same error in a testbed we've stood up.

          dinatale2 Giuseppe Di Natale (Inactive) added a comment - We've started seeing a very similar call stack with the same error in a testbed we've stood up.
          green Oleg Drokin added a comment -

          This is my first stab at the problem unless somebody has better ideas

          green Oleg Drokin added a comment - This is my first stab at the problem unless somebody has better ideas

          Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/14432
          Subject: LU-6447 mdt: mdt_identity_upcall to not block with rwlock held
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 912a0861bc6fff558483a58bc29cad76c7ae4681

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/14432 Subject: LU-6447 mdt: mdt_identity_upcall to not block with rwlock held Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 912a0861bc6fff558483a58bc29cad76c7ae4681

          People

            niu Niu Yawei (Inactive)
            green Oleg Drokin
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: