Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-613

Lustre-Client dead-lock during binary exec() over Lustre FS

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.2.0, Lustre 2.1.1
    • Lustre 2.0.0
    • None
    • Lustre 2.0.0.1
    • 3
    • 4784

    Description

      At Tera-100 we have experienced several Lustre-Client pseudo-hang situations where Console/Kernel messages/stacks started to be reported, for a 1st task as following :
      ====================================================================================
      BUG: soft lockup - CPU#16 stuck for 61s! [APPN77:41132]
      Modules linked in: iptable_filter ip_tables x_tables lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) vtune_drv(U) ipmi_devintf ipmi_si ipmi_msghandler nfs lockd fscache nfs_acl auth_rpcgss sunrpc acpi_cpufreq freq_table rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mad(U) ib_core(U) mlx4_core(U) sg i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma igb dca ext3 jbd sd_mod crc_t10dif ahci dm_mod [last unloaded: scsi_wait_scan]
      CPU 16:
      Modules linked in: iptable_filter ip_tables x_tables lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) vtune_drv(U) ipmi_devintf ipmi_si ipmi_msghandler nfs lockd fscache nfs_acl auth_rpcgss sunrpc acpi_cpufreq freq_table rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mad(U) ib_core(U) mlx4_core(U) sg i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma igb dca ext3 jbd sd_mod crc_t10dif ahci dm_mod [last unloaded: scsi_wait_scan]
      Pid: 41132, comm: APPN77 Not tainted 2.6.32-71.14.1.el6.Bull.20.x86_64 #1 bullx super-node
      RIP: 0010:[<ffffffff8146961e>] [<ffffffff8146961e>] _spin_lock+0x1e/0x30
      RSP: 0018:ffff880eef97d738 EFLAGS: 00000297
      RAX: 0000000000000032 RBX: ffff880eef97d738 RCX: 00000000c0000100
      RDX: 0000000000000031 RSI: 0000000000000292 RDI: ffff88022e0f6340
      RBP: ffffffff8100cc6e R08: ffff880eef97c000 R09: 00000000ffffffff
      R10: 0000000000000000 R11: 0000000000000001 R12: ffff880404bac000
      R13: ffff880404bac008 R14: ffffffff00000001 R15: ffff88107cf06140
      FS: 00002aaed2b2bf20(0000) GS:ffff880036700000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000fbc020 CR3: 00000003fc156000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Call Trace:
      [<ffffffffa043780d>] ? cl_object_attr_lock+0x1d/0x20 [obdclass]
      [<ffffffffa07a88d9>] ? ccc_lock_state+0x169/0x350 [lustre]
      [<ffffffffa0443870>] ? cl_lock_disclosure+0xa0/0x100 [obdclass]
      [<ffffffffa043f10f>] ? cl_lock_state_signal+0x5f/0x130 [obdclass]
      [<ffffffffa0717282>] ? lov_lock_wait+0x202/0x260 [lov]
      [<ffffffffa07a910f>] ? ccc_lock_fits_into+0x2f/0x190 [lustre]
      [<ffffffff8104da10>] ? default_wake_function+0x0/0x20
      [<ffffffffa043f264>] ? cl_lock_state_set+0x84/0x190 [obdclass]
      [<ffffffffa043f96c>] ? cl_wait_try+0x11c/0x2c0 [obdclass]
      [<ffffffffa043ec21>] ? cl_lock_mutex_tail+0x51/0x60 [obdclass]
      [<ffffffffa0440003>] ? cl_lock_mutex_get+0x93/0xd0 [obdclass]
      [<ffffffffa044033a>] ? cl_wait+0x8a/0x260 [obdclass]
      [<ffffffffa07a64e5>] ? cl_glimpse_lock+0x1a5/0x390 [lustre]
      [<ffffffffa07a6c64>] ? cl_glimpse_size+0x174/0x180 [lustre]
      [<ffffffffa07b1516>] ? cl_inode2ccc+0x36/0x400 [lustre]
      [<ffffffffa0794051>] ? ll_file_mmap+0xd1/0x260 [lustre]
      [<ffffffff8110ce89>] ? __pagevec_free+0x59/0xb0
      [<ffffffff81142802>] ? kmem_cache_alloc+0x182/0x190
      [<ffffffff8112a230>] ? mmap_region+0x410/0x5a0
      [<ffffffff8112a6fa>] ? do_mmap_pgoff+0x33a/0x380
      [<ffffffff811ab478>] ? elf_map+0x168/0x170
      [<ffffffff811acbac>] ? load_elf_binary+0x61c/0x1b10
      [<ffffffff8111f821>] ? follow_page+0x321/0x460
      [<ffffffff81124a5f>] ? __get_user_pages+0x10f/0x420
      [<ffffffff811a98bc>] ? load_misc_binary+0xac/0x3e0
      [<ffffffff8116055b>] ? search_binary_handler+0x10b/0x350
      [<ffffffff81160af1>] ? do_execve+0x251/0x310
      [<ffffffff8125a43a>] ? strncpy_from_user+0x4a/0x90
      [<ffffffff8100a5da>] ? sys_execve+0x4a/0x80
      [<ffffffff8100c5ca>] ? stub_execve+0x6a/0xc0
      ====================================================================================

      then for multiple others, as per the following :
      ================================================
      INFO: task APPN77:41138 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      APPN77 D ffff8800367ceaa0 0 41138 41127 0x00000000
      ffff880ea77598e8 0000000000000082 0000000000000000 000000010000000f
      ffff8810561ddb18 0000000000000082 ffff8803fc01de30 ffff880eef97d678
      ffff8810414d1020 ffff880ea7759fd8 000000000000eaa0 ffff8810414d1020
      Call Trace:
      [<ffffffff81467b9d>] schedule_timeout+0x20d/0x2e0
      [<ffffffff81468ac2>] __down+0x72/0xb0
      [<ffffffff81084361>] down+0x41/0x50
      [<ffffffffa076e652>] ll_inode_size_lock+0x42/0x130 [lustre]
      [<ffffffffa043f264>] ? cl_lock_state_set+0x84/0x190 [obdclass]
      [<ffffffffa07aa762>] ccc_prep_size+0x192/0x200 [lustre]
      [<ffffffffa07afaf4>] vvp_io_fault_start+0x104/0x490 [lustre]
      [<ffffffffa0440365>] ? cl_wait+0xb5/0x260 [obdclass]
      [<ffffffffa04449c8>] cl_io_start+0x68/0x170 [obdclass]
      [<ffffffffa0448ad0>] cl_io_loop+0x110/0x1c0 [obdclass]
      [<ffffffffa0793a76>] ll_fault0+0xb6/0x260 [lustre]
      [<ffffffffa0793e68>] ll_fault+0x48/0x160 [lustre]
      [<ffffffff81123764>] __do_fault+0x54/0x500
      [<ffffffff81123cad>] handle_pte_fault+0x9d/0xa90
      [<ffffffff8113e73a>] ? alloc_pages_current+0x9a/0x100
      [<ffffffff8103c147>] ? pte_alloc_one+0x37/0x50
      [<ffffffff8112488d>] handle_mm_fault+0x1ed/0x2b0
      [<ffffffff8146c043>] do_page_fault+0x123/0x3a0
      [<ffffffff81469ab5>] page_fault+0x25/0x30
      ================================================

      forced cras-dumps were taken, and each time shows the same dead-lock between 2 tasks of the same MPI application exec/page-fault'ing over the same binary standing on Lustre FS.

      The exact stacks showing the dead-lock look like :
      ==================================================
      crash> ps | grep APPN77
      > 41132 1 16 ffff88107cf06140 RU 0.0 96 12 APPN77
      41138 1 28 ffff8810414d0c60 UN 0.0 201880 476 APPN77
      crash> bt ffff88107cf06140 ffff8810414d0c60
      PID: 41132 TASK: ffff88107cf06140 CPU: 16 COMMAND: "APPN77"
      #0 [ffff880036707e80] crash_nmi_callback at ffffffff810267c6
      #1 [ffff880036707e90] notifier_call_chain at ffffffff8146c315
      #2 [ffff880036707ed0] atomic_notifier_call_chain at ffffffff8146c37a
      #3 [ffff880036707ee0] notify_die at ffffffff8108451e
      #4 [ffff880036707f10] do_nmi at ffffffff8146a44c
      #5 [ffff880036707f50] nmi at ffffffff81469d40
      [exception RIP: _spin_lock+30]
      RIP: ffffffff8146961e RSP: ffff880eef97d738 RFLAGS: 00000297
      RAX: 0000000000000032 RBX: ffff88087ce81d80 RCX: 00000000c0000100
      RDX: 0000000000000031 RSI: 0000000000000292 RDI: ffff88022e0f6340
      RBP: ffff880eef97d738 R8: ffff880eef97c000 R9: 00000000ffffffff
      R10: 0000000000000000 R11: 0000000000000001 R12: ffff880404bac130
      R13: ffff88087a54ec10 R14: ffff88043d476078 R15: ffff88022e0f6348
      ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
      — <NMI exception stack> —
      #6 [ffff880eef97d738] _spin_lock at ffffffff8146961e
      #7 [ffff880eef97d740] cl_object_attr_lock at ffffffffa043780d
      #8 [ffff880eef97d750] ccc_lock_state at ffffffffa07a88d9
      #9 [ffff880eef97d7e0] cl_lock_state_signal at ffffffffa043f10f
      #10 [ffff880eef97d840] cl_lock_state_set at ffffffffa043f264
      #11 [ffff880eef97d890] cl_wait_try at ffffffffa043f96c
      #12 [ffff880eef97d900] cl_wait at ffffffffa044033a
      #13 [ffff880eef97d960] cl_glimpse_lock at ffffffffa07a64e5
      #14 [ffff880eef97d9f0] cl_glimpse_size at ffffffffa07a6c64
      #15 [ffff880eef97da70] ll_file_mmap at ffffffffa0794051
      #16 [ffff880eef97dae0] mmap_region at ffffffff8112a230
      #17 [ffff880eef97dba0] do_mmap_pgoff at ffffffff8112a6fa
      #18 [ffff880eef97dc10] elf_map at ffffffff811ab478
      #19 [ffff880eef97dc70] load_elf_binary at ffffffff811acbac
      #20 [ffff880eef97de40] search_binary_handler at ffffffff8116055b
      #21 [ffff880eef97deb0] do_execve at ffffffff81160af1
      #22 [ffff880eef97df20] sys_execve at ffffffff8100a5da
      #23 [ffff880eef97df50] stub_execve at ffffffff8100c5ca
      RIP: 0000003b436a6db7 RSP: 00007fff4a5283e8 RFLAGS: 00000246
      RAX: 000000000000003b RBX: 0000000000fb1148 RCX: ffffffffffffffff
      RDX: 0000000000fe1dc8 RSI: 0000000000fbc568 RDI: 0000000000fbc5c8
      RBP: 0000000000fbc8e8 R8: 000000000070e420 R9: 000000000000a0ac
      R10: 0000000000000053 R11: 0000000000000246 R12: 0000000000000000
      R13: 00000000004dbc18 R14: 0000000000fe11d4 R15: 0000000000000001
      ORIG_RAX: 000000000000003b CS: 0033 SS: 002b

      PID: 41138 TASK: ffff8810414d0c60 CPU: 28 COMMAND: "APPN77"
      #0 [ffff880ea7759828] schedule at ffffffff81466e81
      #1 [ffff880ea77598f0] schedule_timeout at ffffffff81467b9d
      #2 [ffff880ea77599a0] __down at ffffffff81468ac2
      #3 [ffff880ea77599f0] down at ffffffff81084361
      #4 [ffff880ea7759a20] ll_inode_size_lock at ffffffffa076e652
      #5 [ffff880ea7759a80] ccc_prep_size at ffffffffa07aa762
      #6 [ffff880ea7759ae0] vvp_io_fault_start at ffffffffa07afaf4
      #7 [ffff880ea7759b80] cl_io_start at ffffffffa04449c8
      #8 [ffff880ea7759be0] cl_io_loop at ffffffffa0448ad0
      #9 [ffff880ea7759c50] ll_fault0 at ffffffffa0793a76
      #10 [ffff880ea7759ce0] ll_fault at ffffffffa0793e68
      #11 [ffff880ea7759d50] __do_fault at ffffffff81123764
      #12 [ffff880ea7759de0] handle_pte_fault at ffffffff81123cad
      #13 [ffff880ea7759eb0] handle_mm_fault at ffffffff8112488d
      #14 [ffff880ea7759f00] do_page_fault at ffffffff8146c043
      #15 [ffff880ea7759f50] page_fault at ffffffff81469ab5
      RIP: 0000003b42e0f9e5 RSP: 00007fff33276920 RFLAGS: 00010206
      RAX: 0000000000000000 RBX: 0000003b4301f128 RCX: 00007fc8ca4594c8
      RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00000000005824a0
      RBP: 00007fff332769e0 R8: 0000003b4301e998 R9: 0000000000000015
      R10: 00007fc8ca4582b0 R11: 0000003b4301f128 R12: 0000003b4301f128
      R13: 0000000000000001 R14: 0000000000000000 R15: 00007fff33276b70
      ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
      ==================================================

      Both locks involved are the "top" object's cl_object_header->coh_attr_guard spin-lock and the ll_inode_info->lli_size_sem mutex.

      Here is the "story" :
      =====================
      Task 41132 from ccc_lock_state()>cl_isize_lock()>ll_inode_size_lock() takes
      ll_inode_info->lli_size_sem
      then
      from ccc_lock_state()->cl_object_attr_lock() blocks on
      cl_object_header->coh_attr_guard

      Task 41138 from ccc_object_size_lock()->cl_object_attr_lock() takes
      cl_object_header->coh_attr_guard
      then
      from ccc_object_size_lock()>cl_isize_write()>ll_inode_size_lock() blocks on
      ll_inode_info->lli_size_sem
      =====================
      So seem that the problem comes from the fact that ccc_object_size_lock() has been called/inlined by ccc_prep_size() without "vfslock" param set and thus did not take ll_inode_info->lli_size_sem first when cl_isize_write() wants it mandatory !!!

      We may encounter a new problem/scenario related to binary exec() over Lustre, or some regression introduced by patches for LU-122/LU-300/bz#21122/bz#22669 already installed on-site ...

      And a quick (and dirty ?) fix for this problem could be to always call ccc_prep_size() from vvp_io_fault_start() with "vfslock" param set ...

      Attachments

        Issue Links

          Activity

            People

              jay Jinshan Xiong (Inactive)
              louveta Alexandre Louvet (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: