Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10904

racer hangs on umount

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.11.0, Lustre 2.10.4
    • None
    • 3
    • 9223372036854775807

    Description

      racer completes all tests, but hangs while unmounting the file system. We see this problem frequently.

       

      On the client console, we see the umount command hung

      [41678.588785] umount          D 0000000000000000     0 13439  13432 0x00000000
      [41678.588786]  ffff880058147da8 ffff880058147de0 ffff880063b84cc0 ffff880058148000
      [41678.588788]  ffff880058147de0 00000001009dda10 ffff88007fc10840 0000000000000000
      [41678.588789]  ffff880058147dc0 ffffffff81611eb5 ffff88007fc10840 ffff880058147e68
      [41678.588789] Call Trace:
      [41678.588792]  [<ffffffff81611eb5>] schedule+0x35/0x80
      [41678.588794]  [<ffffffff81614c71>] schedule_timeout+0x161/0x2d0
      [41678.589164]  [<ffffffffa0d34837>] ll_kill_super+0x77/0x150 [lustre]
      [41678.589219]  [<ffffffffa0838054>] lustre_kill_super+0x34/0x40 [obdclass]
      [41678.589247]  [<ffffffff8120d4ff>] deactivate_locked_super+0x3f/0x70
      [41678.589260]  [<ffffffff812289cb>] cleanup_mnt+0x3b/0x80
      [41678.589264]  [<ffffffff8109d198>] task_work_run+0x78/0x90
      [41678.589267]  [<ffffffff8107b60f>] exit_to_usermode_loop+0x91/0xc2
      [41678.589304]  [<ffffffff81003ae5>] syscall_return_slowpath+0x85/0xa0
      [41678.589327]  [<ffffffff816160a8>] int_ret_from_sys_call+0x25/0x9f
      [41678.591515] DWARF2 unwinder stuck at int_ret_from_sys_call+0x25/0x9f
      
      

       

      The stack trace for the hung umount on the client is common to all these failures. What varies in these failures is what is in the other console logs.

       

      In some cases, there is little information in the console logs as to why umount is hung. The following failures are example of this

      https://testing.hpdd.intel.com/test_sets/1a8349fe-3780-11e8-960d-52540065bddc

      https://testing.hpdd.intel.com/test_sets/49efa63a-2e87-11e8-b74b-52540065bddc

       

       

      Other racer hangs on unmount has more information in client and MDS console logs.

       

      On the MDS console log, we see two hung processes

      [41561.767660] jbd2/vda1-8     D ffff880036476eb0     0   263      2 0x00000000
      [41561.768587] Call Trace:
      [41561.768925]  [<ffffffff816b2060>] ? bit_wait+0x50/0x50
      [41561.769652]  [<ffffffff816b40e9>] schedule+0x29/0x70
      [41561.770284]  [<ffffffff816b1a49>] schedule_timeout+0x239/0x2c0
      [41561.771136]  [<ffffffff812fd2d0>] ? generic_make_request_checks+0x1a0/0x3a0
      [41561.772022]  [<ffffffff81063f5e>] ? kvm_clock_get_cycles+0x1e/0x20
      [41561.772875]  [<ffffffff816b2060>] ? bit_wait+0x50/0x50
      [41561.773511]  [<ffffffff816b35ed>] io_schedule_timeout+0xad/0x130
      [41561.774276]  [<ffffffff816b3688>] io_schedule+0x18/0x20
      [41561.775031]  [<ffffffff816b2071>] bit_wait_io+0x11/0x50
      [41561.775682]  [<ffffffff816b1b97>] __wait_on_bit+0x67/0x90
      [41561.776361]  [<ffffffff816b2060>] ? bit_wait+0x50/0x50
      [41561.777112]  [<ffffffff816b1c41>] out_of_line_wait_on_bit+0x81/0xb0
      [41561.777910]  [<ffffffff810b5080>] ? wake_bit_function+0x40/0x40
      [41561.778756]  [<ffffffff8123b3fa>] __wait_on_buffer+0x2a/0x30
      [41561.779893]  [<ffffffffc00b87f1>] jbd2_journal_commit_transaction+0x1781/0x19b0 [jbd2]
      [41561.780881]  [<ffffffff810c28a0>] ? finish_task_switch+0x50/0x170
      [41561.781650]  [<ffffffffc00bdac9>] kjournald2+0xc9/0x260 [jbd2]
      [41561.782385]  [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30
      [41561.783226]  [<ffffffffc00bda00>] ? commit_timeout+0x10/0x10 [jbd2]
      [41561.784024]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      [41561.784710]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      [41561.785478]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
      [41561.786238]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      …
      [41561.874288] auditd          D ffff88007911bf40     0   464      1 0x00000000
      [41561.875280] Call Trace:
      [41561.875598]  [<ffffffff816b40e9>] schedule+0x29/0x70
      [41561.876257]  [<ffffffffc00bd4a5>] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
      [41561.877183]  [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30
      [41561.877931]  [<ffffffffc00bea92>] jbd2_complete_transaction+0x52/0xa0 [jbd2]
      [41561.879009]  [<ffffffffc00e1d52>] ext4_sync_file+0x292/0x320 [ext4]
      [41561.879805]  [<ffffffff81238477>] do_fsync+0x67/0xb0
      [41561.880476]  [<ffffffff816c0655>] ? system_call_after_swapgs+0xa2/0x146
      [41561.881309]  [<ffffffff81238760>] SyS_fsync+0x10/0x20
      [41561.882029]  [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21
      [41561.882794]  [<ffffffff816c0661>] ? system_call_after_swapgs+0xae/0x146
      

       

      Example of this are

      https://testing.hpdd.intel.com/test_sets/28935b90-2e9f-11e8-9e0e-52540065bddc

       

      Other test hangs have additional information on a client. We see a couple of errors before racer completes

      [38788.150606] Lustre: lustre-MDT0000-mdc-ffff880000015000: Connection restored to 10.2.8.165@tcp (at 10.2.8.165@tcp)
      [38887.196337] LustreError: 8603:0:(statahead.c:1591:start_statahead_thread()) can't start ll_sa thread, rc: -4
      [38920.523725] Lustre: 25748:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523359053/real 1523359053]  req@ffff8800639b96c0 x1597353644599184/t0(0) o36->lustre-MDT0000-mdc-ffff88006424a000@10.2.8.165@tcp:12/10 lens 608/4768 e 0 to 1 dl 1523359107 ref 2 fl Rpc:IX/0/ffffffff rc 0/-1
      [38920.523749] Lustre: 25748:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 2 previous similar messages[38920.523799] Lustre: lustre-MDT0000-mdc-ffff88006424a000: Connection to lustre-MDT0000 (at 10.2.8.165@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [38920.523800] Lustre: Skipped 2 previous similar messages
      [38920.523857] LustreError: 25748:0:(llite_lib.c:1511:ll_md_setattr()) md_setattr fails: rc = -4
      [38920.531836] Lustre: lustre-MDT0000-mdc-ffff88006424a000: Connection restored to 10.2.8.165@tcp (at 10.2.8.165@tcp)
      
      

       

      Examples of this are at

      https://testing.hpdd.intel.com/test_sets/8209f4b2-3d46-11e8-8f8a-52540065bddc

       

       Other client have different information in the console log. Some have many of the following messages

      [10286.777823] ll_sa_21479     D ffff880078c30000     0 21498      2 0x00000080
      [10286.779544] Call Trace:
      [10286.780839]  [<ffffffff816b40e9>] schedule+0x29/0x70
      [10286.782475]  [<ffffffffc0cfa6d1>] ll_statahead_thread+0x651/0xbf0 [lustre]
      [10286.784173]  [<ffffffffc0cfa080>] ? ll_agl_thread+0x3e0/0x3e0 [lustre]
      [10286.785825]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      [10286.787342]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      [10286.788946]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
      [10286.790502]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      [10286.792104] ll_agl_21479    D ffff880078c35ee0     0 21499      2 0x00000080
      [10286.793798] Call Trace:
      [10286.795131]  [<ffffffff816b40e9>] schedule+0x29/0x70
      [10286.796658]  [<ffffffffc0cf9f6b>] ll_agl_thread+0x2cb/0x3e0 [lustre]
      [10286.798302]  [<ffffffffc0cf9ca0>] ? ll_agl_trigger+0x520/0x520 [lustre]
      [10286.799953]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      [10286.801459]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      [10286.803059]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
      [10286.804593]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      
      

       

      Logs for these types of failures are at

      https://testing.hpdd.intel.com/test_sets/847d5946-3152-11e8-b6a0-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: