Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13656

lfs migrate -m hangs a few minutes at start (sometimes)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.12.5
    • None
    • Lustre 2.12.5 RC1 - CentOS 7.6 3.10.0-957.27.2.el7_lustre.pl2.x86_64
    • 3
    • 9223372036854775807

    Description

      With Lustre 2.12.5 RC1, lfs migrate -m now sometimes hangs at start.  The directory migrated is not accessible from any clients for a few minutes. For us, this seems to be a regression of 2.12.5 vs 2.12.4.  Often, the hang doesn't generate any trace, but this morning, I saw one on the source MDT, MDT0001. The goal here is to migrate a directory from MDT0001 to MDT0003:

      lfs migrate -m 3 -v /fir/users/galvisf
      

      The first backtrace to be seen is pasted below. This is after a few minutes of hang. I'm also attaching the kernel logs from MDT1 as fir-md1-s2_2.12.5_20200609_kern.log

      2656 Jun 09 09:28:01 fir-md1-s2 kernel: LNet: Service thread pid 23883 was inactive for 200.15s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      2657 Jun 09 09:28:01 fir-md1-s2 kernel: Pid: 23883, comm: mdt01_069 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
      2658 Jun 09 09:28:01 fir-md1-s2 kernel: Call Trace:
      2659 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0cefc80>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      2660 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0cf1dff>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
      2661 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0cf46be>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
      2662 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc16cfec2>] osp_md_object_lock+0x162/0x2d0 [osp]
      2663 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc15e38c4>] lod_object_lock+0xf4/0x780 [lod]
      2664 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc1665bbe>] mdd_object_lock+0x3e/0xe0 [mdd]
      2665 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14ce401>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
      2666 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14ce99a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
      2667 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14e356e>] mdt_rename_lock+0xbe/0x4b0 [mdt]
      2668 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14e58d5>] mdt_reint_rename+0x2c5/0x2b90 [mdt]
      2669 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14ee963>] mdt_reint_rec+0x83/0x210 [mdt]
      2670 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14cb273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      2671 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14d66e7>] mdt_reint+0x67/0x140 [mdt]
      2672 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0d9066a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      2673 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0d3344b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      2674 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0d36db4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      2675 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffff86ec2e81>] kthread+0xd1/0xe0
      2676 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffff87577c24>] ret_from_fork_nospec_begin+0xe/0x21
      2677 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      

      Then, finally the thread is completing:

      Jun 09 09:28:11 fir-md1-s2 kernel: LNet: Service thread pid 23636 completed after 210.08s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      

      On the client, operations have resumed and the migration is now in progress:

      [root@fir-rbh01 robinhood]# lfs getdirstripe /fir/users/galvisf/
      lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
      mdtidx		 FID[seq:oid:ver]
           3		 [0x2800401af:0x132aa:0x0]		
           1		 [0x240057d88:0xa6e5:0x0]
      

      But this had an impact on production during that time, as while the lfs migrate starts, jobs for this specific user were blocked on I/O... this is why we wanted to report this problem.

      Thanks!
      Stephane

      Attachments

        Activity

          People

            laisiyao Lai Siyao
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: