[LU-13656] lfs migrate -m hangs a few minutes at start (sometimes) Created: 09/Jun/20  Updated: 04/Nov/20  Resolved: 04/Nov/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Lustre 2.12.5 RC1 - CentOS 7.6 3.10.0-957.27.2.el7_lustre.pl2.x86_64


Attachments: Text File fir-md1-s2_2.12.5_20200609_kern.log    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

With Lustre 2.12.5 RC1, lfs migrate -m now sometimes hangs at start.  The directory migrated is not accessible from any clients for a few minutes. For us, this seems to be a regression of 2.12.5 vs 2.12.4.  Often, the hang doesn't generate any trace, but this morning, I saw one on the source MDT, MDT0001. The goal here is to migrate a directory from MDT0001 to MDT0003:

lfs migrate -m 3 -v /fir/users/galvisf

The first backtrace to be seen is pasted below. This is after a few minutes of hang. I'm also attaching the kernel logs from MDT1 as fir-md1-s2_2.12.5_20200609_kern.log

2656 Jun 09 09:28:01 fir-md1-s2 kernel: LNet: Service thread pid 23883 was inactive for 200.15s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
2657 Jun 09 09:28:01 fir-md1-s2 kernel: Pid: 23883, comm: mdt01_069 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
2658 Jun 09 09:28:01 fir-md1-s2 kernel: Call Trace:
2659 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0cefc80>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
2660 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0cf1dff>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
2661 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0cf46be>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
2662 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc16cfec2>] osp_md_object_lock+0x162/0x2d0 [osp]
2663 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc15e38c4>] lod_object_lock+0xf4/0x780 [lod]
2664 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc1665bbe>] mdd_object_lock+0x3e/0xe0 [mdd]
2665 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14ce401>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
2666 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14ce99a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
2667 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14e356e>] mdt_rename_lock+0xbe/0x4b0 [mdt]
2668 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14e58d5>] mdt_reint_rename+0x2c5/0x2b90 [mdt]
2669 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14ee963>] mdt_reint_rec+0x83/0x210 [mdt]
2670 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14cb273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
2671 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14d66e7>] mdt_reint+0x67/0x140 [mdt]
2672 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0d9066a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
2673 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0d3344b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
2674 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0d36db4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
2675 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffff86ec2e81>] kthread+0xd1/0xe0
2676 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffff87577c24>] ret_from_fork_nospec_begin+0xe/0x21
2677 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff

Then, finally the thread is completing:

Jun 09 09:28:11 fir-md1-s2 kernel: LNet: Service thread pid 23636 completed after 210.08s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

On the client, operations have resumed and the migration is now in progress:

[root@fir-rbh01 robinhood]# lfs getdirstripe /fir/users/galvisf/
lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
mdtidx		 FID[seq:oid:ver]
     3		 [0x2800401af:0x132aa:0x0]		
     1		 [0x240057d88:0xa6e5:0x0]

But this had an impact on production during that time, as while the lfs migrate starts, jobs for this specific user were blocked on I/O... this is why we wanted to report this problem.

Thanks!
Stephane



 Comments   
Comment by Peter Jones [ 10/Jun/20 ]

Lai

Could you please advise?

Thanks

Peter

Comment by Stephane Thiell [ 26/Jun/20 ]

Note that I've not seen that problem anymore, and we've been using lfs migrate non-stop since then. Perhaps just a random glitch after the 2.12.4 -> 2.12.5 upgrade. I'll report back if we see this problem again.

Comment by Andreas Dilger [ 04/Nov/20 ]

Closing this as "Cannot Reproduce" based on Stephane's last comments from 5 months ago.

Generated at Sat Feb 10 03:03:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.