Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.12.5
Labels:
None
Environment:
Lustre 2.12.5 RC1 - CentOS 7.6 3.10.0-957.27.2.el7_lustre.pl2.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

With Lustre 2.12.5 RC1, lfs migrate -m now sometimes hangs at start. The directory migrated is not accessible from any clients for a few minutes. For us, this seems to be a regression of 2.12.5 vs 2.12.4. Often, the hang doesn't generate any trace, but this morning, I saw one on the source MDT, MDT0001. The goal here is to migrate a directory from MDT0001 to MDT0003:

lfs migrate -m 3 -v /fir/users/galvisf

The first backtrace to be seen is pasted below. This is after a few minutes of hang. I'm also attaching the kernel logs from MDT1 as fir-md1-s2_2.12.5_20200609_kern.log

2656 Jun 09 09:28:01 fir-md1-s2 kernel: LNet: Service thread pid 23883 was inactive for 200.15s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
2657 Jun 09 09:28:01 fir-md1-s2 kernel: Pid: 23883, comm: mdt01_069 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
2658 Jun 09 09:28:01 fir-md1-s2 kernel: Call Trace:
2659 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0cefc80>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
2660 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0cf1dff>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
2661 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0cf46be>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
2662 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc16cfec2>] osp_md_object_lock+0x162/0x2d0 [osp]
2663 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc15e38c4>] lod_object_lock+0xf4/0x780 [lod]
2664 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc1665bbe>] mdd_object_lock+0x3e/0xe0 [mdd]
2665 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14ce401>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
2666 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14ce99a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
2667 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14e356e>] mdt_rename_lock+0xbe/0x4b0 [mdt]
2668 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14e58d5>] mdt_reint_rename+0x2c5/0x2b90 [mdt]
2669 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14ee963>] mdt_reint_rec+0x83/0x210 [mdt]
2670 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14cb273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
2671 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc14d66e7>] mdt_reint+0x67/0x140 [mdt]
2672 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0d9066a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
2673 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0d3344b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
2674 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffc0d36db4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
2675 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffff86ec2e81>] kthread+0xd1/0xe0
2676 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffff87577c24>] ret_from_fork_nospec_begin+0xe/0x21
2677 Jun 09 09:28:01 fir-md1-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff

Then, finally the thread is completing:

Jun 09 09:28:11 fir-md1-s2 kernel: LNet: Service thread pid 23636 completed after 210.08s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

On the client, operations have resumed and the migration is now in progress:

[root@fir-rbh01 robinhood]# lfs getdirstripe /fir/users/galvisf/
lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
mdtidx		 FID[seq:oid:ver]
     3		 [0x2800401af:0x132aa:0x0]		
     1		 [0x240057d88:0xa6e5:0x0]

But this had an impact on production during that time, as while the lfs migrate starts, jobs for this specific user were blocked on I/O... this is why we wanted to report this problem.

Thanks!
Stephane

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

fir-md1-s2_2.12.5_20200609_kern.log
350 kB
09/Jun/20 4:43 PM

Assignee:: Lai Siyao

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 09/Jun/20 4:46 PM

Updated:: 04/Nov/20 9:58 PM

Resolved:: 04/Nov/20 9:58 PM

Details

Description

Attachments

Attachments

Activity

People

Dates