[LU-5514] After upgrade from 1.8.7 to 2.5.2 stack trace cfs_hash_bd_lookup_intent Created: 20/Aug/14  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Dave Bond (Inactive) Assignee: Lai Siyao
Resolution: Duplicate Votes: 0
Labels: None

Attachments: File lustre-log.1408470549.15310.txt.7z    
Issue Links:
Related
is related to LU-4725 wrong lock ordering in rename leads t... Resolved
Epic/Theme: Lustre-2.5.2
Severity: 3
Rank (Obsolete): 15371

 Description   

We have upgraded from 1.8.7 to 2.5.2

After mounting the FS at approximately 17:29

At 18:49 we see the following

Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: LNet: Service thread pid 15310 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: Pid: 15310, comm: mdt01_000
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel:
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: Call Trace:
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa057ffd5>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa06bdeab>] lu_object_find_at+0xab/0x360 [obdclass]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff81282965>] ? _atomic_dec_and_lock+0x55/0x80
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa06be19f>] lu_object_find_slice+0x1f/0x80 [obdclass]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0f99370>] mdd_object_find+0x10/0x70 [mdd]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0fa97ba>] mdd_is_parent+0xaa/0x3a0 [mdd]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0fa9c0c>] mdd_is_subdir+0x15c/0x200 [mdd]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e6d509>] mdt_reint_rename+0x1049/0x1c20 [mdt]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e48940>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0819f60>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0586f76>] ? upcall_cache_get_entry+0x296/0x880 [libcfs]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa06ddc00>] ? lu_ucred+0x20/0x30 [obdclass]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e684a1>] mdt_reint_rec+0x41/0xe0 [mdt]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e4dcb3>] mdt_reint_internal+0x4c3/0x780 [mdt]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e4dfb4>] mdt_reint+0x44/0xe0 [mdt]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e5158a>] mdt_handle_common+0x52a/0x1470 [mdt]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0e8d755>] mds_regular_handle+0x15/0x20 [mdt]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0853bc5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa056b4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa057c3cf>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa084b2a9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0854f2d>] ptlrpc_main+0xaed/0x1740 [ptlrpc]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffffa0854440>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff8109ab56>] kthread+0x96/0xa0
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel:
Aug 19 18:49:09 cs04r-sc-mds03-01 kernel: LustreError: dumping log to /tmp/lustre-log.1408470549.15310
Aug 19 18:51:19 cs04r-sc-mds03-01 kernel: LNet: Service thread pid 15470 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Aug 19 18:51:19 cs04r-sc-mds03-01 kernel: Pid: 15470, comm: mdt03_004



 Comments   
Comment by Dave Bond (Inactive) [ 20/Aug/14 ]

When unmounting on MDS03-01 so I can mount on MDS03-02 I see the following on MDS03-01

Message from syslogd@cs04r-sc-mds03-01 at Aug 20 09:41:30 ...
kernel:LustreError: 15370:0:(osd_handler.c:2406:osd_object_destroy()) ASSERTION( !lu_object_is_dying(dt->do_lu.lo_header) ) failed:

Message from syslogd@cs04r-sc-mds03-01 at Aug 20 09:41:30 ...
kernel:LustreError: 15370:0:(osd_handler.c:2406:osd_object_destroy()) LBUG

I won't elaborate too much further at the moment as I do not want to confuse the issue.

Comment by Peter Jones [ 20/Aug/14 ]

Lai is looking into this ticket

Comment by Lai Siyao [ 20/Aug/14 ]

I don't see anything particular in the log. And there is a patch for a similar issue: http://review.whamcloud.com/#/c/10816/, however the patch looks not included in 2.5.2, could you verify this?

Besides, http://review.whamcloud.com/#/c/10816/ may not be able to fix this issue because that's in getattr context, but this in rename. There may be a similar error in rename.

Comment by Frederik Ferner (Inactive) [ 20/Aug/14 ]

Lai,

I'm not sure I fully understand your comment, did you want us (Diamond) to verify if that patch is included in 2.5.2?

Also, after looking at the patch and issues referenced in the patch, I'm still not clear what might have happened? I assume it needs special circumstances during a rename/getattr to trigger this. Are there any more details know? Any hints where we should start looking for a reproducer?

FYI, the file system is not yet seeing normal use as we're currently running tests before letting the users back on...

Cheers,
Frederik

Comment by Peter Jones [ 20/Aug/14 ]

Frederik

Lai is suggesting that you should apply the patch - which is missing from 2.5.2 but will be in the upcoming 2.5.3 release.

Regards

Peter

Comment by Lai Siyao [ 21/Aug/14 ]

racer.sh test script is the reproducer for http://review.whamcloud.com/#/c/10816/, which you may use as well.

Comment by Lai Siyao [ 21/Aug/14 ]

This turns out to be a duplicate of LU-4725, and two patches are in review phase: http://review.whamcloud.com/#/c/10916/ http://review.whamcloud.com/#/c/10917/.

This is the same issue as http://review.whamcloud.com/#/c/10816/, but this happens in rename, while that's in getattr. It's a race in lookup the same object twice while object is destroyed in the mean time. I'd suggest you apply these three patches to fully solve such issue.

Comment by Frederik Ferner (Inactive) [ 21/Aug/14 ]

Peter, Lai,

thanks for clarifying this.

We'll look into compiling lustre 2.5.2 + those 3 patches, but if there is an easier way for us to get the correct compiled versions, let us know.

I assume, initially it should be sufficient to run this patches version on the MDS, correct?

Kind regards,
Frederik

Comment by Lai Siyao [ 21/Aug/14 ]

Yes, it's sufficient with these 3 patches. Once those two patches for LU-4725 are landed on 2.5.3, you can use 2.5.3 directly, but currently you have to apply these patches yourself.

Comment by Peter Jones [ 21/Aug/14 ]

Frederik

Is rename commonly used during regular production usage or is this an issue that is more likely to hit in a stressing test run?

Peter

Comment by Frederik Ferner (Inactive) [ 21/Aug/14 ]

There's a number of processes/applications that tend to create files with a temporary extension and than rename them to the proper name, to avoid any race conditions where processing starts before the data is there, so not that uncommon. And considering that we first saw it with basically only ior test runs, I'd say, something is doing it quite frequently...

Cheers,
Frederik

Comment by Peter Jones [ 21/Aug/14 ]

ok fair enough then. I just did not want you to move forward with patches for something that was a rare edge case

Generated at Sat Feb 10 01:52:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.