[LU-7488] _req_capsule_get()) ASSERTION( fmt != ((void *)(long)0x5a5a5a5a5a5a5a5a) ) failed: Created: 27/Nov/15  Updated: 01/Dec/15  Resolved: 01/Dec/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: Di Wang
Resolution: Duplicate Votes: 0
Labels: soak
Environment:

lola
build: 2.7.63-4-gf84e06e, a7eface85ea2d2aa6198681264b082a0244855d4 + patches


Attachments: File console-lola-9.log.bz2     File lola-9-lbug-client-messages.txt.bz2     File lustre-log.1448567094.7588.bz2     File messages-lola-9.log.bz2    
Issue Links:
Related
is related to LU-7490 out_tx_write_exec()) LBUG Resolved
is related to LU-7490 out_tx_write_exec()) LBUG Resolved
is related to LU-7455 Tracking tickets to make DNE pass soa... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The error occurred during soak testing of master branch build '20151122' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151122). DNE is enabled. MDSes are configured in active-active failover configuration.

Sequence of events:

  • 2015-11-26 10:32 Failover resources (mdt-0,1) lola-8 --> lola-9 started
  • 2015-11-26 11:40 Failback resources (mdt-0.1) lola-9 --> lola-8 completed successful
  • 2015-11-26 11:44 LBUG on lola-9. See the following message.
    Nov 26 11:44:53 lola-9 kernel: LustreError: 7588:0:(layout.c:1989:__req_capsule_get()) ASSERTION( fmt != ((void *)(long)0x5a5a5a5a5a5a5a5a) ) failed: 
    Nov 26 11:44:53 lola-9 kernel: LustreError: 7588:0:(layout.c:1989:__req_capsule_get()) LBUG
    Nov 26 11:44:53 lola-9 kernel: Pid: 7588, comm: mdt02_000
    Nov 26 11:44:53 lola-9 kernel: 
    Nov 26 11:44:53 lola-9 kernel: Call Trace:
    Nov 26 11:44:53 lola-9 kernel: [<ffffffffa07c1875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
    Nov 26 11:44:53 lola-9 kernel: [<ffffffffa07c1e77>] lbug_with_loc+0x47/0xb0 [libcfs]
    Nov 26 11:44:53 lola-9 kernel: [<ffffffffa0b2eed7>] __req_capsule_get+0x617/0x6e0 [ptlrpc]
    Nov 26 11:44:53 lola-9 kernel: [<ffffffffa08bb595>] ? class_handle2object+0x95/0x190 [obdclass]
    Nov 26 11:44:53 lola-9 kernel: [<ffffffffa0b2f0a8>] req_capsule_server_get+0x18/0x20 [ptlrpc]
    Nov 26 11:44:53 lola-9 kernel: [<ffffffffa0ad2f52>] ldlm_cli_enqueue_fini+0x1d2/0xe30 [ptlrpc]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa0af55b4>] ? ptlrpc_set_destroy+0x414/0x570 [ptlrpc]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa0ad3f71>] ldlm_cli_enqueue+0x3c1/0x870 [ptlrpc]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa0ad9010>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa122c2f0>] ? mdt_remote_blocking_ast+0x0/0x210 [mdt]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa14105c5>] osp_md_object_lock+0x185/0x240 [osp]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa131d557>] lod_object_lock+0x147/0x860 [lod]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa08dfa0f>] ? lu_object_find_try+0x9f/0x260 [obdclass]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa139f92b>] mdd_object_lock+0x3b/0xd0 [mdd]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa1239b2a>] mdt_remote_object_lock+0x14a/0x310 [mdt]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa0b05925>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa0b2ea22>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa1239e19>] mdt_object_lock_internal+0x129/0x2d0 [mdt]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa123a081>] mdt_object_lock+0x11/0x20 [mdt]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa124bb2a>] mdt_reint_create+0x6fa/0xcc0 [mdt]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa08fb870>] ? lu_ucred+0x20/0x30 [obdclass]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa122b675>] ? mdt_ucred+0x15/0x20 [mdt]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa12448dc>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa0b2ea22>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffff81294a3a>] ? strlcpy+0x4a/0x60
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa1248a1d>] mdt_reint_rec+0x5d/0x200 [mdt]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa123477b>] mdt_reint_internal+0x62b/0xb80 [mdt]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa123516b>] mdt_reint+0x6b/0x120 [mdt]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa0b70e1c>] tgt_request_handle+0x8bc/0x12e0 [ptlrpc]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa0b18711>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0
    Nov 26 11:44:54 lola-9 kernel: [<ffffffffa0b178d0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
    Nov 26 11:44:54 lola-9 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0
    Nov 26 11:44:54 lola-9 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
    Nov 26 11:44:54 lola-9 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
    Nov 26 11:44:54 lola-9 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
    Nov 26 11:44:54 lola-9 kernel: 
    Nov 26 11:44:54 lola-9 kernel: LustreError: dumping log to /tmp/lustre-log.1448567094.7588
    

    Attached messages, console log files of MDS (lola-9) and debug log file mentioned in LBUG error message. Also extracted Lustre messages on client nodes and attached them to the ticket. No errors occured on OSS nodes.



 Comments   
Comment by Di Wang [ 01/Dec/15 ]

This bug is actually due to the patch http://review.whamcloud.com/#/c/17199/ , it should not free the request in delay_list, which might cause this panic, I will update the patch in LU-7490. And close this one.

Comment by Di Wang [ 01/Dec/15 ]

duplicate with LU-7490

Generated at Sat Feb 10 02:09:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.