[LU-3580] Panic in ptlrpc when rerunning lustre-rsync-test/8 without remount Created: 12/Jul/13  Updated: 09/Jan/20  Resolved: 09/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Nathaniel Clark Assignee: WC Triage
Resolution: Low Priority Votes: 0
Labels: zfs
Environment:

1 OSS (2 osts), 1 MDS, 1 Client (all running lustre-master build 1546), MDS and OSS using ZFS


Attachments: Text File serial-manager.txt    
Issue Links:
Related
is related to LU-3573 lustre-rsync-test test_8: @@@@@@ FAIL... Resolved
Severity: 3
Rank (Obsolete): 9068

 Description   

I was running lustre-rsync-test test_8 repeatedly without umount/remount to reproduce LU-3573, when my MDS hit an LBUG:

LustreError: 48946:0:(sec_null.c:318:null_alloc_rs()) ASSERTION( rs->rs_size >= rs_size ) failed: 
LustreError: 48946:0:(sec_null.c:318:null_alloc_rs()) LBUG
Kernel panic - not syncing: LBUG
Pid: 48946, comm: mdt00_002 Tainted: P           ---------------    2.6.32-358.11.1.el6_lustre.g3b657b6.x86_64 #1
Call Trace:
 [<ffffffff8150d8f8>] ? panic+0xa7/0x16f
 [<ffffffffa0629eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa0979632>] ? null_alloc_rs+0x272/0x390 [ptlrpc]
 [<ffffffffa0967dd9>] ? sptlrpc_svc_alloc_rs+0x1d9/0x2a0 [ptlrpc]
 [<ffffffffa093d533>] ? lustre_pack_reply_v2+0x93/0x280 [ptlrpc]
 [<ffffffffa093d7ce>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
 [<ffffffffa093d921>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
 [<ffffffffa09654e3>] ? req_capsule_server_pack+0x53/0x100 [ptlrpc]
 [<ffffffffa0d37f1e>] ? mdt_get_info+0xae/0x19b0 [mdt]
 [<ffffffffa0d29fbd>] ? mdt_unpack_req_pack_rep+0x4d/0x4d0 [mdt]
 [<ffffffffa093e52c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
 [<ffffffffa0d33cf7>] ? mdt_handle_common+0x647/0x16d0 [mdt]
 [<ffffffffa0d6d155>] ? mds_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa094d978>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
 [<ffffffffa062a54e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
 [<ffffffffa063ba9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
 [<ffffffffa0944d99>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffffa094ecfd>] ? ptlrpc_main+0xabd/0x1700 [ptlrpc]
 [<ffffffffa094e240>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff81096936>] ? kthread+0x96/0xa0
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffff810968a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20


 Comments   
Comment by Oleg Drokin [ 15/Jul/13 ]

Shadow, I wonder if you have an opinion on this?

Thee was a class of bugs in the past that you worked on where a missing OST led to some smaller allocations and then everything came down once we realized we had more OSTs in the system.

Nathaniel, why is the OST down?

Comment by Nathaniel Clark [ 15/Jul/13 ]

Oleg, The OST wasn't down. lustre-rsync-test/8 builds a directory tree with createmany and some nested Gfor loops for directories, and then does a lustre_rsync to a local directory (on the client). I had been running that in a loop to try to recreate the bug I was looking for when the MDT went down, it's pretty reproducable, you just have to keep the filesystem mounted between runs. I can reproduce if you want cleaner logs.

Comment by Alexey Lyashkov [ 16/Jul/13 ]

Oleg,

it's looks new bug in sptlrpc code, and don't related to the MDC<>MDT exchange.
OSC have an own pool for requests - where it's preallocted with messages - but looks some reply`s need more size then set in preallocate time or it's related to the early reply.
as i see lustre_pack_reply may called more then one time - first for early reply second for the real reply - in that case we will have different request format and size for

rc = sptlrpc_svc_alloc_rs(req, msg_len);

did we have a crashdump ?

Comment by Andreas Dilger [ 09/Jan/20 ]

Close old bug

Generated at Sat Feb 10 01:35:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.