[LU-3580] Panic in ptlrpc when rerunning lustre-rsync-test/8 without remount - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Low Priority
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.5.0
Labels:
- zfs
Environment:
1 OSS (2 osts), 1 MDS, 1 Client (all running lustre-master build 1546), MDS and OSS using ZFS

Severity:
3
Rank (Obsolete):
9068

Description

I was running lustre-rsync-test test_8 repeatedly without umount/remount to reproduce ~~LU-3573~~, when my MDS hit an LBUG:

LustreError: 48946:0:(sec_null.c:318:null_alloc_rs()) ASSERTION( rs->rs_size >= rs_size ) failed: 
LustreError: 48946:0:(sec_null.c:318:null_alloc_rs()) LBUG
Kernel panic - not syncing: LBUG
Pid: 48946, comm: mdt00_002 Tainted: P           ---------------    2.6.32-358.11.1.el6_lustre.g3b657b6.x86_64 #1
Call Trace:
 [<ffffffff8150d8f8>] ? panic+0xa7/0x16f
 [<ffffffffa0629eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa0979632>] ? null_alloc_rs+0x272/0x390 [ptlrpc]
 [<ffffffffa0967dd9>] ? sptlrpc_svc_alloc_rs+0x1d9/0x2a0 [ptlrpc]
 [<ffffffffa093d533>] ? lustre_pack_reply_v2+0x93/0x280 [ptlrpc]
 [<ffffffffa093d7ce>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
 [<ffffffffa093d921>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
 [<ffffffffa09654e3>] ? req_capsule_server_pack+0x53/0x100 [ptlrpc]
 [<ffffffffa0d37f1e>] ? mdt_get_info+0xae/0x19b0 [mdt]
 [<ffffffffa0d29fbd>] ? mdt_unpack_req_pack_rep+0x4d/0x4d0 [mdt]
 [<ffffffffa093e52c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
 [<ffffffffa0d33cf7>] ? mdt_handle_common+0x647/0x16d0 [mdt]
 [<ffffffffa0d6d155>] ? mds_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa094d978>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
 [<ffffffffa062a54e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
 [<ffffffffa063ba9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
 [<ffffffffa0944d99>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffffa094ecfd>] ? ptlrpc_main+0xabd/0x1700 [ptlrpc]
 [<ffffffffa094e240>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff81096936>] ? kthread+0x96/0xa0
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffff810968a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

serial-manager.txt
12/Jul/13 3:55 PM
288 kB
Nathaniel Clark

Issue Links

is related to

LU-3573 lustre-rsync-test test_8: @@@@@@ FAIL: Failure in replication; differences found.

Resolved

Activity

[LU-3580] Panic in ptlrpc when rerunning lustre-rsync-test/8 without remount

Andreas Dilger added a comment - 09/Jan/20 7:06 AM

Close old bug

Andreas Dilger added a comment - 09/Jan/20 7:06 AM Close old bug

Alexey Lyashkov added a comment - 16/Jul/13 6:11 AM

Oleg,

it's looks new bug in sptlrpc code, and don't related to the MDC<>MDT exchange.
OSC have an own pool for requests - where it's preallocted with messages - but looks some reply`s need more size then set in preallocate time or it's related to the early reply.
as i see lustre_pack_reply may called more then one time - first for early reply second for the real reply - in that case we will have different request format and size for

rc = sptlrpc_svc_alloc_rs(req, msg_len);

did we have a crashdump ?

Alexey Lyashkov added a comment - 16/Jul/13 6:11 AM Oleg, it's looks new bug in sptlrpc code, and don't related to the MDC<>MDT exchange. OSC have an own pool for requests - where it's preallocted with messages - but looks some reply`s need more size then set in preallocate time or it's related to the early reply. as i see lustre_pack_reply may called more then one time - first for early reply second for the real reply - in that case we will have different request format and size for rc = sptlrpc_svc_alloc_rs(req, msg_len); did we have a crashdump ?

Nathaniel Clark added a comment - 15/Jul/13 7:15 PM

Oleg, The OST wasn't down. lustre-rsync-test/8 builds a directory tree with createmany and some nested Gfor loops for directories, and then does a lustre_rsync to a local directory (on the client). I had been running that in a loop to try to recreate the bug I was looking for when the MDT went down, it's pretty reproducable, you just have to keep the filesystem mounted between runs. I can reproduce if you want cleaner logs.

Nathaniel Clark added a comment - 15/Jul/13 7:15 PM Oleg, The OST wasn't down. lustre-rsync-test/8 builds a directory tree with createmany and some nested Gfor loops for directories, and then does a lustre_rsync to a local directory (on the client). I had been running that in a loop to try to recreate the bug I was looking for when the MDT went down, it's pretty reproducable, you just have to keep the filesystem mounted between runs. I can reproduce if you want cleaner logs.

Oleg Drokin added a comment - 15/Jul/13 3:44 PM

Shadow, I wonder if you have an opinion on this?

Thee was a class of bugs in the past that you worked on where a missing OST led to some smaller allocations and then everything came down once we realized we had more OSTs in the system.

Nathaniel, why is the OST down?

Oleg Drokin added a comment - 15/Jul/13 3:44 PM Shadow, I wonder if you have an opinion on this? Thee was a class of bugs in the past that you worked on where a missing OST led to some smaller allocations and then everything came down once we realized we had more OSTs in the system. Nathaniel, why is the OST down?

Panic in ptlrpc when rerunning lustre-rsync-test/8 without remount

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates