[LU-3580] Panic in ptlrpc when rerunning lustre-rsync-test/8 without remount Created: 12/Jul/13 Updated: 09/Jan/20 Resolved: 09/Jan/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Nathaniel Clark | Assignee: | WC Triage |
| Resolution: | Low Priority | Votes: | 0 |
| Labels: | zfs | ||
| Environment: |
1 OSS (2 osts), 1 MDS, 1 Client (all running lustre-master build 1546), MDS and OSS using ZFS |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9068 | ||||||||
| Description |
|
I was running lustre-rsync-test test_8 repeatedly without umount/remount to reproduce LustreError: 48946:0:(sec_null.c:318:null_alloc_rs()) ASSERTION( rs->rs_size >= rs_size ) failed: LustreError: 48946:0:(sec_null.c:318:null_alloc_rs()) LBUG Kernel panic - not syncing: LBUG Pid: 48946, comm: mdt00_002 Tainted: P --------------- 2.6.32-358.11.1.el6_lustre.g3b657b6.x86_64 #1 Call Trace: [<ffffffff8150d8f8>] ? panic+0xa7/0x16f [<ffffffffa0629eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa0979632>] ? null_alloc_rs+0x272/0x390 [ptlrpc] [<ffffffffa0967dd9>] ? sptlrpc_svc_alloc_rs+0x1d9/0x2a0 [ptlrpc] [<ffffffffa093d533>] ? lustre_pack_reply_v2+0x93/0x280 [ptlrpc] [<ffffffffa093d7ce>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] [<ffffffffa093d921>] ? lustre_pack_reply+0x11/0x20 [ptlrpc] [<ffffffffa09654e3>] ? req_capsule_server_pack+0x53/0x100 [ptlrpc] [<ffffffffa0d37f1e>] ? mdt_get_info+0xae/0x19b0 [mdt] [<ffffffffa0d29fbd>] ? mdt_unpack_req_pack_rep+0x4d/0x4d0 [mdt] [<ffffffffa093e52c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] [<ffffffffa0d33cf7>] ? mdt_handle_common+0x647/0x16d0 [mdt] [<ffffffffa0d6d155>] ? mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa094d978>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] [<ffffffffa062a54e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [<ffffffffa063ba9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa0944d99>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] [<ffffffff81063310>] ? default_wake_function+0x0/0x20 [<ffffffffa094ecfd>] ? ptlrpc_main+0xabd/0x1700 [ptlrpc] [<ffffffffa094e240>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff81096936>] ? kthread+0x96/0xa0 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 [<ffffffff810968a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 |
| Comments |
| Comment by Oleg Drokin [ 15/Jul/13 ] |
|
Shadow, I wonder if you have an opinion on this? Thee was a class of bugs in the past that you worked on where a missing OST led to some smaller allocations and then everything came down once we realized we had more OSTs in the system. Nathaniel, why is the OST down? |
| Comment by Nathaniel Clark [ 15/Jul/13 ] |
|
Oleg, The OST wasn't down. lustre-rsync-test/8 builds a directory tree with createmany and some nested Gfor loops for directories, and then does a lustre_rsync to a local directory (on the client). I had been running that in a loop to try to recreate the bug I was looking for when the MDT went down, it's pretty reproducable, you just have to keep the filesystem mounted between runs. I can reproduce if you want cleaner logs. |
| Comment by Alexey Lyashkov [ 16/Jul/13 ] |
|
Oleg, it's looks new bug in sptlrpc code, and don't related to the MDC<>MDT exchange. rc = sptlrpc_svc_alloc_rs(req, msg_len); did we have a crashdump ? |
| Comment by Andreas Dilger [ 09/Jan/20 ] |
|
Close old bug |