[LU-6089] qsd_handler.c:1139:qsd_op_adjust()) ASSERTION( qqi ) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.7.0
Labels:
None

Severity:
3
Rank (Obsolete):
16950

Description

Had this crash happen on the tip of master as of yesterday running test 132 of sanity.sh:

Lustre: 3948:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1420636700/re
al 1420636700]  req@ffff8800290ba380 x1489642625567436/t0(0) o250->MGC192.168.20.154@tcp@0@lo:26/25 lens 400/544 e 0 to 1 dl 1420636706 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 3948:0:(client.c:1942:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
LustreError: 39:0:(qsd_handler.c:1139:qsd_op_adjust()) ASSERTION( qqi ) failed: 
LustreError: 39:0:(qsd_handler.c:1139:qsd_op_adjust()) LBUG
Pid: 39, comm: kswapd0

Call Trace:
[<ffffffffa0779895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa0779e97>] lbug_with_loc+0x47/0xb0 [libcfs]
[<ffffffffa0f25028>] qsd_op_adjust+0x478/0x580 [lquota]
[<ffffffffa100a597>] osd_object_delete+0x217/0x2f0 [osd_ldiskfs]
[<ffffffffa091e0c1>] lu_object_free+0x81/0x1a0 [obdclass]
[<ffffffffa091f167>] lu_site_purge+0x2e7/0x4e0 [obdclass]
[<ffffffffa091f4e8>] lu_cache_shrink+0x188/0x310 [obdclass]
[<ffffffff81138dba>] shrink_slab+0x12a/0x1a0
[<ffffffff8113c0da>] balance_pgdat+0x59a/0x820
[<ffffffff8113c494>] kswapd+0x134/0x3b0

Attachments

Issue Links

is related to

LU-5242 Test hang sanity test_132, test_133: umount ost

Resolved

LU-5331 qsd_handler.c:1139:qsd_op_adjust()) ASSERTION( qqi ) failed

Resolved

Activity

[LU-6089] qsd_handler.c:1139:qsd_op_adjust()) ASSERTION( qqi ) failed

Andreas Dilger added a comment - 11/Feb/15 8:26 PM

I've now hit this once while testing http://review.whamcloud.com/11258 on master instead of the 12515 patch, in a thread doing memory reclaim during an unmount operation, though it wasn't a thread involved in the unmount process:

Lustre: Failing over testfs-OST0000
Lustre: server umount testfs-OST0000 complete
Lustre: Failing over testfs-OST0001
general protection fault: 0000 [#1] SMP 
Pid: 2170, comm: java Tainted: P---------------    2.6.32-431.29.2.el6_lustre.g36cd22b.x86_6
RIP: 0010:[<ffffffffa07a4cf6>] [<ffffffffa07a4cf6>] qsd_op_adjust+0xb6/0x580 [lq
uota]
Process java (pid: 2170, threadinfo ffff8800d00ce000, task ffff880037c61540)
Call Trace:
osd_object_delete+0x217/0x2f0 [osd_ldiskfs]
lu_object_free+0x81/0x1a0 [obdclass]
lu_site_purge+0x2e7/0x4e0 [obdclass]
lu_cache_shrink+0x188/0x310 [obdclass]
shrink_slab+0x12a/0x1a0
do_try_to_free_pages+0x3f7/0x610
try_to_free_pages+0x92/0x120
__alloc_pages_nodemask+0x47e/0x8d0
kmem_getpages+0x62/0x170
fallback_alloc+0x1ba/0x270
____cache_alloc_node+0x99/0x160
user_path_parent+0x31/0x80
sys_renameat+0xb8/0x3a0
sys_rename+0x1b/0x20

I've also gone back and retested the latest version of 12515 and not hit this problem in sanity as I had twice before.

Andreas Dilger added a comment - 11/Feb/15 8:26 PM I've now hit this once while testing http://review.whamcloud.com/11258 on master instead of the 12515 patch, in a thread doing memory reclaim during an unmount operation, though it wasn't a thread involved in the unmount process: Lustre: Failing over testfs-OST0000 Lustre: server umount testfs-OST0000 complete Lustre: Failing over testfs-OST0001 general protection fault: 0000 [#1] SMP Pid: 2170, comm: java Tainted: P--------------- 2.6.32-431.29.2.el6_lustre.g36cd22b.x86_6 RIP: 0010:[<ffffffffa07a4cf6>] [<ffffffffa07a4cf6>] qsd_op_adjust+0xb6/0x580 [lq uota] Process java (pid: 2170, threadinfo ffff8800d00ce000, task ffff880037c61540) Call Trace: osd_object_delete+0x217/0x2f0 [osd_ldiskfs] lu_object_free+0x81/0x1a0 [obdclass] lu_site_purge+0x2e7/0x4e0 [obdclass] lu_cache_shrink+0x188/0x310 [obdclass] shrink_slab+0x12a/0x1a0 do_try_to_free_pages+0x3f7/0x610 try_to_free_pages+0x92/0x120 __alloc_pages_nodemask+0x47e/0x8d0 kmem_getpages+0x62/0x170 fallback_alloc+0x1ba/0x270 ____cache_alloc_node+0x99/0x160 user_path_parent+0x31/0x80 sys_renameat+0xb8/0x3a0 sys_rename+0x1b/0x20 I've also gone back and retested the latest version of 12515 and not hit this problem in sanity as I had twice before.

Andreas Dilger added a comment - 06/Feb/15 2:43 PM

I reverted the 12515 patch, and while I've observed the original ~~LU-5242~~ problem of not being able to unmount/remount the filesystem quickly, I haven't had any crashes, when I was previously 2-of-2 in sanity.sh while the patch was applied.

Andreas Dilger added a comment - 06/Feb/15 2:43 PM I reverted the 12515 patch, and while I've observed the original LU-5242 problem of not being able to unmount/remount the filesystem quickly, I haven't had any crashes, when I was previously 2-of-2 in sanity.sh while the patch was applied.

Andreas Dilger added a comment - 05/Feb/15 11:12 AM

Hit a very similar crash again in osd_object_delete->qsd_op_adjust() when unmounting in sanity.sh test_65j and I am again testing http://review.whamcloud.com/12515 from ~~LU-5242~~. It looks like I restarted another test after my previous run that passed sanity.sh but I don't recall for sure if I reverted the patch at that time. I'll need to test without this patch again.

Andreas Dilger added a comment - 05/Feb/15 11:12 AM Hit a very similar crash again in osd_object_delete->qsd_op_adjust() when unmounting in sanity.sh test_65j and I am again testing http://review.whamcloud.com/12515 from LU-5242 . It looks like I restarted another test after my previous run that passed sanity.sh but I don't recall for sure if I reverted the patch at that time. I'll need to test without this patch again.

Andreas Dilger added a comment - 07/Jan/15 9:24 PM

This has the same symptoms of ~~LU-5331~~, but I'm running the latest master (v2_6_92_0-14-g37145b3 + http://review.whamcloud.com/12515). I'll try reverting that patch to see if it fixes the problem.

Andreas Dilger added a comment - 07/Jan/15 9:24 PM This has the same symptoms of LU-5331 , but I'm running the latest master (v2_6_92_0-14-g37145b3 + http://review.whamcloud.com/12515 ). I'll try reverting that patch to see if it fixes the problem.

qsd_handler.c:1139:qsd_op_adjust()) ASSERTION( qqi ) failed

Details

Description

Attachments

Issue Links

Activity

People

Dates