[LU-5331] qsd_handler.c:1139:qsd_op_adjust()) ASSERTION( qqi ) failed Created: 11/Jul/14  Updated: 20/Jan/22  Resolved: 30/Jul/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-6089 qsd_handler.c:1139:qsd_op_adjust()) A... Resolved
is related to LU-5749 osd-zfs: object creation may serializ... Resolved
Severity: 3
Rank (Obsolete): 14873

 Description   

Had this crash happen on the tip of master as of yesterday running test 900 of sanity.sh:

<3>[109552.952152] LustreError: 12574:0:(client.c:1079:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88001bd78be8 x1473227759636904/t0(0) o13->lustre-OST0000-osc-MDT0000@0@lo:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
<3>[109552.954088] LustreError: 12574:0:(client.c:1079:ptlrpc_import_delay_req()) Skipped 1032 previous similar messages
<4>[109556.157386] Lustre: server umount lustre-MDT0000 complete
<0>[109556.443175] LustreError: 59:0:(qsd_handler.c:1139:qsd_op_adjust()) ASSERTION( qqi ) failed: 
<0>[109556.443758] LustreError: 59:0:(qsd_handler.c:1139:qsd_op_adjust()) LBUG
<4>[109556.444090] Pid: 59, comm: kswapd0
<4>[109556.444347] 
<4>[109556.444348] Call Trace:
<4>[109556.444843]  [<ffffffffa0a9c8a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4>[109556.445208]  [<ffffffffa0a9cea7>] lbug_with_loc+0x47/0xb0 [libcfs]
<4>[109556.445581]  [<ffffffffa08d671c>] qsd_op_adjust+0x4cc/0x5a0 [lquota]
<4>[109556.445972]  [<ffffffff811a6c7d>] ? generic_drop_inode+0x1d/0x80
<4>[109556.446306]  [<ffffffffa09ba8af>] osd_object_delete+0x1ff/0x2d0 [osd_ldiskfs]
<4>[109556.446858]  [<ffffffffa0c507b1>] lu_object_free+0x81/0x1a0 [obdclass]
<4>[109556.447211]  [<ffffffffa0ab28e2>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
<4>[109556.447568]  [<ffffffffa0c51827>] lu_site_purge+0x2c7/0x4c0 [obdclass]
<4>[109556.447915]  [<ffffffffa0c51ba8>] lu_cache_shrink+0x188/0x310 [obdclass]
<4>[109556.448237]  [<ffffffff8113712d>] shrink_slab+0x13d/0x1c0
<4>[109556.448547]  [<ffffffff8113a58a>] balance_pgdat+0x5ba/0x830
<4>[109556.448856]  [<ffffffff81140676>] ? set_pgdat_percpu_threshold+0xa6/0xd0
<4>[109556.449198]  [<ffffffff8113a934>] kswapd+0x134/0x3b0
<4>[109556.449486]  [<ffffffff81098f90>] ? autoremove_wake_function+0x0/0x40
<4>[109556.449810]  [<ffffffff8113a800>] ? kswapd+0x0/0x3b0
<4>[109556.450097]  [<ffffffff81098c06>] kthread+0x96/0xa0
<4>[109556.450423]  [<ffffffff8100c24a>] child_rip+0xa/0x20
<4>[109556.450736]  [<ffffffff81098b70>] ? kthread+0x0/0xa0
<4>[109556.451027]  [<ffffffff8100c240>] ? child_rip+0x0/0x20
<4>[109556.451516] 
<0>[109556.452221] LustreError: 32249:0:(ofd_dev.c:2296:ofd_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: 

tag in my tree: master-20140710
crashdump and modules: /exports/crashdumps/192.168.10.219-2014-07-10-04\:14\:17/



 Comments   
Comment by Niu Yawei (Inactive) [ 11/Jul/14 ]

This looks a general race of umount thread vs. slab shrink thread.

1. memory reclaim thread (kswapd, for exmaple) calls lu_site_purge(nr) to purge some lu_objects;
2. umount process calls lu_site_purge(-1) to try to purge all lu_objects, since some objects have been removed from lu_site by the kswapd, this lu_site_purge(-1) won't process on these objects;
3. umount process continue and hit ASSERTION( atomic_read(&d->ld_ref) == 0 ), because some objects are being processed by kswapd now;
4. kswapd continue and hit ASSERTION( qqi ), because the qsd_instance has been freed by umount process;

It looks to me that ofd_stack_fini() and mdt_stack_fini() needs be improved to address such kind of race. I don't quite understand why we always called lu_purge_site(-1) twice in ofd_statck_fini() & mdt_stack_fini().

Comment by Niu Yawei (Inactive) [ 14/Jul/14 ]

Alex/Tappro, why do we have to call lu_site_purge() twice in ofd_stack_fini() & mdt_statck_fini()? Thanks.

Comment by Niu Yawei (Inactive) [ 15/Jul/14 ]

http://review.whamcloud.com/11099

Comment by Niu Yawei (Inactive) [ 30/Jul/14 ]

patch landed for 2.6

Comment by Isaac Huang (Inactive) [ 16/Oct/14 ]

The lu_site::ls_purge_mutex added by the patch may hurt osd-zfs object creation rate, see LU-5749.

Generated at Sat Feb 10 01:50:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.