[LU-10395] ASSERTION( osd->od_oi_table != NULL && osd->od_oi_count >= 1 ) failed Created: 15/Dec/17  Updated: 11/Jul/20  Resolved: 16/Apr/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6

Type: Bug Priority: Minor
Reporter: Oleg Drokin Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Just had this crash in my master-next testing:

 

 
[271899.484182] Lustre: DEBUG MARKER: == replay-single test 26: |X| open(O_CREAT), unlink two, close one, replay, close one (test mds_cleanup_orphans) ====================================================================================================== 05:31:19 (1513333879)
[271900.114927] Turning device loop0 (0x700000) read-only
[271900.159562] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
[271900.197289] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
[271900.868045] LustreError: 29112:0:(osd_internal.h:899:osd_fid2oi()) ASSERTION( osd->od_oi_table != NULL && osd->od_oi_count >= 1 ) failed: [0xa:0x8:0x0]
[271900.870230] LustreError: 29112:0:(osd_internal.h:899:osd_fid2oi()) LBUG
[271900.870897] Pid: 29112, comm: ll_mgs_0002
[271900.871499] 
Call Trace:
[271900.874098]  [<ffffffffa02927ce>] libcfs_call_trace+0x4e/0x60 [libcfs]
[271900.874904]  [<ffffffffa029285c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[271900.875989]  [<ffffffffa0bd5390>] __osd_oi_lookup+0x2e0/0x390 [osd_ldiskfs]
[271900.876716]  [<ffffffffa0bd715a>] osd_oi_lookup+0xca/0x190 [osd_ldiskfs]
[271900.877452]  [<ffffffffa0bd3112>] osd_fid_lookup+0x4a2/0x1b50 [osd_ldiskfs]
[271900.878132]  [<ffffffff810e3224>] ? lockdep_init_map+0xc4/0x600
[271900.902777]  [<ffffffffa0bd4821>] osd_object_init+0x61/0x180 [osd_ldiskfs]
[271900.903535]  [<ffffffffa03d352f>] lu_object_alloc+0xdf/0x310 [obdclass]
[271900.904230]  [<ffffffffa03d38cc>] lu_object_find_at+0x16c/0x290 [obdclass]
[271900.904930]  [<ffffffffa03d4d88>] dt_locate_at+0x18/0xb0 [obdclass]
[271900.905594]  [<ffffffffa0399140>] llog_osd_open+0x4f0/0xf80 [obdclass]
[271900.906616]  [<ffffffffa038814a>] llog_open+0x13a/0x3b0 [obdclass]
[271900.907360]  [<ffffffffa0647953>] llog_origin_handle_read_header+0x1b3/0x630 [ptlrpc]
[271900.908617]  [<ffffffffa068da13>] tgt_llog_read_header+0x33/0xe0 [ptlrpc]
[271900.909364]  [<ffffffffa069716b>] tgt_request_handle+0x93b/0x13e0 [ptlrpc]
[271900.910077]  [<ffffffffa063c091>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
[271900.911298]  [<ffffffffa063fe48>] ptlrpc_main+0xa58/0x1df0 [ptlrpc]
[271900.911970]  [<ffffffff81706467>] ? _raw_spin_unlock_irq+0x27/0x50
[271900.912646]  [<ffffffffa063f3f0>] ? ptlrpc_main+0x0/0x1df0 [ptlrpc]
[271900.914668]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[271900.915359]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[271900.915996]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[271900.916627]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[271900.917273] 
[271900.917861] Kernel panic - not syncing: LBUG

I have a crashdump.



 Comments   
Comment by Oleg Drokin [ 18/Dec/17 ]

Just had this one happen again.

Comment by Oleg Drokin [ 19/Dec/17 ]

and just once more.

Comment by nasf (Inactive) [ 19/Dec/17 ]

On which branch? any special patch(es)?

Comment by Oleg Drokin [ 04/Jan/18 ]

hit it again om the curent master-next (= current master as of the time of this comment):

[269316.691753] Lustre: DEBUG MARKER: == replay-single test 20c: check that client eviction does not affect file content =================== 23:11:02 (1515039062)
[269316.759984] Lustre: 29811:0:(genops.c:1818:obd_export_evict_by_uuid()) lustre-MDT0000: evicting fbdcce1d-2114-114c-1c77-51be5d58048e at adminstrative request
[269317.806265] LustreError: 11-0: lustre-MDT0000-mdc-ffff8800ac76f800: operation ldlm_enqueue to node 0@lo failed: rc = -107
[269317.809371] LustreError: 29813:0:(file.c:4074:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5
[269319.367417] Lustre: DEBUG MARKER: == replay-single test 21: |X| open(O_CREAT), unlink touch new, replay, close (test mds_cleanup_orphans) ====================================================================================================== 23:11:04 (1515039064)
[269319.575062] Turning device loop0 (0x700000) read-only
[269319.600543] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
[269319.615571] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
[269320.012251] LustreError: 29273:0:(osd_internal.h:899:osd_fid2oi()) ASSERTION( osd->od_oi_table != NULL && osd->od_oi_count >= 1 ) failed: [0xa:0x2:0x0]
[269320.013895] LustreError: 29273:0:(osd_internal.h:899:osd_fid2oi()) LBUG
[269320.027745] Pid: 29273, comm: ll_mgs_0002
[269320.028226] 
Call Trace:
[269320.029360]  [<ffffffffa026e7ce>] libcfs_call_trace+0x4e/0x60 [libcfs]
[269320.030022]  [<ffffffffa026e85c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[269320.030589]  [<ffffffffa0bac790>] __osd_oi_lookup+0x2e0/0x390 [osd_ldiskfs]
[269320.031065]  [<ffffffffa0bae55a>] osd_oi_lookup+0xca/0x190 [osd_ldiskfs]
[269320.031590]  [<ffffffffa0baa422>] osd_fid_lookup+0x4a2/0x1c40 [osd_ldiskfs]
[269320.032062]  [<ffffffff810e3224>] ? lockdep_init_map+0xc4/0x600
[269320.032692]  [<ffffffffa0babc21>] osd_object_init+0x61/0x180 [osd_ldiskfs]
[269320.033438]  [<ffffffffa03afcff>] lu_object_alloc+0xdf/0x310 [obdclass]
[269320.034136]  [<ffffffffa03b009c>] lu_object_find_at+0x16c/0x290 [obdclass]
[269320.034855]  [<ffffffffa03b1558>] dt_locate_at+0x18/0xb0 [obdclass]
[269320.035565]  [<ffffffffa0377520>] llog_osd_open+0x4f0/0xf80 [obdclass]
[269320.036257]  [<ffffffffa036414a>] llog_open+0x13a/0x3b0 [obdclass]
[269320.037013]  [<ffffffffa05f5a03>] llog_origin_handle_read_header+0x1b3/0x630 [ptlrpc]
[269320.038416]  [<ffffffffa063ba53>] tgt_llog_read_header+0x33/0xe0 [ptlrpc]
[269320.039153]  [<ffffffffa06451ab>] tgt_request_handle+0x93b/0x13e0 [ptlrpc]
[269320.039887]  [<ffffffffa05ea141>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
[269320.059312]  [<ffffffffa05edef8>] ptlrpc_main+0xa58/0x1df0 [ptlrpc]
[269320.059852]  [<ffffffff81706467>] ? _raw_spin_unlock_irq+0x27/0x50
[269320.060367]  [<ffffffffa05ed4a0>] ? ptlrpc_main+0x0/0x1df0 [ptlrpc]
[269320.060842]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[269320.061311]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[269320.062024]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[269320.062780]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[269320.063281] 
[269320.063683] Kernel panic - not syncing: LBUG
[269320.064200] CPU: 0 PID: 29273 Comm: ll_mgs_0002 Tainted: P           OE  ------------   3.10.0-debug #2
[269320.065091] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[269320.065559]  ffffffffa028e212 00000000626be182 ffff8800a1bf78a0 ffffffff816fd3e4
[269320.066423]  ffff8800a1bf7920 ffffffff816f8c34 ffffffff00000008 ffff8800a1bf7930
[269320.067317]  ffff8800a1bf78d0 00000000626be182 00000000626be182 0000000000000001
[269320.068197] Call Trace:
[269320.068621]  [<ffffffff816fd3e4>] dump_stack+0x19/0x1b
[269320.069060]  [<ffffffff816f8c34>] panic+0xd8/0x1e7
[269320.069528]  [<ffffffffa026e874>] lbug_with_loc+0x64/0xb0 [libcfs]
[269320.070015]  [<ffffffffa0bac790>] __osd_oi_lookup+0x2e0/0x390 [osd_ldiskfs]
[269320.070547]  [<ffffffffa0bae55a>] osd_oi_lookup+0xca/0x190 [osd_ldiskfs]
[269320.071017]  [<ffffffffa0baa422>] osd_fid_lookup+0x4a2/0x1c40 [osd_ldiskfs]
[269320.071510]  [<ffffffff810e3224>] ? lockdep_init_map+0xc4/0x600
[269320.071959]  [<ffffffffa0babc21>] osd_object_init+0x61/0x180 [osd_ldiskfs]
[269320.072477]  [<ffffffffa03afcff>] lu_object_alloc+0xdf/0x310 [obdclass]
[269320.072957]  [<ffffffffa03b009c>] lu_object_find_at+0x16c/0x290 [obdclass]
[269320.073602]  [<ffffffffa03b1558>] dt_locate_at+0x18/0xb0 [obdclass]
[269320.074419]  [<ffffffffa0377520>] llog_osd_open+0x4f0/0xf80 [obdclass]
[269320.075331]  [<ffffffffa036414a>] llog_open+0x13a/0x3b0 [obdclass]
[269320.076242]  [<ffffffffa05f5a03>] llog_origin_handle_read_header+0x1b3/0x630 [ptlrpc]
[269320.077666]  [<ffffffffa063ba53>] tgt_llog_read_header+0x33/0xe0 [ptlrpc]
[269320.078474]  [<ffffffffa06451ab>] tgt_request_handle+0x93b/0x13e0 [ptlrpc]
[269320.079298]  [<ffffffffa05ea141>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
[269320.080761]  [<ffffffffa05edef8>] ptlrpc_main+0xa58/0x1df0 [ptlrpc]
[269320.081560]  [<ffffffff81706467>] ? _raw_spin_unlock_irq+0x27/0x50
[269320.082392]  [<ffffffffa05ed4a0>] ? ptlrpc_register_service+0xeb0/0xeb0 [ptlrpc]
[269320.083844]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[269320.084600]  [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140
[269320.085397]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[269320.087581]  [<ffffffff810a2dd0>] ? kthread_create_on_node+0x140/0x140
Comment by Oleg Drokin [ 04/Jan/18 ]

latest crash's crashdump is on onyx-68 in /exports/crashdumps/192.168.123.129-2018-01-03-23:11:10

Comment by Oleg Drokin [ 21/Jan/18 ]

just hit this once more on current master-next

Comment by Oleg Drokin [ 04/Feb/18 ]

just hit this again on master-next:

[151025.517057] Lustre: DEBUG MARKER: == sanity-pfl test complete, duration 805 sec ======================================================== 18:28:23 (1517700503)
[151025.778703] Lustre: Failing over lustre-MDT0000
[151026.031080] LustreError: 13940:0:(osd_internal.h:911:osd_fid2oi()) ASSERTION( osd->od_oi_table != NULL && osd->od_oi_count >= 1 ) failed: [0xa:0x9:0x0]
[151026.033598] LustreError: 13940:0:(osd_internal.h:911:osd_fid2oi()) LBUG
[151026.034402] Pid: 13940, comm: ll_mgs_0000
[151026.035148] 
Call Trace:
[151026.036641]  [<ffffffffa01d77ce>] libcfs_call_trace+0x4e/0x60 [libcfs]
[151026.037462]  [<ffffffffa01d785c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[151026.038313]  [<ffffffffa0b52f80>] __osd_oi_lookup+0x2e0/0x390 [osd_ldiskfs]
[151026.041160]  [<ffffffffa0b54d5a>] osd_oi_lookup+0xca/0x190 [osd_ldiskfs]
[151026.041748]  [<ffffffffa0b50902>] osd_fid_lookup+0x4a2/0x1c60 [osd_ldiskfs]
[151026.042281]  [<ffffffff810e3224>] ? lockdep_init_map+0xc4/0x600
[151026.042780]  [<ffffffffa0b52121>] osd_object_init+0x61/0x110 [osd_ldiskfs]
[151026.043366]  [<ffffffffa039999f>] lu_object_alloc+0xdf/0x310 [obdclass]
[151026.043895]  [<ffffffffa0399d3c>] lu_object_find_at+0x16c/0x290 [obdclass]
[151026.046491]  [<ffffffffa039b308>] dt_locate_at+0x18/0xb0 [obdclass]
[151026.047276]  [<ffffffffa0361800>] llog_osd_open+0x4f0/0xf80 [obdclass]
[151026.047924]  [<ffffffffa034e14a>] llog_open+0x13a/0x3b0 [obdclass]
[151026.048498]  [<ffffffffa061bad3>] llog_origin_handle_read_header+0x1b3/0x630 [ptlrpc]
[151026.049479]  [<ffffffffa0663263>] tgt_llog_read_header+0x33/0xe0 [ptlrpc]
[151026.050493]  [<ffffffffa066c9bb>] tgt_request_handle+0x93b/0x13e0 [ptlrpc]
[151026.051100]  [<ffffffffa0610211>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
[151026.052417]  [<ffffffffa0613fc8>] ptlrpc_main+0xa58/0x1df0 [ptlrpc]
[151026.053122]  [<ffffffff81706467>] ? _raw_spin_unlock_irq+0x27/0x50
[151026.053837]  [<ffffffffa0613570>] ? ptlrpc_main+0x0/0x1df0 [ptlrpc]
[151026.054527]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[151026.055231]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[151026.055902]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[151026.056556]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[151026.057205] 
[151026.057786] Kernel panic - not syncing: LBUG
Comment by Alex Zhuravlev [ 11/Nov/19 ]

hit this few times locally:

LustreError: 7249:0:(osd_internal.h:994:osd_fid2oi()) ASSERTION( osd->od_oi_table != NULL && osd->od_oi_count >= 1 )
 lbug_with_loc+0x79/0x80 [libcfs]
 ? osd_oi_delete+0x32c/0x470 [osd_ldiskfs]
 ? osd_destroy+0x1ad/0x810 [osd_ldiskfs]
 ? osd_ref_del+0x1f8/0x700 [osd_ldiskfs]
 ? local_object_unlink+0x50d/0x1020 [obdclass]
 ? nodemap_cache_find_create+0x139/0x530 [ptlrpc]
 ? nodemap_save_config_cache+0x49/0x4d0 [ptlrpc]
 ? lu_object_put+0x230/0x370 [obdclass]
 ? nodemap_config_set_active_mgc+0xd6/0x1d0 [ptlrpc]
 ? mgc_process_log+0x27a8/0x27e0 [mgc]
 ? mgc_requeue_thread+0x278/0x800 [mgc]
 ? wake_up_q+0x60/0x60
 ? kthread+0x100/0x140
 ? mgc_process_config+0x13c0/0x13c0 [mgc]

according to the dumped log it was a race between umount and mgc thread.

Comment by Alex Zhuravlev [ 19/Feb/20 ]

this is a race: MDT umount vs mgs normal processing:

ll_mgs_0002     D    0 10966      2 0x80000000
Call Trace:
 ? __schedule+0x2ad/0xb00
 schedule+0x34/0x80
 lbug_with_loc+0x79/0x80 [libcfs]
 ? __osd_oi_lookup.isra.0+0x32f/0x3d0 [osd_ldiskfs]
 ? osd_object_init+0x290/0x2110 [osd_ldiskfs]
 ? __raw_spin_lock_init+0x28/0x50
 ? osd_object_alloc+0x117/0x3d0 [osd_ldiskfs]
 ? lu_object_start.isra.0+0x68/0x100 [obdclass]
 ? lu_object_find_at+0x317/0x8d0 [obdclass]
 ? dt_locate_at+0x13/0xa0 [obdclass]
 ? llog_osd_open+0x2b0/0xf90 [obdclass]
 ? __raw_spin_lock_init+0x28/0x50
 ? llog_open+0x307/0x410 [obdclass]
 ? llog_origin_handle_read_header+0x178/0x500 [ptlrpc]
 ? tgt_llog_read_header+0x26/0xb0 [ptlrpc]
 ? tgt_request_handle+0x3fe/0x1770 [ptlrpc]

umount          I    0 11081  11080 0x00000000
Call Trace:
 ? __schedule+0x2ad/0xb00
 schedule+0x34/0x80
 ptlrpc_stop_all_threads+0x55d/0x590 [ptlrpc]
 ? wait_woken+0x90/0x90
 ptlrpc_unregister_service+0xb6/0x1110 [ptlrpc]
 mgs_device_fini+0xa0/0x5c0 [mgs]
 class_cleanup+0x682/0xb50 [obdclass]
 class_process_config+0x153e/0x30f0 [obdclass]
 ? class_manual_cleanup+0xd1/0x670 [obdclass]
 ? class_manual_cleanup+0xd1/0x670 [obdclass]
 ? cache_alloc_debugcheck_after+0x138/0x150
 ? __kmalloc+0x20c/0x2e0
 class_manual_cleanup+0x197/0x670 [obdclass]
 server_put_super+0x1525/0x1d50 [obdclass]
 ? evict_inodes+0x138/0x180
 generic_shutdown_super+0x5f/0xf0
Comment by Alex Zhuravlev [ 19/Feb/20 ]

Patch https://review.whamcloud.com/37615 should help.

LU-10395 osd: stop OI at device shutdown

and not at obd_cleanup(). otherwise a race is possible:

  umount <MDT> stopping OI vs MGS accessing same OSD which

results in the assertion:
 ASSERTION( osd->od_oi_table != NULL && osd->od_oi_count >= 1 )

Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Change-Id: I24fccea718f2e2663166cfb0ff26571039357535
Comment by Alexander Boyko [ 19/Feb/20 ]

I have got a local reproducer for this race. I've tried 37615 patch, and it helps.

Comment by Alexander Boyko [ 20/Feb/20 ]

The original shutdown process frees a od_oi_table at the next trace

[ 2833.653390] Call Trace:
[ 2833.653395]  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
[ 2833.653398]  [<ffffffff8108ae58>] __warn+0xd8/0x100
[ 2833.653401]  [<ffffffff8108af9d>] warn_slowpath_null+0x1d/0x20
[ 2833.653409]  [<ffffffffc0e95109>] osd_oi_fini+0x39/0x1f0 [osd_ldiskfs]
[ 2833.653418]  [<ffffffffc0eb48ec>] osd_scrub_cleanup+0x6c/0xb0 [osd_ldiskfs]
[ 2833.653437]  [<ffffffffc0e79738>] osd_shutdown+0x118/0x2e0 [osd_ldiskfs]
[ 2833.653528]  [<ffffffffc0e92a37>] osd_process_config+0x277/0x360 [osd_ldiskfs]
[ 2833.653551]  [<ffffffffc10a17cd>] lod_process_config+0x24d/0x15a0 [lod]
[ 2833.653562]  [<ffffffffc04b5fd7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[ 2833.653570]  [<ffffffffc0f500e6>] mdd_process_config+0x146/0x5f0 [mdd]
[ 2833.653587]  [<ffffffffc0fbd2e2>] mdt_stack_fini+0x2c2/0xca0 [mdt]
[ 2833.653596]  [<ffffffffc0fbe00b>] mdt_device_fini+0x34b/0x930 [mdt]
[ 2833.653623]  [<ffffffffc06519a8>] class_cleanup+0x9a8/0xc40 [obdclass]
[ 2833.653641]  [<ffffffffc06528cc>] class_process_config+0x65c/0x2830 [obdclass]
[ 2833.653650]  [<ffffffffc04b5fd7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[ 2833.653667]  [<ffffffffc0654c66>] class_manual_cleanup+0x1c6/0x710 [obdclass]
[ 2833.653687]  [<ffffffffc0687441>] server_put_super+0xa01/0x1120 [obdclass]
[ 2833.653692]  [<ffffffff81223246>] ? evict_inodes+0xd6/0x140
[ 2833.653695]  [<ffffffff81208085>] generic_shutdown_super+0x75/0x100
[ 2833.653697]  [<ffffffff81208462>] kill_anon_super+0x12/0x20
[ 2833.653714]  [<ffffffffc0657872>] lustre_kill_super+0x32/0x50 [obdclass]
[ 2833.653717]  [<ffffffff8120881e>] deactivate_locked_super+0x4e/0x70
[ 2833.653719]  [<ffffffff81208fa6>] deactivate_super+0x46/0x60
[ 2833.653722]  [<ffffffff8122655f>] cleanup_mnt+0x3f/0x80
[ 2833.653725]  [<ffffffff812265f2>] __cleanup_mnt+0x12/0x20
[ 2833.653727]  [<ffffffff810b087b>] task_work_run+0xbb/0xe0
[ 2833.653730]  [<ffffffff8102ab52>] do_notify_resume+0x92/0xb0
[ 2833.653734]  [<ffffffff816c1a5d>] int_signal+0x12/0x17
[ 2833.653735] ---[ end trace bfc327a09eefce17 ]---

Also osd_shutdown frees quota related things, what if some quota requests are processed during shutdown? @Alex Zhuravlev, what do you think, is this case possible?

Comment by Alex Zhuravlev [ 20/Feb/20 ]

OSD is able to work without quota, there is internal locking in the quota code to deal with possible race, see qsd_fini() and qsd_op_begin()

Comment by Gerrit Updater [ 20/Feb/20 ]

Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/37635
Subject: LU-10395 tests: add test_280 sanity
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 997da4924fc37125bbf5fbae2c12c186bb05a284

Comment by Alexander Boyko [ 20/Feb/20 ]

I've pushed the regression test for issue if it is OK I rebase it to the fix.

Comment by Alex Zhuravlev [ 20/Feb/20 ]

thanks!

Comment by Alexander Boyko [ 20/Feb/20 ]

Alex, I don't see how qsd_op_begin() pins qsd at memory during using. Only small checks about !=NULL and qsd_started, even qsd_stopping is missed. 

Comment by Andreas Dilger [ 23/Feb/20 ]

+1 on master sanity test_208:
https://testing.whamcloud.com/test_sets/52f42644-0100-4d2d-bb11-794a4b7a1bf0

[ 8058.688742] LustreError: 12938:0:(osd_internal.h:1010:osd_fid2oi()) ASSERTION( osd->od_oi_table != NULL && osd->od_oi_count >= 1 ) failed: [0xa:0x3:0x0]
[ 8058.691039] LustreError: 12938:0:(osd_internal.h:1010:osd_fid2oi()) LBUG
[ 8058.692144] Pid: 12938, comm: ll_mgs_0002 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Wed Feb 12 09:50:45 UTC 2020
[ 8058.693915] Call Trace:
[ 8058.694388]  [<ffffffffc09b8f7c>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 8058.695655]  [<ffffffffc09b902c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 8058.696793]  [<ffffffffc11159d0>] __osd_oi_lookup+0x310/0x3c0 [osd_ldiskfs]
[ 8058.698145]  [<ffffffffc1117925>] osd_oi_lookup+0x95/0x1e0 [osd_ldiskfs]
[ 8058.699407]  [<ffffffffc1112ff5>] osd_fid_lookup+0x455/0x1d60 [osd_ldiskfs]
[ 8058.700594]  [<ffffffffc1114961>] osd_object_init+0x61/0x110 [osd_ldiskfs]
[ 8058.701927]  [<ffffffffc0bdbafb>] lu_object_start.isra.31+0x8b/0x120 [obdclass]
[ 8058.703596]  [<ffffffffc0bdfba2>] lu_object_find_at+0x1b2/0x980 [obdclass]
[ 8058.704808]  [<ffffffffc0be0fcd>] dt_locate_at+0x1d/0xb0 [obdclass]
[ 8058.705983]  [<ffffffffc0ba2c4e>] llog_osd_open+0x50e/0xf30 [obdclass]
[ 8058.707231]  [<ffffffffc0b8f08f>] llog_open+0x25f/0x400 [obdclass]
[ 8058.708380]  [<ffffffffc0edb5b6>] llog_origin_handle_read_header+0x1b6/0x630 [ptlrpc]
[ 8058.710027]  [<ffffffffc0f25ca3>] tgt_llog_read_header+0x33/0xe0 [ptlrpc]
[ 8058.711367]  [<ffffffffc0f2f68a>] tgt_request_handle+0x95a/0x1610 [ptlrpc]
[ 8058.712594]  [<ffffffffc0ed1066>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[ 8058.714032]  [<ffffffffc0ed5464>] ptlrpc_main+0xbb4/0x1550 [ptlrpc]
[ 8058.715207]  [<ffffffffa32c61f1>] kthread+0xd1/0xe0
Comment by Gerrit Updater [ 07/Apr/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37615/
Subject: LU-10395 osd: stop OI at device shutdown
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2789978e1192dbf6d90399c96b5594e0dc049cd9

Comment by Gerrit Updater [ 07/Apr/20 ]

Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38153
Subject: LU-10395 osd: stop OI at device shutdown
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: a7e345f49985b29d1d6f45a6065af56340102470

Comment by Gerrit Updater [ 07/Apr/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37635/
Subject: LU-10395 tests: add test_280 sanity
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f4eeadee5ba5d4ab9d04918d8d81d18907daa831

Comment by Alex Zhuravlev [ 16/Apr/20 ]

can we close the ticket now?

Comment by Gerrit Updater [ 11/Jul/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38153/
Subject: LU-10395 osd: stop OI at device shutdown
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: b27a323147d992b510fddcfbef8aaef508be7c87

Generated at Sat Feb 10 02:34:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.