[LU-4913] mgc import reconnect race Created: 16/Apr/14  Updated: 31/May/16  Resolved: 23/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Critical
Reporter: Andriy Skulysh Assignee: Cliff White (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-5420 Failure on test suite sanity test_17m... Resolved
is related to LU-8206 ptlrpc_invalidate_import ( ASSERTION(... Closed
Severity: 3
Rank (Obsolete): 13574

 Description   

<0>[ 231.434653] LustreError: 26880:0:(import.c:286:ptlrpc_invalidate_import()) ASSERTION( imp->imp_invalid ) failed:
<0>[ 231.445129] LustreError: 26880:0:(import.c:286:ptlrpc_invalidate_import()) LBUG
<4>[ 231.452651] Pid: 26880, comm: mount.lustre
<4>[ 231.456892]
<4>[ 231.456892] Call Trace:
<4>[ 231.461143] [<ffffffffa057b825>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4>[ 231.468255] [<ffffffffa057be27>] lbug_with_loc+0x47/0xb0 [libcfs]
<4>[ 231.474624] [<ffffffffa08a3597>] ptlrpc_invalidate_import+0x377/0x8b0 [ptlrpc]
<4>[ 231.482172] [<ffffffffa08a1a16>] ? ptlrpc_set_import_discon+0xf6/0x600 [ptlrpc]
<4>[ 231.489805] [<ffffffffa08a3b03>] ptlrpc_reconnect_import+0x33/0x1a0 [ptlrpc]
<4>[ 231.497100] [<ffffffffa05870d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
<4>[ 231.503869] [<ffffffffa0380aec>] mgc_set_info_async+0x32c/0xc80 [mgc]
<4>[ 231.510570] [<ffffffffa0722d75>] obd_set_info_async.clone.3+0xf5/0x3a0 [obdclass]
<4>[ 231.518377] [<ffffffffa06fe1d6>] ? class_name2dev+0x56/0xe0 [obdclass]
<4>[ 231.525164] [<ffffffffa072b471>] lustre_start_mgc+0x12a1/0x2020 [obdclass]
<4>[ 231.532310] [<ffffffffa0724927>] ? server_kernel_mount+0x667/0xf30 [obdclass]
<4>[ 231.539761] [<ffffffffa072ece8>] lustre_fill_super+0x2d8/0xfc0 [obdclass]
<4>[ 231.546818] [<ffffffffa072ea10>] ? lustre_fill_super+0x0/0xfc0 [obdclass]
<4>[ 231.553851] [<ffffffff8117997f>] get_sb_nodev+0x5f/0xa0
<4>[ 231.559341] [<ffffffffa071f765>] lustre_get_sb+0x25/0x30 [obdclass]
<4>[ 231.565850] [<ffffffff811795db>] vfs_kern_mount+0x7b/0x1b0
<4>[ 231.571567] [<ffffffff81179782>] do_kern_mount+0x52/0x130
<4>[ 231.577210] [<ffffffff81197da2>] do_mount+0x2d2/0x8d0
<4>[ 231.582493] [<ffffffff81198430>] sys_mount+0x90/0xe0
<4>[ 231.587687] [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b



 Comments   
Comment by Andriy Skulysh [ 16/Apr/14 ]

mgc import can be reconnected by pinger or ptlrpc_reconnect_import().
ptlrpc_invalidate_import() isn't protected against alteration of imp_invalid state. Import can be
reconnected by pinger which makes imp_invalid equal to false. Thus LASSERT(imp->imp_invalid) fails
in ptlrpc_invalidate_import().

Comment by Andriy Skulysh [ 16/Apr/14 ]

patch: http://review.whamcloud.com/9967

Comment by Andriy Skulysh [ 16/Apr/14 ]

A test was added to reproduce the issue. Here is dmesg log showing the failure:


<4>Lustre: DEBUG MARKER: == conf-sanity test 80: mgc import reconnect race == 16:25:18 (1397136318)
<6>LDISKFS-fs (loop0): mounted filesystem with ordered data mode. Opts: 
<6>LDISKFS-fs (loop0): mounted filesystem with ordered data mode. Opts: 
<4>Lustre: 9028:0:(ldlm_lib.c:1005:target_handle_connect()) MGS: connection from afc19ceb-785b-4037-43da-94710aee9fb7@0@lo t0 exp (null) cur 1397136328 last 0
<4>Lustre: lustre-MDT0000: used disk, loading
<4>Lustre: 9078:0:(debug.c:320:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
<6>LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: 
<6>LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: 
<6>Lustre: lustre-OST0000: Now serving lustre-OST0000 on /dev/loop1 with recovery enabled
<4>Lustre: 9202:0:(genops.c:1498:obd_export_evict_by_uuid()) MGS: evicting afc19ceb-785b-4037-43da-94710aee9fb7 at adminstrative request
<3>LustreError: 11-0: MGC192.168.122.169@tcp: Communicating with 0@lo, operation ldlm_enqueue failed with -107.
<3>LustreError: 166-1: MGC192.168.122.169@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
<3>LustreError: 9027:0:(fail.c:124:__cfs_fail_timeout_set()) cfs_fail_timeout id 906 sleeping for 10000ms
<4>Lustre: 8745:0:(client.c:1960:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1397136328/real 1397136328]  req@ffff88007a9c3bc0 x1465003620171816/t0(0) o8->lustre-OST0000-osc-MDT0000@0@lo:28/4 lens 368/512 e 0 to 1 dl 1397136333 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
<4>Lustre: 9136:0:(ldlm_lib.c:1005:target_handle_connect()) lustre-OST0000: connection from lustre-MDT0000-mdtlov_UUID@0@lo t0 exp (null) cur 1397136338 last 0
<4>Lustre: lustre-OST0000: received MDS connection from 0@lo
<6>Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans
<4>Lustre: 8745:0:(client.c:1960:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1397136333/real 1397136333]  req@ffff88007a9c3440 x1465003620171868/t0(0) o250->MGC192.168.122.169@tcp@0@lo:26/25 lens 368/512 e 0 to 1 dl 1397136339 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
<3>LustreError: 9027:0:(fail.c:128:__cfs_fail_timeout_set()) cfs_fail_timeout id 906 awake
<4>Lustre: 9027:0:(ldlm_lib.c:1005:target_handle_connect()) MGS: connection from afc19ceb-785b-4037-43da-94710aee9fb7@0@lo t0 exp (null) cur 1397136343 last 0
<3>LustreError: 9029:0:(fail.c:124:__cfs_fail_timeout_set()) cfs_fail_timeout id 906 sleeping for 10000ms
<3>LustreError: 9029:0:(fail.c:128:__cfs_fail_timeout_set()) cfs_fail_timeout id 906 awake
<4>Lustre: MGS: Client afc19ceb-785b-4037-43da-94710aee9fb7 (at 0@lo) reconnecting
<4>Lustre: 9029:0:(ldlm_lib.c:1005:target_handle_connect()) MGS: connection from afc19ceb-785b-4037-43da-94710aee9fb7@0@lo t0 exp ffff88007a58ec00 cur 1397136353 last 1397136343
<3>LustreError: 9029:0:(obd_class.h:514:obd_set_info_async()) obd_set_info_async: dev 0 no operation
<4>Lustre: Evicted from MGS (at MGC192.168.122.169@tcp_0) after server handle changed from 0x49a8420081f448d2 to 0x49a8420081f44981
<3>LustreError: 9205:0:(fail.c:124:__cfs_fail_timeout_set()) cfs_fail_timeout id 906 sleeping for 15000ms
<6>LDISKFS-fs (loop2): mounted filesystem with ordered data mode. Opts: 
<6>LDISKFS-fs (loop2): mounted filesystem with ordered data mode. Opts: 
<3>LustreError: 9231:0:(fail.c:124:__cfs_fail_timeout_set()) cfs_fail_timeout id 906 sleeping for 15000ms
<3>LustreError: 9205:0:(fail.c:128:__cfs_fail_timeout_set()) cfs_fail_timeout id 906 awake
<6>Lustre: MGC192.168.122.169@tcp: Connection restored to MGS (at 0@lo)
<3>LustreError: 9231:0:(fail.c:128:__cfs_fail_timeout_set()) cfs_fail_timeout id 906 awake
<0>LustreError: 9231:0:(import.c:287:ptlrpc_invalidate_import()) ASSERTION( imp->imp_invalid ) failed: 
<0>LustreError: 9231:0:(import.c:287:ptlrpc_invalidate_import()) LBUG
<4>Pid: 9231, comm: mount.lustre
<4>
<4>Call Trace:
<4> [<ffffffffa0d9d825>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0d9de27>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa068df2f>] ptlrpc_invalidate_import+0x84f/0x8e0 [ptlrpc]
<4> [<ffffffffa068bed6>] ? ptlrpc_set_import_discon+0xf6/0x600 [ptlrpc]
<4> [<ffffffffa068dff3>] ptlrpc_reconnect_import+0x33/0x1a0 [ptlrpc]
<4> [<ffffffffa0da90f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
<4> [<ffffffffa092bb3c>] mgc_set_info_async+0x32c/0xc80 [mgc]
<4> [<ffffffffa0da90f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
<4> [<ffffffffa050c115>] obd_set_info_async.clone.3+0xf5/0x3a0 [obdclass]
<4> [<ffffffffa04e71f6>] ? class_name2dev+0x56/0xe0 [obdclass]
<4> [<ffffffffa05148f3>] lustre_start_mgc+0x1383/0x1fd0 [obdclass]
<4> [<ffffffffa050dd10>] ? server_kernel_mount+0x6b0/0xf30 [obdclass]
<4> [<ffffffffa0518038>] lustre_fill_super+0x2d8/0xfd0 [obdclass]
<4> [<ffffffffa0517d60>] ? lustre_fill_super+0x0/0xfd0 [obdclass]
<4> [<ffffffff81179a2f>] get_sb_nodev+0x5f/0xa0
<4> [<ffffffffa0508b05>] lustre_get_sb+0x25/0x30 [obdclass]
<4> [<ffffffff8117968b>] vfs_kern_mount+0x7b/0x1b0
<4> [<ffffffff81179832>] do_kern_mount+0x52/0x130
<4> [<ffffffff81197eb2>] do_mount+0x2d2/0x8d0
<4> [<ffffffff81198540>] sys_mount+0x90/0xe0
<4> [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Comment by Peter Jones [ 23/May/14 ]

Landed for 2.6

Comment by Cory Spitz [ 15/Aug/14 ]

FYI, in LU-5420 it was reported that the fix for this bug caused that regression.

Generated at Sat Feb 10 01:46:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.