[LU-9699] osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed Created: 21/Jun/17  Updated: 24/Sep/22  Resolved: 22/Sep/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: VIKRAM BABASO JADHAV (Inactive) Assignee: VIKRAM BABASO JADHAV (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Duplicate
is duplicated by LU-10828 OBD devices and exports not cleaned u... Closed
Related
is related to LU-15000 MDS crashes with (osp_dev.c:1404:osp_... Resolved
is related to LU-15151 conf-sanity test_119: mds1: ssh: Coul... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Writeconf on an MDT with index > 0000 will cause "add mdc" to be added to $FSNAME-client config and "add osp" to be added to $FSNAME-MDTXXXX configs.

However, the configs may already contain these directives. Duplicating the OSP device will cause the assertion failure, unlike duplicating the MDC which will just return -EEXIST.

A possible solution is to check configs for duplicates before writing to them. However, sometimes we would like to change nids which are part of "add mdc" and "add osp".

Another solution is to mark previous entries with SKIP flags. This patch implements this approach. Since after revoking the config lock, the clients and the MDTs will receive the updated log and apply its newer entries, we still have to handle OSP duplication, but this is only an issue immediately after writeconf processing.

[1904009.530445] LDISKFS-fs (md0): mounted filesystem with ordered data mode. quota=on. Opts:
[1904010.544738] LustreError: 11-0: snx11117-MDT0000-osp-MDT0001: Communicating with 10.9.100.10@o2ib3, operation mds_connect failed with -114.
[1904010.814980] Lustre: snx11117-MDD0001: changelog on
[1904019.177269] LustreError: 84835:0:(genops.c:345:class_newdev()) Device snx11117-MDT0002-osp-MDT0001 already exists at 7, won't add
[1904019.189880] LustreError: 84835:0:(obd_config.c:368:class_attach()) Cannot create device snx11117-MDT0002-osp-MDT0001 of type osp : -17
[1904019.202925] LustreError: 84835:0:(obd_config.c:1610:class_config_llog_handler()) MGC10.9.100.9@o2ib3: cfg command failed: rc = -17
[1904019.215616] Lustre:    cmd=cf001 0:snx11117-MDT0002-osp-MDT0001  1:osp  2:snx11117-MDT0001-mdtlov_UUID
[1904019.215617]
[1904019.228104] LustreError: 84588:0:(mgc_request.c:517:do_requeue()) failed processing log: -17
[1904036.493105] LustreError: 85373:0:(obd_config.c:464:class_setup()) Device 7 already setup (type osp)
[1904036.503095] LustreError: 85373:0:(obd_config.c:1610:class_config_llog_handler()) MGC10.9.100.9@o2ib3: cfg command failed: rc = -17
[1904036.515779] Lustre:    cmd=cf003 0:snx11117-MDT0002-osp-MDT0001  1:snx11117-MDT0002_UUID  2:10.9.100.16@o2ib3
[1904036.515780]
[1904036.528886] LustreError: 84588:0:(mgc_request.c:517:do_requeue()) failed processing log: -17
[1904044.588103] LustreError: 85579:0:(osp_dev.c:1175:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed:
[1904044.599237] LustreError: 85579:0:(osp_dev.c:1175:osp_obd_connect()) LBUG
[1904044.606507] Pid: 85579, comm: llog_process_th
[1904044.611432]
[1904044.611433] Call Trace:
[1904044.616506]  [<ffffffffa07dd895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[1904044.624041]  [<ffffffffa07dde97>] lbug_with_loc+0x47/0xb0 [libcfs]
[1904044.630798]  [<ffffffffa0919f2c>] osp_obd_connect+0x3bc/0x420 [osp]
[1904044.645834]  [<ffffffffa06fb7c1>] lod_add_device+0x8d1/0x1e00 [lod]
[1904044.652674]  [<ffffffffa06f4259>] lod_process_config+0xb89/0x1720 [lod]
[1904044.667268]  [<ffffffffa099a370>] class_process_config+0x1900/0x1ac0 [obdclass]
[1904044.682914]  [<ffffffffa099b664>] class_config_llog_handler+0xa34/0x18b0 [obdclass]
[1904044.697415]  [<ffffffffa095ecf9>] llog_process_thread+0xaa9/0xe80 [obdclass]
[1904044.705421]  [<ffffffffa095f115>] llog_process_thread_daemonize+0x45/0x70 [obdclass]
[1904044.722898]  [<ffffffff8109ac66>] kthread+0x96/0xa0
[1904044.728341]  [<ffffffff8100c20a>] child_rip+0xa/0x20
[1904044.745110]
[1904044.747419] Kernel panic - not syncing: LBUG


 Comments   
Comment by Gerrit Updater [ 21/Jun/17 ]

jadhav.vikram (jadhav.vikram@seagate.com) uploaded a new patch: https://review.whamcloud.com/27753
Subject: LU-9699: ASSERTION(osp->opd_connects == 1) failed
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 72a2595023498ec212f80a9aed1c02189d2b900c

Comment by VIKRAM BABASO JADHAV (Inactive) [ 13/Jul/17 ]

Patch https://review.whamcloud.com/27753 abounded so please close this ticket

 https://review.whamcloud.com/#/c/28026/ is created under SEA-428

Comment by Peter Jones [ 09/Mar/18 ]

It seems as if this patch has been resurrected

Comment by John Hammond [ 22/Mar/18 ]

The issue description should be updated to say how to reproduce this.

Comment by Gerrit Updater [ 14/Sep/21 ]

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44912
Subject: LU-9699 osp: don't assert on OSP duplicating
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: e714e4ef02032114942c399a213d6823c5db5651

Comment by Gerrit Updater [ 22/Sep/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/27753/
Subject: LU-9699 osp: don't assert on OSP duplicating
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 98f107b53e4daa3bfaf026c379c0a9c41cb5f161

Comment by Peter Jones [ 22/Sep/21 ]

Landed for 2.15

Generated at Sat Feb 10 02:28:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.