[LU-642] LBUG in client when activating an OST which was registered as initially inactive Created: 25/Aug/11  Updated: 28/May/17  Resolved: 28/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Minor
Reporter: John Spray (Inactive) Assignee: Zhenyu Xu
Resolution: Low Priority Votes: 0
Labels: ptr

Issue Links:
Related
is related to LU-1459 Disabling OSC in file system causes m... Resolved
is related to LU-4302 If a client mounts FS with conf_param... Open
Severity: 3
Rank (Obsolete): 7751

 Description   

What we're trying to accomplish is to have an OST be inactive when it's first registered, by tunefs'ing the osc.active setting on the OST before we first mount it. I'm seeing that when I activate an OST which was initially inactive, I hit an LBUG on client trying to write to it.

The config is MGS+MDT+OST0+OST1. Tried with all on one host, and with OSTs+client on different hosts, same effect.
Steps to reproduce:
1. Format all targets
2. On one of the OSTs, run tunefs.lustre --param osc.active=0
3. Start all targets (one of the OSTs is initially activated, the other initially deactivated)
4. Mount the filesystem
5. Create some files on the client mount
6. Run lctl conf_param OSTxxxx.osc.active=1 on the MGS
7. Create some more files on the client mount (some of them should be written to the newly activated OST)
-> LBUG on the client

Using lustre-2.1.0-2.6.18_238.19.1.el5_lustre.g65156ed_gf426fb9 + other packages from the same build on CentOS5.

Logs etc to follow.



 Comments   
Comment by Robert Read (Inactive) [ 25/Aug/11 ]

This would be a good test to add to conf-sanity.sh.

Comment by John Spray (Inactive) [ 26/Aug/11 ]

Relevant syslog excerpts:

On the MGS/MDT when the OST is activated:
Aug 26 00:56:33 flint01 kernel: Lustre: Permanently reactivating flintfs-OST0000
Aug 26 00:56:33 flint01 kernel: Lustre: Modifying parameter flintfs-OST0000-osc.osc.active in log flintfs-client
Aug 26 00:56:33 flint01 kernel: Lustre: Skipped 1 previous similar message
Aug 26 00:56:37 flint01 kernel: Lustre: 9790:0:(import.c:526:import_select_connection()) flintfs-OST0000-osc-MDT0000: tried all connections, increasing
latency to 5s
Aug 26 00:56:37 flint01 kernel: Lustre: MDS mdd_obd-flintfs-MDT0000: flintfs-OST0000_UUID now active, resetting orphans

On the initially inactive OST when it's activated:

Aug 26 00:53:19 flint03 kernel: Lustre: 15810:0:(ldlm_lib.c:877:target_handle_connect()) flintfs-OST0000: connection from flintfs-MDT0000-mdtlov_UUID@1
92.168.122.105@tcp t0 exp 0000000000000000 cur 1314316399 last 0
Aug 26 00:53:19 flint03 kernel: Lustre: 15810:0:(filter.c:2697:filter_connect_internal()) flintfs-OST0000: Received MDS connection for group 0
Aug 26 00:53:19 flint03 kernel: Lustre: 15810:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import flintfs-OST0000->NET_0x20000c0a87a69_UUID netid 20000: s
elect flavor null
Aug 26 00:56:37 flint03 kernel: Lustre: 15810:0:(ldlm_lib.c:806:target_handle_connect()) flintfs-OST0000: received new MDS connection from NID 192.168.
122.105@tcp, removing former export from same NID
Aug 26 00:56:37 flint03 kernel: Lustre: 15810:0:(ldlm_lib.c:877:target_handle_connect()) flintfs-OST0000: connection from flintfs-MDT0000-mdtlov_UUID@1
92.168.122.105@tcp t0 exp 0000000000000000 cur 1314316597 last 0
Aug 26 00:56:37 flint03 kernel: Lustre: 15810:0:(filter.c:2697:filter_connect_internal()) flintfs-OST0000: Received MDS connection for group 0
Aug 26 00:56:37 flint03 kernel: Lustre: 15810:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import flintfs-OST0000->NET_0x20000c0a87a69_UUID netid 20000: s
elect flavor null
Aug 26 00:56:37 flint03 kernel: Lustre: flintfs-OST0000: received MDS connection from 192.168.122.105@tcp

On the client after the OST is activated, just this one message before the LBUG
Aug 26 00:56:37 flint06 kernel: LustreError: 16075:0:(lov_obd.c:485:lov_notify()) event(11) of flintfs-OST0000_UUID failed: -22

Comment by John Spray (Inactive) [ 26/Aug/11 ]

New observation: I can get this LBUG without ever trying to activate the OST, i.e. skip steps 5+6 in the original report.

Here's the output from the client now that I've hooked up a serial console:

Lustre: 2429:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.0.8@tcp1->MGC192.168.0.8@tcp1_0 netid 20000: select flavor null
Lustre: MGC192.168.0.8@tcp1: Reactivating import
Lustre: setting import flintfs-OST0000_UUID INACTIVE by administrator request
Lustre: 2429:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import flintfs-MDT0000-mdc-ffff81000d807000->192.168.122.105@tcp netid 20000: select flavor null
Lustre: Client flintfs-client has started
LustreError: 2571:0:(sec.c:382:import_sec_validate_get()) import ffff810016cc9000 (NEW) with no sec
LustreError: 2571:0:(osc_lock.c:586:osc_lock_blocking()) ASSERTION(olck->ols_lock == dlmlock) failed
LustreError: 2571:0:(osc_lock.c:586:osc_lock_blocking()) LBUG
Pid: 2571, comm: dd

Call Trace:
[<ffffffff886795f1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
[<ffffffff88679b2a>] lbug_with_loc+0x7a/0xd0 [libcfs]
[<ffffffff88684de0>] cfs_tracefile_init+0x0/0x10a [libcfs]
[<ffffffff889e3fe1>] osc_ldlm_blocking_ast+0x111/0x2b0 [osc]
[<ffffffff88810840>] ldlm_handle_bl_callback+0x160/0x260 [ptlrpc]
[<ffffffff88825759>] __ptlrpc_request_bufs_pack+0x3d9/0x410 [ptlrpc]
[<ffffffff8880851b>] ldlm_interval_alloc+0x2cb/0x350 [ptlrpc]
[<ffffffff887f27ca>] ldlm_lock_decref_internal+0x3ea/0x770 [ptlrpc]
[<ffffffff8880ba60>] failed_lock_cleanup+0xc0/0x100 [ptlrpc]
[<ffffffff8880d926>] ldlm_cli_enqueue+0x346/0x730 [ptlrpc]
[<ffffffff889e4180>] osc_ldlm_completion_ast+0x0/0x1d0 [osc]
[<ffffffff889e3ed0>] osc_ldlm_blocking_ast+0x0/0x2b0 [osc]
[<ffffffff889e4350>] osc_ldlm_glimpse_ast+0x0/0x160 [osc]
[<ffffffff889e2fb0>] osc_ldlm_weigh_ast+0x0/0x2b0 [osc]
[<ffffffff889ca0b1>] osc_enqueue_base+0x451/0x650 [osc]
[<ffffffff8002e3fe>] __wake_up+0x38/0x4f
[<ffffffff889e4c27>] osc_lock_enqueue+0x777/0x8a0 [osc]
[<ffffffff889e39f0>] osc_lock_upcall+0x0/0x4e0 [osc]
[<ffffffff887774b0>] cl_enqueue_try+0x170/0x370 [obdclass]
[<ffffffff88a40d70>] lov_sublock_hold+0x30/0x200 [lov]
[<ffffffff88a42a12>] lov_lock_enqueue+0x402/0x770 [lov]
[<ffffffff88774d90>] cl_lock_state_signal+0xc0/0x120 [obdclass]
[<ffffffff88774eeb>] cl_lock_state_set+0xfb/0x170 [obdclass]
[<ffffffff887774b0>] cl_enqueue_try+0x170/0x370 [obdclass]
[<ffffffff8877467d>] cl_lock_used_mod+0x2d/0x80 [obdclass]
[<ffffffff88777ee1>] cl_enqueue_locked+0xc1/0x1c0 [obdclass]
[<ffffffff88779857>] cl_lock_request+0xb7/0x1d0 [obdclass]
[<ffffffff88a45fc2>] lov_io_lock+0xe2/0x160 [lov]
[<ffffffff8877ca0c>] cl_io_lock+0x30c/0x5b0 [obdclass]
[<ffffffff8877a650>] cl_io_iter_init+0xa0/0x150 [obdclass]
[<ffffffff8877dc27>] cl_io_loop+0x97/0x1a0 [obdclass]
[<ffffffff88a8861e>] ll_file_io_generic+0x4de/0x530 [lustre]
[<ffffffff8876da05>] cl_env_get+0x25/0x300 [obdclass]
[<ffffffff88a88850>] ll_file_writev+0x1e0/0x280 [lustre]
[<ffffffff8876db42>] cl_env_get+0x162/0x300 [obdclass]
[<ffffffff88a94d54>] ll_file_write+0x164/0x210 [lustre]
[<ffffffff801aa50f>] read_zero+0x1cc/0x225
[<ffffffff80016b48>] vfs_write+0xce/0x174
[<ffffffff80017415>] sys_write+0x45/0x6e
[<ffffffff8005d28d>] tracesys+0xd5/0xe0

Kernel panic - not syncing: LBUG

Comment by Peter Jones [ 30/Dec/11 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Jeremy Filizetti [ 19/Oct/12 ]

This bug should probably be linked to LU-1459 since it appears to be an issue with use of a disabled OSC, at least when I experienced it

Comment by Zhenyu Xu [ 04/Nov/12 ]

found the root cause.

When OSC is inactivate before lov tries to connect it (as this scenario does), lov_connect will not connect the OST device, and the import to it is set to invalid, when we activate it later, following procedure happens:

ptlrpc_set_import_active() set import valid

{00000100:00080000:0.0:1352185476.068106:0:19884:0:(recover.c:276:ptlrpc_set_import_active()) setting import lustre-OST0000_UUID VALID

ptlrpc_recover_import()
--> ptlrpc_set_import_discon() do nothing, since the import is in NEW state

00000100:00080000:0.0:1352185476.068108:0:19884:0:(import.c:195:ptlrpc_set_import_discon()) osc: import ffff88003c258800 already not connected (conn 0, was 0): NEW

-> lov_notify()>lov_set_osc_active() could not set this lov target's active state (return -EINVAL) since the target has no export yet (connect RPC never issued)

00020000:00000001:0.0:1352185476.068118:0:19884:0:(lov_obd.c:414:lov_set_osc_active()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)

--> ptlrpc_recover_import_no_retry() fails out with -EALREADY, since the import is in NEW state, not in supposed DISCON state.

00000100:00000001:0.0:1352185476.068206:0:19884:0:(recover.c:337:ptlrpc_recover_import_no_retry()) Process leaving (rc=18446744073709551502 : -114 : ffffffffffffff8e)

We need supplement the obd_connect RPC if it is still in NEW state when we activate the OSC later.

Comment by Zhenyu Xu [ 08/Nov/12 ]

b2_1 patch tracking at http://review.whamcloud.com/4463

patch description
LU-642 lov: make up obd_connect for inactive OSC

When OSC is inactivated before lov tries to connect it, lov_connect()
miss the chance to connect it to OST devices even when it is
activated later.

We need make up the connection for the initially inactive OSC when it
is activated.
Generated at Sat Feb 10 01:09:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.