Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-642

LBUG in client when activating an OST which was registered as initially inactive

Details

    • Bug
    • Resolution: Low Priority
    • Minor
    • Lustre 2.4.0
    • Lustre 2.1.0
    • 3
    • 7751

    Description

      What we're trying to accomplish is to have an OST be inactive when it's first registered, by tunefs'ing the osc.active setting on the OST before we first mount it. I'm seeing that when I activate an OST which was initially inactive, I hit an LBUG on client trying to write to it.

      The config is MGS+MDT+OST0+OST1. Tried with all on one host, and with OSTs+client on different hosts, same effect.
      Steps to reproduce:
      1. Format all targets
      2. On one of the OSTs, run tunefs.lustre --param osc.active=0
      3. Start all targets (one of the OSTs is initially activated, the other initially deactivated)
      4. Mount the filesystem
      5. Create some files on the client mount
      6. Run lctl conf_param OSTxxxx.osc.active=1 on the MGS
      7. Create some more files on the client mount (some of them should be written to the newly activated OST)
      -> LBUG on the client

      Using lustre-2.1.0-2.6.18_238.19.1.el5_lustre.g65156ed_gf426fb9 + other packages from the same build on CentOS5.

      Logs etc to follow.

      Attachments

        Issue Links

          Activity

            [LU-642] LBUG in client when activating an OST which was registered as initially inactive
            bobijam Zhenyu Xu added a comment -

            b2_1 patch tracking at http://review.whamcloud.com/4463

            patch description
            LU-642 lov: make up obd_connect for inactive OSC
            
            When OSC is inactivated before lov tries to connect it, lov_connect()
            miss the chance to connect it to OST devices even when it is
            activated later.
            
            We need make up the connection for the initially inactive OSC when it
            is activated.
            
            bobijam Zhenyu Xu added a comment - b2_1 patch tracking at http://review.whamcloud.com/4463 patch description LU-642 lov: make up obd_connect for inactive OSC When OSC is inactivated before lov tries to connect it, lov_connect() miss the chance to connect it to OST devices even when it is activated later. We need make up the connection for the initially inactive OSC when it is activated.
            bobijam Zhenyu Xu added a comment - - edited

            found the root cause.

            When OSC is inactivate before lov tries to connect it (as this scenario does), lov_connect will not connect the OST device, and the import to it is set to invalid, when we activate it later, following procedure happens:

            ptlrpc_set_import_active() set import valid

            {00000100:00080000:0.0:1352185476.068106:0:19884:0:(recover.c:276:ptlrpc_set_import_active()) setting import lustre-OST0000_UUID VALID

            ptlrpc_recover_import()
            --> ptlrpc_set_import_discon() do nothing, since the import is in NEW state

            00000100:00080000:0.0:1352185476.068108:0:19884:0:(import.c:195:ptlrpc_set_import_discon()) osc: import ffff88003c258800 already not connected (conn 0, was 0): NEW

            -> lov_notify()>lov_set_osc_active() could not set this lov target's active state (return -EINVAL) since the target has no export yet (connect RPC never issued)

            00020000:00000001:0.0:1352185476.068118:0:19884:0:(lov_obd.c:414:lov_set_osc_active()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)

            --> ptlrpc_recover_import_no_retry() fails out with -EALREADY, since the import is in NEW state, not in supposed DISCON state.

            00000100:00000001:0.0:1352185476.068206:0:19884:0:(recover.c:337:ptlrpc_recover_import_no_retry()) Process leaving (rc=18446744073709551502 : -114 : ffffffffffffff8e)

            We need supplement the obd_connect RPC if it is still in NEW state when we activate the OSC later.

            bobijam Zhenyu Xu added a comment - - edited found the root cause. When OSC is inactivate before lov tries to connect it (as this scenario does), lov_connect will not connect the OST device, and the import to it is set to invalid, when we activate it later, following procedure happens: ptlrpc_set_import_active() set import valid {00000100:00080000:0.0:1352185476.068106:0:19884:0:(recover.c:276:ptlrpc_set_import_active()) setting import lustre-OST0000_UUID VALID ptlrpc_recover_import() --> ptlrpc_set_import_discon() do nothing, since the import is in NEW state 00000100:00080000:0.0:1352185476.068108:0:19884:0:(import.c:195:ptlrpc_set_import_discon()) osc: import ffff88003c258800 already not connected (conn 0, was 0): NEW - > lov_notify() >lov_set_osc_active() could not set this lov target's active state (return -EINVAL) since the target has no export yet (connect RPC never issued) 00020000:00000001:0.0:1352185476.068118:0:19884:0:(lov_obd.c:414:lov_set_osc_active()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea) --> ptlrpc_recover_import_no_retry() fails out with -EALREADY, since the import is in NEW state, not in supposed DISCON state. 00000100:00000001:0.0:1352185476.068206:0:19884:0:(recover.c:337:ptlrpc_recover_import_no_retry()) Process leaving (rc=18446744073709551502 : -114 : ffffffffffffff8e) We need supplement the obd_connect RPC if it is still in NEW state when we activate the OSC later.

            This bug should probably be linked to LU-1459 since it appears to be an issue with use of a disabled OSC, at least when I experienced it

            jfilizetti Jeremy Filizetti added a comment - This bug should probably be linked to LU-1459 since it appears to be an issue with use of a disabled OSC, at least when I experienced it
            pjones Peter Jones added a comment -

            Bobijam

            Could you please look into this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobijam Could you please look into this one? Thanks Peter

            New observation: I can get this LBUG without ever trying to activate the OST, i.e. skip steps 5+6 in the original report.

            Here's the output from the client now that I've hooked up a serial console:

            Lustre: 2429:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.0.8@tcp1->MGC192.168.0.8@tcp1_0 netid 20000: select flavor null
            Lustre: MGC192.168.0.8@tcp1: Reactivating import
            Lustre: setting import flintfs-OST0000_UUID INACTIVE by administrator request
            Lustre: 2429:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import flintfs-MDT0000-mdc-ffff81000d807000->192.168.122.105@tcp netid 20000: select flavor null
            Lustre: Client flintfs-client has started
            LustreError: 2571:0:(sec.c:382:import_sec_validate_get()) import ffff810016cc9000 (NEW) with no sec
            LustreError: 2571:0:(osc_lock.c:586:osc_lock_blocking()) ASSERTION(olck->ols_lock == dlmlock) failed
            LustreError: 2571:0:(osc_lock.c:586:osc_lock_blocking()) LBUG
            Pid: 2571, comm: dd

            Call Trace:
            [<ffffffff886795f1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
            [<ffffffff88679b2a>] lbug_with_loc+0x7a/0xd0 [libcfs]
            [<ffffffff88684de0>] cfs_tracefile_init+0x0/0x10a [libcfs]
            [<ffffffff889e3fe1>] osc_ldlm_blocking_ast+0x111/0x2b0 [osc]
            [<ffffffff88810840>] ldlm_handle_bl_callback+0x160/0x260 [ptlrpc]
            [<ffffffff88825759>] __ptlrpc_request_bufs_pack+0x3d9/0x410 [ptlrpc]
            [<ffffffff8880851b>] ldlm_interval_alloc+0x2cb/0x350 [ptlrpc]
            [<ffffffff887f27ca>] ldlm_lock_decref_internal+0x3ea/0x770 [ptlrpc]
            [<ffffffff8880ba60>] failed_lock_cleanup+0xc0/0x100 [ptlrpc]
            [<ffffffff8880d926>] ldlm_cli_enqueue+0x346/0x730 [ptlrpc]
            [<ffffffff889e4180>] osc_ldlm_completion_ast+0x0/0x1d0 [osc]
            [<ffffffff889e3ed0>] osc_ldlm_blocking_ast+0x0/0x2b0 [osc]
            [<ffffffff889e4350>] osc_ldlm_glimpse_ast+0x0/0x160 [osc]
            [<ffffffff889e2fb0>] osc_ldlm_weigh_ast+0x0/0x2b0 [osc]
            [<ffffffff889ca0b1>] osc_enqueue_base+0x451/0x650 [osc]
            [<ffffffff8002e3fe>] __wake_up+0x38/0x4f
            [<ffffffff889e4c27>] osc_lock_enqueue+0x777/0x8a0 [osc]
            [<ffffffff889e39f0>] osc_lock_upcall+0x0/0x4e0 [osc]
            [<ffffffff887774b0>] cl_enqueue_try+0x170/0x370 [obdclass]
            [<ffffffff88a40d70>] lov_sublock_hold+0x30/0x200 [lov]
            [<ffffffff88a42a12>] lov_lock_enqueue+0x402/0x770 [lov]
            [<ffffffff88774d90>] cl_lock_state_signal+0xc0/0x120 [obdclass]
            [<ffffffff88774eeb>] cl_lock_state_set+0xfb/0x170 [obdclass]
            [<ffffffff887774b0>] cl_enqueue_try+0x170/0x370 [obdclass]
            [<ffffffff8877467d>] cl_lock_used_mod+0x2d/0x80 [obdclass]
            [<ffffffff88777ee1>] cl_enqueue_locked+0xc1/0x1c0 [obdclass]
            [<ffffffff88779857>] cl_lock_request+0xb7/0x1d0 [obdclass]
            [<ffffffff88a45fc2>] lov_io_lock+0xe2/0x160 [lov]
            [<ffffffff8877ca0c>] cl_io_lock+0x30c/0x5b0 [obdclass]
            [<ffffffff8877a650>] cl_io_iter_init+0xa0/0x150 [obdclass]
            [<ffffffff8877dc27>] cl_io_loop+0x97/0x1a0 [obdclass]
            [<ffffffff88a8861e>] ll_file_io_generic+0x4de/0x530 [lustre]
            [<ffffffff8876da05>] cl_env_get+0x25/0x300 [obdclass]
            [<ffffffff88a88850>] ll_file_writev+0x1e0/0x280 [lustre]
            [<ffffffff8876db42>] cl_env_get+0x162/0x300 [obdclass]
            [<ffffffff88a94d54>] ll_file_write+0x164/0x210 [lustre]
            [<ffffffff801aa50f>] read_zero+0x1cc/0x225
            [<ffffffff80016b48>] vfs_write+0xce/0x174
            [<ffffffff80017415>] sys_write+0x45/0x6e
            [<ffffffff8005d28d>] tracesys+0xd5/0xe0

            Kernel panic - not syncing: LBUG

            john John Spray (Inactive) added a comment - New observation: I can get this LBUG without ever trying to activate the OST, i.e. skip steps 5+6 in the original report. Here's the output from the client now that I've hooked up a serial console: Lustre: 2429:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.0.8@tcp1->MGC192.168.0.8@tcp1_0 netid 20000: select flavor null Lustre: MGC192.168.0.8@tcp1: Reactivating import Lustre: setting import flintfs-OST0000_UUID INACTIVE by administrator request Lustre: 2429:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import flintfs-MDT0000-mdc-ffff81000d807000->192.168.122.105@tcp netid 20000: select flavor null Lustre: Client flintfs-client has started LustreError: 2571:0:(sec.c:382:import_sec_validate_get()) import ffff810016cc9000 (NEW) with no sec LustreError: 2571:0:(osc_lock.c:586:osc_lock_blocking()) ASSERTION(olck->ols_lock == dlmlock) failed LustreError: 2571:0:(osc_lock.c:586:osc_lock_blocking()) LBUG Pid: 2571, comm: dd Call Trace: [<ffffffff886795f1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs] [<ffffffff88679b2a>] lbug_with_loc+0x7a/0xd0 [libcfs] [<ffffffff88684de0>] cfs_tracefile_init+0x0/0x10a [libcfs] [<ffffffff889e3fe1>] osc_ldlm_blocking_ast+0x111/0x2b0 [osc] [<ffffffff88810840>] ldlm_handle_bl_callback+0x160/0x260 [ptlrpc] [<ffffffff88825759>] __ptlrpc_request_bufs_pack+0x3d9/0x410 [ptlrpc] [<ffffffff8880851b>] ldlm_interval_alloc+0x2cb/0x350 [ptlrpc] [<ffffffff887f27ca>] ldlm_lock_decref_internal+0x3ea/0x770 [ptlrpc] [<ffffffff8880ba60>] failed_lock_cleanup+0xc0/0x100 [ptlrpc] [<ffffffff8880d926>] ldlm_cli_enqueue+0x346/0x730 [ptlrpc] [<ffffffff889e4180>] osc_ldlm_completion_ast+0x0/0x1d0 [osc] [<ffffffff889e3ed0>] osc_ldlm_blocking_ast+0x0/0x2b0 [osc] [<ffffffff889e4350>] osc_ldlm_glimpse_ast+0x0/0x160 [osc] [<ffffffff889e2fb0>] osc_ldlm_weigh_ast+0x0/0x2b0 [osc] [<ffffffff889ca0b1>] osc_enqueue_base+0x451/0x650 [osc] [<ffffffff8002e3fe>] __wake_up+0x38/0x4f [<ffffffff889e4c27>] osc_lock_enqueue+0x777/0x8a0 [osc] [<ffffffff889e39f0>] osc_lock_upcall+0x0/0x4e0 [osc] [<ffffffff887774b0>] cl_enqueue_try+0x170/0x370 [obdclass] [<ffffffff88a40d70>] lov_sublock_hold+0x30/0x200 [lov] [<ffffffff88a42a12>] lov_lock_enqueue+0x402/0x770 [lov] [<ffffffff88774d90>] cl_lock_state_signal+0xc0/0x120 [obdclass] [<ffffffff88774eeb>] cl_lock_state_set+0xfb/0x170 [obdclass] [<ffffffff887774b0>] cl_enqueue_try+0x170/0x370 [obdclass] [<ffffffff8877467d>] cl_lock_used_mod+0x2d/0x80 [obdclass] [<ffffffff88777ee1>] cl_enqueue_locked+0xc1/0x1c0 [obdclass] [<ffffffff88779857>] cl_lock_request+0xb7/0x1d0 [obdclass] [<ffffffff88a45fc2>] lov_io_lock+0xe2/0x160 [lov] [<ffffffff8877ca0c>] cl_io_lock+0x30c/0x5b0 [obdclass] [<ffffffff8877a650>] cl_io_iter_init+0xa0/0x150 [obdclass] [<ffffffff8877dc27>] cl_io_loop+0x97/0x1a0 [obdclass] [<ffffffff88a8861e>] ll_file_io_generic+0x4de/0x530 [lustre] [<ffffffff8876da05>] cl_env_get+0x25/0x300 [obdclass] [<ffffffff88a88850>] ll_file_writev+0x1e0/0x280 [lustre] [<ffffffff8876db42>] cl_env_get+0x162/0x300 [obdclass] [<ffffffff88a94d54>] ll_file_write+0x164/0x210 [lustre] [<ffffffff801aa50f>] read_zero+0x1cc/0x225 [<ffffffff80016b48>] vfs_write+0xce/0x174 [<ffffffff80017415>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Kernel panic - not syncing: LBUG

            Relevant syslog excerpts:

            On the MGS/MDT when the OST is activated:
            Aug 26 00:56:33 flint01 kernel: Lustre: Permanently reactivating flintfs-OST0000
            Aug 26 00:56:33 flint01 kernel: Lustre: Modifying parameter flintfs-OST0000-osc.osc.active in log flintfs-client
            Aug 26 00:56:33 flint01 kernel: Lustre: Skipped 1 previous similar message
            Aug 26 00:56:37 flint01 kernel: Lustre: 9790:0:(import.c:526:import_select_connection()) flintfs-OST0000-osc-MDT0000: tried all connections, increasing
            latency to 5s
            Aug 26 00:56:37 flint01 kernel: Lustre: MDS mdd_obd-flintfs-MDT0000: flintfs-OST0000_UUID now active, resetting orphans

            On the initially inactive OST when it's activated:

            Aug 26 00:53:19 flint03 kernel: Lustre: 15810:0:(ldlm_lib.c:877:target_handle_connect()) flintfs-OST0000: connection from flintfs-MDT0000-mdtlov_UUID@1
            92.168.122.105@tcp t0 exp 0000000000000000 cur 1314316399 last 0
            Aug 26 00:53:19 flint03 kernel: Lustre: 15810:0:(filter.c:2697:filter_connect_internal()) flintfs-OST0000: Received MDS connection for group 0
            Aug 26 00:53:19 flint03 kernel: Lustre: 15810:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import flintfs-OST0000->NET_0x20000c0a87a69_UUID netid 20000: s
            elect flavor null
            Aug 26 00:56:37 flint03 kernel: Lustre: 15810:0:(ldlm_lib.c:806:target_handle_connect()) flintfs-OST0000: received new MDS connection from NID 192.168.
            122.105@tcp, removing former export from same NID
            Aug 26 00:56:37 flint03 kernel: Lustre: 15810:0:(ldlm_lib.c:877:target_handle_connect()) flintfs-OST0000: connection from flintfs-MDT0000-mdtlov_UUID@1
            92.168.122.105@tcp t0 exp 0000000000000000 cur 1314316597 last 0
            Aug 26 00:56:37 flint03 kernel: Lustre: 15810:0:(filter.c:2697:filter_connect_internal()) flintfs-OST0000: Received MDS connection for group 0
            Aug 26 00:56:37 flint03 kernel: Lustre: 15810:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import flintfs-OST0000->NET_0x20000c0a87a69_UUID netid 20000: s
            elect flavor null
            Aug 26 00:56:37 flint03 kernel: Lustre: flintfs-OST0000: received MDS connection from 192.168.122.105@tcp

            On the client after the OST is activated, just this one message before the LBUG
            Aug 26 00:56:37 flint06 kernel: LustreError: 16075:0:(lov_obd.c:485:lov_notify()) event(11) of flintfs-OST0000_UUID failed: -22

            john John Spray (Inactive) added a comment - Relevant syslog excerpts: On the MGS/MDT when the OST is activated: Aug 26 00:56:33 flint01 kernel: Lustre: Permanently reactivating flintfs-OST0000 Aug 26 00:56:33 flint01 kernel: Lustre: Modifying parameter flintfs-OST0000-osc.osc.active in log flintfs-client Aug 26 00:56:33 flint01 kernel: Lustre: Skipped 1 previous similar message Aug 26 00:56:37 flint01 kernel: Lustre: 9790:0:(import.c:526:import_select_connection()) flintfs-OST0000-osc-MDT0000: tried all connections, increasing latency to 5s Aug 26 00:56:37 flint01 kernel: Lustre: MDS mdd_obd-flintfs-MDT0000: flintfs-OST0000_UUID now active, resetting orphans On the initially inactive OST when it's activated: Aug 26 00:53:19 flint03 kernel: Lustre: 15810:0:(ldlm_lib.c:877:target_handle_connect()) flintfs-OST0000: connection from flintfs-MDT0000-mdtlov_UUID@1 92.168.122.105@tcp t0 exp 0000000000000000 cur 1314316399 last 0 Aug 26 00:53:19 flint03 kernel: Lustre: 15810:0:(filter.c:2697:filter_connect_internal()) flintfs-OST0000: Received MDS connection for group 0 Aug 26 00:53:19 flint03 kernel: Lustre: 15810:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import flintfs-OST0000->NET_0x20000c0a87a69_UUID netid 20000: s elect flavor null Aug 26 00:56:37 flint03 kernel: Lustre: 15810:0:(ldlm_lib.c:806:target_handle_connect()) flintfs-OST0000: received new MDS connection from NID 192.168. 122.105@tcp, removing former export from same NID Aug 26 00:56:37 flint03 kernel: Lustre: 15810:0:(ldlm_lib.c:877:target_handle_connect()) flintfs-OST0000: connection from flintfs-MDT0000-mdtlov_UUID@1 92.168.122.105@tcp t0 exp 0000000000000000 cur 1314316597 last 0 Aug 26 00:56:37 flint03 kernel: Lustre: 15810:0:(filter.c:2697:filter_connect_internal()) flintfs-OST0000: Received MDS connection for group 0 Aug 26 00:56:37 flint03 kernel: Lustre: 15810:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import flintfs-OST0000->NET_0x20000c0a87a69_UUID netid 20000: s elect flavor null Aug 26 00:56:37 flint03 kernel: Lustre: flintfs-OST0000: received MDS connection from 192.168.122.105@tcp On the client after the OST is activated, just this one message before the LBUG Aug 26 00:56:37 flint06 kernel: LustreError: 16075:0:(lov_obd.c:485:lov_notify()) event(11) of flintfs-OST0000_UUID failed: -22
            rread Robert Read added a comment -

            This would be a good test to add to conf-sanity.sh.

            rread Robert Read added a comment - This would be a good test to add to conf-sanity.sh.

            People

              bobijam Zhenyu Xu
              john John Spray (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: