Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-642

LBUG in client when activating an OST which was registered as initially inactive

Details

    • Bug
    • Resolution: Low Priority
    • Minor
    • Lustre 2.4.0
    • Lustre 2.1.0
    • 3
    • 7751

    Description

      What we're trying to accomplish is to have an OST be inactive when it's first registered, by tunefs'ing the osc.active setting on the OST before we first mount it. I'm seeing that when I activate an OST which was initially inactive, I hit an LBUG on client trying to write to it.

      The config is MGS+MDT+OST0+OST1. Tried with all on one host, and with OSTs+client on different hosts, same effect.
      Steps to reproduce:
      1. Format all targets
      2. On one of the OSTs, run tunefs.lustre --param osc.active=0
      3. Start all targets (one of the OSTs is initially activated, the other initially deactivated)
      4. Mount the filesystem
      5. Create some files on the client mount
      6. Run lctl conf_param OSTxxxx.osc.active=1 on the MGS
      7. Create some more files on the client mount (some of them should be written to the newly activated OST)
      -> LBUG on the client

      Using lustre-2.1.0-2.6.18_238.19.1.el5_lustre.g65156ed_gf426fb9 + other packages from the same build on CentOS5.

      Logs etc to follow.

      Attachments

        Issue Links

          Activity

            [LU-642] LBUG in client when activating an OST which was registered as initially inactive
            adilger Andreas Dilger made changes -
            Resolution New: Low Priority [ 10100 ]
            Status Original: In Progress [ 3 ] New: Resolved [ 5 ]
            bobijam Zhenyu Xu made changes -
            Link New: This issue is related to LU-4302 [ LU-4302 ]
            jlevi Jodi Levi (Inactive) made changes -
            Fix Version/s New: Lustre 2.4.0 [ 10154 ]
            pjones Peter Jones made changes -
            Labels New: ptr
            bobijam Zhenyu Xu added a comment -

            b2_1 patch tracking at http://review.whamcloud.com/4463

            patch description
            LU-642 lov: make up obd_connect for inactive OSC
            
            When OSC is inactivated before lov tries to connect it, lov_connect()
            miss the chance to connect it to OST devices even when it is
            activated later.
            
            We need make up the connection for the initially inactive OSC when it
            is activated.
            
            bobijam Zhenyu Xu added a comment - b2_1 patch tracking at http://review.whamcloud.com/4463 patch description LU-642 lov: make up obd_connect for inactive OSC When OSC is inactivated before lov tries to connect it, lov_connect() miss the chance to connect it to OST devices even when it is activated later. We need make up the connection for the initially inactive OSC when it is activated.
            bobijam Zhenyu Xu added a comment - - edited

            found the root cause.

            When OSC is inactivate before lov tries to connect it (as this scenario does), lov_connect will not connect the OST device, and the import to it is set to invalid, when we activate it later, following procedure happens:

            ptlrpc_set_import_active() set import valid

            {00000100:00080000:0.0:1352185476.068106:0:19884:0:(recover.c:276:ptlrpc_set_import_active()) setting import lustre-OST0000_UUID VALID

            ptlrpc_recover_import()
            --> ptlrpc_set_import_discon() do nothing, since the import is in NEW state

            00000100:00080000:0.0:1352185476.068108:0:19884:0:(import.c:195:ptlrpc_set_import_discon()) osc: import ffff88003c258800 already not connected (conn 0, was 0): NEW

            -> lov_notify()>lov_set_osc_active() could not set this lov target's active state (return -EINVAL) since the target has no export yet (connect RPC never issued)

            00020000:00000001:0.0:1352185476.068118:0:19884:0:(lov_obd.c:414:lov_set_osc_active()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)

            --> ptlrpc_recover_import_no_retry() fails out with -EALREADY, since the import is in NEW state, not in supposed DISCON state.

            00000100:00000001:0.0:1352185476.068206:0:19884:0:(recover.c:337:ptlrpc_recover_import_no_retry()) Process leaving (rc=18446744073709551502 : -114 : ffffffffffffff8e)

            We need supplement the obd_connect RPC if it is still in NEW state when we activate the OSC later.

            bobijam Zhenyu Xu added a comment - - edited found the root cause. When OSC is inactivate before lov tries to connect it (as this scenario does), lov_connect will not connect the OST device, and the import to it is set to invalid, when we activate it later, following procedure happens: ptlrpc_set_import_active() set import valid {00000100:00080000:0.0:1352185476.068106:0:19884:0:(recover.c:276:ptlrpc_set_import_active()) setting import lustre-OST0000_UUID VALID ptlrpc_recover_import() --> ptlrpc_set_import_discon() do nothing, since the import is in NEW state 00000100:00080000:0.0:1352185476.068108:0:19884:0:(import.c:195:ptlrpc_set_import_discon()) osc: import ffff88003c258800 already not connected (conn 0, was 0): NEW - > lov_notify() >lov_set_osc_active() could not set this lov target's active state (return -EINVAL) since the target has no export yet (connect RPC never issued) 00020000:00000001:0.0:1352185476.068118:0:19884:0:(lov_obd.c:414:lov_set_osc_active()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea) --> ptlrpc_recover_import_no_retry() fails out with -EALREADY, since the import is in NEW state, not in supposed DISCON state. 00000100:00000001:0.0:1352185476.068206:0:19884:0:(recover.c:337:ptlrpc_recover_import_no_retry()) Process leaving (rc=18446744073709551502 : -114 : ffffffffffffff8e) We need supplement the obd_connect RPC if it is still in NEW state when we activate the OSC later.
            bobijam Zhenyu Xu made changes -
            Status Original: Open [ 1 ] New: In Progress [ 3 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to LU-1459 [ LU-1459 ]

            This bug should probably be linked to LU-1459 since it appears to be an issue with use of a disabled OSC, at least when I experienced it

            jfilizetti Jeremy Filizetti added a comment - This bug should probably be linked to LU-1459 since it appears to be an issue with use of a disabled OSC, at least when I experienced it
            pjones Peter Jones made changes -
            Assignee Original: WC Triage [ wc-triage ] New: Zhenyu Xu [ bobijam ]

            People

              bobijam Zhenyu Xu
              john John Spray (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: