Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6553

Recurrence of LU-5299: obd_mount_server.c:1690:osd_start()) ASSERTION( obd ) failed

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.7.0, Lustre 2.5.4
    • None
    • Combined MGT/MDT, racing multiple mount commands.
    • 3
    • 9223372036854775807

    Description

      The patch for LU-5573 (http://review.whamcloud.com/#/c/12353/), which closed LU-5299, does not cover some cases.

      Specifically, the code which enables the combined MGT/MDT to start correctly also disables the race protection for a combined MGT/MDT.

      So racing multiple mount commands on a combined MGT/MDT can still cause this problem.

      I've taken a look, and I don't see any easy way to fix this in the current context. I can provide dumps if needed, and I'll attach a log now.

      Note the attempts to start MDT0000. There are five, four of which start after the first one but before it has completed.

      Attachments

        Issue Links

          Activity

            [LU-6553] Recurrence of LU-5299: obd_mount_server.c:1690:osd_start()) ASSERTION( obd ) failed

            Thanks, Wally!

            paf Patrick Farrell (Inactive) added a comment - Thanks, Wally!

            Bruno,
            Here is a simple reproducer:

            1. create and start a Lustre file system with mgt/mdt combo
            2. umount the mgt and mdt
            3. run the following 'test_mount' script 5 times in parallel:

            cat test_mount
            #!/bin/bash
            mount -t lustre -o nosvc,abort_recov --verbose /dev/sdd /tmp/lustre/scratch/mgt
            mount -t lustre -o nomgs,abort_recov --verbose /dev/sdd /tmp/lustre/scratch/mdt

            for ((i=0;i<5;i++));do ./test_mount & done;

            wang Wally Wang (Inactive) added a comment - Bruno, Here is a simple reproducer: 1. create and start a Lustre file system with mgt/mdt combo 2. umount the mgt and mdt 3. run the following 'test_mount' script 5 times in parallel: cat test_mount #!/bin/bash mount -t lustre -o nosvc,abort_recov --verbose /dev/sdd /tmp/lustre/scratch/mgt mount -t lustre -o nomgs,abort_recov --verbose /dev/sdd /tmp/lustre/scratch/mdt for ((i=0;i<5;i++));do ./test_mount & done;

            Bruno -

            We don't have a specific reproducer. It actually turned out we were doing concurrent mounts because our failover stuff was misconfigured on an internal system.
            But I think it would be sufficient to simply use a bash script to spawn off multiple mount commands for a particular target.

            paf Patrick Farrell (Inactive) added a comment - Bruno - We don't have a specific reproducer. It actually turned out we were doing concurrent mounts because our failover stuff was misconfigured on an internal system. But I think it would be sufficient to simply use a bash script to spawn off multiple mount commands for a particular target.

            Patrick, Wally,
            Sorry I am late on this, but back working on a new/addon patch to fully fix.
            Can you help in providing your reproducer or at least give some details on how these concurrent mount cmds are generated ?

            bfaccini Bruno Faccini (Inactive) added a comment - Patrick, Wally, Sorry I am late on this, but back working on a new/addon patch to fully fix. Can you help in providing your reproducer or at least give some details on how these concurrent mount cmds are generated ?

            Hi Bruno, any progress on this one? Thanks.

            wang Wally Wang (Inactive) added a comment - Hi Bruno, any progress on this one? Thanks.

            Thanks, Bruno - Good luck. I couldn't find an easy way to do it, but I expect you know this code much better than me.

            paf Patrick Farrell (Inactive) added a comment - Thanks, Bruno - Good luck. I couldn't find an easy way to do it, but I expect you know this code much better than me.

            Hello Patrick,
            As I am the unfortunate author of both+complementary patches for LU-5299 and LU-5573, I think I had to assign this ticket to me ...
            Thanks for the report and also the attached Lustre debug trace for the LBUG.
            Having a look to the trace I think you are right with the fact that the problem/race is still being present, but this is only in the case of concurrent mount/start commands where either of the nosvc/nomgs flags has been specified for a combined MDT/MGS device.
            Will try to fix this case too, as a new complementary patch ...

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Patrick, As I am the unfortunate author of both+complementary patches for LU-5299 and LU-5573 , I think I had to assign this ticket to me ... Thanks for the report and also the attached Lustre debug trace for the LBUG. Having a look to the trace I think you are right with the fact that the problem/race is still being present, but this is only in the case of concurrent mount/start commands where either of the nosvc/nomgs flags has been specified for a combined MDT/MGS device. Will try to fix this case too, as a new complementary patch ...

            People

              bfaccini Bruno Faccini (Inactive)
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: