Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1421

Client LBUG in ll_file_write after filesystem expansion

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.3.0
    • Lustre 2.2.0, Lustre 2.3.0
    • SL5 on clients and servers, mix of 2.1.0, 2.1.1 and 2.2.0 clients, 2.2.0 on all servers
    • 4
    • 4593

    Description

      After adding 24 OSTs to the file system we get client LBUGs and crashes on Lustre 2.2.0. We expanded the file system by adding new resources and new OSTs had been seen by clients properly, however now we get dozens of crashes every day. Trace looks like this:

      May 18 15:18:36 <user.notice> n3-1-13.local Pid[]: 9127, comm: dtf3d_qdot.out
      May 18 15:18:36 <user.notice> n3-1-13.local []:
      May 18 15:18:36 <user.notice> n3-1-13.local Call[]: Trace:
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8870c5f1>]: libcfs_debug_dumpstack+0x51/0x60 [libcfs]
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8870ca28>]: lbug_with_loc+0x48/0x90 [libcfs]
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88816a60>]: cl_page_assume+0xa0/0x190 [obdclass]
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c53198>]: ll_prepare_write+0x98/0x150 [lustre]
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c6749b>]: ll_write_begin+0xdb/0x150 [lustre]
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8000fe46>]: generic_file_buffered_write+0x14b/0x6a9
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff80016741>]: __generic_file_aio_write_nolock+0x369/0x3b6
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff800c9ab4>]: __generic_file_write_nolock+0x8f/0xa8
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff800a34ad>]: autoremove_wake_function+0x0/0x2e
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8881ad4d>]: cl_enqueue_try+0x23d/0x2f0 [obdclass]
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff80063af9>]: mutex_lock+0xd/0x1d
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff800c9b15>]: generic_file_writev+0x48/0xa3
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c7772d>]: vvp_io_write_start+0xfd/0x1b0 [lustre]
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8881d810>]: cl_io_start+0x90/0xf0 [obdclass]
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff888204d8>]: cl_io_loop+0x88/0x130 [obdclass]
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c3124d>]: ll_file_io_generic+0x44d/0x4a0 [lustre]
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c31425>]: ll_file_writev+0x185/0x1f0 [lustre]
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c3aa71>]: ll_file_write+0x121/0x190 [lustre]
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff80016b49>]: vfs_write+0xce/0x174
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff80017412>]: sys_write+0x45/0x6e
      May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8005d28d>]: tracesys+0xd5/0xe0
      May 18 15:18:36 <user.notice> n3-1-13.local []:
      May 18 15:18:36 <user.notice> n3-1-13.local Kernel[]: panic - not syncing: LBUG
      May 18 15:18:36 <user.notice> n3-1-13.local []:

      Problem is hard to reproduce even though we know which binaries caused it. For now it looks like after client reboot the problem disappears, however a subsequent crash might have simply not happened yet. We don't have a crashkernel dump yet. There is nothing suspicious in the server logs.

      Attachments

        Activity

          [LU-1421] Client LBUG in ll_file_write after filesystem expansion
          pjones Peter Jones added a comment -

          Landed for 2.3

          pjones Peter Jones added a comment - Landed for 2.3

          Thanks, Jinshan! I've pulled it into our Orion branch.

          prakash Prakash Surya (Inactive) added a comment - Thanks, Jinshan! I've pulled it into our Orion branch.
          jay Jinshan Xiong (Inactive) added a comment - A patch is pushed to: http://review.whamcloud.com/3027

          This issue is imported by page writeback support at c5361360e51de22a59d4427327bddf9fd398f352. I'll cook a patch soon.

          jay Jinshan Xiong (Inactive) added a comment - This issue is imported by page writeback support at c5361360e51de22a59d4427327bddf9fd398f352. I'll cook a patch soon.
          pjones Peter Jones added a comment -

          Jinshan

          Could you please look into this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Jinshan Could you please look into this one? Thanks Peter

          Jinshan, here is a dump of the backtraces for all processes running on the node at the time of the ASSERT.

          Let me know if any other information would be helpful to get out of the crash dump.

          prakash Prakash Surya (Inactive) added a comment - Jinshan, here is a dump of the backtraces for all processes running on the node at the time of the ASSERT. Let me know if any other information would be helpful to get out of the crash dump.

          Yes, as I mentioned this bug doesn't concern 2.1.1, that's the version to which we are doing rollback.

          m.magrys Marek Magrys added a comment - Yes, as I mentioned this bug doesn't concern 2.1.1, that's the version to which we are doing rollback.

          We seem to hit this running a specific test on our Grove/Sequoia filesystem. We don't see this running the same test on our 2.1.1 based test system. Perhaps it is a regression in 2.2?

          prakash Prakash Surya (Inactive) added a comment - We seem to hit this running a specific test on our Grove/Sequoia filesystem. We don't see this running the same test on our 2.1.1 based test system. Perhaps it is a regression in 2.2?

          In our case this was with 2.6.32-220.17.1.2chaos.ch5.x86_64 which is RHEL6.2 plus a few patches.

          behlendorf Brian Behlendorf added a comment - In our case this was with 2.6.32-220.17.1.2chaos.ch5.x86_64 which is RHEL6.2 plus a few patches.
          m.magrys Marek Magrys added a comment -

          The kernel version is 2.6.18-308.4.1, latest from SL5. As soon as we will be able to reproduce the problem I'll post the information. We dedicated some hardware resources just to reproduce the problem with codes which are possibly the cause of the LBUG, but we're not lucky yet.

          m.magrys Marek Magrys added a comment - The kernel version is 2.6.18-308.4.1, latest from SL5. As soon as we will be able to reproduce the problem I'll post the information. We dedicated some hardware resources just to reproduce the problem with codes which are possibly the cause of the LBUG, but we're not lucky yet.

          People

            jay Jinshan Xiong (Inactive)
            m.magrys Marek Magrys
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: