Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-980

incompatbile error handling

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.2.0, Lustre 2.1.2
    • Lustre 2.0.0
    • None
    • 3
    • 4751

    Description

      We have got this general protection fault on a client after some toubles with
      the MDS :

      crash> bt
      PID: 6696 TASK: ffff88184352e700 CPU: 7 COMMAND: "robinhood"
      #0 [ffff8810f44fbc10] machine_kexec at ffffffff8102e77b
      #1 [ffff8810f44fbc70] crash_kexec at ffffffff810a6cd8
      #2 [ffff8810f44fbd40] oops_end at ffffffff8146aad0
      #3 [ffff8810f44fbd70] die at ffffffff8101021b
      #4 [ffff8810f44fbda0] do_general_protection at ffffffff8146a642
      #5 [ffff8810f44fbdd0] general_protection at ffffffff81469e15
      [exception RIP: llog_cat_put+53]
      RIP: ffffffffa09f4f65 RSP: ffff8810f44fbe80 RFLAGS: 00010296
      RAX: 0000000000000000 RBX: ffff8816756d7600 RCX: 000000000000001d
      RDX: 000000000000001d RSI: ffffffff81675340 RDI: 5a5a5a5a5a5a59fa
      RBP: ffff8810f44fbec0 R8: 0000000000000000 R9: 0000000000000001
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff8820517c90c0
      R13: 00000000ffffffe2 R14: ffffffffffffffe2 R15: 0000000040086694
      ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
      #6 [ffff8810f44fbec8] mdc_changelog_send_thread at ffffffffa0c8572f
      #7 [ffff8810f44fbf48] kernel_thread at ffffffff8100d1aa
      [exception RIP: child_rip]
      RIP: ffffffff8100d1a0 RSP: ffff8810f44fbf58 RFLAGS: 00000200
      RAX: 0000000000000000 RBX: ffff8818436c80f8 RCX: 0000000000000001
      RDX: 0000000000000500 RSI: ffff8816756d7600 RDI: ffffffffa0c85490
      RBP: ffff8816a7ddfa58 R8: 0000000000000246 R9: 0000000000000000
      R10: ffff8818436a6800 R11: 0000000000000000 R12: ffff880019b772c0
      R13: ffff8816756d7600 R14: 0000000000000000 R15: 0000000040086694
      ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018
      bt: WARNING: possibly bogus exception frame

      with this error message shown just before in dmesg :
      LustreError: 6696:0:(mdc_request.c:1256:mdc_changelog_send_thread()) llog_create() failed -30

      After some investigation, I found that the handle used in llog_cat_put was
      created in the following call path :
      mdc_changelog_send_thread
      llog_create
      lop->lop_create = llog_client_create
      llog_client_create

      where the way the llog_handle is returned in case of error doesn't look
      compatible with the error handling done in mdc_changelog_send_thread (llog_create
      don't do anything with llog_handle)

      In llog_client_create, 'handle' is assigned with the return value of llog_alloc_handle,
      which (if !NULL) is directly assigned to the return parameter '*res' (pointer
      to llh in mdc_changelog_send_thread). Then, if something goes wrong, the code
      jump to err_free where 'handle' is free & poison but res is keep untouched
      llog_client_create return a value != 0 and a pointer (in *res) to a freed &
      poisoned area.

      In mdc_changelog_send_thread, in case of error the error label (out) rely on
      the fact that if llh is not NULL, it is valid and should be processed by
      llog_cat_put. But it isn't since it as been poisoned in llog_client_create
      during the previous error handling.

      A simple fix could be to move the *res = handle in llog_client_create just
      before the EXIT macro. The *res will stay untouched until the function
      successfuly complete.

      Attachments

        Issue Links

          Activity

            People

              hongchao.zhang Hongchao Zhang
              louveta Alexandre Louvet (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: