Details

    • Bug
    • Resolution: Done
    • Major
    • None
    • Lustre 2.1.3
    • 3
    • 6118

    Description

      We have got 4 OSSes that crash at the same time, at umount, with the following bt :

      PID: 18173 TASK: ffff8803376dc040 CPU: 4 COMMAND: "umount"
      #0 [ffff8802b115f8d0] machine_kexec at ffffffff8102895b
      0000001 [ffff8802b115f930] crash_kexec at ffffffff810a4622
      0000002 [ffff8802b115fa00] panic at ffffffff81484657
      0000003 [ffff8802b115fa80] lbug_with_loc at ffffffffa04ade5b [libcfs]
      0000004 [ffff8802b115faa0] llog_recov_thread_stop at ffffffffa072e55b [ptlrpc]
      0000005 [ffff8802b115fad0] llog_recov_thread_fini at ffffffffa072e593 [ptlrpc]
      0000006 [ffff8802b115faf0] filter_llog_finish at ffffffffa0c7d3dd [obdfilter]
      0000007 [ffff8802b115fb20] obd_llog_finish at ffffffffa057c2f8 [obdclass]
      0000008 [ffff8802b115fb40] filter_precleanup at ffffffffa0c7cdaf [obdfilter]
      0000009 [ffff8802b115fba0] class_cleanup at ffffffffa05a3ca7 [obdclass]
      0000010 [ffff8802b115fc20] class_process_config at ffffffffa05a5feb [obdclass]
      0000011 [ffff8802b115fcb0] class_manual_cleanup at ffffffffa05a6d29 [obdclass]
      0000012 [ffff8802b115fd70] server_put_super at ffffffffa05b2c0c [obdclass]
      0000013 [ffff8802b115fe40] generic_shutdown_super at ffffffff8116542b
      0000014 [ffff8802b115fe60] kill_anon_super at ffffffff81165546
      0000015 [ffff8802b115fe80] lustre_kill_super at ffffffffa05a8966 [obdclass]
      0000016 [ffff8802b115fea0] deactivate_super at ffffffff811664e0
      0000017 [ffff8802b115fec0] mntput_no_expire at ffffffff811826bf
      0000018 [ffff8802b115fef0] sys_umount at ffffffff81183188
      0000019 [ffff8802b115ff80] system_call_fastpath at ffffffff810030f2
      RIP: 00007f62ddfbdd67 RSP: 00007fffab738308 RFLAGS: 00010202
      RAX: 00000000000000a6 RBX: ffffffff810030f2 RCX: 0000000000000010
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f62deeb3bb0
      RBP: 00007f62deeb3b80 R8: 00007f62deeb3bd0 R9: 0000000000000000
      R10: 00007fffab738130 R11: 0000000000000246 R12: 0000000000000000
      R13: 0000000000000000 R14: 0000000000000000 R15: 00007f62deeb3c10
      ORIG_RAX: 00000000000000a6 CS: 0033 SS: 002b

      This bt is identical as the one shown LU-1194 which is supposed to be fixed in 2.1.3.

      Site is classified so I can't upload the binary crash but I can export the content of some structures upon request.

      Attachments

        1. ptlrpcd.c
          32 kB
        2. recov_thread.c
          24 kB

        Activity

          [LU-2615] group of OSS crashed at umount

          Hi Hongchao,

          Sorry, may be 'race' is not the right word to express my thought.

          At the time llcd_send() returns, the completion handler llcd_interpret() might not be called yet, right? When the llcd is still under use by the RPC on flight, llog_recov_thread_stop() will hit a LBUG. I can't find any codes in filter_llog_finish() which waits for the RPC finishes, so I guess it is possible that when llog_recov_thread_stop() is called, the RPC is still on flight. Am I right?

          Thanks
          Li Xi

          lixi Li Xi (Inactive) added a comment - Hi Hongchao, Sorry, may be 'race' is not the right word to express my thought. At the time llcd_send() returns, the completion handler llcd_interpret() might not be called yet, right? When the llcd is still under use by the RPC on flight, llog_recov_thread_stop() will hit a LBUG. I can't find any codes in filter_llog_finish() which waits for the RPC finishes, so I guess it is possible that when llog_recov_thread_stop() is called, the RPC is still on flight. Am I right? Thanks Li Xi

          what are the two threads involved in the race?
          normally, the llog_recov_thread_stop is only called by llog_recov_thread_fini, and "llog_recov_thread_stop" is called in two places,
          one is the cleanup for the failed llog_recov_thread_init call, the other is the normal cleanup phase during device cleanup
          (called in filter_llog_finish). they can't be called simultaneously

          could you please attach some more info about this issue, and can it be reproduced on your site?

          hongchao.zhang Hongchao Zhang added a comment - what are the two threads involved in the race? normally, the llog_recov_thread_stop is only called by llog_recov_thread_fini, and "llog_recov_thread_stop" is called in two places, one is the cleanup for the failed llog_recov_thread_init call, the other is the normal cleanup phase during device cleanup (called in filter_llog_finish). they can't be called simultaneously could you please attach some more info about this issue, and can it be reproduced on your site?

          We hit the same problem on lustre-2.1.6 too.

          After reading a few codes, I am wondering whether it is possible for following race problem to happen. Please correct me if I am wrong.

          filter_llog_finish
          --llog_recov_thread_fini
          ----llog_sync
          ------llog_obd_repl_sync
          --------llog_cancel
          ----------llog_obd_repl_cancel
          ------------llcd_push
          --------------llcd_send
          ----------------Sending async
          ----llog_recov_thread_stop
          ------LBUG,because llcd_send is sending a llcd and llcd_interpret() is not called since no reply has been got now.

          Thanks!

          lixi Li Xi (Inactive) added a comment - We hit the same problem on lustre-2.1.6 too. After reading a few codes, I am wondering whether it is possible for following race problem to happen. Please correct me if I am wrong. filter_llog_finish --llog_recov_thread_fini ----llog_sync ------llog_obd_repl_sync --------llog_cancel ----------llog_obd_repl_cancel ------------llcd_push --------------llcd_send ----------------Sending async ----llog_recov_thread_stop ------LBUG,because llcd_send is sending a llcd and llcd_interpret() is not called since no reply has been got now. Thanks!

          Hi,

          I have asked people on site for the results of the tests.

          Cheers,
          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi, I have asked people on site for the results of the tests. Cheers, Sebastien.

          Hi, what is the output of the test? Thanks

          hongchao.zhang Hongchao Zhang added a comment - Hi, what is the output of the test? Thanks

          Hi,

          Yes, it will disable the ptlrpcd thread pools (although not shaking off the patch completely) and it should be still a relevant test.

          Thanks

          hongchao.zhang Hongchao Zhang added a comment - Hi, Yes, it will disable the ptlrpcd thread pools (although not shaking off the patch completely) and it should be still a relevant test. Thanks

          Hi,

          It might be difficult to have the opportunity to install packages with those 2 patches reverted at customer site.
          Instead, could we just set ptlrpcd_bind_policy=1 and max_ptlrpcds=2 as options for the ptlrpc kernel module, so that it behaves like if patch from ORNL-22 was not applied?
          Is it still a relevant test for you?

          Thanks,
          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi, It might be difficult to have the opportunity to install packages with those 2 patches reverted at customer site. Instead, could we just set ptlrpcd_bind_policy=1 and max_ptlrpcds=2 as options for the ptlrpc kernel module, so that it behaves like if patch from ORNL-22 was not applied? Is it still a relevant test for you? Thanks, Sebastien.

          the remaining "llcd" should have been sent over ptlrpc_request for llog_ctxt->loc_llcd == NULL, and the request could not finish, then "llcd_interpret" wasn't
          called to free the "llcd", there are 2 patches (ORNL-22 general ptlrpcd threads pool support; LU-1144 implement a NUMA aware ptlrpcd binding policy) among
          the patches applied currently is related to it, could you please help to revert the 2 patches and test it, Thanks!

          hongchao.zhang Hongchao Zhang added a comment - the remaining "llcd" should have been sent over ptlrpc_request for llog_ctxt->loc_llcd == NULL, and the request could not finish, then "llcd_interpret" wasn't called to free the "llcd", there are 2 patches (ORNL-22 general ptlrpcd threads pool support; LU-1144 implement a NUMA aware ptlrpcd binding policy) among the patches applied currently is related to it, could you please help to revert the 2 patches and test it, Thanks!

          Hi,
          Does the kernel dump referred in comment at 11/Feb/13 3:56 PM exist? if so, could you please print the content at 0xffff88021b2c2050
          as "struct llog_canceld_ctxt"? besides, can the console output(just the part related to Lustre) be attached here? Thanks a lot!

          hongchao.zhang Hongchao Zhang added a comment - Hi, Does the kernel dump referred in comment at 11/Feb/13 3:56 PM exist? if so, could you please print the content at 0xffff88021b2c2050 as "struct llog_canceld_ctxt"? besides, can the console output(just the part related to Lustre) be attached here? Thanks a lot!

          > Could it be memory corrupt?
          It's unexpected. The serveur is fully ECC protected and, as it's an OSS, almost only linux & lustre are running on this node.

          > does the issue occur again recently?
          It did occurs the last 4 times we did stop lustre on the node.

          louveta Alexandre Louvet (Inactive) added a comment - > Could it be memory corrupt? It's unexpected. The serveur is fully ECC protected and, as it's an OSS, almost only linux & lustre are running on this node. > does the issue occur again recently? It did occurs the last 4 times we did stop lustre on the node.

          the list "lcm_llcds" is corrupted for its value of "next" and "prev" is wrong (it's not in the address region of "struct llog_commit_master").
          Could it be memory corrupt? there is no trace of the bug yet, sorry!

          does the issue occur again recently?

          hongchao.zhang Hongchao Zhang added a comment - the list "lcm_llcds" is corrupted for its value of "next" and "prev" is wrong (it's not in the address region of "struct llog_commit_master"). Could it be memory corrupt? there is no trace of the bug yet, sorry! does the issue occur again recently?

          People

            hongchao.zhang Hongchao Zhang
            louveta Alexandre Louvet (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: