Details

    • Bug
    • Resolution: Done
    • Major
    • None
    • Lustre 2.1.3
    • 3
    • 6118

    Description

      We have got 4 OSSes that crash at the same time, at umount, with the following bt :

      PID: 18173 TASK: ffff8803376dc040 CPU: 4 COMMAND: "umount"
      #0 [ffff8802b115f8d0] machine_kexec at ffffffff8102895b
      0000001 [ffff8802b115f930] crash_kexec at ffffffff810a4622
      0000002 [ffff8802b115fa00] panic at ffffffff81484657
      0000003 [ffff8802b115fa80] lbug_with_loc at ffffffffa04ade5b [libcfs]
      0000004 [ffff8802b115faa0] llog_recov_thread_stop at ffffffffa072e55b [ptlrpc]
      0000005 [ffff8802b115fad0] llog_recov_thread_fini at ffffffffa072e593 [ptlrpc]
      0000006 [ffff8802b115faf0] filter_llog_finish at ffffffffa0c7d3dd [obdfilter]
      0000007 [ffff8802b115fb20] obd_llog_finish at ffffffffa057c2f8 [obdclass]
      0000008 [ffff8802b115fb40] filter_precleanup at ffffffffa0c7cdaf [obdfilter]
      0000009 [ffff8802b115fba0] class_cleanup at ffffffffa05a3ca7 [obdclass]
      0000010 [ffff8802b115fc20] class_process_config at ffffffffa05a5feb [obdclass]
      0000011 [ffff8802b115fcb0] class_manual_cleanup at ffffffffa05a6d29 [obdclass]
      0000012 [ffff8802b115fd70] server_put_super at ffffffffa05b2c0c [obdclass]
      0000013 [ffff8802b115fe40] generic_shutdown_super at ffffffff8116542b
      0000014 [ffff8802b115fe60] kill_anon_super at ffffffff81165546
      0000015 [ffff8802b115fe80] lustre_kill_super at ffffffffa05a8966 [obdclass]
      0000016 [ffff8802b115fea0] deactivate_super at ffffffff811664e0
      0000017 [ffff8802b115fec0] mntput_no_expire at ffffffff811826bf
      0000018 [ffff8802b115fef0] sys_umount at ffffffff81183188
      0000019 [ffff8802b115ff80] system_call_fastpath at ffffffff810030f2
      RIP: 00007f62ddfbdd67 RSP: 00007fffab738308 RFLAGS: 00010202
      RAX: 00000000000000a6 RBX: ffffffff810030f2 RCX: 0000000000000010
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f62deeb3bb0
      RBP: 00007f62deeb3b80 R8: 00007f62deeb3bd0 R9: 0000000000000000
      R10: 00007fffab738130 R11: 0000000000000246 R12: 0000000000000000
      R13: 0000000000000000 R14: 0000000000000000 R15: 00007f62deeb3c10
      ORIG_RAX: 00000000000000a6 CS: 0033 SS: 002b

      This bt is identical as the one shown LU-1194 which is supposed to be fixed in 2.1.3.

      Site is classified so I can't upload the binary crash but I can export the content of some structures upon request.

      Attachments

        1. ptlrpcd.c
          32 kB
        2. recov_thread.c
          24 kB

        Activity

          [LU-2615] group of OSS crashed at umount

          what are the two threads involved in the race?
          normally, the llog_recov_thread_stop is only called by llog_recov_thread_fini, and "llog_recov_thread_stop" is called in two places,
          one is the cleanup for the failed llog_recov_thread_init call, the other is the normal cleanup phase during device cleanup
          (called in filter_llog_finish). they can't be called simultaneously

          could you please attach some more info about this issue, and can it be reproduced on your site?

          hongchao.zhang Hongchao Zhang added a comment - what are the two threads involved in the race? normally, the llog_recov_thread_stop is only called by llog_recov_thread_fini, and "llog_recov_thread_stop" is called in two places, one is the cleanup for the failed llog_recov_thread_init call, the other is the normal cleanup phase during device cleanup (called in filter_llog_finish). they can't be called simultaneously could you please attach some more info about this issue, and can it be reproduced on your site?

          We hit the same problem on lustre-2.1.6 too.

          After reading a few codes, I am wondering whether it is possible for following race problem to happen. Please correct me if I am wrong.

          filter_llog_finish
          --llog_recov_thread_fini
          ----llog_sync
          ------llog_obd_repl_sync
          --------llog_cancel
          ----------llog_obd_repl_cancel
          ------------llcd_push
          --------------llcd_send
          ----------------Sending async
          ----llog_recov_thread_stop
          ------LBUG,because llcd_send is sending a llcd and llcd_interpret() is not called since no reply has been got now.

          Thanks!

          lixi Li Xi (Inactive) added a comment - We hit the same problem on lustre-2.1.6 too. After reading a few codes, I am wondering whether it is possible for following race problem to happen. Please correct me if I am wrong. filter_llog_finish --llog_recov_thread_fini ----llog_sync ------llog_obd_repl_sync --------llog_cancel ----------llog_obd_repl_cancel ------------llcd_push --------------llcd_send ----------------Sending async ----llog_recov_thread_stop ------LBUG,because llcd_send is sending a llcd and llcd_interpret() is not called since no reply has been got now. Thanks!

          Hi,

          I have asked people on site for the results of the tests.

          Cheers,
          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi, I have asked people on site for the results of the tests. Cheers, Sebastien.

          Hi, what is the output of the test? Thanks

          hongchao.zhang Hongchao Zhang added a comment - Hi, what is the output of the test? Thanks

          Hi,

          Yes, it will disable the ptlrpcd thread pools (although not shaking off the patch completely) and it should be still a relevant test.

          Thanks

          hongchao.zhang Hongchao Zhang added a comment - Hi, Yes, it will disable the ptlrpcd thread pools (although not shaking off the patch completely) and it should be still a relevant test. Thanks

          Hi,

          It might be difficult to have the opportunity to install packages with those 2 patches reverted at customer site.
          Instead, could we just set ptlrpcd_bind_policy=1 and max_ptlrpcds=2 as options for the ptlrpc kernel module, so that it behaves like if patch from ORNL-22 was not applied?
          Is it still a relevant test for you?

          Thanks,
          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi, It might be difficult to have the opportunity to install packages with those 2 patches reverted at customer site. Instead, could we just set ptlrpcd_bind_policy=1 and max_ptlrpcds=2 as options for the ptlrpc kernel module, so that it behaves like if patch from ORNL-22 was not applied? Is it still a relevant test for you? Thanks, Sebastien.

          the remaining "llcd" should have been sent over ptlrpc_request for llog_ctxt->loc_llcd == NULL, and the request could not finish, then "llcd_interpret" wasn't
          called to free the "llcd", there are 2 patches (ORNL-22 general ptlrpcd threads pool support; LU-1144 implement a NUMA aware ptlrpcd binding policy) among
          the patches applied currently is related to it, could you please help to revert the 2 patches and test it, Thanks!

          hongchao.zhang Hongchao Zhang added a comment - the remaining "llcd" should have been sent over ptlrpc_request for llog_ctxt->loc_llcd == NULL, and the request could not finish, then "llcd_interpret" wasn't called to free the "llcd", there are 2 patches (ORNL-22 general ptlrpcd threads pool support; LU-1144 implement a NUMA aware ptlrpcd binding policy) among the patches applied currently is related to it, could you please help to revert the 2 patches and test it, Thanks!

          Hi,
          Does the kernel dump referred in comment at 11/Feb/13 3:56 PM exist? if so, could you please print the content at 0xffff88021b2c2050
          as "struct llog_canceld_ctxt"? besides, can the console output(just the part related to Lustre) be attached here? Thanks a lot!

          hongchao.zhang Hongchao Zhang added a comment - Hi, Does the kernel dump referred in comment at 11/Feb/13 3:56 PM exist? if so, could you please print the content at 0xffff88021b2c2050 as "struct llog_canceld_ctxt"? besides, can the console output(just the part related to Lustre) be attached here? Thanks a lot!

          > Could it be memory corrupt?
          It's unexpected. The serveur is fully ECC protected and, as it's an OSS, almost only linux & lustre are running on this node.

          > does the issue occur again recently?
          It did occurs the last 4 times we did stop lustre on the node.

          louveta Alexandre Louvet (Inactive) added a comment - > Could it be memory corrupt? It's unexpected. The serveur is fully ECC protected and, as it's an OSS, almost only linux & lustre are running on this node. > does the issue occur again recently? It did occurs the last 4 times we did stop lustre on the node.

          the list "lcm_llcds" is corrupted for its value of "next" and "prev" is wrong (it's not in the address region of "struct llog_commit_master").
          Could it be memory corrupt? there is no trace of the bug yet, sorry!

          does the issue occur again recently?

          hongchao.zhang Hongchao Zhang added a comment - the list "lcm_llcds" is corrupted for its value of "next" and "prev" is wrong (it's not in the address region of "struct llog_commit_master"). Could it be memory corrupt? there is no trace of the bug yet, sorry! does the issue occur again recently?

          Hi, here are the source files requested by Hongchao.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi, here are the source files requested by Hongchao.

          People

            hongchao.zhang Hongchao Zhang
            louveta Alexandre Louvet (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: