Details

    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for liuying <emoly.liu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/d938f700-c59c-11e7-9c63-52540065bddc.

      The sub-test test_57 failed with the following error:

      Timeout occurred after 630 mins, last suite running was recovery-small, restarting cluster to continue tests
      
      [14587.215937] general protection fault: 0000 [#1] SMP 
      [14587.226037] CPU: 0 PID: 4274 Comm: lctl Tainted: G           OE  ------------   3.10.0-693.5.2.el7.x86_64 #1
      [14587.227003] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [14587.228344] RIP: 0010:[<ffffffffc0742fa1>]  [<ffffffffc0742fa1>] lprocfs_stats_collect+0xc1/0x140 [obdclass]
      [14587.239740] Call Trace:
      [14587.240004]  [<ffffffffc0743b87>] lprocfs_stats_seq_show+0x47/0x140 [obdclass]
      [14587.240754]  [<ffffffff81226550>] seq_read+0x250/0x3b0
      [14587.241283]  [<ffffffff812702cd>] proc_reg_read+0x3d/0x80
      [14587.241831]  [<ffffffff81200b1c>] vfs_read+0x9c/0x170
      [14587.242347]  [<ffffffff812019df>] SyS_read+0x7f/0xe0
      [14587.242854]  [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
      

      Please provide additional information about the failure here.

      Info required for matching: recovery-small 57

      Attachments

        Issue Links

          Activity

            [LU-10224] recovery-small test_57: timeout

            John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31197/
            Subject: LU-10224 obd: free obd_svc_stats when all users are gone
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: d9c595b530c5bc5f112698151bd1e9f4d74080cb

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31197/ Subject: LU-10224 obd: free obd_svc_stats when all users are gone Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: d9c595b530c5bc5f112698151bd1e9f4d74080cb
            bogl Bob Glossman (Inactive) added a comment - another on b2_10: https://testing.hpdd.intel.com/test_sets/91b464c2-1ce1-11e8-a10a-52540065bddc

            Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31197
            Subject: LU-10224 obd: free obd_svc_stats when all users are gone
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 758bb9b569c14f30c0257fbb7f4f675b6f003040

            gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31197 Subject: LU-10224 obd: free obd_svc_stats when all users are gone Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 758bb9b569c14f30c0257fbb7f4f675b6f003040
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30249/
            Subject: LU-10224 obd: free obd_svc_stats when all users are gone
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ffc843a0aacd78495b1cff51344aaee3e32fc2de

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30249/ Subject: LU-10224 obd: free obd_svc_stats when all users are gone Project: fs/lustre-release Branch: master Current Patch Set: Commit: ffc843a0aacd78495b1cff51344aaee3e32fc2de

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/30249
            Subject: LU-10224 obd: free obd_svc_stats when all users are gone
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1d4fc1f20bd701edb45ff3e2a26b734f7ba08e37

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/30249 Subject: LU-10224 obd: free obd_svc_stats when all users are gone Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1d4fc1f20bd701edb45ff3e2a26b734f7ba08e37

            Andreas, I believe that, according to LU-5378, and the history of Maloo test results, it looks like this is a very old issue but not enough frequent to have been addressed until now.

            Also, during the testing of my fix proposal as a first try for osc lprocfs, I am unable to reproduce the crash with a heavier reproducer than recovery-small/test_57, on a single-node setup :

            # while true ; do umount /mnt/lustre ; mount -t lustre -o rw,flock,user_xattr,lazystatfs 10.2.5.126@tcp:/lustre /mnt/lustre ; done &
            # while true ; do lctl get_param osc.*-f*.stats ; done &
            

            But I am still able to get the same crash by stressing the mdc lprocfs with this reproducer :

            # while true ; do umount /mnt/lustre ; mount -t lustre -o rw,flock,user_xattr,lazystatfs 10.2.5.126@tcp:/lustre /mnt/lustre ; done &
            # while true ; do lctl get_param mdc.*.stats ; done &
            

            and when stressing the osp lprocfs with the same kind of reproducer :

            # while true ; do umount /mnt/lustre-mds1 ; mount -t lustre -o loop /tmp/lustre-mdt1 /mnt/lustre-mds1 ; done &
            # while true ; do lctl get_param osp.*.stats ; done &
            

            So seems that as I suspected before, my fix idea looks valid and also may need to be applied for all the obd_device types I have already identified and listed.

            Will try to push a full patch soon now.

            bfaccini Bruno Faccini (Inactive) added a comment - Andreas, I believe that, according to LU-5378 , and the history of Maloo test results, it looks like this is a very old issue but not enough frequent to have been addressed until now. Also, during the testing of my fix proposal as a first try for osc lprocfs, I am unable to reproduce the crash with a heavier reproducer than recovery-small/test_57, on a single-node setup : # while true ; do umount /mnt/lustre ; mount -t lustre -o rw,flock,user_xattr,lazystatfs 10.2.5.126@tcp:/lustre /mnt/lustre ; done & # while true ; do lctl get_param osc.*-f*.stats ; done & But I am still able to get the same crash by stressing the mdc lprocfs with this reproducer : # while true ; do umount /mnt/lustre ; mount -t lustre -o rw,flock,user_xattr,lazystatfs 10.2.5.126@tcp:/lustre /mnt/lustre ; done & # while true ; do lctl get_param mdc.*.stats ; done & and when stressing the osp lprocfs with the same kind of reproducer : # while true ; do umount /mnt/lustre-mds1 ; mount -t lustre -o loop /tmp/lustre-mdt1 /mnt/lustre-mds1 ; done & # while true ; do lctl get_param osp.*.stats ; done & So seems that as I suspected before, my fix idea looks valid and also may need to be applied for all the obd_device types I have already identified and listed. Will try to push a full patch soon now.

            Bruno, thanks for looking into this.

            We used to have proper protection of the /process entries, but I guess that that was lost during the transition to sysfs or similar. Is there anything that landed just before this bug started appearing that might be the cause?

            adilger Andreas Dilger added a comment - Bruno, thanks for looking into this. We used to have proper protection of the /process entries, but I guess that that was lost during the transition to sysfs or similar. Is there anything that landed just before this bug started appearing that might be the cause?

            Stacks showing the umount/lctl racy scenario in crash-dump :

            PID: 25606  TASK: ffff88007b579fa0  CPU: 1   COMMAND: "lctl"
            bt: seek error: kernel virtual address: ffffffffffffffff  type: "cpu_online_map"
             #0 [ffff880069293b50] machine_kexec at ffffffff8105c4cb
             #1 [ffff880069293bb0] __crash_kexec at ffffffff81104a42
             #2 [ffff880069293c80] crash_kexec at ffffffff81104b30
             #3 [ffff880069293c98] oops_end at ffffffff816ad338
             #4 [ffff880069293cc0] die at ffffffff8102e97b
             #5 [ffff880069293cf0] do_general_protection at ffffffff816accbe
             #6 [ffff880069293d20] general_protection at ffffffff816ac568
                [exception RIP: lprocfs_stats_collect+193]
                RIP: ffffffffc0720a41  RSP: ffff880069293dd8  RFLAGS: 00010246
                RAX: 0000000000007c15  RBX: ffff880069293e20  RCX: 0000000000000006
                RDX: 6f2d313030305953  RSI: 6f2d313030305a53  RDI: 0000000000000006
                RBP: ffff880069293e00   R8: 0000000000000500   R9: 0000000000000100
                R10: 0000000000000000  R11: ffff880069293ca6  R12: ffff88007c15a2c0
                R13: 0000000000000020  R14: ffff880079723000  R15: ffff880068d94d00
                ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
             #7 [ffff880069293e08] lprocfs_stats_seq_show at ffffffffc0721627 [obdclass]
             #8 [ffff880069293e78] seq_read at ffffffff81226550
             #9 [ffff880069293ee8] proc_reg_poll at ffffffff812702cd  <<<< or proc_reg_read() ???????
            #10 [ffff880069293f08] vfs_read at ffffffff81200b1c
            #11 [ffff880069293f38] sys_read at ffffffff812019df
            #12 [ffff880069293f80] system_call_fastpath at ffffffff816b5089
                RIP: 00007f3b2efdc6f0  RSP: 00007fff00fef468  RFLAGS: 00000246
                RAX: 0000000000000000  RBX: ffffffff816b5089  RCX: ffffffffffffffff
                RDX: 0000000000001000  RSI: 0000000001d74250  RDI: 0000000000000003
                RBP: 0000000001d7425a   R8: 0000000001d72430   R9: 0000000000001000
                R10: 0000000000000063  R11: 0000000000000246  R12: 00007fff00ff0610
                R13: 0000000001d74250  R14: 0000000000000001  R15: 0000000001d72410
                ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b
            
            PID: 25600  TASK: ffff88007ad72f70  CPU: 0   COMMAND: "umount"
             #0 [ffff880068ceb810] __schedule at ffffffff816a9005
             #1 [ffff880068ceb878] schedule at ffffffff816a9589
             #2 [ffff880068ceb888] schedule_timeout at ffffffff816a7099
             #3 [ffff880068ceb930] wait_for_completion at ffffffff816a993d
             #4 [ffff880068ceb990] proc_entry_rundown at ffffffff81270885
             #5 [ffff880068ceb9e0] remove_proc_subtree at ffffffff81276e52
             #6 [ffff880068ceba30] proc_remove at ffffffff81276efe
             #7 [ffff880068ceba40] lprocfs_obd_cleanup at ffffffffc071baff [obdclass]
             #8 [ffff880068ceba58] osc_precleanup at ffffffffc0ad8cd7 [osc]
             #9 [ffff880068ceba80] class_cleanup at ffffffffc072f1e2 [obdclass]
            #10 [ffff880068cebaf8] class_process_config at ffffffffc0731b26 [obdclass]
            #11 [ffff880068cebba8] class_manual_cleanup at ffffffffc0732746 [obdclass]
            #12 [ffff880068cebc50] lov_putref at ffffffffc0b43a22 [lov]
            #13 [ffff880068cebcb0] lov_disconnect at ffffffffc0b4ae02 [lov]
            #14 [ffff880068cebcd8] obd_disconnect at ffffffffc0bc24f3 [lustre]
            #15 [ffff880068cebcf8] ll_put_super at ffffffffc0bc5990 [lustre]
            #16 [ffff880068cebe38] generic_shutdown_super at ffffffff81203692
            #17 [ffff880068cebe60] kill_anon_super at ffffffff81203a62
            #18 [ffff880068cebe78] lustre_kill_super at ffffffffc0734eb5 [obdclass]
            #19 [ffff880068cebe90] deactivate_locked_super at ffffffff81203e19
            #20 [ffff880068cebeb0] deactivate_super at ffffffff81204586
            #21 [ffff880068cebec8] cleanup_mnt at ffffffff812217bf
            #22 [ffff880068cebee0] __cleanup_mnt at ffffffff81221852
            #23 [ffff880068cebef0] task_work_run at ffffffff810ad275
            #24 [ffff880068cebf30] do_notify_resume at ffffffff8102ab62
            #25 [ffff880068cebf50] int_signal at ffffffff816b533d
                RIP: 00007f403a9cc3e7  RSP: 00007ffc54bf3958  RFLAGS: 00000246
                RAX: 0000000000000000  RBX: 0000556135a43040  RCX: ffffffffffffffff
                RDX: 0000000000000001  RSI: 0000000000000000  RDI: 0000556135a445a0
                RBP: 0000556135a445a0   R8: 0000000000000008   R9: 000000000000000c
                R10: 00007ffc54bf3670  R11: 0000000000000246  R12: 00007f403b542d58
                R13: 0000000000000000  R14: 0000556135a43210  R15: 0000556135a43040
                ORIG_RAX: 00000000000000a6  CS: 0033  SS: 002b
            
            bfaccini Bruno Faccini (Inactive) added a comment - Stacks showing the umount/lctl racy scenario in crash-dump : PID: 25606 TASK: ffff88007b579fa0 CPU: 1 COMMAND: "lctl" bt: seek error: kernel virtual address: ffffffffffffffff type: "cpu_online_map" #0 [ffff880069293b50] machine_kexec at ffffffff8105c4cb #1 [ffff880069293bb0] __crash_kexec at ffffffff81104a42 #2 [ffff880069293c80] crash_kexec at ffffffff81104b30 #3 [ffff880069293c98] oops_end at ffffffff816ad338 #4 [ffff880069293cc0] die at ffffffff8102e97b #5 [ffff880069293cf0] do_general_protection at ffffffff816accbe #6 [ffff880069293d20] general_protection at ffffffff816ac568 [exception RIP: lprocfs_stats_collect+193] RIP: ffffffffc0720a41 RSP: ffff880069293dd8 RFLAGS: 00010246 RAX: 0000000000007c15 RBX: ffff880069293e20 RCX: 0000000000000006 RDX: 6f2d313030305953 RSI: 6f2d313030305a53 RDI: 0000000000000006 RBP: ffff880069293e00 R8: 0000000000000500 R9: 0000000000000100 R10: 0000000000000000 R11: ffff880069293ca6 R12: ffff88007c15a2c0 R13: 0000000000000020 R14: ffff880079723000 R15: ffff880068d94d00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff880069293e08] lprocfs_stats_seq_show at ffffffffc0721627 [obdclass] #8 [ffff880069293e78] seq_read at ffffffff81226550 #9 [ffff880069293ee8] proc_reg_poll at ffffffff812702cd <<<< or proc_reg_read() ??????? #10 [ffff880069293f08] vfs_read at ffffffff81200b1c #11 [ffff880069293f38] sys_read at ffffffff812019df #12 [ffff880069293f80] system_call_fastpath at ffffffff816b5089 RIP: 00007f3b2efdc6f0 RSP: 00007fff00fef468 RFLAGS: 00000246 RAX: 0000000000000000 RBX: ffffffff816b5089 RCX: ffffffffffffffff RDX: 0000000000001000 RSI: 0000000001d74250 RDI: 0000000000000003 RBP: 0000000001d7425a R8: 0000000001d72430 R9: 0000000000001000 R10: 0000000000000063 R11: 0000000000000246 R12: 00007fff00ff0610 R13: 0000000001d74250 R14: 0000000000000001 R15: 0000000001d72410 ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b PID: 25600 TASK: ffff88007ad72f70 CPU: 0 COMMAND: "umount" #0 [ffff880068ceb810] __schedule at ffffffff816a9005 #1 [ffff880068ceb878] schedule at ffffffff816a9589 #2 [ffff880068ceb888] schedule_timeout at ffffffff816a7099 #3 [ffff880068ceb930] wait_for_completion at ffffffff816a993d #4 [ffff880068ceb990] proc_entry_rundown at ffffffff81270885 #5 [ffff880068ceb9e0] remove_proc_subtree at ffffffff81276e52 #6 [ffff880068ceba30] proc_remove at ffffffff81276efe #7 [ffff880068ceba40] lprocfs_obd_cleanup at ffffffffc071baff [obdclass] #8 [ffff880068ceba58] osc_precleanup at ffffffffc0ad8cd7 [osc] #9 [ffff880068ceba80] class_cleanup at ffffffffc072f1e2 [obdclass] #10 [ffff880068cebaf8] class_process_config at ffffffffc0731b26 [obdclass] #11 [ffff880068cebba8] class_manual_cleanup at ffffffffc0732746 [obdclass] #12 [ffff880068cebc50] lov_putref at ffffffffc0b43a22 [lov] #13 [ffff880068cebcb0] lov_disconnect at ffffffffc0b4ae02 [lov] #14 [ffff880068cebcd8] obd_disconnect at ffffffffc0bc24f3 [lustre] #15 [ffff880068cebcf8] ll_put_super at ffffffffc0bc5990 [lustre] #16 [ffff880068cebe38] generic_shutdown_super at ffffffff81203692 #17 [ffff880068cebe60] kill_anon_super at ffffffff81203a62 #18 [ffff880068cebe78] lustre_kill_super at ffffffffc0734eb5 [obdclass] #19 [ffff880068cebe90] deactivate_locked_super at ffffffff81203e19 #20 [ffff880068cebeb0] deactivate_super at ffffffff81204586 #21 [ffff880068cebec8] cleanup_mnt at ffffffff812217bf #22 [ffff880068cebee0] __cleanup_mnt at ffffffff81221852 #23 [ffff880068cebef0] task_work_run at ffffffff810ad275 #24 [ffff880068cebf30] do_notify_resume at ffffffff8102ab62 #25 [ffff880068cebf50] int_signal at ffffffff816b533d RIP: 00007f403a9cc3e7 RSP: 00007ffc54bf3958 RFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000556135a43040 RCX: ffffffffffffffff RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556135a445a0 RBP: 0000556135a445a0 R8: 0000000000000008 R9: 000000000000000c R10: 00007ffc54bf3670 R11: 0000000000000246 R12: 00007f403b542d58 R13: 0000000000000000 R14: 0000556135a43210 R15: 0000556135a43040 ORIG_RAX: 00000000000000a6 CS: 0033 SS: 002b
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            After having a look to an associated crash-dump, my guess is that there is a possible race between umount process and concurrent "lctl get_param osc.<OSC>.stats", what recovery-small/test_57 is intended to detect.

            This seems to occur due to osc_precleanup() calling ptlrpc_lprocfs_unregister_obd() before lprocfs_obd_cleanup(), allowing the obd->obd_svc_stats/lprocfs_stats struct to be freed before the associated proc file and internal structs concurrent users are done and thus can be cleaned. Looks like a revival of LU-106!

            I wonder if this can just be fixed by changing ptlrpc_lprocfs_unregister_obd() and lprocfs_obd_cleanup() order in osc_precleanup() (and may be also in other places where both routines are used the same way/order, like in mdc_precleanup()/lwp_device_fini()/osp_init0()/osp_device_fini()).

            bfaccini Bruno Faccini (Inactive) added a comment - - edited After having a look to an associated crash-dump, my guess is that there is a possible race between umount process and concurrent "lctl get_param osc.<OSC>.stats", what recovery-small/test_57 is intended to detect. This seems to occur due to osc_precleanup() calling ptlrpc_lprocfs_unregister_obd() before lprocfs_obd_cleanup(), allowing the obd->obd_svc_stats/lprocfs_stats struct to be freed before the associated proc file and internal structs concurrent users are done and thus can be cleaned. Looks like a revival of LU-106! I wonder if this can just be fixed by changing ptlrpc_lprocfs_unregister_obd() and lprocfs_obd_cleanup() order in osc_precleanup() (and may be also in other places where both routines are used the same way/order, like in mdc_precleanup()/lwp_device_fini()/osp_init0()/osp_device_fini()).

            People

              bfaccini Bruno Faccini (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: