Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.14.0
    • lustre-master-ib #404
    • 3
    • 9223372036854775807

    Description

      1 MDS hung during mount during failover process.

      soak-9 console

      [ 3961.086008] mount.lustre    D ffff8f5730291070     0  5206   5205 0x00000082
      [ 3961.093940] Call Trace:
      [ 3961.096752]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      [ 3961.105419]  [<ffffffff99380a09>] schedule+0x29/0x70
      [ 3961.110980]  [<ffffffff9937e511>] schedule_timeout+0x221/0x2d0
      [ 3961.117509]  [<ffffffff98ce10f6>] ? select_task_rq_fair+0x5a6/0x760
      [ 3961.124565]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      [ 3961.133226]  [<ffffffff99380dbd>] wait_for_completion+0xfd/0x140
      [ 3961.139955]  [<ffffffff98cdb4c0>] ? wake_up_state+0x20/0x20
      [ 3961.146222]  [<ffffffffc12f8b84>] llog_process_or_fork+0x254/0x520 [obdclass]
      [ 3961.154226]  [<ffffffffc12f8e64>] llog_process+0x14/0x20 [obdclass]
      [ 3961.161271]  [<ffffffffc132b055>] class_config_parse_llog+0x125/0x350 [obdclass]
      [ 3961.169552]  [<ffffffffc15beaf8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
      [ 3961.176961]  [<ffffffffc15c223f>] mgc_process_log+0x3bf/0x920 [mgc]
      [ 3961.184004]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      [ 3961.192673]  [<ffffffffc15c3cc3>] mgc_process_config+0xc63/0x1870 [mgc]
      [ 3961.200110]  [<ffffffffc1336f27>] lustre_process_log+0x2d7/0xad0 [obdclass]
      [ 3961.207925]  [<ffffffffc136a064>] server_start_targets+0x12d4/0x2970 [obdclass]
      [ 3961.216133]  [<ffffffffc1339fe7>] ? lustre_start_mgc+0x257/0x2420 [obdclass]
      [ 3961.224020]  [<ffffffff98e23db6>] ? kfree+0x106/0x140
      [ 3961.229698]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      [ 3961.238396]  [<ffffffffc136c7cc>] server_fill_super+0x10cc/0x1890 [obdclass]
      [ 3961.246314]  [<ffffffffc133cd88>] lustre_fill_super+0x498/0x990 [obdclass]
      [ 3961.254033]  [<ffffffffc133c8f0>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      [ 3961.262511]  [<ffffffff98e4e7df>] mount_nodev+0x4f/0xb0
      [ 3961.268390]  [<ffffffffc1334d98>] lustre_mount+0x18/0x20 [obdclass]
      [ 3961.275401]  [<ffffffff98e4f35e>] mount_fs+0x3e/0x1b0
      [ 3961.281064]  [<ffffffff98e6d507>] vfs_kern_mount+0x67/0x110
      [ 3961.287299]  [<ffffffff98e6fc5f>] do_mount+0x1ef/0xce0
      [ 3961.293070]  [<ffffffff98e4737a>] ? __check_object_size+0x1ca/0x250
      [ 3961.300073]  [<ffffffff98e250ec>] ? kmem_cache_alloc_trace+0x3c/0x200
      [ 3961.307276]  [<ffffffff98e70a93>] SyS_mount+0x83/0xd0
      [ 3961.312939]  [<ffffffff9938dede>] system_call_fastpath+0x25/0x2a
      [ 3961.319665]  [<ffffffff9938de21>] ? system_call_after_swapgs+0xae/0x146
      [ 4024.321554] Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      [ 4024.360505] Lustre: soaked-MDT0001: in recovery but waiting for the first client to connect
      [ 4025.087731] Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 27 clients reconnect
      
      

      Attachments

        1. lustre-log.1588133843.6068-soak-8
          124.12 MB
        2. soak-11.log-051120
          944 kB
        3. soak-9.log-20200419.gz
          184 kB
        4. trace-8
          1002 kB
        5. trace-s-11-051120
          997 kB
        6. trace-soak8
          976 kB

        Issue Links

          Activity

            [LU-13469] MDS hung during mount

            sarah I don't think there is any relation here. I think you can either modify the source or set panic_on_lbug=0 in the scripts? or in modules conf file

            bzzz Alex Zhuravlev added a comment - sarah I don't think there is any relation here. I think you can either modify the source or set panic_on_lbug=0 in the scripts? or in modules conf file
            sarah Sarah Liu added a comment -

            Hi Alex,

            I am having a weird issue when setting up the panic_on_lbug=0 permanently on soak-8(MGS), Here is what I did
            1. lctl set_param -P panic_on_lbug=0
            2. umount and remount as ldiskfs and check the config log, and the value was set as 0
            3. mount lustre, check the panic_on_lbug=1, it didn't change.

            I am not sure if this is related to the llog issue here, can you please check? Do you need any log for this? If it is unrelated, I will create a new ticket, and may need delete bad stuff and restart.

            Thanks

            sarah Sarah Liu added a comment - Hi Alex, I am having a weird issue when setting up the panic_on_lbug=0 permanently on soak-8(MGS), Here is what I did 1. lctl set_param -P panic_on_lbug=0 2. umount and remount as ldiskfs and check the config log, and the value was set as 0 3. mount lustre, check the panic_on_lbug=1, it didn't change. I am not sure if this is related to the llog issue here, can you please check? Do you need any log for this? If it is unrelated, I will create a new ticket, and may need delete bad stuff and restart. Thanks
            sarah Sarah Liu added a comment -

            Hi Alex, I will restart with the debug on

            sarah Sarah Liu added a comment - Hi Alex, I will restart with the debug on

            would it be possible to grab full Lustre logs with libcfs_panic_on_lbug=0 ?

            bzzz Alex Zhuravlev added a comment - would it be possible to grab full Lustre logs with libcfs_panic_on_lbug=0 ?
            sarah Sarah Liu added a comment -

            The LBUG seems can be reproduced easily on soak, I have come cross this 2/3 times so far.

            sarah Sarah Liu added a comment - The LBUG seems can be reproduced easily on soak, I have come cross this 2/3 times so far.

            I think this share root cause with LU-13195 - basically llog got corrupted and then lead to different symptoms.

            bzzz Alex Zhuravlev added a comment - I think this share root cause with LU-13195 - basically llog got corrupted and then lead to different symptoms.
            sarah Sarah Liu added a comment - - edited

            I restarted the test and the system crashed overnight on 3 MDSs . There are crash dumps available on spirit for soak-8/9/10, the path is:
            /scratch/dumps/soak-9.spirit.whamcloud.com/10.10.1.109-2020-04-22-01:59:08

            Last login: Tue Apr 21 12:05:20 on ttys001
            DDN11368M:~ wliu$ ssh w3liu@ssh-1.spirit.whamcloud.int
            Last login: Tue Apr 21 18:50:40 2020 from host-64-47-132-181.masergy.com
            [w3liu@ssh-1 ~]$ ssh root@soak-16
            The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established.
            RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44.
            Are you sure you want to continue connecting (yes/no)? yes
            Warning: Permanently added 'soak-16' (RSA) to the list of known hosts.
            Password: 
            Last login: Tue Apr 21 17:06:58 2020 from ssh-1.spirit.whamcloud.com
            [root@soak-16 ~]# ssh soak-9
            Last login: Tue Mar  3 23:55:21 2020 from soak-16.spirit.whamcloud.com
            [root@soak-9 ~]# ls /tmp/
            krb5ccmachine_SPIRIT.WHAMCLOUD.COM                                                      systemd-private-ba1d46467241463c9bf777e964af7eaa-ntpd.service-yRleX7
            openldap-tlsmc-certs--F2A4EB14DB359420C21982FE2480932C55B52B238FF1E9079F379137D27D13CC  systemd-private-e3a3d4c94d1741a68674eb88c50144a6-ntpd.service-5xNWmx
            [root@soak-9 ~]# exit
            logout
            Connection to soak-9 closed.
            [root@soak-16 ~]# ssh soak-8
            Last login: Mon Apr  6 19:47:28 2020 from soak-16.spirit.whamcloud.com
            [root@soak-8 ~]# ls /tmp/
            krb5ccmachine_SPIRIT.WHAMCLOUD.COM                                                      systemd-private-c062f40ef20c422c98b9375d108e2d78-ntpd.service-bpcY2l
            openldap-tlsmc-certs--283FB28B7CD85BA8C6BD6F2F1837458C8301AFC4036ADCD22B40A473F4F87DAE  systemd-private-d5bd51fa26ff4fc3aab039698e87f37f-ntpd.service-wXZA7q
            systemd-private-1faf9075d15b445eab312529a1bc7e2e-ntpd.service-oOzCQD                    systemd-private-f83975b64fc54f1498329d9f871d565d-ntpd.service-tdxy0M
            [root@soak-8 ~]# exit
            logout
            Connection to soak-8 closed.
            [root@soak-16 ~]# screen -x
            [detached from 12070.pts-0.soak-16]
            [root@soak-16 ~]# exit
            logout
            Connection to soak-16 closed.
            [w3liu@ssh-1 ~]$ ls
            aeon          ddn-681             HP259  lu-13050.stack  savelog     soak-conf        soak-homedir.tgz  soak-toolbin  tmp          toolbin.tgz
            backup-frank  download-build-bin  logs   repo            soak-5.tgz  soak-format.tgz  soak-merge        test-git      tmp-scripts
            [w3liu@ssh-1 ~]$ mkdir lu-13469
            [w3liu@ssh-1 ~]$ cd lu-13469/
            [w3liu@ssh-1 lu-13469]$ scp soak-16://scratch/logs/console/soak-9.log-20200419.tgz .
            The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established.
            RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44.
            Are you sure you want to continue connecting (yes/no)? yes
            Warning: Permanently added 'soak-16' (RSA) to the list of known hosts.
            scp: //scratch/logs/console/soak-9.log-20200419.tgz: No such file or directory
            [w3liu@ssh-1 lu-13469]$ ssh soak-16
            The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established.
            RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44.
            Are you sure you want to continue connecting (yes/no)? yes
            Warning: Permanently added 'soak-16' (RSA) to the list of known hosts.
            Last login: Mon Apr 20 21:08:14 2020 from ssh-1.spirit.whamcloud.com
            [w3liu@soak-16 ~]$ exit
            logout
            Connection to soak-16 closed.
            [w3liu@ssh-1 lu-13469]$ ssh root@soak-16
            The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established.
            RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44.
            Are you sure you want to continue connecting (yes/no)? yes
            Warning: Permanently added 'soak-16' (RSA) to the list of known hosts.
            Password: 
            Last login: Tue Apr 21 20:26:23 2020 from ssh-1.spirit.whamcloud.com
            [root@soak-16 ~]# screen -x
            [detached from 12070.pts-0.soak-16]
            [root@soak-16 ~]# exit
            logout
            Connection to soak-16 closed.
            [w3liu@ssh-1 lu-13469]$ scp root@soak-16:/scratch/logs/console/soak-9.log-20200419.tgz .
            The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established.
            RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44.
            Are you sure you want to continue connecting (yes/no)? yes
            Warning: Permanently added 'soak-16' (RSA) to the list of known hosts.
            Password: 
            scp: /scratch/logs/console/soak-9.log-20200419.tgz: No such file or directory
            [w3liu@ssh-1 lu-13469]$ scp root@soak-16:/scratch/logs/console/soak-9.log-20200419.gz .
            The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established.
            RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44.
            Are you sure you want to continue connecting (yes/no)? yes
            Warning: Permanently added 'soak-16' (RSA) to the list of known hosts.
            Password: 
            soak-9.log-20200419.gz                                                                                                          100%  184KB 183.6KB/s   00:00    
            [w3liu@ssh-1 lu-13469]$ ssh root@soak-16
            The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established.
            RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44.
            Are you sure you want to continue connecting (yes/no)? yes
            Warning: Permanently added 'soak-16' (RSA) to the list of known hosts.
            Password: 
            Last login: Tue Apr 21 20:29:06 2020 from ssh-1.spirit.whamcloud.com
            [root@soak-16 ~]# screen -x
            
            [338152.399548] Lustre: Skipped 1 previous similar message
            [338153.078244] Lustre: soaked-MDT0000: Client 333b7f48-68da-4 (at 192.168.1.120@o2ib) reconnecting
            [338153.088076] Lustre: Skipped 6 previous similar messages
            [338154.126802] Lustre: soaked-MDT0000: Client b8a5f1df-2f21-4 (at 192.168.1.119@o2ib) reconnecting
            [338154.136639] Lustre: Skipped 8 previous similar messages
            [338533.965788] LNet: 57040:0:(o2iblnd_cb.c:3394:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 1 seconds
            [338533.977364] LNet: 57040:0:(o2iblnd_cb.c:3394:kiblnd_check_conns()) Skipped 146 previous similar messages
            [338610.961038] LNetError: 57040:0:(lib-move.c:3680:lnet_handle_recovery_reply()) peer NI (192.168.1.110@o2ib) recovery failed with -110
            [338610.974469] LNetError: 57040:0:(lib-move.c:3680:lnet_handle_recovery_reply()) Skipped 122 previous similar messages
            [338753.412960] Lustre: soaked-MDT0000: Client f59a1caf-b57f-4 (at 192.168.1.130@o2ib) reconnecting
            [338753.422834] Lustre: Skipped 2 previous similar messages
            [338753.428846] Lustre: soaked-MDT0000: Connection restored to 3a61c982-16a5-4 (at 192.168.1.130@o2ib)
            [338753.438969] Lustre: Skipped 20 previous similar messages
            [338760.544124] Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib, removing former export from same NID
            [339000.160995] Lustre: 57949:0:(mdd_device.c:1825:mdd_changelog_clear()) soaked-MDD0000: No entry for user 1
            [340067.766748] LustreError: 57125:0:(out_handler.c:918:out_tx_end()) soaked-MDT0000-osd: undo for /tmp/rpmbuild-lustre-jenkins-UoS8uXZU/BUILD/lustre-2.13.53_18_g
            6706bfa/lustre/ptlrpc/../../lustre/target/out_handler.c:453: rc = -524
            [340067.789495] LustreError: 57125:0:(out_handler.c:918:out_tx_end()) Skipped 299 previous similar messages
            [340131.819910] LustreError: 58494:0:(out_handler.c:918:out_tx_end()) soaked-MDT0000-osd: undo for /tmp/rpmbuild-lustre-jenkins-UoS8uXZU/BUILD/lustre-2.13.53_18_g
            6706bfa/lustre/ptlrpc/../../lustre/target/out_handler.c:453: rc = -524
            [340131.842686] LustreError: 58494:0:(out_handler.c:918:out_tx_end()) Skipped 1855 previous similar messages
            [340265.331826] LustreError: 57983:0:(out_handler.c:918:out_tx_end()) soaked-MDT0000-osd: undo for /tmp/rpmbuild-lustre-jenkins-UoS8uXZU/BUILD/lustre-2.13.53_18_g
            6706bfa/lustre/ptlrpc/../../lustre/target/out_handler.c:453: rc = -524
            [340265.354552] LustreError: 57983:0:(out_handler.c:918:out_tx_end()) Skipped 2651 previous similar messages
            [340419.033915] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2
            [340419.048319] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -2
            [340424.618904] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2
            [340424.633301] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) Skipped 22 previous similar messages
            [340424.644781] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -2
            [340424.659646] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) Skipped 22 previous similar messages
            [340432.347952] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2
            ...
            [340475.463999] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -116
            [340475.479044] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) Skipped 177 previous similar messages
            [340516.956460] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2
            [340516.970860] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) Skipped 294 previous similar messages
            [340516.982425] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -2
            [340516.997295] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) Skipped 294 previous similar messages
            [340581.591455] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2
            [340581.605835] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) Skipped 172 previous similar messages
            [340581.617394] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -2
            [340581.632256] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) Skipped 172 previous similar messages
            [340681.741732] LustreError: 58187:0:(llog_osd.c:616:llog_osd_write_rec()) soaked-MDT0003-osp-MDT0000: index 36369 already set in log bitmap
            [340681.755535] LustreError: 58187:0:(llog_osd.c:618:llog_osd_write_rec()) LBUG
            [340681.763417] Pid: 58187, comm: mdt_rdpg00_004 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Tue Feb 11 19:11:56 UTC 2020
            [340681.775080] Call Trace:
            [340681.777921]  [<ffffffffc1041e6c>] libcfs_call_trace+0x8c/0xc0 [libcfs]
            [340681.785341]  [<ffffffffc1041f1c>] lbug_with_loc+0x4c/0xa0 [libcfs]
            [340681.792402]  [<ffffffffc10d34e5>] llog_osd_write_rec+0x1395/0x14b0 [obdclass]
            [340681.800519]  [<ffffffffc10c22c3>] llog_write_rec+0x273/0x540 [obdclass]
            [340681.808044]  [<ffffffffc10c79f0>] llog_cat_add_rec+0x220/0x880 [obdclass]
            [340681.815763]  [<ffffffffc10bf422>] llog_add+0x162/0x1d0 [obdclass]
            [340681.822701]  [<ffffffffc1490a01>] sub_updates_write+0x2e8/0xdd7 [ptlrpc]
            [340681.830372]  [<ffffffffc1472027>] top_trans_stop+0x827/0xbc0 [ptlrpc]
            [340681.837748]  [<ffffffffc19ab98c>] lod_trans_stop+0x25c/0x340 [lod]
            [340681.844795]  [<ffffffffc1a6623e>] mdd_trans_stop+0x2e/0x174 [mdd]
            [340681.851729]  [<ffffffffc1a59a07>] mdd_attr_set+0x6a7/0xdd0 [mdd]
            [340681.858558]  [<ffffffffc18e21b9>] mdt_mfd_close+0x6b9/0x860 [mdt]
            [340681.865523]  [<ffffffffc18e7a71>] mdt_close_internal+0x121/0x220 [mdt]
            [340681.872958]  [<ffffffffc18e7d91>] mdt_close+0x221/0x790 [mdt]
            [340681.879515]  [<ffffffffc145db9a>] tgt_request_handle+0x95a/0x1630 [ptlrpc]
            [340681.887378]  [<ffffffffc1402f76>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
            [340681.896074]  [<ffffffffc1407424>] ptlrpc_main+0xbb4/0x1550 [ptlrpc]
            [340681.903228]  [<ffffffff842c61f1>] kthread+0xd1/0xe0
            [340681.908808]  [<ffffffff8498dd37>] ret_from_fork_nospec_end+0x0/0x39
            [340681.915929]  [<ffffffffffffffff>] 0xffffffffffffffff
            [340681.921610] Kernel panic - not syncing: LBUG
            [340681.926473] CPU: 7 PID: 58187 Comm: mdt_rdpg00_004 Kdump: loaded Tainted: P           OE  ------------   3.10.0-1062.9.1.el7_lustre.x86_64 #1
            [340681.940752] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [340681.953368] Call Trace:
            [340681.956203]  [<ffffffff8497ac23>] dump_stack+0x19/0x1b
            [340681.962038]  [<ffffffff84974967>] panic+0xe8/0x21f
            [340681.967489]  [<ffffffffc1041f6b>] lbug_with_loc+0x9b/0xa0 [libcfs]
            [340681.974504]  [<ffffffffc10d34e5>] llog_osd_write_rec+0x1395/0x14b0 [obdclass]
            [340681.982585]  [<ffffffffc10c22c3>] llog_write_rec+0x273/0x540 [obdclass]
            [340681.990085]  [<ffffffffc10c79f0>] llog_cat_add_rec+0x220/0x880 [obdclass]
            [340681.997775]  [<ffffffffc10bf422>] llog_add+0x162/0x1d0 [obdclass]
            [340682.004696]  [<ffffffffc10edf19>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
            [340682.012803]  [<ffffffffc1490a01>] sub_updates_write+0x2e8/0xdd7 [ptlrpc]
            [340682.020422]  [<ffffffffc1475b43>] ? update_records_attr_set_pack+0xc3/0xf0 [ptlrpc]
            [340682.029124]  [<ffffffffc14906fd>] ? prepare_writing_updates.isra.12+0x139/0x14a [ptlrpc]
            [340682.038311]  [<ffffffffc1472027>] top_trans_stop+0x827/0xbc0 [ptlrpc]
            [340682.045606]  [<ffffffffc19ab98c>] lod_trans_stop+0x25c/0x340 [lod]
            [340682.052620]  [<ffffffffc1a6623e>] mdd_trans_stop+0x2e/0x174 [mdd]
            [340682.059528]  [<ffffffffc1a59a07>] mdd_attr_set+0x6a7/0xdd0 [mdd]
            [340682.066362]  [<ffffffffc13f5e97>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc]
            [340682.074451]  [<ffffffffc18e21b9>] mdt_mfd_close+0x6b9/0x860 [mdt]
            [340682.081377]  [<ffffffffc13f636f>] ? lustre_pack_reply_flags+0x6f/0x1e0 [ptlrpc]
            [340682.089646]  [<ffffffffc18e7a71>] mdt_close_internal+0x121/0x220 [mdt]
            [340682.097039]  [<ffffffffc18e7d91>] mdt_close+0x221/0x790 [mdt]
            [340682.103593]  [<ffffffffc145db9a>] tgt_request_handle+0x95a/0x1630 [ptlrpc]
            [340682.111377]  [<ffffffffc102902e>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [340682.119391]  [<ffffffffc1402f76>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
            [340682.128070]  [<ffffffffc13fde2e>] ? ptlrpc_wait_event+0x12e/0x5b0 [ptlrpc]
            [340682.135852]  [<ffffffff842d3903>] ? __wake_up+0x13/0x20
            [340682.141814]  [<ffffffffc140254a>] ? ptlrpc_server_handle_req_in+0x92a/0x1100 [ptlrpc]
            [340682.150657]  [<ffffffff842c72e0>] ? wake_up_atomic_t+0x30/0x30
            [340682.157298]  [<ffffffffc1407424>] ptlrpc_main+0xbb4/0x1550 [ptlrpc]
            [340682.164440]  [<ffffffffc1406870>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc]
            [340682.172793]  [<ffffffff842c61f1>] kthread+0xd1/0xe0
            [340682.178335]  [<ffffffff842c6120>] ? insert_kthread_work+0x40/0x40
            [340682.185231]  [<ffffffff8498dd37>] ret_from_fork_nospec_begin+0x21/0x21
            [340682.192614]  [<ffffffff842c6120>] ? insert_kthread_work+0x40/0x40
            
            sarah Sarah Liu added a comment - - edited I restarted the test and the system crashed overnight on 3 MDSs . There are crash dumps available on spirit for soak-8/9/10, the path is: /scratch/dumps/soak-9.spirit.whamcloud.com/10.10.1.109-2020-04-22-01:59:08 Last login: Tue Apr 21 12:05:20 on ttys001 DDN11368M:~ wliu$ ssh w3liu@ssh-1.spirit.whamcloud.int Last login: Tue Apr 21 18:50:40 2020 from host-64-47-132-181.masergy.com [w3liu@ssh-1 ~]$ ssh root@soak-16 The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established. RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'soak-16' (RSA) to the list of known hosts. Password: Last login: Tue Apr 21 17:06:58 2020 from ssh-1.spirit.whamcloud.com [root@soak-16 ~]# ssh soak-9 Last login: Tue Mar 3 23:55:21 2020 from soak-16.spirit.whamcloud.com [root@soak-9 ~]# ls /tmp/ krb5ccmachine_SPIRIT.WHAMCLOUD.COM systemd-private-ba1d46467241463c9bf777e964af7eaa-ntpd.service-yRleX7 openldap-tlsmc-certs--F2A4EB14DB359420C21982FE2480932C55B52B238FF1E9079F379137D27D13CC systemd-private-e3a3d4c94d1741a68674eb88c50144a6-ntpd.service-5xNWmx [root@soak-9 ~]# exit logout Connection to soak-9 closed. [root@soak-16 ~]# ssh soak-8 Last login: Mon Apr 6 19:47:28 2020 from soak-16.spirit.whamcloud.com [root@soak-8 ~]# ls /tmp/ krb5ccmachine_SPIRIT.WHAMCLOUD.COM systemd-private-c062f40ef20c422c98b9375d108e2d78-ntpd.service-bpcY2l openldap-tlsmc-certs--283FB28B7CD85BA8C6BD6F2F1837458C8301AFC4036ADCD22B40A473F4F87DAE systemd-private-d5bd51fa26ff4fc3aab039698e87f37f-ntpd.service-wXZA7q systemd-private-1faf9075d15b445eab312529a1bc7e2e-ntpd.service-oOzCQD systemd-private-f83975b64fc54f1498329d9f871d565d-ntpd.service-tdxy0M [root@soak-8 ~]# exit logout Connection to soak-8 closed. [root@soak-16 ~]# screen -x [detached from 12070.pts-0.soak-16] [root@soak-16 ~]# exit logout Connection to soak-16 closed. [w3liu@ssh-1 ~]$ ls aeon ddn-681 HP259 lu-13050.stack savelog soak-conf soak-homedir.tgz soak-toolbin tmp toolbin.tgz backup-frank download-build-bin logs repo soak-5.tgz soak-format.tgz soak-merge test-git tmp-scripts [w3liu@ssh-1 ~]$ mkdir lu-13469 [w3liu@ssh-1 ~]$ cd lu-13469/ [w3liu@ssh-1 lu-13469]$ scp soak-16://scratch/logs/console/soak-9.log-20200419.tgz . The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established. RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'soak-16' (RSA) to the list of known hosts. scp: //scratch/logs/console/soak-9.log-20200419.tgz: No such file or directory [w3liu@ssh-1 lu-13469]$ ssh soak-16 The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established. RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'soak-16' (RSA) to the list of known hosts. Last login: Mon Apr 20 21:08:14 2020 from ssh-1.spirit.whamcloud.com [w3liu@soak-16 ~]$ exit logout Connection to soak-16 closed. [w3liu@ssh-1 lu-13469]$ ssh root@soak-16 The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established. RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'soak-16' (RSA) to the list of known hosts. Password: Last login: Tue Apr 21 20:26:23 2020 from ssh-1.spirit.whamcloud.com [root@soak-16 ~]# screen -x [detached from 12070.pts-0.soak-16] [root@soak-16 ~]# exit logout Connection to soak-16 closed. [w3liu@ssh-1 lu-13469]$ scp root@soak-16:/scratch/logs/console/soak-9.log-20200419.tgz . The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established. RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'soak-16' (RSA) to the list of known hosts. Password: scp: /scratch/logs/console/soak-9.log-20200419.tgz: No such file or directory [w3liu@ssh-1 lu-13469]$ scp root@soak-16:/scratch/logs/console/soak-9.log-20200419.gz . The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established. RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'soak-16' (RSA) to the list of known hosts. Password: soak-9.log-20200419.gz 100% 184KB 183.6KB/s 00:00 [w3liu@ssh-1 lu-13469]$ ssh root@soak-16 The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established. RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'soak-16' (RSA) to the list of known hosts. Password: Last login: Tue Apr 21 20:29:06 2020 from ssh-1.spirit.whamcloud.com [root@soak-16 ~]# screen -x [338152.399548] Lustre: Skipped 1 previous similar message [338153.078244] Lustre: soaked-MDT0000: Client 333b7f48-68da-4 (at 192.168.1.120@o2ib) reconnecting [338153.088076] Lustre: Skipped 6 previous similar messages [338154.126802] Lustre: soaked-MDT0000: Client b8a5f1df-2f21-4 (at 192.168.1.119@o2ib) reconnecting [338154.136639] Lustre: Skipped 8 previous similar messages [338533.965788] LNet: 57040:0:(o2iblnd_cb.c:3394:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 1 seconds [338533.977364] LNet: 57040:0:(o2iblnd_cb.c:3394:kiblnd_check_conns()) Skipped 146 previous similar messages [338610.961038] LNetError: 57040:0:(lib-move.c:3680:lnet_handle_recovery_reply()) peer NI (192.168.1.110@o2ib) recovery failed with -110 [338610.974469] LNetError: 57040:0:(lib-move.c:3680:lnet_handle_recovery_reply()) Skipped 122 previous similar messages [338753.412960] Lustre: soaked-MDT0000: Client f59a1caf-b57f-4 (at 192.168.1.130@o2ib) reconnecting [338753.422834] Lustre: Skipped 2 previous similar messages [338753.428846] Lustre: soaked-MDT0000: Connection restored to 3a61c982-16a5-4 (at 192.168.1.130@o2ib) [338753.438969] Lustre: Skipped 20 previous similar messages [338760.544124] Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib, removing former export from same NID [339000.160995] Lustre: 57949:0:(mdd_device.c:1825:mdd_changelog_clear()) soaked-MDD0000: No entry for user 1 [340067.766748] LustreError: 57125:0:(out_handler.c:918:out_tx_end()) soaked-MDT0000-osd: undo for /tmp/rpmbuild-lustre-jenkins-UoS8uXZU/BUILD/lustre-2.13.53_18_g 6706bfa/lustre/ptlrpc/../../lustre/target/out_handler.c:453: rc = -524 [340067.789495] LustreError: 57125:0:(out_handler.c:918:out_tx_end()) Skipped 299 previous similar messages [340131.819910] LustreError: 58494:0:(out_handler.c:918:out_tx_end()) soaked-MDT0000-osd: undo for /tmp/rpmbuild-lustre-jenkins-UoS8uXZU/BUILD/lustre-2.13.53_18_g 6706bfa/lustre/ptlrpc/../../lustre/target/out_handler.c:453: rc = -524 [340131.842686] LustreError: 58494:0:(out_handler.c:918:out_tx_end()) Skipped 1855 previous similar messages [340265.331826] LustreError: 57983:0:(out_handler.c:918:out_tx_end()) soaked-MDT0000-osd: undo for /tmp/rpmbuild-lustre-jenkins-UoS8uXZU/BUILD/lustre-2.13.53_18_g 6706bfa/lustre/ptlrpc/../../lustre/target/out_handler.c:453: rc = -524 [340265.354552] LustreError: 57983:0:(out_handler.c:918:out_tx_end()) Skipped 2651 previous similar messages [340419.033915] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2 [340419.048319] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -2 [340424.618904] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2 [340424.633301] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) Skipped 22 previous similar messages [340424.644781] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -2 [340424.659646] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) Skipped 22 previous similar messages [340432.347952] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2 ... [340475.463999] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -116 [340475.479044] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) Skipped 177 previous similar messages [340516.956460] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2 [340516.970860] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) Skipped 294 previous similar messages [340516.982425] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -2 [340516.997295] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) Skipped 294 previous similar messages [340581.591455] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2 [340581.605835] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) Skipped 172 previous similar messages [340581.617394] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -2 [340581.632256] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) Skipped 172 previous similar messages [340681.741732] LustreError: 58187:0:(llog_osd.c:616:llog_osd_write_rec()) soaked-MDT0003-osp-MDT0000: index 36369 already set in log bitmap [340681.755535] LustreError: 58187:0:(llog_osd.c:618:llog_osd_write_rec()) LBUG [340681.763417] Pid: 58187, comm: mdt_rdpg00_004 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Tue Feb 11 19:11:56 UTC 2020 [340681.775080] Call Trace: [340681.777921] [<ffffffffc1041e6c>] libcfs_call_trace+0x8c/0xc0 [libcfs] [340681.785341] [<ffffffffc1041f1c>] lbug_with_loc+0x4c/0xa0 [libcfs] [340681.792402] [<ffffffffc10d34e5>] llog_osd_write_rec+0x1395/0x14b0 [obdclass] [340681.800519] [<ffffffffc10c22c3>] llog_write_rec+0x273/0x540 [obdclass] [340681.808044] [<ffffffffc10c79f0>] llog_cat_add_rec+0x220/0x880 [obdclass] [340681.815763] [<ffffffffc10bf422>] llog_add+0x162/0x1d0 [obdclass] [340681.822701] [<ffffffffc1490a01>] sub_updates_write+0x2e8/0xdd7 [ptlrpc] [340681.830372] [<ffffffffc1472027>] top_trans_stop+0x827/0xbc0 [ptlrpc] [340681.837748] [<ffffffffc19ab98c>] lod_trans_stop+0x25c/0x340 [lod] [340681.844795] [<ffffffffc1a6623e>] mdd_trans_stop+0x2e/0x174 [mdd] [340681.851729] [<ffffffffc1a59a07>] mdd_attr_set+0x6a7/0xdd0 [mdd] [340681.858558] [<ffffffffc18e21b9>] mdt_mfd_close+0x6b9/0x860 [mdt] [340681.865523] [<ffffffffc18e7a71>] mdt_close_internal+0x121/0x220 [mdt] [340681.872958] [<ffffffffc18e7d91>] mdt_close+0x221/0x790 [mdt] [340681.879515] [<ffffffffc145db9a>] tgt_request_handle+0x95a/0x1630 [ptlrpc] [340681.887378] [<ffffffffc1402f76>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [340681.896074] [<ffffffffc1407424>] ptlrpc_main+0xbb4/0x1550 [ptlrpc] [340681.903228] [<ffffffff842c61f1>] kthread+0xd1/0xe0 [340681.908808] [<ffffffff8498dd37>] ret_from_fork_nospec_end+0x0/0x39 [340681.915929] [<ffffffffffffffff>] 0xffffffffffffffff [340681.921610] Kernel panic - not syncing: LBUG [340681.926473] CPU: 7 PID: 58187 Comm: mdt_rdpg00_004 Kdump: loaded Tainted: P OE ------------ 3.10.0-1062.9.1.el7_lustre.x86_64 #1 [340681.940752] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [340681.953368] Call Trace: [340681.956203] [<ffffffff8497ac23>] dump_stack+0x19/0x1b [340681.962038] [<ffffffff84974967>] panic+0xe8/0x21f [340681.967489] [<ffffffffc1041f6b>] lbug_with_loc+0x9b/0xa0 [libcfs] [340681.974504] [<ffffffffc10d34e5>] llog_osd_write_rec+0x1395/0x14b0 [obdclass] [340681.982585] [<ffffffffc10c22c3>] llog_write_rec+0x273/0x540 [obdclass] [340681.990085] [<ffffffffc10c79f0>] llog_cat_add_rec+0x220/0x880 [obdclass] [340681.997775] [<ffffffffc10bf422>] llog_add+0x162/0x1d0 [obdclass] [340682.004696] [<ffffffffc10edf19>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [340682.012803] [<ffffffffc1490a01>] sub_updates_write+0x2e8/0xdd7 [ptlrpc] [340682.020422] [<ffffffffc1475b43>] ? update_records_attr_set_pack+0xc3/0xf0 [ptlrpc] [340682.029124] [<ffffffffc14906fd>] ? prepare_writing_updates.isra.12+0x139/0x14a [ptlrpc] [340682.038311] [<ffffffffc1472027>] top_trans_stop+0x827/0xbc0 [ptlrpc] [340682.045606] [<ffffffffc19ab98c>] lod_trans_stop+0x25c/0x340 [lod] [340682.052620] [<ffffffffc1a6623e>] mdd_trans_stop+0x2e/0x174 [mdd] [340682.059528] [<ffffffffc1a59a07>] mdd_attr_set+0x6a7/0xdd0 [mdd] [340682.066362] [<ffffffffc13f5e97>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc] [340682.074451] [<ffffffffc18e21b9>] mdt_mfd_close+0x6b9/0x860 [mdt] [340682.081377] [<ffffffffc13f636f>] ? lustre_pack_reply_flags+0x6f/0x1e0 [ptlrpc] [340682.089646] [<ffffffffc18e7a71>] mdt_close_internal+0x121/0x220 [mdt] [340682.097039] [<ffffffffc18e7d91>] mdt_close+0x221/0x790 [mdt] [340682.103593] [<ffffffffc145db9a>] tgt_request_handle+0x95a/0x1630 [ptlrpc] [340682.111377] [<ffffffffc102902e>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [340682.119391] [<ffffffffc1402f76>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [340682.128070] [<ffffffffc13fde2e>] ? ptlrpc_wait_event+0x12e/0x5b0 [ptlrpc] [340682.135852] [<ffffffff842d3903>] ? __wake_up+0x13/0x20 [340682.141814] [<ffffffffc140254a>] ? ptlrpc_server_handle_req_in+0x92a/0x1100 [ptlrpc] [340682.150657] [<ffffffff842c72e0>] ? wake_up_atomic_t+0x30/0x30 [340682.157298] [<ffffffffc1407424>] ptlrpc_main+0xbb4/0x1550 [ptlrpc] [340682.164440] [<ffffffffc1406870>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc] [340682.172793] [<ffffffff842c61f1>] kthread+0xd1/0xe0 [340682.178335] [<ffffffff842c6120>] ? insert_kthread_work+0x40/0x40 [340682.185231] [<ffffffff8498dd37>] ret_from_fork_nospec_begin+0x21/0x21 [340682.192614] [<ffffffff842c6120>] ? insert_kthread_work+0x40/0x40

            it would be helpful to see all available traces

            bzzz Alex Zhuravlev added a comment - it would be helpful to see all available traces

            People

              bzzz Alex Zhuravlev
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: