Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12979

OSS hung during failback

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.13.0
    • b2_13-ib build #2
    • 3
    • 9223372036854775807

    Description

      During OSS failover testing, the failover pair node hung during failback after system running for about 35 hours

      soak-7

      [17882.991912] Lustre: soaked-OST0006: deleting orphan objects from 0x0:1148098 to 0x0:1149825
      [17884.486957] Lustre: Failing over soaked-OST0002
      [17884.728160] Lustre: server umount soaked-OST0002 complete
      [17895.395319] Lustre: Failing over soaked-OST000a
      [17895.413977] Lustre: server umount soaked-OST000a complete
      [17896.790424] LustreError: 137-5: soaked-OST0002_UUID: not available for connect from 192.168.1.124@o2ib (no target). If you are running an HA pair che
      ck that the target is mounted on the other server.
      [17896.810226] LustreError: Skipped 437 previous similar messages
      [17902.199718] Lustre: Failing over soaked-OST0006
      [17902.346578] Lustre: server umount soaked-OST0006 complete
      [17908.627101] Lustre: Failing over soaked-OST000e
      [17908.758771] Lustre: server umount soaked-OST000e complete
      [17923.789823] Lustre: soaked-OST0007: Export ffff9724a6452c00 already connecting from 192.168.1.118@o2ib
      [17931.042012] Lustre: soaked-OST0007: Export ffff9724d6813400 already connecting from 192.168.1.126@o2ib
      [17937.914063] Lustre: soaked-OST0007: Export ffff9728a5360800 already connecting from 192.168.1.130@o2ib
      [17940.946368] Lustre: soaked-OST0007: Export ffff9728d851d800 already connecting from 192.168.1.137@o2ib
      [17940.956765] Lustre: Skipped 1 previous similar message
      [17946.942948] Lustre: soaked-OST0007: Export ffff9728d81df400 already connecting from 192.168.1.124@o2ib
      [17973.969095] Lustre: soaked-OST0007: Export ffff9724a6452c00 already connecting from 192.168.1.118@o2ib
      [17973.979489] Lustre: Skipped 13 previous similar messages
      [17991.122858] Lustre: soaked-OST0007: Export ffff9728d851d800 already connecting from 192.168.1.137@o2ib
      [17991.133257] Lustre: Skipped 3 previous similar messages
      [18024.148290] Lustre: soaked-OST0007: Export ffff9724a6452c00 already connecting from 192.168.1.118@o2ib
      [18024.158682] Lustre: Skipped 14 previous similar messages
      [18064.712164] Lustre: ll_ost00_008: service thread pid 18052 was inactive for 200.107 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [18064.733418] Pid: 18052, comm: ll_ost00_008 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Nov 8 18:37:40 UTC 2019
      [18064.744672] Call Trace:
      [18064.747423]  [<ffffffffc091e2d5>] cv_wait_common+0x125/0x150 [spl]
      [18064.754345]  [<ffffffffc091e315>] __cv_wait+0x15/0x20 [spl]
      [18064.760577]  [<ffffffffc0c992ef>] txg_wait_synced+0xef/0x140 [zfs]
      [18064.767530]  [<ffffffffc0c4ecc5>] dmu_tx_wait+0x275/0x3c0 [zfs]
      [18064.774174]  [<ffffffffc0c4eea2>] dmu_tx_assign+0x92/0x490 [zfs]
      [18064.780910]  [<ffffffffc16c2fd9>] osd_trans_start+0x199/0x440 [osd_zfs]
      [18064.788312]  [<ffffffffc14e8430>] tgt_server_data_update+0x3c0/0x510 [ptlrpc]
      [18064.796347]  [<ffffffffc14ea40d>] tgt_client_del+0x29d/0x6a0 [ptlrpc]
      [18064.803581]  [<ffffffffc180523c>] ofd_obd_disconnect+0x1ac/0x220 [ofd]
      [18064.810885]  [<ffffffffc144f176>] target_handle_disconnect+0xd6/0x450 [ptlrpc]
      [18064.818985]  [<ffffffffc14f0d38>] tgt_disconnect+0x58/0x170 [ptlrpc]
      [18064.826127]  [<ffffffffc14f983a>] tgt_request_handle+0x98a/0x1630 [ptlrpc]
      [18064.833851]  [<ffffffffc149ba96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [18064.842439]  [<ffffffffc149f5cc>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
      [18064.849482]  [<ffffffff93cc50d1>] kthread+0xd1/0xe0
      [18064.854943]  [<ffffffff9438cd37>] ret_from_fork_nospec_end+0x0/0x39
      [18064.861953]  [<ffffffffffffffff>] 0xffffffffffffffff
      [18065.736218] Lustre: ll_ost00_006: service thread pid 18039 was inactive for 200.435 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [18065.757478] Pid: 18039, comm: ll_ost00_006 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Nov 8 18:37:40 UTC 2019
      [18065.768742] Call Trace:
      [18065.771481]  [<ffffffffc091e2d5>] cv_wait_common+0x125/0x150 [spl]
      [18065.778397]  [<ffffffffc091e315>] __cv_wait+0x15/0x20 [spl]
      [18065.784636]  [<ffffffffc0c992ef>] txg_wait_synced+0xef/0x140 [zfs]
      [18065.791576]  [<ffffffffc0c4ecc5>] dmu_tx_wait+0x275/0x3c0 [zfs]
      [18065.798219]  [<ffffffffc0c4eea2>] dmu_tx_assign+0x92/0x490 [zfs]
      [18065.804957]  [<ffffffffc16c2fd9>] osd_trans_start+0x199/0x440 [osd_zfs]
      [18065.812358]  [<ffffffffc14e8430>] tgt_server_data_update+0x3c0/0x510 [ptlrpc]
      [18065.820373]  [<ffffffffc14ea40d>] tgt_client_del+0x29d/0x6a0 [ptlrpc]
      [18065.827613]  [<ffffffffc180523c>] ofd_obd_disconnect+0x1ac/0x220 [ofd]
      [18065.834923]  [<ffffffffc144f176>] target_handle_disconnect+0xd6/0x450 [ptlrpc]
      [18065.843022]  [<ffffffffc14f0d38>] tgt_disconnect+0x58/0x170 [ptlrpc]
      [18065.850165]  [<ffffffffc14f983a>] tgt_request_handle+0x98a/0x1630 [ptlrpc]
      [18065.857886]  [<ffffffffc149ba96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [18065.866478]  [<ffffffffc149f5cc>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
      [18065.873511]  [<ffffffff93cc50d1>] kthread+0xd1/0xe0
      [18065.878972]  [<ffffffff9438cd37>] ret_from_fork_nospec_end+0x0/0x39
      [18065.885982]  [<ffffffffffffffff>] 0xffffffffffffffff
      ...
      
      [21128.204987] Lustre: Skipped 1 previous similar message
      [21138.287341] Lustre: soaked-OST0002: Not available for connect from 192.168.1.137@o2ib (not set up)
      [21138.297347] Lustre: Skipped 2 previous similar messages
      [21167.633849] Lustre: soaked-OST0002: Not available for connect from 192.168.1.111@o2ib (not set up)
      [21167.643858] Lustre: Skipped 9 previous similar messages
      [21195.356109] Lustre: 23774:0:(service.c:1442:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply  req@ffff97278de0e780 x1650366648233984/t0(0) o5->soaked-MDT0002-mdtlov_UUID@192.168.1.110@o2ib:545/0 lens 432/432 e 1 to 0 dl 1573925640 ref 2 fl Interpret:/0/0 rc 0/0 job:''
      [21195.386391] Lustre: 23774:0:(service.c:1442:ptlrpc_at_send_early_reply()) Skipped 3 previous similar messages
      [21199.340302] ptlrpc_watchdog_fire: 3 callbacks suppressed
      [21199.346241] Lustre: ll_ost00_024: service thread pid 5497 was inactive for 1200.379 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [21199.367494] Pid: 5497, comm: ll_ost00_024 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Nov 8 18:37:40 UTC 2019
      [21199.378663] Call Trace:
      [21199.381399]  [<ffffffffc1801873>] ofd_create_hdl+0xcc3/0x2100 [ofd]
      [21199.388410]  [<ffffffffc14f983a>] tgt_request_handle+0x98a/0x1630 [ptlrpc]
      [21199.396155]  [<ffffffffc149ba96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [21199.404748]  [<ffffffffc149f5cc>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
      [21199.411790]  [<ffffffff93cc50d1>] kthread+0xd1/0xe0
      [21199.417239]  [<ffffffff9438cd37>] ret_from_fork_nospec_end+0x0/0x39
      [21199.424249]  [<ffffffffffffffff>] 0xffffffffffffffff
      [21203.436516] Pid: 18121, comm: ll_ost01_023 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Nov 8 18:37:40 UTC 2019
      [21203.447781] Call Trace:
      [21203.450514]  [<ffffffffc1801873>] ofd_create_hdl+0xcc3/0x2100 [ofd]
      [21203.457536]  [<ffffffffc14f983a>] tgt_request_handle+0x98a/0x1630 [ptlrpc]
      [21203.465297]  [<ffffffffc149ba96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [21203.473893]  [<ffffffffc149f5cc>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
      [21203.480928]  [<ffffffff93cc50d1>] kthread+0xd1/0xe0
      [21203.486390]  [<ffffffff9438cd37>] ret_from_fork_nospec_end+0x0/0x39
      [21203.493401]  [<ffffffffffffffff>] 0xffffffffffffffff
      [21217.812639] Lustre: soaked-OST0002: Not available for connect from 192.168.1.111@o2ib (not set up)
      [21217.822646] Lustre: Skipped 23 previous similar messages
      
      

      Attachments

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: