Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11125

ofd_create_hdl() destroys_in_progress already cleared

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.12.0, Lustre 2.10.6
    • Lustre 2.10.4
    • lustre-2.10.4_1.chaos-1.ch6.x86_64 servers
      RHEL 7.5
      DNE1 file system
    • 3
    • 9223372036854775807

    Description

      Servers were restarted and appeared to recover normally.  They briefly appeared to be handling the same (heavy) workload from before they were powered off, then started logging the "system was overloaded" message.  The kernel then reported several stacks like this:

      INFO: task ll_ost00_007:108440 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      ll_ost00_007 D ffff8ba4dc35bf40 0 108440 2 0x00000080
      Call Trace:
      [<ffffffffaad38919>] schedule_preempt_disabled+0x39/0x90
      [<ffffffffaad3654f>] __mutex_lock_slowpath+0x10f/0x250
      [<ffffffffaad357f2>] mutex_lock+0x32/0x42
      [<ffffffffc1669afb>] ofd_create_hdl+0xdcb/0x2090 [ofd]
      [<ffffffffc1322007>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc]
      [<ffffffffc132235f>] ? lustre_pack_reply_v2+0x14f/0x290 [ptlrpc]
      [<ffffffffc1322691>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
      [<ffffffffc138653a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      [<ffffffffc132db5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      [<ffffffffc132b26b>] ? ptlrpc_wait_event+0xab/0x350 [ptlrpc]
      [<ffffffffaa6d5c32>] ? default_wake_function+0x12/0x20
      [<ffffffffaa6cb01b>] ? __wake_up_common+0x5b/0x90
      [<ffffffffc1331c70>] ptlrpc_main+0xae0/0x1e90 [ptlrpc]
      [<ffffffffc1331190>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
      [<ffffffffaa6c0ad1>] kthread+0xd1/0xe0
      [<ffffffffaa6c0a00>] ? insert_kthread_work+0x40/0x40
      [<ffffffffaad44837>] ret_from_fork_nospec_begin+0x21/0x21
      [<ffffffffaa6c0a00>] ? insert_kthread_work+0x40/0x40

       And lustre began reporting:
      LustreError: 108448:0:(ofd_dev.c:1627:ofd_create_hdl()) lquake-OST0003:[27917288460] destroys_in_progress already cleared

       

      Attachments

        Issue Links

          Activity

            [LU-11125] ofd_create_hdl() destroys_in_progress already cleared
            ofaaland Olaf Faaland added a comment -

            Hello Mikhail, have you learned anything more of interest?  Thanks.

            ofaaland Olaf Faaland added a comment - Hello Mikhail, have you learned anything more of interest?  Thanks.

            It still may indicate the problem there, there are two different operations are possible on OST restart - MDT initiates orphan destroys, but also new precreates may be initiated by write replays. Both uses the same mutex 'os_create_lock' though one process destroys non-used objects and other creates new objects. That causes high contention on that mutex exactly on OST restart and I am checking now is that possible to separate locking for both processes. That would relieve load on that mutex.

            tappro Mikhail Pershin added a comment - It still may indicate the problem there, there are two different operations are possible on OST restart - MDT initiates orphan destroys, but also new precreates may be initiated by write replays. Both uses the same mutex 'os_create_lock' though one process destroys non-used objects and other creates new objects. That causes high contention on that mutex exactly on OST restart and I am checking now is that possible to separate locking for both processes. That would relieve load on that mutex.

            Thank you for the explanation.

            It seems the problem is in that long orphan destroy, something was blocking it for quite a time. That can be just result of overall high load upon OST restart

            It sounds like destroys_in_progress should be a CDEBUG message, since it can occur under normal operating conditions.

             

            ofaaland Olaf Faaland added a comment - Thank you for the explanation. It seems the problem is in that long orphan destroy, something was blocking it for quite a time. That can be just result of overall high load upon OST restart It sounds like destroys_in_progress should be a CDEBUG message, since it can occur under normal operating conditions.  

            That message - "destroys_in_progress already cleared" - related just to MDT->OST reconnects and may happen when MDT reconnects due to some reasons but OST is still performing orphan destroys from previous connect. I see in logs in description this:

            00002000:00020000:9.0:1530842274.090444:0:102598:0:(ofd_dev.c:1627:ofd_create_hdl()) lquake-OST0002:[28991030285] destroys_in_progress already cleared
            00000100:00000400:8.0:1530842274.090453:0:102596:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:203s); client may timeout.  req@ffff8bb9b7e93850 x1605203337942624/t0(0) o5->lquake-MDT0003-mdtlov_UUID@172.19.1.114@o2ib100:296/0 lens 432/400 e 0 to 0 dl 1530842071 ref 1 fl Complete:/0/0 rc 0/0
            

            Looks like second thread [102598] was waiting for the first [102596] to complete OST_CREATE request with orphan destroys. And that first one took quite long time. It seems the problem is in that long orphan destroy, something was blocking it for quite a time. That can be just result of overall high load upon OST restart

            tappro Mikhail Pershin added a comment - That message - "destroys_in_progress already cleared" - related just to MDT->OST reconnects and may happen when MDT reconnects due to some reasons but OST is still performing orphan destroys from previous connect. I see in logs in description this: 00002000:00020000:9.0:1530842274.090444:0:102598:0:(ofd_dev.c:1627:ofd_create_hdl()) lquake-OST0002:[28991030285] destroys_in_progress already cleared 00000100:00000400:8.0:1530842274.090453:0:102596:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:203s); client may timeout. req@ffff8bb9b7e93850 x1605203337942624/t0(0) o5->lquake-MDT0003-mdtlov_UUID@172.19.1.114@o2ib100:296/0 lens 432/400 e 0 to 0 dl 1530842071 ref 1 fl Complete:/0/0 rc 0/0 Looks like second thread [102598] was waiting for the first [102596] to complete OST_CREATE request with orphan destroys. And that first one took quite long time. It seems the problem is in that long orphan destroy, something was blocking it for quite a time. That can be just result of overall high load upon OST restart
            ofaaland Olaf Faaland added a comment -

            I am going to attempt to re-create this without the complication of the Lustre 2.11 arm system mounting the file system.

            ofaaland Olaf Faaland added a comment - I am going to attempt to re-create this without the complication of the Lustre 2.11 arm system mounting the file system.
            ofaaland Olaf Faaland added a comment -

            I was mistaken about OST0002 being rebooted a second time.  The file system was restarted just once, and the symptoms described here were seen after they came up and completed recovery.

             

            ofaaland Olaf Faaland added a comment - I was mistaken about OST0002 being rebooted a second time.  The file system was restarted just once, and the symptoms described here were seen after they came up and completed recovery.  
            ofaaland Olaf Faaland added a comment - - edited

            Mike, at the time several IORs were running and they all appeared not to resume progress after recovery completed.  ltop was showing no activity on the file system, and the IORs were not finishing at the expected pace.

            Later, the problem seemed to resolve itself - all but one of the IORs completed successfully, but several took 10-20 times longer than they should have.  I'm looking into that now to get more information.

            One compute node crashed, but due to an unrelated problem we have neither the stack nor the crash dump.  We're looking into that problem as well.

             

             

             

            ofaaland Olaf Faaland added a comment - - edited Mike, at the time several IORs were running and they all appeared not to resume progress after recovery completed.  ltop was showing no activity on the file system, and the IORs were not finishing at the expected pace. Later, the problem seemed to resolve itself - all but one of the IORs completed successfully, but several took 10-20 times longer than they should have.  I'm looking into that now to get more information. One compute node crashed, but due to an unrelated problem we have neither the stack nor the crash dump.  We're looking into that problem as well.      

            Olaf, are there any other bad effects except stack dumps and LustreError messages? Do OSTs works normal after all? You mentioned that OST0002 was rebooted, was that because of this issue or planned restart?

            tappro Mikhail Pershin added a comment - Olaf, are there any other bad effects except stack dumps and LustreError messages? Do OSTs works normal after all? You mentioned that OST0002 was rebooted, was that because of this issue or planned restart?
            pjones Peter Jones added a comment -

            Mike

            Could you please investigate?

            Thanks

            Peter

            pjones Peter Jones added a comment - Mike Could you please investigate? Thanks Peter
            ofaaland Olaf Faaland added a comment - - edited

            At the same time this was occurring, an aarch64 system running a 2.11 client was trying to connect unsuccessfully with these servers (the file system is mounted on the 2.11 nodes but the map_on_demand setting was changed on the servers and the connection attempts are all failing).

            /tmp/lustre-log.XXX files were being created, with contents like this:

            00000800:00000100:13.0:1530842270.070848:0:114359:0:(o2iblnd_cb.c:2672:kiblnd_rejected()) 172.19.1.23@o2ib100 rejected: no listener at 987
            00000800:00000100:13.0:1530842271.070330:0:114359:0:(o2iblnd_cb.c:2646:kiblnd_check_reconnect()) 172.19.1.23@o2ib100: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1
            00000800:00000100:13.0:1530842271.070337:0:114359:0:(o2iblnd_cb.c:2672:kiblnd_rejected()) 172.19.1.23@o2ib100 rejected: no listener at 987
            00000800:00000100:13.0:1530842272.070954:0:114359:0:(o2iblnd_cb.c:2646:kiblnd_check_reconnect()) 172.19.1.23@o2ib100: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1
            00000800:00000100:13.0:1530842272.070957:0:114359:0:(o2iblnd_cb.c:2672:kiblnd_rejected()) 172.19.1.23@o2ib100 rejected: no listener at 987
            00000800:00000100:13.0:1530842273.071332:0:114359:0:(o2iblnd_cb.c:2646:kiblnd_check_reconnect()) 172.19.1.23@o2ib100: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1
            00000800:00000100:13.0:1530842273.071336:0:114359:0:(o2iblnd_cb.c:2672:kiblnd_rejected()) 172.19.1.23@o2ib100 rejected: no listener at 987
            00000800:00000100:5.0:1530842274.071750:0:70765:0:(o2iblnd_cb.c:490:kiblnd_rx_complete()) Rx from 172.19.1.23@o2ib100 failed: 5
            00000800:00000100:13.0:1530842274.071790:0:114359:0:(o2iblnd_cb.c:2646:kiblnd_check_reconnect()) 172.19.1.23@o2ib100: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1
            00000800:00000100:13.0:1530842274.071793:0:114359:0:(o2iblnd_cb.c:2672:kiblnd_rejected()) 172.19.1.23@o2ib100 rejected: no listener at 987
            00002000:00020000:9.0:1530842274.090444:0:102598:0:(ofd_dev.c:1627:ofd_create_hdl()) lquake-OST0002:[28991030285] destroys_in_progress already cleared
            00000100:00000400:8.0:1530842274.090453:0:102596:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:203s); client may timeout.  req@ffff8bb9b7e93850 x1605203337942624/t0(0) o5->lquake-MDT0003-mdtlov_UUID@172.19.1.114@o2ib100:296/0 lens 432/400 e 0 to 0 dl 1530842071 ref 1 fl Complete:/0/0 rc 0/0
            00000800:00000100:13.0:1530842275.071508:0:114359:0:(o2iblnd_cb.c:2646:kiblnd_check_reconnect()) 172.19.1.23@o2ib100: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1
            00000800:00000100:13.0:1530842275.071516:0:114359:0:(o2iblnd_cb.c:2672:kiblnd_rejected()) 172.19.1.23@o2ib100 rejected: no listener at 987
            00000800:00000400:8.0:1530842276.068989:0:70764:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.19.1.23@o2ib100: 25 seconds
            00000800:00000100:13.0:1530842276.070940:0:114359:0:(o2iblnd_cb.c:2646:kiblnd_check_reconnect()) 172.19.1.23@o2ib100: don't reconnect (no need), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1
            

            OST0002 and OST0003 both reported "destroys_in_progress already cleared", and both reported dumping /tmp/lustre-log files, but the node hosting OST0002 was rebooted and those log files were lost. So the content above is from a different OST, but it was also reporting the same stacks and console messages.

            ofaaland Olaf Faaland added a comment - - edited At the same time this was occurring, an aarch64 system running a 2.11 client was trying to connect unsuccessfully with these servers (the file system is mounted on the 2.11 nodes but the map_on_demand setting was changed on the servers and the connection attempts are all failing). /tmp/lustre-log.XXX files were being created, with contents like this: 00000800:00000100:13.0:1530842270.070848:0:114359:0:(o2iblnd_cb.c:2672:kiblnd_rejected()) 172.19.1.23@o2ib100 rejected: no listener at 987 00000800:00000100:13.0:1530842271.070330:0:114359:0:(o2iblnd_cb.c:2646:kiblnd_check_reconnect()) 172.19.1.23@o2ib100: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1 00000800:00000100:13.0:1530842271.070337:0:114359:0:(o2iblnd_cb.c:2672:kiblnd_rejected()) 172.19.1.23@o2ib100 rejected: no listener at 987 00000800:00000100:13.0:1530842272.070954:0:114359:0:(o2iblnd_cb.c:2646:kiblnd_check_reconnect()) 172.19.1.23@o2ib100: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1 00000800:00000100:13.0:1530842272.070957:0:114359:0:(o2iblnd_cb.c:2672:kiblnd_rejected()) 172.19.1.23@o2ib100 rejected: no listener at 987 00000800:00000100:13.0:1530842273.071332:0:114359:0:(o2iblnd_cb.c:2646:kiblnd_check_reconnect()) 172.19.1.23@o2ib100: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1 00000800:00000100:13.0:1530842273.071336:0:114359:0:(o2iblnd_cb.c:2672:kiblnd_rejected()) 172.19.1.23@o2ib100 rejected: no listener at 987 00000800:00000100:5.0:1530842274.071750:0:70765:0:(o2iblnd_cb.c:490:kiblnd_rx_complete()) Rx from 172.19.1.23@o2ib100 failed: 5 00000800:00000100:13.0:1530842274.071790:0:114359:0:(o2iblnd_cb.c:2646:kiblnd_check_reconnect()) 172.19.1.23@o2ib100: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1 00000800:00000100:13.0:1530842274.071793:0:114359:0:(o2iblnd_cb.c:2672:kiblnd_rejected()) 172.19.1.23@o2ib100 rejected: no listener at 987 00002000:00020000:9.0:1530842274.090444:0:102598:0:(ofd_dev.c:1627:ofd_create_hdl()) lquake-OST0002:[28991030285] destroys_in_progress already cleared 00000100:00000400:8.0:1530842274.090453:0:102596:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:203s); client may timeout. req@ffff8bb9b7e93850 x1605203337942624/t0(0) o5->lquake-MDT0003-mdtlov_UUID@172.19.1.114@o2ib100:296/0 lens 432/400 e 0 to 0 dl 1530842071 ref 1 fl Complete:/0/0 rc 0/0 00000800:00000100:13.0:1530842275.071508:0:114359:0:(o2iblnd_cb.c:2646:kiblnd_check_reconnect()) 172.19.1.23@o2ib100: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1 00000800:00000100:13.0:1530842275.071516:0:114359:0:(o2iblnd_cb.c:2672:kiblnd_rejected()) 172.19.1.23@o2ib100 rejected: no listener at 987 00000800:00000400:8.0:1530842276.068989:0:70764:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.19.1.23@o2ib100: 25 seconds 00000800:00000100:13.0:1530842276.070940:0:114359:0:(o2iblnd_cb.c:2646:kiblnd_check_reconnect()) 172.19.1.23@o2ib100: don't reconnect (no need), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1 OST0002 and OST0003 both reported "destroys_in_progress already cleared", and both reported dumping /tmp/lustre-log files, but the node hosting OST0002 was rebooted and those log files were lost. So the content above is from a different OST, but it was also reporting the same stacks and console messages.
            ofaaland Olaf Faaland added a comment -

            Note this is not the same system as the one we reported an issue with earlier today.

            ofaaland Olaf Faaland added a comment - Note this is not the same system as the one we reported an issue with earlier today.

            People

              tappro Mikhail Pershin
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: