Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      It looks like "LU-9480 lnet: implement Peer Discovery" commit 0f1aaad4c1b4447ee5097b8bb79a49d09eaa23c2 broke lolnd (suggested by git bisect)

      This manifests in e.g. sanity test 101b hanging with this in logs:

      [  215.914245] Lustre: DEBUG MARKER: == sanity test 101b: check stride-io mode read-ahead ================================================= 01:32:15 (1504675935)
      [  215.985320] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200000401:0x5:0x0], use llapi_layout_get_by_path()
      [  256.717500] LNet: Service thread pid 4032 was inactive for 40.01s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [  256.720328] Pid: 4032, comm: ll_ost_io00_002
      [  256.721561] 
      Call Trace:
      [  256.723391]  [<ffffffff81704339>] schedule+0x29/0x70
      [  256.724533]  [<ffffffff81700972>] schedule_timeout+0x162/0x2a0
      [  256.725651]  [<ffffffff810879f0>] ? process_timeout+0x0/0x10
      [  256.726859]  [<ffffffffa0534e3e>] target_bulk_io+0x4ee/0xb20 [ptlrpc]
      [  256.729276]  [<ffffffff810b7ce0>] ? default_wake_function+0x0/0x20
      [  256.730431]  [<ffffffffa05ddf08>] tgt_brw_read+0xf38/0x1870 [ptlrpc]
      [  256.731359]  [<ffffffffa01ba4a4>] ? libcfs_log_return+0x24/0x30 [libcfs]
      [  256.732387]  [<ffffffffa0579f90>] ? lustre_pack_reply_v2+0x1a0/0x2a0 [ptlrpc]
      [  256.733578]  [<ffffffffa0532800>] ? target_bulk_timeout+0x0/0xb0 [ptlrpc]
      [  256.734845]  [<ffffffffa057a102>] ? lustre_pack_reply_flags+0x72/0x1f0 [ptlrpc]
      [  256.736719]  [<ffffffffa057a291>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
      [  256.737931]  [<ffffffffa05dad2b>] tgt_request_handle+0x93b/0x1390 [ptlrpc]
      [  256.738981]  [<ffffffffa05853b1>] ptlrpc_server_handle_request+0x251/0xae0 [ptlrpc]
      [  256.740764]  [<ffffffffa0589168>] ptlrpc_main+0xa58/0x1df0 [ptlrpc]
      [  256.741800]  [<ffffffff81706487>] ? _raw_spin_unlock_irq+0x27/0x50
      [  256.742938]  [<ffffffffa0588710>] ? ptlrpc_main+0x0/0x1df0 [ptlrpc]
      [  256.743943]  [<ffffffff810a2eda>] kthread+0xea/0xf0
      [  256.744963]  [<ffffffff810a2df0>] ? kthread+0x0/0xf0
      [  256.745913]  [<ffffffff8170fbd8>] ret_from_fork+0x58/0x90
      [  256.746933]  [<ffffffff810a2df0>] ? kthread+0x0/0xf0
      
      [  256.748798] LustreError: dumping log to /tmp/lustre-log.1504675975.4032
      [  269.494952] LustreError: 2624:0:(events.c:449:server_bulk_callback()) event type 5, status -5, desc ffff8800720b3e00
      

      Easy to reproduce, just run this on a single node: ONLY=101 REFORMAT=yes sh sanity.sh

      Attachments

        Issue Links

          Activity

            [LU-9949] lolnd broken
            pfarrell Patrick Farrell (Inactive) made changes -
            Link Original: This issue is related to LU-9920 [ LU-9920 ]
            pjones Peter Jones made changes -
            Link New: This issue duplicates LU-9992 [ LU-9992 ]
            pjones Peter Jones made changes -
            Fix Version/s Original: Lustre 2.12.0 [ 13495 ]
            Resolution New: Duplicate [ 3 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Sounds like a duplicate 

            pjones Peter Jones added a comment - Sounds like a duplicate 
            jgmitter Joseph Gmitter (Inactive) made changes -
            Priority Original: Blocker [ 1 ] New: Major [ 3 ]
            jgmitter Joseph Gmitter (Inactive) made changes -
            Fix Version/s New: Lustre 2.12.0 [ 13495 ]
            jhammond John Hammond made changes -
            Link New: This issue is related to LU-9992 [ LU-9992 ]
            jhammond John Hammond made changes -
            Link New: This issue is related to LU-9920 [ LU-9920 ]
            ashehata Amir Shehata (Inactive) added a comment - - edited

            According to Alex: yes, it's safe as page is removed from the mapping (so another threads can't find it), but it can't get reused for anything else until last page_put() is called

            to summarize, the issue seems to be observed only on Oleg's VM on a single node setup running a debug kernel. It appears that calling generic_error_remove_page() which unmaps the pages before they are sent by the socklnd causes the socklnd send to succeed, but not to actually send the pages. It is still not known exactly why that happens.

            This issue occurred after 0f1aaad4c1b4447ee5097b8bb79a49d09eaa23c2 which discovers the loopback interface when it's initially used then due to the lolnd having no credits always prefers the other interfaces. This issue has been resolved in LU-9992.

            Should this still be a blocker?

            ashehata Amir Shehata (Inactive) added a comment - - edited According to Alex: yes, it's safe as page is removed from the mapping (so another threads can't find it), but it can't get reused for anything else until last page_put() is called to summarize, the issue seems to be observed only on Oleg's VM on a single node setup running a debug kernel. It appears that calling generic_error_remove_page() which unmaps the pages before they are sent by the socklnd causes the socklnd send to succeed, but not to actually send the pages. It is still not known exactly why that happens. This issue occurred after 0f1aaad4c1b4447ee5097b8bb79a49d09eaa23c2 which discovers the loopback interface when it's initially used then due to the lolnd having no credits always prefers the other interfaces. This issue has been resolved in LU-9992 . Should this still be a blocker?

            One outstanding question that should be answered: Is it safe for generic_error_remove_page() to be called on the pages before they are passed to the LND?

            ashehata Amir Shehata (Inactive) added a comment - One outstanding question that should be answered: Is it safe for generic_error_remove_page() to be called on the pages before they are passed to the LND?

            People

              ashehata Amir Shehata (Inactive)
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: