Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3581

Recurrence of LU-3020: Lustre returns EINTR during writes when SA_RESTART is set

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.5.0
    • Lustre 2.4.0
    • 3
    • 9074

    Description

      This is the same issue as described in LU-3020, where EINTR is returned instead of ERESTARTSYS during writes. This issue is caught by the same reproducer as for LU-3020, but the cause is different.

      As I did not hit this issue while testing the fix for LU-3020, I suspect this has been introduced by some subsequent patch. We are seeing this against 2.4 release branch.

      This issue is easy to hit without debugging enabled, and very hard to hit with debugging enabled.

      Here is the relevant portion of the trace logs:

      00000008:00000001:4.0:1372452012.457494:0:13003:0:(osc_cache.c:2206:osc_queue_async_io()) Process entered
      00000008:00000001:4.0:1372452012.457495:0:13003:0:(osc_cache.c:543:osc_extent_release()) Process entered
      00000008:00000001:4.0:1372452012.457496:0:13003:0:(osc_cache.c:240:osc_extent_sanity_check0()) Process leaving via out (rc=0 : 0 : 0x0)
      00000008:00000001:4.0:1372452012.457498:0:13003:0:(osc_cache.c:1616:osc_makes_rpc()) Process entered
      00000008:00000001:4.0:1372452012.457499:0:13003:0:(osc_cache.c:1662:osc_makes_rpc()) Process leaving (rc=0 : 0 : 0)
      00000008:00000001:4.0:1372452012.457500:0:13003:0:(osc_cache.c:1616:osc_makes_rpc()) Process entered
      00000008:00000001:4.0:1372452012.457501:0:13003:0:(osc_cache.c:1652:osc_makes_rpc()) Process leaving (rc=0 : 0 : 0)
      00000008:00000001:4.0:1372452012.457502:0:13003:0:(osc_cache.c:575:osc_extent_release()) Process leaving (rc=0 : 0 : 0)
      00000008:00000001:4.0:1372452012.457503:0:13003:0:(osc_cache.c:1506:osc_enter_cache()) Process entered
      00000100:00000001:1.0F:1372452012.457511:0:5940:0:(ptlrpcd.c:293:ptlrpcd_check()) Process entered
      00000008:00000001:4.0:1372452012.457512:0:13003:0:(osc_cache.c:1549:osc_enter_cache()) Process leaving via out (rc=18446744073709551612 : -4 : 0xfffffffffffffffc)
      00000100:00000001:0.0F:1372452012.457512:0:5941:0:(ptlrpcd.c:293:ptlrpcd_check()) Process entered
      00000100:00000001:1.0:1372452012.457513:0:5940:0:(client.c:1486:ptlrpc_check_set()) Process entered
      00000100:00000001:1.0:1372452012.457513:0:5940:0:(client.c:1561:ptlrpc_check_set()) Process leaving via interpret (rc=0 : 0 : 0x0)
      00000008:00000001:4.0:1372452012.457514:0:13003:0:(osc_cache.c:1564:osc_enter_cache()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
      00000100:00000001:0.0:1372452012.457514:0:5941:0:(ptlrpcd.c:395:ptlrpcd_check()) Process leaving (rc=0 : 0 : 0)
      00000008:00000001:4.0:1372452012.457515:0:13003:0:(osc_cache.c:2352:osc_queue_async_io()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)

      This is hit during writes, specifically during ll_commit_write. I will be attaching the full log.

      This is happening due to a signal arriving during the following l_wait_event call, in osc_enter_cache:
      CDEBUG(D_CACHE, "%s: sleeping for cache space @ %p for %p\n",
      cli->cl_import->imp_obd->obd_name, &ocw, oap);

      rc = l_wait_event(ocw.ocw_waitq, ocw_granted(cli, &ocw), &lwi);

      client_obd_list_lock(&cli->cl_loi_list_lock);

      /* l_wait_event is interrupted by signal */
      if (rc < 0)

      { cfs_list_del_init(&ocw.ocw_entry); GOTO(out, rc); }

      I will attach full trace logs. Search for -4 in the log to find the EINTR.

      The question is: Is it safe to return ERESTARTSYS here, instead of EINTR?

      More generally, Lustre's default behavior in l_wait_event is to return EINTR. Should we consider changing this to ERESTARTSYS and making EINTR the exceptional case? (This may be a terrible idea - I'm just floating it out of curiositiy.)

      Attachments

        1. eintrlog.tail
          1.10 MB
        2. new_eintr_test.c
          4 kB
        3. test.sh
          0.4 kB

        Issue Links

          Activity

            [LU-3581] Recurrence of LU-3020: Lustre returns EINTR during writes when SA_RESTART is set
            pjones Peter Jones made changes -
            Assignee Original: Niu Yawei [ niu ] New: Peter Jones [ pjones ]
            Labels Original: yuc2 New: mn4 yuc2
            pjones Peter Jones made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones made changes -
            Labels New: yuc2
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.5.0 [ 10295 ]
            pjones Peter Jones made changes -
            Assignee Original: WC Triage [ wc-triage ] New: Niu Yawei [ niu ]
            paf Patrick Farrell (Inactive) made changes -
            Attachment New: new_eintr_test.c [ 13169 ]
            Attachment New: test.sh [ 13170 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-3020 [ LU-3020 ]
            jlevi Jodi Levi (Inactive) made changes -
            Priority Original: Minor [ 4 ] New: Major [ 3 ]
            paf Patrick Farrell (Inactive) created issue -

            People

              pjones Peter Jones
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: