Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13328

sanityn test 35 is broken

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.14.0
    • None
    • 3
    • 9223372036854775807

    Description

      It looks like tha race condition is rarely hit. Here's the typical run:

      [  751.745601] Lustre: DEBUG MARKER: == sanityn test 35: -EINTR cp_ast vs. bl_ast race does not evict client ============================== 04:48:12 (1583315292)
      [  752.624986] Lustre: DEBUG MARKER: Race attempt 0
      [  754.244032] Lustre: DEBUG MARKER: Wait for 16439 16445 for 60 sec...
      [  759.855738] Lustre: lustre-OST0001-osc-ffff8800a08ce800: disconnect after 21s idle
      [  816.516229] Lustre: DEBUG MARKER: == sanityn test 36: handle ESTALE/open-unlink correctly 
      

      BUT sometimes it does hit and in that case the thread is hung and remains hung forever, making cleanup timeout.

      [  699.235637] Lustre: DEBUG MARKER: == sanityn test 35: -EINTR cp_ast vs. bl_ast race does not evict client ============================== 17:30:03 (1583274603)
      [  700.065742] Lustre: DEBUG MARKER: Race attempt 0
      [  701.154904] Lustre: *** cfs_fail_loc=317, val=2147484440***
      [  701.155724] LustreError: 17592:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 318 sleeping
      [  701.579880] Lustre: DEBUG MARKER: Wait for 17549 17565 for 60 sec...
      [  763.731071] Lustre: DEBUG MARKER: == sanityn test 36: handle ESTALE/open-unlink correctly 
      

      See how the race hit, but never actually woke.

      We can confirm this in analyzing the core file:

      PID: 17592  TASK: ffff88009dfc8440  CPU: 2   COMMAND: "ll_sa_17565"
       #0 [ffff8800c25276e8] __schedule+0x3ee at ffffffff817d889e
       #1 [ffff8800c2527778] schedule+0x29 at ffffffff817d8ea9
       #2 [ffff8800c2527788] ldlm_lock_decref_internal+0x96e at ffffffffa05b09ce [ptlrpc]
       #3 [ffff8800c25277e8] failed_lock_cleanup+0x98 at ffffffffa05cb928 [ptlrpc]
       #4 [ffff8800c2527808] ldlm_cli_enqueue_fini+0x16c at ffffffffa05cddac [ptlrpc]
       #5 [ffff8800c25278b0] ldlm_cli_enqueue+0x441 at ffffffffa05d1ad1 [ptlrpc]
      
      crash> l *0xffffffffa05b09ce
      0xffffffffa05b09ce is in ldlm_lock_decref_internal (/home/green/git/lustre-release/libcfs/include/libcfs/libcfs_fail.h:170).
      165		if (CFS_FAIL_PRECHECK(id)) {
      166			if (unlikely(__cfs_fail_check_set(id, 0, CFS_FAIL_LOC_NOSET))) {
      167				int rc;
      168				cfs_race_state = 0;
      169				CERROR("cfs_race id %x sleeping\n", id);
      170				rc = wait_event_interruptible(cfs_race_waitq,
      171							      cfs_race_state != 0);
      172				CERROR("cfs_fail_race id %x awake: rc=%d\n", id, rc);
      173			} else {
      174				CERROR("cfs_fail_race id %x waking\n", id);
      

      so in addition to fixing this, I wonder if it makes sense to clear the active race with a very visible dmesg message?

      Here's the example of the failure case: http://testing.linuxhacker.ru:3333/lustre-reports/7130/testresults/sanityn-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: