Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14411

parallel-scale test ior_mdtest_parallel_fpp fails with 'test_ior_mdtest_parallel_fpp failed with 1'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.14.0
    • None
    • 3
    • 9223372036854775807

    Description

      parallel-scale test_ior_mdtest_parallel_fpp fails with 'test_ior_mdtest_parallel_fpp failed with 1' for the first time for SLES 15.2 client testing at https://testing.whamcloud.com/test_sets/06599f29-3d00-47cb-9fc1-16149bb2b7ce.

      Looking at the test_suite log, we see that we get a read error when trying to read a file(s) that was just written out

      Results: 
      
      access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
      ------    ---------  ---------- ---------  --------   --------   --------   --------   ----
      Commencing write performance test: Wed Feb 10 07:13:17 2021
      write     132.29     6144       1024.00    0.004714   0.176921   0.002869   0.181421   0   
      Verifying contents of the file(s) just written.
      Wed Feb 10 07:13:17 2021
      
      Commencing read performance test: Wed Feb 10 07:13:17 2021
      ior ERROR: read() returned EOF prematurely, errno 2, No such file or directory (aiori-POSIX.c:463)
      --------------------------------------------------------------------------
      MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
      with errorcode -1.
      
      NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
      You may or may not see output from other processes, depending on
      exactly when Open MPI kills them.
      --------------------------------------------------------------------------
      ior ERROR: read() returned EOF prematurely, errno 2, No such file or directory (aiori-POSIX.c:463)
      [trevis-68vm2:28459] 1 more process has sent help message help-mpi-api.txt / mpi-abort
      [trevis-68vm2:28459] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
       parallel-scale test_ior_mdtest_parallel_fpp: @@@@@@ FAIL: ior failed! 1 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
        = /usr/lib64/lustre/tests/functions.sh:693:run_ior()
        = /usr/lib64/lustre/tests/functions.sh:1211:ior_mdtest_parallel()
        = /usr/lib64/lustre/tests/parallel-scale.sh:121:test_ior_mdtest_parallel_fpp()
      

      Looking at the OSS () console log, we see a connection issue

      [61346.392146] Lustre: DEBUG MARKER: == parallel-scale test ior_mdtest_parallel_fpp: iormdtestfpp ========================================= 07:13:16 (1612941196)
      [61347.261298] Lustre: 1639833:0:(osd_handler.c:1938:osd_trans_start()) lustre-OST0006: credits 4203 > trans_max 2592
      [61347.263180] Lustre: 1639833:0:(osd_handler.c:1938:osd_trans_start()) Skipped 1 previous similar message
      [61347.265047] Lustre: 1639833:0:(osd_handler.c:1867:osd_trans_dump_creds())   create: 0/0/0, destroy: 0/0/0
      [61347.267141] Lustre: 1639833:0:(osd_handler.c:1867:osd_trans_dump_creds()) Skipped 1 previous similar message
      [61347.269921] Lustre: 1639833:0:(osd_handler.c:1874:osd_trans_dump_creds())   attr_set: 1/1/0, xattr_set: 2/15/0
      [61347.272025] Lustre: 1639833:0:(osd_handler.c:1884:osd_trans_dump_creds())   write: 2/4038/0, punch: 0/0/0, quota 5/149/0
      [61347.274323] Lustre: 1639833:0:(osd_handler.c:1891:osd_trans_dump_creds())   insert: 0/0/0, delete: 0/0/0
      [61347.276357] Lustre: 1639833:0:(osd_handler.c:1898:osd_trans_dump_creds())   ref_add: 0/0/0, ref_del: 0/0/0
      [61347.282887] Pid: 1639833, comm: ll_ost_io00_007 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Mon Feb 1 19:16:26 UTC 2021
      [61347.284605] Call Trace TBD:
      [61347.285740] Lustre: 1639833:0:(osd_internal.h:1293:osd_trans_exec_op()) lustre-OST0006: opcode 7: before 2593 < left 4038, rollback = 7
      [61347.288836] Lustre: 1639833:0:(osd_internal.h:1293:osd_trans_exec_op()) Skipped 7 previous similar messages
      [61368.735940] LNetError: 14442:0:(socklnd.c:1696:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.6.235@tcp[2], ip 10.9.6.235:7988, with error, wanted: 565040, left: 565040, last alive is 0 secs ago
      [61368.739570] LNetError: 14442:0:(socklnd.c:1696:ksocknal_destroy_conn()) Skipped 1 previous similar message
      [61368.741407] LustreError: 14442:0:(events.c:458:server_bulk_callback()) event type 3, status -5, desc 00000000a2bffac8
      [61370.440228] Lustre: lustre-OST0000: Client faa7a436-6853-4f01-958d-932d80b4057f (at 10.9.6.235@tcp) reconnecting
      [61370.442482] Lustre: Skipped 8 previous similar messages
      [61370.768191] LustreError: 1582063:0:(ldlm_lib.c:3512:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@0000000053c6d5fb x1691283882208960/t0(0) o4->faa7a436-6853-4f01-958d-932d80b4057f@10.9.6.235@tcp:2/0 lens 488/448 e 1 to 0 dl 1612941232 ref 1 fl Interpret:H/0/0 rc 0/0 job:'IOR.532'
      [61370.780608] Lustre: lustre-OST0005: Bulk IO write error with faa7a436-6853-4f01-958d-932d80b4057f (at 10.9.6.235@tcp), client will retry: rc = -110
      [61370.783239] Lustre: Skipped 30 previous similar messages
      [61371.349902] LustreError: 1639833:0:(ldlm_lib.c:3512:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@0000000092326787 x1691283882080704/t0(0) o4->faa7a436-6853-4f01-958d-932d80b4057f@10.9.6.235@tcp:1/0 lens 488/448 e 1 to 0 dl 1612941231 ref 1 fl Interpret:H/0/0 rc 0/0 job:'IOR.532'
      [61413.453319] Lustre: ll_ost_io00_009: service thread pid 1639835 was inactive for 63.273 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [61413.457401] Pid: 1639835, comm: ll_ost_io00_009 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Mon Feb 1 19:16:26 UTC 2021
      [61413.459235] Call Trace TBD:
      [61413.459814] Pid: 1639880, comm: ll_ost_io00_049 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Mon Feb 1 19:16:26 UTC 2021
      [61413.461783] Call Trace TBD:
      [61422.467006] Lustre: lustre-OST0005: haven't heard from client faa7a436-6853-4f01-958d-932d80b4057f (at 10.9.6.235@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp 00000000b927baa9, cur 1612941272 expire 1612941242 last 1612941222
      [61422.471192] Lustre: Skipped 6 previous similar messages
      [61423.373876] LustreError: 1639833:0:(ldlm_lib.c:3506:target_bulk_io()) @@@ Eviction on bulk WRITE  req@000000004c071e2b x1691283882080704/t0(0) o4->faa7a436-6853-4f01-958d-932d80b4057f@10.9.6.235@tcp:72/0 lens 488/448 e 2 to 0 dl 1612941302 ref 1 fl Interpret:H/2/0 rc 0/0 job:'IOR.532'
      [61423.378586] LustreError: 1639833:0:(ldlm_lib.c:3506:target_bulk_io()) Skipped 2 previous similar messages
      [61426.985333] LNetError: 14442:0:(socklnd.c:1696:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.6.236@tcp[2], ip 10.9.6.236:7988, with error, wanted: 821640, left: 821640, last alive is 0 secs ago
      [61426.989100] LustreError: 14442:0:(events.c:458:server_bulk_callback()) event type 3, status -5, desc 00000000952366e4
      [61426.991359] LustreError: 1639835:0:(ldlm_lib.c:3518:target_bulk_io()) @@@ network error on bulk WRITE  req@00000000239cec28 x1691227679378688/t0(0) o4->1b5881b7-b342-4042-a9de-aa44341d5ce2@10.9.6.236@tcp:77/0 lens 488/448 e 4 to 0 dl 1612941307 ref 1 fl Interpret:H/0/0 rc 0/0 job:'IOR.532'
      [61426.997102] Lustre: lustre-OST0001: Bulk IO write error with 1b5881b7-b342-4042-a9de-aa44341d5ce2 (at 10.9.6.236@tcp), client will retry: rc = -110
      [61426.999466] Lustre: Skipped 1 previous similar message
      [61427.000506] LustreError: 14444:0:(events.c:458:server_bulk_callback()) event type 5, status -5, desc 00000000cab01c41
      [61429.285539] Lustre: lustre-OST0001: Client 1b5881b7-b342-4042-a9de-aa44341d5ce2 (at 10.9.6.236@tcp) reconnecting
      [61429.287751] Lustre: Skipped 1 previous similar message
      [61430.028552] LustreError: 1639880:0:(ldlm_lib.c:3512:target_bulk_io()) @@@ Reconnect on bulk READ  req@000000006d9cf483 x1691227687692608/t0(0) o3->1b5881b7-b342-4042-a9de-aa44341d5ce2@10.9.6.236@tcp:125/0 lens 488/440 e 1 to 0 dl 1612941355 ref 1 fl Interpret:/0/0 rc 0/0 job:'IOR.532'
      [61430.033318] Lustre: lustre-OST0005: Bulk IO read error with 1b5881b7-b342-4042-a9de-aa44341d5ce2 (at 10.9.6.236@tcp), client will retry: rc -110
      [61430.035644] Lustre: Skipped 5 previous similar messages
      [61434.134986] Lustre: ll_ost_io00_001: service thread pid 1582063 was inactive for 62.502 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [61434.138921] Lustre: Skipped 1 previous similar message
      [61434.140121] Pid: 1582063, comm: ll_ost_io00_001 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Mon Feb 1 19:16:26 UTC 2021
      [61434.141996] Call Trace TBD:
      [61434.142526] Lustre: ll_ost_io00_007: service thread pid 1639833 was inactive for 62.789 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [61456.602121] LNetError: 14442:0:(socklnd.c:1696:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.6.235@tcp[2], ip 10.9.6.235:7988, with error, wanted: 145440, left: 145440, last alive is 0 secs ago
      [61456.605796] LustreError: 14442:0:(events.c:458:server_bulk_callback()) event type 3, status -5, desc 000000008fda42c9
      [61456.607789] LustreError: 14442:0:(events.c:458:server_bulk_callback()) event type 3, status -5, desc 00000000ba124b7f
      [61456.611283] Lustre: lustre-OST0005: Bulk IO write error with faa7a436-6853-4f01-958d-932d80b4057f (at 10.9.6.235@tcp), client will retry: rc = -107
      [61456.613810] Lustre: 1582063:0:(service.c:2323:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (56/28s); client may timeout  req@000000002cff53ea x1691283882208960/t0(0) o4->faa7a436-6853-4f01-958d-932d80b4057f@10.9.6.235@tcp:48/0 lens 488/448 e 1 to 0 dl 1612941278 ref 1 fl Complete:nH/2/ffffffff rc -107/-1 job:'IOR.532'
      [61458.621806] Lustre: lustre-OST0001: Client 1b5881b7-b342-4042-a9de-aa44341d5ce2 (at 10.9.6.236@tcp) reconnecting
      [61458.622430] Lustre: lustre-OST0001: Export 0000000081139db5 already connecting from 10.9.6.236@tcp
      [61458.624124] Lustre: Skipped 5 previous similar messages
      [61458.625736] LustreError: 1639857:0:(ldlm_lib.c:3512:target_bulk_io()) @@@ Reconnect on bulk READ  req@0000000071389a9c x1691227687692608/t0(0) o3->1b5881b7-b342-4042-a9de-aa44341d5ce2@10.9.6.236@tcp:131/0 lens 488/440 e 0 to 0 dl 1612941361 ref 1 fl Interpret:/2/0 rc 0/0 job:'IOR.532'
      [61458.631523] Lustre: lustre-OST0005: Bulk IO read error with 1b5881b7-b342-4042-a9de-aa44341d5ce2 (at 10.9.6.236@tcp), client will retry: rc -110
      [61462.527529] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  parallel-scale test_ior_mdtest_parallel_fpp: @@@@@@ FAIL: ior failed! 1 
      [61463.429467] Lustre: DEBUG MARKER: parallel-scale test_ior_mdtest_parallel_fpp: @@@@@@ FAIL: ior failed! 1
      [61463.836102] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/at-candidate/2021-02-08/lustre-master_full_4154_1_7_cbab7d75-1786-40dc-b0b8-d58abd3f2f6e/parallel-scale.test_ior_mdtest_parallel_fpp.debug_log.$(hostname -s).1612941313.log;
      [61463.836102] 		dmesg > /autotest/at-candidate/2021-02-08/lustre-
      [61466.771031] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      [61469.594652] Lustre: 1633839:0:(osd_handler.c:1938:osd_trans_start()) lustre-OST0001: credits 3207 > trans_max 2592
      [61469.596528] Lustre: 1633839:0:(osd_handler.c:1867:osd_trans_dump_creds())   create: 128/1024/0, destroy: 0/0/0
      [61469.598262] Lustre: 1633839:0:(osd_handler.c:1867:osd_trans_dump_creds()) Skipped 2 previous similar messages
      [61469.599956] Lustre: 1633839:0:(osd_handler.c:1874:osd_trans_dump_creds())   attr_set: 0/0/0, xattr_set: 0/0/0
      [61469.600493] Lustre: 1633837:0:(osd_handler.c:1884:osd_trans_dump_creds())   write: 1/1/0, punch: 0/0/0, quota 6/6/0
      [61469.601567] Lustre: 1633839:0:(osd_handler.c:1874:osd_trans_dump_creds()) Skipped 3 previous similar messages
      [61469.603401] Lustre: 1633837:0:(osd_handler.c:1884:osd_trans_dump_creds()) Skipped 2 previous similar messages
      [61469.604994] Lustre: 1633839:0:(osd_handler.c:1891:osd_trans_dump_creds())   insert: 128/2176/0, delete: 0/0/0
      [61469.606701] Lustre: 1633837:0:(osd_handler.c:1898:osd_trans_dump_creds())   ref_add: 0/0/0, ref_del: 0/0/0
      [61469.608287] Lustre: 1633839:0:(osd_handler.c:1891:osd_trans_dump_creds()) Skipped 3 previous similar messages
      [61469.609910] Lustre: 1633837:0:(osd_handler.c:1898:osd_trans_dump_creds()) Skipped 2 previous similar messages
      [61469.611805] Pid: 1633839, comm: ll_ost00_047 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Mon Feb 1 19:16:26 UTC 2021
      [61469.614807] Call Trace TBD:
      [61469.615294] Pid: 1633837, comm: ll_ost00_045 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Mon Feb 1 19:16:26 UTC 2021
      [61469.616974] Call Trace TBD:
      [61679.126009] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  parallel-scale test_ior_mdtest_parallel_fpp: @@@@@@ FAIL: test_ior_mdtest_parallel_fpp failed with 1 
      

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: