Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14495

(llog_osd.c:622:llog_osd_write_rec()) LBUG

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.15.0
    • None
    • lustre-master-ib tag-2.14.50_115_gf55fdff
    • 3
    • 9223372036854775807

    Description

      Move soak to lustre-master-ib build 537 on 3/2, the system was reformatted. it runs fro about 5 hours and 1 mds node hit following LBUG

      [  373.349621] Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900
      [  374.778336] Lustre: soaked-MDT0000: Will be in recovery for at least 5:00, or until 26 clients reconnect
      [  381.623999] Lustre: soaked-MDT0000: Recovery over after 0:07, of 26 clients 26 recovered and 0 were evicted.
      [  488.543482] LustreError: 4036:0:(llog_cat.c:603:llog_cat_add_rec()) llog_write_rec -116: lh=ffff96e6b6ba2200
      [  488.554466] LustreError: 4036:0:(update_trans.c:1051:top_trans_stop()) soaked-MDT0003-osp-MDT0000: write updates failed: rc = -116
      [  488.826212] LustreError: 3934:0:(llog_cat.c:752:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc =
       -116
      [  488.840617] LustreError: 3934:0:(llog_cat.c:788:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records:
       rc = -116
      [  489.331008] LustreError: 4026:0:(llog_cat.c:603:llog_cat_add_rec()) llog_write_rec -116: lh=ffff96e6b6ba2200
      
      [  751.036178] LustreError: 3851:0:(llog_osd.c:622:llog_osd_write_rec()) LBUG
      [  751.043861] Pid: 3851, comm: mdt00_002 3.10.0-1160.6.1.el7_lustre.x86_64 #1 SMP Mon Feb 22 19:55:21 UTC 2021
      [  751.054838] Call Trace:
      [  751.057614] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
      [  751.063361] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [  751.068762] [<0>] llog_osd_write_rec+0x15f1/0x1630 [obdclass]
      [  751.075201] [<0>] llog_write_rec+0x273/0x550 [obdclass]
      [  751.081067] [<0>] llog_cat_add_rec+0x220/0x880 [obdclass]
      [  751.087111] [<0>] llog_add+0x162/0x1d0 [obdclass]
      [  751.092446] [<0>] sub_updates_write+0x2e8/0xdd9 [ptlrpc]
      [  751.098437] [<0>] top_trans_stop+0x84f/0xbf0 [ptlrpc]
      [  751.104089] [<0>] lod_trans_stop+0x25c/0x340 [lod]
      [  751.109473] [<0>] mdd_trans_stop+0x2e/0x174 [mdd]
      [  751.114735] [<0>] mdd_create+0x1441/0x1820 [mdd]
      [  751.119913] [<0>] mdo_create+0x46/0x48 [mdt]
      [  751.124695] [<0>] mdt_create+0x9c5/0xd00 [mdt]
      [  751.129670] [<0>] mdt_reint_create+0x310/0x3c0 [mdt]
      [  751.135235] [<0>] mdt_reint_rec+0x83/0x210 [mdt]
      [  751.140408] [<0>] mdt_reint_internal+0x720/0xaf0 [mdt]
      [  751.146164] [<0>] mdt_reint+0x67/0x140 [mdt]
      [  751.150984] [<0>] tgt_request_handle+0x7ea/0x1750 [ptlrpc]
      [  751.157146] [<0>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [  751.164194] [<0>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc]
      [  751.169653] [<0>] kthread+0xd1/0xe0
      [  751.173564] [<0>] ret_from_fork_nospec_begin+0x21/0x21
      [  751.179324] [<0>] 0xfffffffffffffffe
      [  751.183358] LustreError: dumping log to /tmp/lustre-log.1614665912.3851
      [  855.168034] LustreError: 4276:0:(llog_cat.c:603:llog_cat_add_rec()) llog_write_rec -116: lh=ffff96e6b6ba0600
      [  855.179033] LustreError: 4276:0:(update_trans.c:1051:top_trans_stop()) soaked-MDT0002-osp-MDT0000: write updates failed: rc = -116
      [  914.209786] LustreError: 4347:0:(llog_cat.c:603:llog_cat_add_rec()) llog_write_rec -116: lh=ffff96e6b6ba0600
      [  914.220789] LustreError: 4347:0:(update_trans.c:1051:top_trans_stop()) soaked-MDT0002-osp-MDT0000: write updates failed: rc = -116
      [  951.201186] Lustre: mdt00_004: service thread pid 4024 was inactive for 200.368 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [  951.201189] Pid: 4026, comm: mdt00_005 3.10.0-1160.6.1.el7_lustre.x86_64 #1 SMP Mon Feb 22 19:55:21 UTC 2021
      [  951.201193] Call Trace:
      [  951.201195] Lustre: mdt00_002: service thread pid 3851 was inactive for 200.368 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [  951.201206] Lustre: Skipped 1 previous similar message
      [  951.201291] [<0>] ldlm_completion_ast+0x777/0x9d0 [ptlrpc]
      [  951.201336] [<0>] ldlm_cli_enqueue_fini+0x9e0/0xe50 [ptlrpc]
      [  951.201380] [<0>] ldlm_cli_enqueue+0x436/0x9b0 [ptlrpc]
      [  951.201401] [<0>] osp_md_object_lock+0x165/0x2d0 [osp]
      [  951.201425] [<0>] lod_object_lock+0xde/0x760 [lod]
      [  951.201444] [<0>] mdd_object_lock+0x30/0xd0 [mdd]
      [  951.201467] [<0>] mdt_remote_object_lock_try+0x1cf/0x500 [mdt]
      [  951.201491] [<0>] mdt_object_lock_internal+0x18c/0x360 [mdt]
      [  951.201507] [<0>] mdt_object_lock+0x20/0x30 [mdt]
      [  951.201527] [<0>] mdt_create+0x6a6/0xd00 [mdt]
      [  951.201545] [<0>] mdt_reint_create+0x310/0x3c0 [mdt]
      [  951.201564] [<0>] mdt_reint_rec+0x83/0x210 [mdt]
      [  951.201585] [<0>] mdt_reint_internal+0x720/0xaf0 [mdt]
      [  951.201602] [<0>] mdt_reint+0x67/0x140 [mdt]
      [  951.201679] [<0>] tgt_request_handle+0x7ea/0x1750 [ptlrpc]
      [  951.201734] [<0>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [  951.201788] [<0>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc]
      [  951.201793] [<0>] kthread+0xd1/0xe0
      [  951.201798] [<0>] ret_from_fork_nospec_begin+0x21/0x21
      [  951.201825] [<0>] 0xfffffffffffffffe
      [  951.201830] Pid: 4107, comm: mdt01_009 3.10.0-1160.6.1.el7_lustre.x86_64 #1 SMP Mon Feb 22 19:55:21 UTC 2021
      [  951.201831] Call Trace:
      [  951.513260] [<0>] ldlm_completion_ast+0x777/0x9d0 [ptlrpc]
      [  951.519429] [<0>] ldlm_cli_enqueue_fini+0x9e0/0xe50 [ptlrpc]
      [  951.525777] [<0>] ldlm_cli_enqueue+0x436/0x9b0 [ptlrpc]
      [  951.531620] [<0>] osp_md_object_lock+0x165/0x2d0 [osp]
      [  951.537372] [<0>] lod_object_lock+0xde/0x760 [lod]
      [  951.542732] [<0>] mdd_object_lock+0x30/0xd0 [mdd]
      [  951.548003] [<0>] mdt_remote_object_lock_try+0x1cf/0x500 [mdt]
      [  951.554532] [<0>] mdt_object_lock_internal+0x18c/0x360 [mdt]
      [  951.560862] [<0>] mdt_object_lock+0x20/0x30 [mdt]
      [  951.566126] [<0>] mdt_create+0x6a6/0xd00 [mdt]
      [  951.571099] [<0>] mdt_reint_create+0x310/0x3c0 [mdt]
      [  951.576654] [<0>] mdt_reint_rec+0x83/0x210 [mdt]
      [  951.581821] [<0>] mdt_reint_internal+0x720/0xaf0 [mdt]
      [  951.587569] [<0>] mdt_reint+0x67/0x140 [mdt]
      [  951.592378] [<0>] tgt_request_handle+0x7ea/0x1750 [ptlrpc]
      [  951.598543] [<0>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [  951.605583] [<0>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc]
      [  951.611039] [<0>] kthread+0xd1/0xe0
      [  951.614936] [<0>] ret_from_fork_nospec_begin+0x21/0x21
      [  951.620681] [<0>] 0xfffffffffffffffe
      [  961.223773] INFO: task mdt00_002:3851 blocked for more than 120 seconds.
      [  961.231298] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  961.240053] mdt00_002       D ffff96e8fc45e300     0  3851      2 0x00000080
      [  961.247966] Call Trace:
      [  961.250732]  [<ffffffffb86daf40>] ? wake_up_state+0x20/0x20
      [  961.256987]  [<ffffffffb8d87229>] schedule+0x29/0x70
      [  961.262572]  [<ffffffffc0bed9fd>] lbug_with_loc+0x7d/0xa0 [libcfs]
      [  961.269543]  [<ffffffffc0d15691>] llog_osd_write_rec+0x15f1/0x1630 [obdclass]
      [  961.277540]  [<ffffffffc0d03463>] llog_write_rec+0x273/0x550 [obdclass]
      [  961.284986]  [<ffffffffc0d08b70>] llog_cat_add_rec+0x220/0x880 [obdclass]
      [  961.292618]  [<ffffffffc0d005c2>] llog_add+0x162/0x1d0 [obdclass]
      [  961.299470]  [<ffffffffc0d2ef19>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [  961.307547]  [<ffffffffc10e15f5>] sub_updates_write+0x2e8/0xdd9 [ptlrpc]
      [  961.315098]  [<ffffffffc17265f5>] ? osp_md_attr_set+0xe5/0x1c0 [osp]
      [  961.322283]  [<ffffffffc10e12fc>] ? prepare_writing_updates.isra.11+0x139/0x14a [ptlrpc]
      [  961.331408]  [<ffffffffc10b3f0f>] top_trans_stop+0x84f/0xbf0 [ptlrpc]
      [  961.338635]  [<ffffffffc1633f93>] ? lod_attr_set+0x243/0x9d0 [lod]
      [  961.345563]  [<ffffffffc164e05e>] ? lod_sub_ref_add+0x1be/0x450 [lod]
      [  961.352776]  [<ffffffffc160ea2c>] lod_trans_stop+0x25c/0x340 [lod]
      [  961.359735]  [<ffffffffc16cd7be>] mdd_trans_stop+0x2e/0x174 [mdd]
      [  961.366548]  [<ffffffffc16afb91>] mdd_create+0x1441/0x1820 [mdd]
      [  961.373306]  [<ffffffffc151a893>] ? mdt_version_save+0x63/0xf0 [mdt]
      [  961.380431]  [<ffffffffc1565810>] mdo_create+0x46/0x48 [mdt]
      [  961.386799]  [<ffffffffc1524665>] mdt_create+0x9c5/0xd00 [mdt]
      [  961.393383]  [<ffffffffc1067ab1>] ? req_capsule_get_size+0x31/0x70 [ptlrpc]
      [  961.401186]  [<ffffffffc1524cb0>] mdt_reint_create+0x310/0x3c0 [mdt]
      [  961.408288]  [<ffffffffc1528de3>] mdt_reint_rec+0x83/0x210 [mdt]
      [  961.415022]  [<ffffffffc1500bb0>] mdt_reint_internal+0x720/0xaf0 [mdt]
      [  961.422356]  [<ffffffffc150c747>] mdt_reint+0x67/0x140 [mdt]
      [  961.428741]  [<ffffffffc10a2eda>] tgt_request_handle+0x7ea/0x1750 [ptlrpc]
      [  961.436454]  [<ffffffffc1084281>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
      [  961.444944]  [<ffffffffc0bea02e>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
      [  961.452877]  [<ffffffffc104d8d6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [  961.461433]  [<ffffffffb86c6d10>] ? wake_up_atomic_t+0x30/0x30
      [  961.467978]  [<ffffffffc105242c>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc]
      [  961.475051]  [<ffffffffc10518f0>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc]
      [  961.483347]  [<ffffffffb86c5c21>] kthread+0xd1/0xe0
      [  961.488803]  [<ffffffffb86c5b50>] ? insert_kthread_work+0x40/0x40
      [  961.495612]  [<ffffffffb8d94df7>] ret_from_fork_nospec_begin+0x21/0x21
      [  961.502904]  [<ffffffffb86c5b50>] ? insert_kthread_work+0x40/0x40
      ......
      [ 1652.388142] LustreError: 4349:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1614666513, 300s ago);
       not entering recovery in server code, just going back to sleep ns: mdt-soaked-MDT0000_UUID lock: ffff96e8e2faaf40/0x9d008a4e71cca5ef lrc: 3/
      1,0 mode: --/PR res: [0x200000408:0x4:0x0].0x0 bits 0x13/0x0 rrc: 23 type: IBT gid 0 flags: 0x40210000000000 nid: local remote: 0x0 expref: -
      99 pid: 4349 timeout: 0 lvb_type: 0
      [ 1653.117989] Lustre: 4053:0:(service.c:1434:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply  req@ff
      ff96ed4d3b8d80 x1693073188179200/t8590852393(0) o36->dad5519f-3c27-4aba-baad-7ec78af8ef07@192.168.1.140@o2ib:414/0 lens 552/448 e 1 to 0 dl 1
      614666819 ref 2 fl Interpret:/0/0 rc 0/0 job:''
      [ 1653.150120] Lustre: 4053:0:(service.c:1434:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
      [ 1659.403233] Lustre: soaked-MDT0000: Client dad5519f-3c27-4aba-baad-7ec78af8ef07 (at 192.168.1.140@o2ib) reconnecting
      [ 1662.749484] LustreError: 4161:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1614666524, 300s ago);
       not entering recovery in server code, just going back to sleep ns: mdt-soaked-MDT0000_UUID lock: ffff96e8e7908480/0x9d008a4e71cd5398 lrc: 3/
      1,0 mode: --/PR res: [0x200000408:0x4:0x0].0x0 bits 0x13/0x0 rrc: 23 type: IBT gid 0 flags: 0x40210000000000 nid: local remote: 0x0 expref: -
      99 pid: 4161 timeout: 0 lvb_type: 0
      [ 1715.578839] ptlrpc_watchdog_fire: 4 callbacks suppressed
      [ 1715.578854] Lustre: mdt01_011: service thread pid 4276 was inactive for 657.231 seconds. Watchdog stack traces are limited to 3 per 300 se
      conds, skipping this one.
      [ 1715.578857] Lustre: Skipped 2 previous similar messages
      [ 1715.606956] Lustre: mdt01_004: service thread pid 4022 was inactive for 657.012 seconds. The thread might be hung, or it might only be slo
      w and will resume later. Dumping the stack trace for debugging purposes:
      [ 1715.627844] Pid: 4022, comm: mdt01_004 3.10.0-1160.6.1.el7_lustre.x86_64 #1 SMP Mon Feb 22 19:55:21 UTC 2021
      [ 1715.638826] Call Trace:
      [ 1715.641695] [<0>] ldlm_completion_ast+0x635/0x9d0 [ptlrpc]
      [ 1715.647872] [<0>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc]
      [ 1715.654376] [<0>] mdt_object_local_lock+0x51f/0xb50 [mdt]
      [ 1715.660423] [<0>] mdt_object_lock_internal+0x70/0x360 [mdt]
      [ 1715.666687] [<0>] mdt_getattr_name_lock+0xa7a/0x1ef0 [mdt]
      [ 1715.672831] [<0>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      [ 1715.678594] [<0>] mdt_intent_opc+0x1ba/0xb40 [mdt]
      [ 1715.683959] [<0>] mdt_intent_policy+0x1a1/0x360 [mdt]
      [ 1715.689661] [<0>] ldlm_lock_enqueue+0x3c5/0xa90 [ptlrpc]
      [ 1715.695653] [<0>] ldlm_handle_enqueue0+0xa66/0x1620 [ptlrpc]
      [ 1715.702065] [<0>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [ 1715.707385] [<0>] tgt_request_handle+0x7ea/0x1750 [ptlrpc]
      [ 1715.713576] [<0>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [ 1715.720624] [<0>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc]
      [ 1715.726099] [<0>] kthread+0xd1/0xe0
      [ 1715.730023] [<0>] ret_from_fork_nospec_begin+0x21/0x21
      [ 1715.735803] [<0>] 0xfffffffffffffffe
      

      Then after system reboot, it hit hardware issue

      [   17.065581] sd 0:0:1:3: [sdj] Attached SCSI disk
      [   17.070444] sd 0:0:0:2: [sde] Attached SCSI disk
      [   17.072246] sd 0:0:1:2: [sdi] 6986547200 512-byte logical blocks: (3.57 TB/3.25 TiB)
      [   17.107708] sd 0:0:1:2: [sdi] Write Protect is off
      [   17.125880] sd 0:0:1:2: [sdi] Write cache: enabled, read cache: enabled, supports DPO and FUA
      [   17.209669] Dev sdi: unable to read RDB block 0
      [   17.209687]  sdi: unable to read partition table
      [   17.310369] sd 0:0:1:2: [sdi] Attached SCSI disk
      [   17.400189] sd 1:0:0:0: [sda] Write Protect is off
      [   17.405605] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
      [   17.436028]  sda: sda1 sda2
      [   17.439935] sd 1:0:0:0: [sda] Attached SCSI disk
      [   17.577436] sd 0:0:1:1: [sdh] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
      [   17.587772] sd 0:0:1:1: [sdh] tag#1 Sense Key : Hardware Error [current] 
      [   17.595372] sd 0:0:1:1: [sdh] tag#1 <<vendor>>ASC=0x84 ASCQ=0x0 
      [   17.602089] sd 0:0:1:1: [sdh] tag#1 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00
      [   17.612011] blk_update_request: critical target error, dev sdh, sector 0
      [   17.619501] Buffer I/O error on dev sdh, logical block 0, async page read
      [   18.127431] sd 0:0:1:1: [sdh] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
      [   18.137759] sd 0:0:1:1: [sdh] tag#0 Sense Key : Hardware Error [current] 
      [   18.145349] sd 0:0:1:1: [sdh] tag#0 <<vendor>>ASC=0x84 ASCQ=0x0 
      [   18.152071] sd 0:0:1:1: [sdh] tag#0 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00
      [   18.161978] blk_update_request: critical target error, dev sdh, sector 0
      [   18.169468] Buffer I/O error on dev sdh, logical block 0, async page read
      [   18.677551] sd 0:0:1:1: [sdh] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
      [   18.687850] sd 0:0:1:1: [sdh] tag#0 Sense Key : Hardware Error [current] 
      [   18.695435] sd 0:0:1:1: [sdh] tag#0 <<vendor>>ASC=0x84 ASCQ=0x0 
      [   18.702154] sd 0:0:1:1: [sdh] tag#0 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00
      [   18.712076] blk_update_request: critical target error, dev sdh, sector 0
      [   18.719565] Buffer I/O error on dev sdh, logical block 0, async page read
      [   19.244098] sd 0:0:1:1: [sdh] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
      

      Attachments

        Issue Links

          Activity

            [LU-14495] (llog_osd.c:622:llog_osd_write_rec()) LBUG
            # Setting offset is working, but creating directory
            # directly utilising -D option still fails (see INTL-201)
            lfs setdirstripe -c $c_count -i $s_offset -D $dirname > /dev/null 2>&1
            

            OK, so it looks like remote+striped directories are used at least for the top-level job directories, which is good. It also sets a default striped directory, so that should be exercising the DNE code quite well.

            I can't access INTL-201 (that project no longer exists in our Jira), but I suspect that is the same as LU-15856 "'lfs setdirstripe -D ... <dir>' should create directory if missing".

            adilger Andreas Dilger added a comment - # Setting offset is working, but creating directory # directly utilising -D option still fails (see INTL-201) lfs setdirstripe -c $c_count -i $s_offset -D $dirname > /dev/null 2>&1 OK, so it looks like remote+striped directories are used at least for the top-level job directories, which is good. It also sets a default striped directory, so that should be exercising the DNE code quite well. I can't access INTL-201 (that project no longer exists in our Jira), but I suspect that is the same as LU-15856 " 'lfs setdirstripe -D ... <dir>' should create directory if missing ".
            sarah Sarah Liu added a comment -

            Hi Andreas, this is what I found

            #!/bin/bash
            
            dirname=%s/%s
            mkdir -p $dirname
            nrdt=`lfs osts $dirname | awk '/-OST/{count += 1} END{print count}'`
            nrmd=`lfs mdts $dirname | awk '/-MDT/{count += 1} END{print count}'`
            
            dirname=${dirname}/$SLURM_JOB_ID
            # striped directory, if supported
            # This is a temporary hack to circumvent the
            # problems described in INTL-201
            c_count=$((RANDOM %% (nrmd + 1)))
            s_offset=$((RANDOM %% nrmd))
            # Setting offset is working, but creating directory
            # directly utilising -D option still fails (see INTL-201)
            lfs setdirstripe -c $c_count -i $s_offset $dirname > /dev/null 2>&1
            lfs setdirstripe -c $c_count -i $s_offset -D $dirname > /dev/null 2>&1
            # Temporary commented out till INTL-201 is landed and/or include into build
            ##lfs setdirstripe -c $((RANDOM %% (nrmd + 1))) -i $((RANDOM %% nrmd)) $dirname > /dev/null 2>&1
            if [ ! -d $dirname ]; then
                    # fallback to remote directory
                    lfs setdirstripe -i $s_offset $dirname > /dev/null 2>&1
                    if [ ! -d $dirname ]; then
                            # fallback to striped directory
                            lfs setdirstripe -c $c_count $dirname > /dev/null 2>&1
                            if [ ! -d $dirname ]; then
                                    # fallback to regular directory
                                    mkdir $dirname
                            fi
                    fi
            fi
            
            sarah Sarah Liu added a comment - Hi Andreas, this is what I found #!/bin/bash dirname=%s/%s mkdir -p $dirname nrdt=`lfs osts $dirname | awk '/-OST/{count += 1} END{print count}'` nrmd=`lfs mdts $dirname | awk '/-MDT/{count += 1} END{print count}'` dirname=${dirname}/$SLURM_JOB_ID # striped directory, if supported # This is a temporary hack to circumvent the # problems described in INTL-201 c_count=$((RANDOM %% (nrmd + 1))) s_offset=$((RANDOM %% nrmd)) # Setting offset is working, but creating directory # directly utilising -D option still fails (see INTL-201) lfs setdirstripe -c $c_count -i $s_offset $dirname > /dev/null 2>&1 lfs setdirstripe -c $c_count -i $s_offset -D $dirname > /dev/null 2>&1 # Temporary commented out till INTL-201 is landed and/or include into build ##lfs setdirstripe -c $((RANDOM %% (nrmd + 1))) -i $((RANDOM %% nrmd)) $dirname > /dev/null 2>&1 if [ ! -d $dirname ]; then # fallback to remote directory lfs setdirstripe -i $s_offset $dirname > /dev/null 2>&1 if [ ! -d $dirname ]; then # fallback to striped directory lfs setdirstripe -c $c_count $dirname > /dev/null 2>&1 if [ ! -d $dirname ]; then # fallback to regular directory mkdir $dirname fi fi fi
            pjones Peter Jones added a comment -

            Seems like the inclusion of the above-mentioned patches has eliminated this issue

            pjones Peter Jones added a comment - Seems like the inclusion of the above-mentioned patches has eliminated this issue
            sarah Sarah Liu added a comment -

            quick update, soak is doing well over the weekend, not seeing this issue so far with patch #47226(on top of fix of LU-15761)

            sarah Sarah Liu added a comment - quick update, soak is doing well over the weekend, not seeing this issue so far with patch #47226(on top of fix of LU-15761 )

            Sarah, I've rebased 47226 to the tip of master so that it also includes the 47185 patch.

            adilger Andreas Dilger added a comment - Sarah, I've rebased 47226 to the tip of master so that it also includes the 47185 patch.
            pjones Peter Jones added a comment -

            Sarah

            The second patch has landed so you will only need v2 of the first patch (or the build on that)

            Peter

            pjones Peter Jones added a comment - Sarah The second patch has landed so you will only need v2 of the first patch (or the build on that) Peter
            sarah Sarah Liu added a comment -

            I will load soak with them today. Thank you.

            sarah Sarah Liu added a comment - I will load soak with them today. Thank you.
            bzzz Alex Zhuravlev added a comment - - edited

            Sarah, can you please re-run that testing on soak with https://review.whamcloud.com/47226 and https://review.whamcloud.com/47185 please?
            this is not a fix, rather a workaround to decrease likelyhood.

            bzzz Alex Zhuravlev added a comment - - edited Sarah, can you please re-run that testing on soak with https://review.whamcloud.com/47226 and https://review.whamcloud.com/47185 please? this is not a fix, rather a workaround to decrease likelyhood.
            pjones Peter Jones added a comment -

            Hit during testing LU-15761 patch

            pjones Peter Jones added a comment - Hit during testing LU-15761 patch
            pjones Peter Jones added a comment -

            This has not been seen for a long time and is believed likely has been fixed by fixes tracked under other tickets

            pjones Peter Jones added a comment - This has not been seen for a long time and is believed likely has been fixed by fixes tracked under other tickets

            People

              bzzz Alex Zhuravlev
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: