Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.14.0, Lustre 2.12.7, Lustre 2.15.0, Lustre 2.15.3, Lustre 2.15.4
-
None
-
3
-
9223372036854775807
Description
obdfilter-survey test_2b hangs starting on 02 OCT 2020 with logs at https://testing.whamcloud.com/test_sets/258c8761-9d71-4d82-8a5e-1c473e5a089c. So far, this looks like it only hangs for ZFS.
Looking at the suite_log for the failure at https://testing.whamcloud.com/test_sessions/296ca0ce-3daa-42aa-a92d-cdbdb7af75da, the last test run is for netdisk
Unable to detect ip address for host: '' + NETTYPE=tcp thrlo=4 nobjhi=1 thrhi=4 size=1024 case=netdisk rslt_loc=/tmp targets="10.9.7.117:lustre-OST0000 10.9.7.117:lustre-OST0001 10.9.7.117:lustre-OST0002 10.9.7.117:lustre-OST0003 10.9.7.117:lustre-OST0004 10.9.7.117:lustre-OST0005 10.9.7.117:lustre-OST0006" /usr/bin/obdfilter-survey Fri Dec 25 12:00:08 UTC 2020 Obdfilter-survey for case=netdisk from trevis-201vm1.trevis.whamcloud.com ost 7 sz 7340032K rsz 1024K obj 7 thr 28 write 5.28 [ 0.00, 8.15] rewrite 4.88 [ 0.00, 4.00] read
We see some errors in the console logs. For the client1 (vm1) we see an error message from osc_init_grant()
[22622.025189] Lustre: DEBUG MARKER: == obdfilter-survey test 2b: Stripe F/S over the Network, async journal ============================== 12:00:01 (1608897601) [25456.252361] LustreError: 200837:0:(osc_request.c:1043:osc_init_grant()) lustre-OST0003_osc: granted 368640 but already consumed 1048576 [26276.964898] sysrq: SysRq : Changing Loglevel [26276.965724] sysrq: Loglevel set to 8 [26276.967609] sysrq: SysRq : Show State [26276.968269] task PC stack pid father [26276.969210] systemd S 0 1 0 0x00000000 [26276.970088] Call Trace: [26276.970548] __schedule+0x2a6/0x700 [26276.971139] schedule+0x38/0xa0 [26276.971677] schedule_hrtimeout_range_clock+0xf7/0x110 [26276.972514] ? ep_scan_ready_list.constprop.22+0x20c/0x230 [26276.973406] ep_poll+0x3c8/0x410 [26276.973953] ? wake_up_q+0x80/0x80 [26276.974534] do_epoll_wait+0xb0/0xd0 [26276.975131] __x64_sys_epoll_wait+0x1a/0x20 [26276.975821] do_syscall_64+0x5b/0x1a0 [26276.976447] entry_SYSCALL_64_after_hwframe+0x65/0xca [26276.977339] RIP: 0033:0x7f023c889257 [26276.977937] Code: Bad RIP value. [26276.978495] RSP: 002b:00007ffc57bbe450 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8 [26276.979678] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f023c889257 [26276.980796] RDX: 0000000000000056 RSI: 00007ffc57bbe490 RDI: 0000000000000004 [26276.981913] RBP: 00007ffc57bbe490 R08: 0000000000000000 R09: 7465677261742e79 [26276.983033] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000056 [26276.984152] R13: 00000000ffffffff R14: 00007ffc57bbe490 R15: 0000000000000001 [26276.985270] kthreadd S 0 2 0 0x80004000 [26276.986143] Call Trace: [26276.986580] __schedule+0x2a6/0x700 [26276.987166] schedule+0x38/0xa0 [26276.987703] kthreadd+0x2e8/0x300 [26276.988275] ? kthread_create_on_cpu+0xa0/0xa0 [26276.989001] ret_from_fork+0x35/0x40 [26276.989601] rcu_gp I 0 3 2 0x80004000 [26276.990486] Call Trace: [26276.990916] __schedule+0x2a6/0x700 [26276.991509] schedule+0x38/0xa0 [26276.992048] rescuer_thread+0x29e/0x340 [26276.992686] ? process_one_work+0x360/0x360
In the OSS (vm3) console log, we see some grant errors
[22642.753377] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param obdfilter.lustre-*.sync_journal=0 [22643.253588] LustreError: 366155:0:(tgt_grant.c:919:tgt_grant_alloc()) lustre-OST0000: client lustre-OST0000_osc_UUID/00000000089e081d requesting > max (2147483647), 34359738368 [22643.256053] LustreError: 366155:0:(tgt_grant.c:919:tgt_grant_alloc()) Skipped 6 previous similar messages [26311.031755] sysrq: SysRq : Changing Loglevel [26311.032550] sysrq: Loglevel set to 8 [26311.034388] sysrq: SysRq : Show State [26311.034963] task PC stack pid father [26311.035888] systemd S 0 1 0 0x00000000 [26311.036726] Call Trace: [26311.037144] __schedule+0x2a6/0x700 [26311.037701] schedule+0x38/0xa0 [26311.038200] schedule_hrtimeout_range_clock+0xf7/0x110 [26311.038999] ? ep_scan_ready_list.constprop.22+0x20c/0x230 [26311.039846] ep_poll+0x3c8/0x410 [26311.040381] ? wake_up_q+0x80/0x80 [26311.040903] do_epoll_wait+0xb0/0xd0 [26311.041468] __x64_sys_epoll_wait+0x1a/0x20 [26311.042113] do_syscall_64+0x5b/0x1a0 [26311.042695] entry_SYSCALL_64_after_hwframe+0x65/0xca [26311.043491] RIP: 0033:0x7ffbb0799257 [26311.044053] Code: Bad RIP value. [26311.044562] RSP: 002b:00007fffda2f4c30 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8 [26311.045720] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007ffbb0799257 [26311.046806] RDX: 0000000000000057 RSI: 00007fffda2f4c70 RDI: 0000000000000004 [26311.047893] RBP: 00007fffda2f4c70 R08: 0000000000000000 R09: 7465677261742e79 [26311.048978] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000057 [26311.050066] R13: 00000000ffffffff R14: 00007fffda2f4c70 R15: 0000000000000001 [26311.051152] kthreadd S 0 2 0 0x80004000 [26311.052001] Call Trace:
We’ve seen this hang in the past at
https://testing.whamcloud.com/test_sets/c037a9f9-459a-41c2-9eea-ea54cb91eb3b
Attachments
Issue Links
- is duplicated by
-
LU-15756 obdfilter-survey test_2b:osc_request.c:1037:osc_init_grant()) lustre-..._osc: granted 364544 but already consumed 1048576
-
- Resolved
-
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...