[LU-4045] Overflow in tracking declares for index, Created: 02/Oct/13  Updated: 03/Nov/17  Resolved: 16/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Alexey Lyashkov Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 10856

 Description   

is it's expected?

Lustre: DEBUG MARKER: == sanity test 233: checking that OBF of the FS root succeeds == 14:55:00 (1380624900)
Lustre: DEBUG MARKER: == sanity test 234: xattr cache should not crash on ENOMEM == 14:55:00 (1380624900)
Lustre: DEBUG MARKER: sanity test_234: @@@@@@ FAIL: touch failed
Lustre: DEBUG MARKER: == sanity test 235: LU-1715: flock deadlock detection does not work properly == 14:55:00 (1380624900)
Lustre: DEBUG MARKER: sanity test_235: @@@@@@ FAIL: error executing flock_deadlock /mnt/lustre/f.sanity.235
Lustre: DEBUG MARKER: == sanity test 236: Layout swap on open unlinked file == 14:55:00 (1380624900)
Lustre: DEBUG MARKER: sanity test_236: @@@@@@ FAIL: cannot setstripe on '/mnt/lustre/d0.sanity/d236/f1': rc = 28
Lustre: DEBUG MARKER: == sanity test 900: umount should not race with any mgc requeue thread == 14:55:00 (1380624900)
LustreError: 11-0: lustre-MDT0000-lwp-OST0001: Communicating with 0@lo, operation obd_ping failed with -107.
Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 2 previous similar messages
Lustre: lustre-MDT0000: Not available for connect from 0@lo (stopping)
Lustre: Skipped 1 previous similar message
LustreError: Skipped 4 previous similar messages
LustreError: 17372:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8801d1537610 x1447689357275628/t0(0) o13->lustre-OST0000-osc-MDT0000@0@lo:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Lustre: server umount lustre-MDT0000 complete
Lustre: 28485:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1380624907/real 1380624907]  req@ffff8801f2f64108 x1447689357275644/t0(0) o251->MGC192.168.69.5@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1380624913 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 28485:0:(client.c:1897:ptlrpc_expire_one_request()) Skipped 1 previous similar message
LNet: Removed LNI 192.168.69.5@tcp
LNet: HW CPU cores: 8, npartitions: 2
alg: No test for crc32 (crc32-table)
alg: No test for adler32 (adler32-zlib)
alg: No test for crc32 (crc32-pclmul)
Lustre: Lustre: Build Version: 2.4.93-gbe84477-CHANGED-2.6.32-358.18.1-lustre
LNet: Added LNI 192.168.69.5@tcp [8/256/0/180]
LNet: Accept secure, port 988
Lustre: Echo OBD driver; http://www.lustre.org/
LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
Lustre: lustre-MDT0000: used disk, loading
LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
LDISKFS-fs (loop1): mounted filesystem with ordered data mode. quota=on. Opts: 
LustreError: 29613:0:(osd_io.c:1065:osd_ldiskfs_write_record()) loop1: error reading offset 0 (block 0): rc = -28
LustreError: 29613:0:(llog_osd.c:156:llog_osd_write_blob()) lustre-OST0000-osd: error writing log record: rc = -28
Lustre: 29613:0:(osd_internal.h:978:osd_trans_exec_op()) lustre-OST0000-osd: Overflow in tracking declares for index, rb = 0
Pid: 29613, comm: mount.lustre

Call Trace:
 [<ffffffffa09198c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0c85908>] osd_trans_exec_op+0x1f8/0x2e0 [osd_ldiskfs]
 [<ffffffffa0c96ec3>] osd_attr_set+0xe3/0x540 [osd_ldiskfs]
 [<ffffffffa0ca3cb5>] ? osd_punch+0x1d5/0x620 [osd_ldiskfs]
 [<ffffffffa0b17d91>] llog_osd_write_blob+0x211/0x850 [obdclass]
 [<ffffffffa0b1ba84>] llog_osd_write_rec+0x814/0x13d0 [obdclass]
 [<ffffffffa0b3d99e>] ? lu_context_init+0x4e/0x240 [obdclass]
 [<ffffffffa0b3d99e>] ? lu_context_init+0x4e/0x240 [obdclass]
 [<ffffffffa0aef3d8>] llog_write_rec+0xc8/0x290 [obdclass]
 [<ffffffffa0af0ab8>] ? llog_write+0x288/0x420 [obdclass]
 [<ffffffffa0af0add>] llog_write+0x2ad/0x420 [obdclass]
 [<ffffffffa0af0c74>] llog_copy_handler+0x24/0x30 [obdclass]
 [<ffffffffa0af1ce6>] llog_process_thread+0x866/0xce0 [obdclass]
 [<ffffffffa0af0c50>] ? llog_copy_handler+0x0/0x30 [obdclass]
 [<ffffffffa0af383a>] llog_process_or_fork+0x12a/0x550 [obdclass]
 [<ffffffffa0af4062>] llog_backup+0x3e2/0x510 [obdclass]
 [<ffffffff8129b390>] ? sprintf+0x40/0x50
 [<ffffffffa073046f>] mgc_process_log+0x10af/0x1770 [mgc]
 [<ffffffffa0730d33>] ? do_config_log_add+0x203/0x580 [mgc]
 [<ffffffffa0732153>] mgc_process_config+0x653/0x11e0 [mgc]
 [<ffffffffa0b32eb6>] lustre_process_log+0x256/0x990 [obdclass]
 [<ffffffffa0929871>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa0b6150b>] ? server_register_mount+0x54b/0x820 [obdclass]
 [<ffffffffa0b63917>] server_start_targets+0x357/0x1aa0 [obdclass]
 [<ffffffffa0b36355>] ? lustre_start_mgc+0x465/0x1dc0 [obdclass]
 [<ffffffffa0c8d13b>] ? osd_obd_connect+0x22b/0x270 [osd_ldiskfs]
 [<ffffffffa0b2e7d0>] ? class_config_llog_handler+0x0/0x1830 [obdclass]
 [<ffffffffa0b69aac>] server_fill_super+0xbac/0x1630 [obdclass]
 [<ffffffffa0b37e78>] lustre_fill_super+0x1c8/0x520 [obdclass]
 [<ffffffffa0b37cb0>] ? lustre_fill_super+0x0/0x520 [obdclass]
 [<ffffffff81199cbf>] get_sb_nodev+0x5f/0xa0
 [<ffffffffa0b30105>] lustre_get_sb+0x25/0x30 [obdclass]
 [<ffffffff8119927b>] vfs_kern_mount+0x7b/0x1b0
 [<ffffffff81199422>] do_kern_mount+0x52/0x130
 [<ffffffff811b9ca2>] do_mount+0x2d2/0x8c0
 [<ffffffff811ba320>] sys_mount+0x90/0xe0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

LustreError: 29613:0:(osd_io.c:1065:osd_ldiskfs_write_record()) loop1: error reading offset 0 (block 0): rc = -28
LustreError: 29613:0:(llog_osd.c:156:llog_osd_write_blob()) lustre-OST0000-osd: error writing log record: rc = -28


 Comments   
Comment by Ian Costello (Inactive) [ 15/Jan/14 ]

We are also seeing these stack traces in the logs for recently upgraded 2.4.2 MDS's.

Comment by Ian Costello (Inactive) [ 15/Jan/14 ]

Actually the stack traces we are seeing are not the same, but the message prior is. Here is an example of the issue (if necessary will create a separate bug to track):

2014-01-14T17:43:16+11:00 gerbil5 kernel: Lustre: 1013:0:(osd_internal.h:956:osd_trans_exec_op()) short-MDT0000-osd: Overflow in tracking declares for index, rb = 2
2014-01-14T17:43:16+11:00 gerbil5 kernel: Lustre: 1013:0:(osd_internal.h:956:osd_trans_exec_op()) Skipped 3 previous similar messages
2014-01-14T17:43:16+11:00 gerbil5 kernel: Lustre: 1013:0:(osd_internal.h:956:osd_trans_exec_op()) short-MDT0000-osd: Overflow in tracking declares for index, rb = 2
2014-01-14T17:43:16+11:00 gerbil5 kernel: Lustre: 1013:0:(osd_internal.h:956:osd_trans_exec_op()) Skipped 3 previous similar messages
2014-01-14T17:43:16+11:00 gerbil5 kernel: Pid: 1013, comm: mdt01_047
2014-01-14T17:43:16+11:00 gerbil5 kernel:
2014-01-14T17:43:16+11:00 gerbil5 kernel: Call Trace:
2014-01-14T17:43:16+11:00 gerbil5 kernel: Pid: 1013, comm: mdt01_047
2014-01-14T17:43:16+11:00 gerbil5 kernel:
2014-01-14T17:43:16+11:00 gerbil5 kernel: Call Trace:
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0401895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0401895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0d08368>] osd_trans_exec_op+0x1f8/0x2e0 [osd_ldiskfs]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0d08368>] osd_trans_exec_op+0x1f8/0x2e0 [osd_ldiskfs]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0d11778>] osd_xattr_set+0x278/0x390 [osd_ldiskfs]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0d11778>] osd_xattr_set+0x278/0x390 [osd_ldiskfs]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0ede440>] lod_xattr_set+0xc0/0x810 [lod]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0ede440>] lod_xattr_set+0xc0/0x810 [lod]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0ba76e7>] mdo_xattr_set.clone.0+0x47/0xd0 [mdd]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0ba76e7>] mdo_xattr_set.clone.0+0x47/0xd0 [mdd]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0ba79ba>] mdd_acl_chmod+0x24a/0x290 [mdd]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0ba79ba>] mdd_acl_chmod+0x24a/0x290 [mdd]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0b8b434>] mdd_attr_set_internal+0x174/0x230 [mdd]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0b8b434>] mdd_attr_set_internal+0x174/0x230 [mdd]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0b8e6ea>] mdd_attr_set+0x107a/0x1390 [mdd]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0b8e6ea>] mdd_attr_set+0x107a/0x1390 [mdd]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0e2d738>] mdt_attr_set+0x268/0x560 [mdt]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0e2d738>] mdt_attr_set+0x268/0x560 [mdt]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0e2dfed>] mdt_reint_setattr+0x5bd/0xcf0 [mdt]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0e2dfed>] mdt_reint_setattr+0x5bd/0xcf0 [mdt]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa075d12e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa075d12e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
2014-01-14T17:43:16+11:00 gerbil5 kernel: [<ffffffffa0e27911>] mdt_reint_rec+0x41/0xe0 [mdt]

Comment by Mark Mansk [ 03/Jun/14 ]

We're seeing this test fail at Cray now with 2.6. Any status on it?

Comment by Gerrit Updater [ 24/Apr/15 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/14566
Subject: LU-4045 mdd: use same attr to declare setattr
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 46d729410f02ae8c59803eb9f707eb1a44dfa057

Comment by Hongchao Zhang [ 24/Apr/15 ]

the patch is tracked at http://review.whamcloud.com/#/c/14566/

Comment by Gerrit Updater [ 16/Jul/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14566/
Subject: LU-4045 mdd: use same attr to declare setattr
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0158031bc32d1719ec05d12d1ceb03e4ecb66a97

Comment by Peter Jones [ 16/Jul/15 ]

Landed for 2.8

Generated at Sat Feb 10 01:39:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.