Thursday, February 24, 2011 8:17 PM
I'll cut to the chase with my question. See the latter part of this post if you want the background.
- Why does MS Access use so much space per data element?
- Is there a way to reduce physical element size? (Besides Unicode compression)
I'm putting data in an MS Access database, and don't want to waste space. Thus I define numeric fields according the most sensible size -- 4-byte INTEGER ("long integer"), 2-byte SMALLINT ("integer"), etc. I avoid using the mega 17-byte DECIMAL if at all possible. I was curious as to the actual physical effect this had, so I did some experiments, and was very suprised by the results. Notably:
- All numeric type fields have an overhead of about 12 bytes per element. As in, it takes 16 bytes to store a 4-byte integer, or 29 bytes to store a 17-byte decimal.
- Text fields have about 16 bytes of overhead, and memo fields have 23 bytes.
- 2-byte integers, 1-byte fields, and even BIT ("Yes/No") fields take the same amount of space as a 4-byte integer! Which is to say 16-17 bytes total.
Does anyone know why it takes so much space? This is a killer for big databases. Notably, it appears completely useless to downsize integer fields (say 4-byte to a 1-byte TINYINT), and even allowing a standard 4-byte bucket, 75% of their footprint is...what? Metadata? Ether?
Is there a way to reduce physical element size? From testing I can see that, yes, using Unicode compression on text and memo fields cuts the content data size in half. But it still leaves the 16 byte overhead even for a 1-characted text field or 23 for a memo. Let alone that bloated integer field element size.
I'm trying to assess optimization issues in downloading data from SQL server and other sources to MS Access. I'm using VB.NET code to perform this, but my question is not about the code.
Basically, the DBAs for the SQL server database I frequently query defined almost all number fields using the NUMERIC field type -- equivalent to the DECIMAL type in Jet / MS Access. So many fields that could easily be a 2-byte SMALLINT ( Integer in the Access UI ) are defined as NUMERIC(5,0).
Perhaps SQL server compacts numeric fields to only the precision required, but Jet definitely does not. Theoretically a decimal field takes a whopping 17 bytes, compared to 4 bytes for a "long integer", 2 for an "integer" / SMALLINT, or 1 for a "byte" / TINYINT. My programs make use of the very nifty DataSchemaTable and DataAdapter features in the .NET framework so that I can preserve strong data typing as much as possible.
But with the lazy NUMERIC(x,0) fields I convert them to a smaller integer type. And I can see that does save over 12 bytes per element--reasonably close to the theoritical 13-byte difference between an integer and decimal field. But I'm still left with all that mysterious extra space as described above. And again, reducing a field from a 4-byte integer to a 2-byte SMALLINT, 1-byte TINYINT, or even 1-bit boolean saves no space at all.
So, am I missing something? Should I skip Access and start using DataAdapters to digest flat files? And what's up with all that bloat?
Thanks for any insight or intput,
Thursday, February 24, 2011 8:27 PM
This is from Access 2003 Help (Sorry about formatting - just copied/pasted)
Text Use for text or combinations of text and numbers, such as addresses, or for numbers that do not require calculations, such as phone numbers, part numbers, or postal codes. Stores up to 255 characters. The FieldSize property controls the maximum number of characters that can be entered. Memo Use for lengthy text and numbers, such as notes or descriptions. Stores up to 63,999 characters.
Number Use for data to be included in mathematical calculations, except calculations involving money (use Currency type). Stores 1, 2, 4, or 8 bytes; stores 16 bytes for Replication ID (GUID). The FieldSize property defines the specific Number type.
Date/Time Use for dates and times. Stores 8 bytes. Currency Use for currency values and to prevent rounding off during calculations. Stores 8 bytes.
AutoNumber Use for unique sequential (incrementing by 1) or random numbers that are automatically inserted when a record is added. Stores 4 bytes; stores 16 bytes for Replication ID (GUID).
Yes/No Use for data that can be only one of two possible values, such as Yes/No, True/False, On/Off. Null (Null: A value you can enter in a field or use in expressions or queries to indicate missing or unknown data. In Visual Basic, the Null keyword indicates a Null value. Some fields, such as primary key fields, can't contain a Null value.) values are not allowed. Stores 1 bit.
OLE Object Use for OLE objects (such as Microsoft Word documents, Microsoft Excel spreadsheets, pictures, sounds, or other binary data) that were created in other programs using the OLE (OLE: A program-integration technology that you can use to share information between programs. All Office programs support OLE, so you can share information through linked and embedded objects.) protocol. Stores up to 1 gigabyte (limited by disk space).
Hyperlink Use for hyperlinks (hyperlink: Colored and underlined text or a graphic that you click to go to a file, a location in a file, a Web page on the World Wide Web, or a Web page on an intranet. Hyperlinks can also go to newsgroups and to Gopher, Telnet, and FTP sites.). A hyperlink can be a UNC path (universal naming convention (UNC): A naming convention for files that provides a machine-independent means of locating the file. Rather than specifying a drive letter and path, a UNC name uses the syntax \\server\share\path\filename.) or a URL (Uniform Resource Locator (URL): An address that specifies a protocol (such as HTTP or FTP) and a location of an object, document, World Wide Web page, or other destination on the Internet or an intranet, for example: http://www.microsoft.com/.). Stores up to 2048 characters. Lookup Wizard Use to create a field that allows you to choose a value from another table or from a list of values using a combo box—-choosing this option in the data type list starts a wizard to define this for you. Requires the same storage size as the primary key (primary key: One or more fields (columns) whose values uniquely identify each record in a table. A primary key cannot allow Null values and must always have a unique index. A primary key is used to relate a table to foreign keys in other tables.) that corresponds to the Lookup field (Lookup field: A field, used on a form or report in an Access database, that either displays a list of values retrieved from a table or query, or stores a static set of values.)—-typically 4 bytes.
See Also Free trial of Microsoft Office 2010
- Edited by -suzyQ Thursday, February 24, 2011 8:36 PM help formatting a little
Thursday, February 24, 2011 10:09 PMSorry, I know I didn't answer your questions. There is a header in the file that stores number of records, indexes, and other stuff, but I didn't think it grew significantly to the number records that are in the database so I guess this is a question for someone with more knowledge than I.
Thursday, February 24, 2011 10:23 PMModerator
I would doubt your experimentation, or the interpretation you made of it.
If even Boolean were to have an overhead as you described, a table with 200 Boolean will exceed the maximum record length. But is it doable to make such a table (make a table with 10 Booleans, make a query which bring that table 20 times, transform that query into a make table query, and run the query to make the table with 200 Booleans).
Friday, February 25, 2011 2:33 PMModeratorTake note that if the table is created from within Access Table Designer, Access may decide to add some index on your fields (because their name ends by "id", as example), and those indexes will take space. Check the List of Indexes to see if that is not your case, ie, if you don't have unwanted indexes.
Friday, February 25, 2011 3:05 PMHere's a thought also - when you created your tables and the rows in them there might have been intermediary steps that you took that caused bloat. Try compact/repair and see if that changes anything.
Friday, February 25, 2011 3:27 PM
I appreciate your input suzyQ and Vanderghast. You rightly question my methodology. I did however take all of these things into account. I didn't include all the details of how I did my experiment in my first post for the sake of length, but it looks like I had better post them now! I'll probably split this into several responses...
To try to determine the physical field sizes of different number and text data types in a Jet (MS Access) database, I created a series of databases which were identical in the following attributes:
- Each database contained exactly 1 table with 1 field, called ID.
- *I did NOT index this field. * Yes, I know MS Access sometimes creates indexes by default. If this happened I removed the indexes. I did other experiments where I did include an index, yet others where I further specified the field as a primary key. Naturally adding an index increased the db size significantly, and designating the field as PK increased the size more. For all the data presented here, however, no indexes were present.
- The single table in each database contained exactly 1,000,000 rows. Yes, 1 million.
- For every single observation I compacted the respective database multiple times and then closed the file before recording the final (compacted) size.
- Although the physical size of the metadata was minimal compared to the physical size of the data, I still subtracted the size of the metadata (the size of the database with no records) before dividing for the per-element data size. Below are the "constants"
for each data type tested:
Size With No Data (0 rows) Data Type AccessUI Size Type Size Jet ADO DDL (KB) [No table - new database] 268 Yes/No BIT 280 Number Byte TINYINT 280 Number Integer SMALLINT 280 Number Long Integer INTEGER 280 Number Decimal (3,0) DECIMAL(3,0) 280 Number Decimal (5,0) DECIMAL(5,0) 280 Number Decimal (10,0) DECIMAL(10,0) 280 Number Decimal (18,0) DECIMAL(18,0) 280 Number Decimal (28,0) DECIMAL(28,0) 280 Text 50 CHAR(50) 284 Text 100 CHAR(100) 284 Text 150 CHAR(150) 284 Text 200 CHAR(200) 284 Memo MEMO 284
Friday, February 25, 2011 4:32 PM
Test Cases - Number Fields
For the number types, I also tested each type with different amounts of variance in the values in the column. In English: For the databases in the columns labeled "unique", each value in the table was different. It was a list of 1 to 1,000,000. I didn't add a constraint on the table, I just inserted unique values. For 16bit, each value was between 0 and 32,767. For 8 bit, 0 - 255. For 0, every single value was 0. I did this because I wondered if it would take more room to store 1,000,000 unique values vs. 1,000,000 zeros. The answer is no, it doesn't make a difference.
The field size in bytes is calculated from the file size using this formula:
[field size] = (([file size] - 280) * 1024) / 1,000,000)
The 280 is to subtract the size of the metadata (in KB). 1024 is to convert from KB to bytes. 1,000,000 is obviously the number of elements. The result is striking: all integer types (including the bit / boolean / yesno field) take 16 bytes per element. In databases with 1,000,000 records, the file sizes are precisely the same.
Of course files are not stored in less that byte chunks; each element takes 16 bytes, not 16.08. I presume the extra is from small amounts of incremental meta data.
Table with 1,000,000 rows Data File Size (KB) Field Size (Bytes) Type Unique 16Bit 8Bit 0 Unique 16Bit 8Bit 0 Bit 15,988 16.08 Byte 15,988 15,988 16.08 16.08 Small 15,988 15,988 15,988 16.08 16.08 16.08 Long 15,988 15,988 15,988 15,988 16.08 16.08 16.08 16.08 Dec(3,0) 27,888 27,888 28.27 28.27 Dec(5,0) 27,888 27,888 27,888 28.27 28.27 28.27 Dec(10,0) 27,888 27,888 27,888 27,888 28.27 28.27 28.27 28.27 Dec(18,0) 27,888 27,888 27,888 27,888 28.27 28.27 28.27 28.27 Dec(28,0) 27,888 27,888 27,888 27,888 28.27 28.27 28.27 28.27
So, now you see! And my questions remain:
- Why does it take 16 bytes to store 4-byte integer?
- Why does it take just as much room to store a bit, byte, or Int16 as it does for an 4-byte integer?
- Is there any way to reduce this overhead?
Friday, February 25, 2011 6:28 PMModerator
"Each database contained exactly 1 table with 1 field, called ID. "
If you do it with the Access table designer (rather than, say, with a DDL statement), you then ALSO create an index, which can then explain the overhead. Try by calling the field MyField, intead (or create the table with DDL statements). Note that it is ACCESS, the application, not JET, the database engine, which adds the index, for you, when you use its graphical table designer.
Friday, February 25, 2011 7:20 PMModeratorThere is also a second possibility for that overhead. Under MS SQL Server, if you don't define a primary key, MS SQL Server adds one, for you, but hidden to you. I don't know if Jet does the same thing, or not, but a way to be sure is to supply the table with a primary key field, explicitly, "compute" the space so used on the hard disk (or other measure, if appropriate), then, add a single Boolean field, and re-examine the size. The difference between the second result, and the first one, will be a better candidate about the required size for that Boolean field. And again, be sure the Boolean field is NOT INDEXED, since any index has to be stored in the database too.
Friday, February 25, 2011 7:41 PMIs it also not the case that Access works on a Page basis, with a maximum number of records stored per page? Just because a record has only a single integer field in it doesn't mean that it won't take just as much space as, perhaps, a record with multiple integer fields, since you may run into the maximum records per page in both cases.
Doug Steele, Microsoft Access MVP
http://www.AccessMVP.com/djsteele (no e-mails, please!)
Co-author Access Solutions — Tips, Tricks, and Secrets from Microsoft Access MVPs (ISBN 978-0-470-59168-0)
Tuesday, March 01, 2011 1:03 PM
Vanderghast and Doug,
Good thoughts! I'm heads down in a project right now, but I hope to test these items out later this week.
Wednesday, March 02, 2011 2:39 PMPerhaps is thus just wild speculation then?Have you tried the same with SQL server?There are also additional factors here such as the disk sector size etc. So, access might be only adding a few bytes, but the os can ONLY grab chunks of the disk drive in certain size increments and that is what you see.And, while you do a compact + repair, excessive bloat is often seen if you have row locking turned on. Access fakes row locking of records by expanding a single record to fill a whole page (4000 bytes) of which the JET data engine then can in fact lock.Heck, trying deleting the whole column, and then C + R and see what you get back! The measured results might be telling you what is the smallest sector size the os can deal with.I suspect that SQL server and other systems you will see even more overhead since they require transaction logs in addition to an typical frame based or page type of architecture.I not sure if you are seeing anything new that I seen or known about in the last 20+ years of desktop database systems. So how this works is not really much news here.Page based systems (Access, SQL server, Oracle etc.) do have some weakness that you will see. I mean if we have some data in the pages like this:I can now add 6 more records and see ZERO growth. We now have:So, now I adding records and the file is not growing!Then, I just add two more records, and if data frame size is 4000 chars, then that is a cost of 8000 bytes. We get:The above also explains why a freshly compacted file grows like crazy, but then settles down to a dull roar since over time there will be many buckets that have extra room, and adding of records will not likely case a whole new bucket/frame/page to be added.Older PC systems like dBase were fixed length for each record added and were quite linear in growth. However these systems also did not work well with multi user systems and did not support variable length text columns for the most part.Albert D. Kallal (Access MVP)Edmonton, Alberta Canada
- Marked As Answer by Bruce SongModerator Wednesday, March 09, 2011 11:20 AM
Wednesday, March 02, 2011 3:45 PMPerhaps SQL server compacts numeric fields to only the precision required
Perhaps is thus just wild speculation then?
Yes, it is just wild speculation :D. I was trying to give the DBA the benefit of the doubt that he would not be wasting enormous amounts of space.
Albert, thanks for the additional input. I've added this to my list of experimental to-dos. I don't have any more results to post yet but I will post when I do.
Wednesday, March 02, 2011 4:59 PM
I know that you are trying to understand this - secretly more complex than expected - question with indexes being removed, but I would submit the following two observations: 1) in any Real-Life(tm) system with 1 million data rows, indexes are ALWAYS used (therefore even your simplified data model fails to be able to predict real-world results); and 2) since the (rushmore?) indexing technology MS obtained many years ago - and has used in MS Access and elsewhere since - remains a proprietary technology that MS has never published the "inner workings" of, then your ability to accurately predict the real-world size of target database files is hampered even further.
I was forced to develop a (flawed but useful for my purposes) estimate of the index size/growth for a project recently, and the complexities of estimating MS Access database file sizes are compounded by not only the # of data rows and index fields, but they also vary by the uniqueness of the data populating those indexed fields. This frustrates the usability of even %-age size-based growth estimation models, depending upon the particulars of your situation.
...but hey, if MS Access really were the "childs play"/"play toy" that its ignorant detractors imagine it to be, then there would be no need of any MS Access MVPs in the world, would there? (and there are quite a few MS Access MVPs out there...)
Mark Burns, MCAD, MCP
Sr. Microsoft Access Analyst/Developer
Manager LinkedIn.Com community: Professional Microsoft Access Developers Network (PMADN)
- Marked As Answer by Bruce SongModerator Wednesday, March 09, 2011 11:20 AM
Thursday, March 03, 2011 1:50 PM
Thanks Mark! A most insightful post. I may still try to do some more experiments, but your comments, along with others from Doug, Albert, and Vanderghast, have pretty well convinced me that my controlled experiment was rather too controlled, or limited, to provide a meaningful prediction about space usage in a large MS Access database.
The reason I am working with perhaps sloppy-appearing, no-PK-defined tables: I'm using MS Access as a local cache for complicated SQL processing for very specific data analysis. My queries change constantly, call from many data sources, and often involve mutiple steps collating and transforming intermediate tables.
MS Access is a great tool for high-speed data collection and on-the-fly scripting. But the 2GB size limit sometimes looms large in my processing. Sometimes I'm pulling a hundred records, sometimes it could be 10 million. Most of the time I don't know how much data I'll get until I run the query.
Even if I could convince our IT department to give some high-skilled fraud specialists free reign in a production-performance SQL Server (or DB2, or Teradata) database, there's no way a network-attached db provider could touch the IO speed of a database on a local hard disk. Of course there's SQL Server Express edition, but that doesn't have the integrating scripting flexibility of VBA, nor is it portable for sharing data and scripts / VBA programs with my colleagues.
In short, MS Access really is the best tool for the job. I just worry about space sometimes! Hence this discussion.
Thursday, March 03, 2011 4:07 PMI should very much say that I appreciate you sharing some of your numbers here. I have not time this week, but my interest was certainly was raised by your post. I shall try and look at this further on the weekend and see if I can dig up anything of additional interest on this matter.--
Albert D. Kallal (Access MVP)
Edmonton, Alberta Canada
Friday, March 04, 2011 5:54 PM
Here's one "real" example:
I just pulled a result set the ended up with 76,204 records. The fields consist of a INTEGER, DATETIME, CHAR(20), and then 51 DECIMAL fields. Compacted, the database is 76,488 KB. Incidentally this is very close to the nominal size with no meta data:
Theoretical size of one record: 919 bytes
Type Bytes Qty Total INTEGER 4 1 4 DATETIME 8 1 8 CHAR(20) 40 1 40 DECIMAL 17 51 867 Total 54 919
Average physical size of record: 1,028 bytes. This computes to 11.9% additional size over nominal.
Total (KB) 76,488 × 1,024 Total (B) 78,323,712 ÷ 76,204 Per Record 1,028
None of the DECIMAL fields actually need 17 bytes of precision, so I created a second table with a duplicate definition, except that the 51 DECIMAL fields were defined as FLOAT (that's a Double in VBA / Access UI terms). When filled and compacted, the resulting database was 38,388 KB. The stats are now 460 bytesnominal data size and 516 bytes average physical size: 12.0% additional size over nominal.
Type Bytes Qty Total INTEGER 4 1 4 DATETIME 8 1 8 CHAR(20) 40 1 40 FLOAT 8 51 408 Total 54 460 Total (KB) 38,388 × 1,024 Total (B) 39,309,312 ÷ 76,204 Per Record 516
In this case, total overhead runs about 12% over nominal, and changing data types from a 17-bit DECIMAL to a 8-bit FLOAT really does save 9 bits per elements (actually 9 * 1.12 = 10 bits). With my 51 fields of this type, changing the data type cut the database size in half.
- Edited by Joshua Honig Friday, March 04, 2011 6:25 PM Fixed formatting
Saturday, March 05, 2011 3:04 PMModerator
Assuming the database page is 2K (2048 bytes) and KNOWING that a record has to be in ONE page (cannot store half of the record in one page, and the other half in another page), 2048 / 460 = 4 records + 208 blanks left over which cannot be used: 4* 460 + 208 = 2048.
That is, you use 2048 Bytes to effectively store 1840 Bytes, 11.3% more than the strict theorical minimum.