locked
Unique composite identifier for efficient query of the ACL entries at startup RRS feed

  • Question

  • Hello Everyone,

    I bet you are surprised with the title of this topic but I'll try to explain the reason of my headache and if you could show me some direction, I would really be thankful. So, the situation is that we have a project built on SenseNet Security. It is an open-source security system allowing users to control access to resources implementing usual Access Control List (ACL) with inheritance, minimal membership & group management for inheritable security setting checks, and so on.

    The default storage provider of it uses MS SQL Server and stores the entities in the following schema:

    ACL

    As you can see, all identifiers are represented as a simple 32 bit integer, including EntityIds which are the entities of our application. This DB design assumes that you have globally unique int keys for all kind of your entities not giving a discriminator field to make a difference between entity and entity based on their types, similarly to EF TPT inheritance. It can even get more complicated if you want to include a resource-based discriminator, e.g. a ProjectId to isolate entries per a project-base as well.

    To solve this issue, I have started to refactor all this stuffs replacing the int Ids in the C# code with a readonly struct EntityId wrapper containing the 3 properties we need for our project: ProjectId, TypeId, LocalId. The question is that: how should I store them to avoid degrade performance of data fetched into the memory (or even Redis later) at application startup (SenseNet Security system populates all ACL entries into the memory on init).

    My first imagination was to use surrogate keys using a "lookup" table to map int Entity Ids to this 3 fields. At first, it seemed a good idea but I measured performance and realized some degradation with 10 million of ACL entries because of the extra joins. So, this was the original query:

    SELECT * FROM [EFEntries]

    ... and this is the new one (compact version of the EF Core generated SQL):

    SELECT a.*, a0.*, a1.*, a2.* FROM [EFEntries] AS [a]
    INNER JOIN [EFObjectIds] AS [a0] ON [a].[Id] = [a0].[Id]
    INNER JOIN [EFObjectIds] AS [a1] ON [a].[OwnerId] = [a1].[Id] // MemberId (User | Group) is also EntityId
    LEFT JOIN [EFObjectIds] AS [a2] ON [a].[ParentId] = [a2].[Id]

    The difference between execution time of two queries is about 20 seconds with the mentioned amount of data. Okay, here comes the new idea to my mind. What would happen if I used a fixed size binary(10) to represent the 3 numbers as one in the EFEntityId field? Of course, I can do: 

    public readonly struct ObjectId : IEquatable<ObjectId>
    {
        public int StoreId { get; }   // 32 bits - 4 bytes
        public short TypeId { get; }  // 16 bits - 2 bytes
        public int LocalId { get; }   // 32 bits - 4 bytes
    }
    
    public sealed class EFEntry
    {
        public ObjectId Id { get; set; }
        public ObjectId OwnerId { get; set; }
        public ObjectId? ParentId { get; set; }
        public bool LocalOnly { get; set; }
        public ulong AllowBits { get; set; }
        public ulong DenyBits { get; set; }
    }

    Using a simple EF Core value converter, I can simply transform the keys back and forth:

    private static byte[] ConvertTo(ObjectId value)
    {
        Span<byte> span = stackalloc byte[10];
        Span<byte> span2 = span.Slice(4, 2);
        Span<byte> span3 = span.Slice(6, 4);
    
        BitConverter.TryWriteBytes(span, value.StoreId);
        BitConverter.TryWriteBytes(span2, value.TypeId);
        BitConverter.TryWriteBytes(span3, value.LocalId);
    
        return span.ToArray();
    }
    
    private static ObjectId ConvertBack(byte[] bytes)
    {
       return new ObjectId
       (
           storeId: BitConverter.ToInt32(bytes, 0),
           typeId: BitConverter.ToInt16(bytes, 4),
           localId: BitConverter.ToInt32(bytes, 6)
       );
    }

    The question is that it has some serious negative performance effect on the MS SQL DB? As I said, the security system pre-populates data from the tables and keeps everything in the memory so this pre-fetch of the entries should be very fast without extra JOINs.

    It is also important that size of data kept in the memory should be small/compact. This is why we would not like to use 128 bit (16 byte) Guid. 10 byte of the memory is enough for us to identify all of our entities, moreover, it can even provide detailed info about relationships between permission entries and our entities instead of a simple Guid... but I'm really curious what a SQL expert thinks about this idea because I would not like to blindly move on.

    Thank you for your help in advance,

    Regards,
    János Janka
    https://github.com/janosjanka




    Saturday, April 4, 2020 3:09 PM

All replies

  • These design questions are always difficult, because in a forum post you can tell so much. I can give one answer now, but if you then would hire me as a consultant, and we would sit down a whole day looking at this in detail I might say something completely different.

    I have a couple of reflections:

    1) I'm surprised that the join query takes 20 seconds longer. I would expect some overhead compared to a single table scan, but not that much. I think there is reason to look into some tuning here. And also reduce the output. There is zero reason to return all columns from all table, since some are the same by necessity.

    2) I believe there are some optimizations for int (and bigint?) keys in SQL Server. Thus, in that case binary(10) may not perform equally well, although there is no inherent reason that there would have to be that way. (As compared to string keys which are more complex with non-binary collations.) Then again, from what you say I don't think is something that will matter anyway.

    3) The problem with your design is that what if there is a need to find a row for a certain type or project id across all entities. Then you need to scan all binary(10) values to find these rows. It type_id is a single column, you can index it and find them quickly. (You could use computed columns to work around this though.)

    4) In the multi-table scenario, I assume that the child tables hold some information. With your binary(10) would there be a single table? Would then the information per type be repeated? Such a redundancy often cauae problems and break down.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Saturday, April 4, 2020 7:01 PM
  • Thank you for your answer.

    1) Maybe, the parent left join and the repeated keys can be ignored but it is still slower by about 10 secs on 5 million of the entries than a simple SELECT without extra table lookup. In addition, I have to compose data in the application by hand if I go with this solution.

    SELECT
      [a].[Id],
      [a].[OwnerId],
      [a].[ParentId],
      [a].[LocalOnly],
      [a].[AllowBits],
      [a].[DenyBits],
    
      [i].[StoreId] AS EntryStoreId,
      [i].[TypeId]  AS EntryTypeId,
      [i].[LocalId] AS EntryLocalId,
    
      [o].[StoreId] AS OwnerStoreId,
      [o].[TypeId]  AS OwnerTypeId,
      [o].[LocalId] AS OwnerLocalId
    
    FROM [AcEntries] AS [a]
      INNER JOIN [AcObjects] AS [i] ON [a].[Id]       = [i].[Id]
      INNER JOIN [AcObjects] AS [o] ON [a].[OwnerId]  = [o].[Id]

    I can just see Left Outer & Inner Join Hash Matches are the most expensive used for Owner & Parent joins. Removing the parent JOIN, it has fixed performance by about 6 secs. If I remove the owner join, it also fixes about 5 secs. The first one is fast using Merge Join.

    2) This is what I really like to know. What could I won and lost if would go with this fixed-size binary representation for the primary key? More precisely, I know about what I win but what I lost ...

    3) As you mentioned, I can not simply filter by project and/or type. That is correct but I guess this is not a goal as it has never worked so far. If I would use a Guid instead of this custom type, I would get the same result with an extra cost of the bigger data size per field in the memory.



    Saturday, April 4, 2020 10:39 PM
  • Honestly, Id map table-based surrogate key does not seem the best idea for me. The model above would look like this:

    ... and the EF Core model types:

    public sealed class ObjectId
    {
        public int Id { get; set; }
        public int StoreId { get; set; }
        public byte TypeId { get; set; }
        public int LocalId { get; set; }
    }
    
    public sealed class AcEntity
    {
        public ObjectId Oid { get; set; } = default!;
        public ObjectId OwnerOid { get; set; } = default!;
        public ObjectId? ParentOid { get; set; }
        public bool IsInherited { get; set; }
    }
    
    public sealed class AcEntry
    {
        public ObjectId Oid { get; set; } = default!;
        public int EntryType { get; set; }
        public ObjectId IdentityOid { get; set; } = default!;
        public bool LocalOnly { get; set; }
        public ulong AllowBits { get; set; }
        public ulong DenyBits { get; set; }
    }


    If I would keep this 3 things in one field, I could achieve much better performance. At least, I guess. The question will rather be: 16 byte Guid or 10 byte binary.



    Sunday, April 5, 2020 9:09 AM
  • It is possible that creating an indexed view could cut down the load speed. However, you cannot have outer joins in indexed views as I recall, so you would need handle that. Possibly you could have the clustered index on the owner id to make that a merge join. You should also specify the NOEXPAND hint the query to encure that the view is actually used. (Without the hint, the optimizer will only consider it if you are running Enterprise Edition.)

    Indexed views incur quite some overhead for updates. It is possible that in your case you can take that hit - all depending how updates are performed in the system.

    As for the question of binary(10) vs. GUID, the binary(10) is certainly a better alternative. It's smaller, and you can still extract the ids to computed columns if you need. But of it actually is the best design, I cannot say. I know too little about your system to tell.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Sunday, April 5, 2020 10:16 AM
  • If I would keep this 3 things in one field, I could achieve much better performance. At least, I think.

    How would the model look in that case?


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Sunday, April 5, 2020 10:20 AM
  • It is almost the same except the ObjectId which is a value type (struct) without an Id property:

    internal readonly struct ObjectId : IEquatable<ObjectId>
    {
        public int StoreId { get; }   // 32 bits - 4 bytes
        public short TypeId { get; }  // 16 bits - 2 bytes
        public int LocalId { get; }   // 32 bits - 4 bytes
    }


    And I have an EF  ValueConverter<ObjectId, byte[]> that can do this conversion automatically:

    private static byte[] ConvertTo(ObjectId value)
    {
        Span<byte> span = stackalloc byte[10];
        Span<byte> span2 = span.Slice(4, 2);
        Span<byte> span3 = span.Slice(6, 4);
    
        BitConverter.TryWriteBytes(span, value.StoreId);
        BitConverter.TryWriteBytes(span2, value.TypeId);
        BitConverter.TryWriteBytes(span3, value.LocalId);
    
        return span.ToArray();
    }
    
    private static ObjectId ConvertBack(byte[] bytes)
    {
        return new ObjectId
        (
            storeId: BitConverter.ToInt32(bytes, 0),
            typeId: BitConverter.ToInt16(bytes, 4),
            localId: BitConverter.ToInt32(bytes, 6)
        );
    }

    Using custom converters are very simple with EF Core:

    modelBuilder
        .Entity<AcEntry>()
        .Property(p => p.OwnerOid)
        .HasConversion(new ObjectIdByteConverter()) // ObjectId Converter.
        .HasColumnName("OwnerId")
        .HasColumnType("binary(10)")
        .IsRequired(true);


    However, I do not know which binary ordering should I use for storing data in an efficient (index-friendly) way, little or big-endian?

    // Write bits to the array using Big-Endian ordering.
    Span<byte> span = stackalloc byte[10];
    Span<byte> span2 = span.Slice(4, 2);
    Span<byte> span3 = span.Slice(6, 4);
    BinaryPrimitives.WriteInt32BigEndian(span, 5);
    BinaryPrimitives.WriteInt16BigEndian(span2, 5);
    BinaryPrimitives.WriteInt32BigEndian(span3, 100);
    var bytes = span.ToArray();
    
    // Read bits from the array using Big-Endian ordering.
    var sbytes = bytes.AsSpan();
    var value1 = BinaryPrimitives.ReadInt32BigEndian(sbytes);
    var value2 = BinaryPrimitives.ReadInt16BigEndian(sbytes.Slice(4, 2));
    var value3 = BinaryPrimitives.ReadInt32BigEndian(sbytes.Slice(6, 4));




    Sunday, April 5, 2020 12:35 PM
  • I was more thinking of the database design. How you design things in the client is not within the scope for this forum. :-)


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Sunday, April 5, 2020 1:09 PM
  • Oh, sorry. As I said, it seems very simple for me. Both Entity and Identity IDs would be fixed binary fields describing a concrete object in our application business domain. We can consider this security system as a 3rd-party library that does not have any knowledge about our business domain model. Anyway, this is the situation today as well. So, I would just make it to be able to identify a specific entity in the system as easy as possible, without having to rewrite everything from scratch.

    Returning to the query issue, maybe an extra table can fix this missing where composite binary Ids can be mapped to detailed (deconstructed) format containing this 3 fields but I do think it is required.

    But bit order is still a question for me. Big or Little Endian would be better for SQL indexes in this case.


    Sunday, April 5, 2020 1:26 PM
  • But bit order is still a question for me. Big or Little Endian would be better for SQL indexes in this case.

    As long as you going to read it all into memory, that does not really matter. But then there are all the operations you have not told me about.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Sunday, April 5, 2020 3:31 PM
  • Checking the implementation and running SQL Profiler, it seems to me they do not use navigation (relationship) properties to load data at all. So, the load operations seem very lightweight avoiding JOINs. Of course, there are other operations which need to do more, for instance, here is a batch delete:

    DECLARE @EntityIdTable TABLE (EntityId int)
    WITH EntityCTE AS
    (
       SELECT Id, ParentId FROM EFEntities WHERE Id = @EntityId
       UNION ALL
       SELECT E.Id, E.ParentId FROM EFEntities E INNER JOIN EntityCTE ON E.ParentId = EntityCTE.Id
    )
    INSERT INTO @EntityIdTable
    SELECT Id FROM EntityCTE
    
    DELETE E1 FROM EFEntries E1 INNER JOIN @EntityIdTable E2 ON E2.EntityId = E1.EFEntityId
    DELETE E1 FROM EFEntities E1 INNER JOIN @EntityIdTable E2 ON E2.EntityId = E1.Id";


    Here you can find the inline SQL scripts if you are interested in: SecurityStorage.cs

    But I cannot really imagine how much does it have an effect on performance of the JOINs ...

    Using MS SQL Profiler I can just see that it is executing 4 simple queries to populate data from each table at startup (there are no JOINs at all):

    SELECT TOP(2) [e].[Id], [e].[Value]
    FROM ( SELECT 1 AS Id, COUNT(1) AS Value FROM EFEntities ) AS [e]
    
    SELECT [e].[Id], [e].[OwnerId] AS [nullableOwnerId], [e].[ParentId] AS [nullableParentId], [e].[IsInherited]
    FROM [EFEntities] AS [e]
    
    SELECT [e].[EFEntityId], [e].[EntryType], [e].[IdentityId], [e].[LocalOnly], [e].[AllowBits], [e].[DenyBits]
    FROM [EFEntries] AS [e]
    
    SELECT [e].[GroupId], [e].[MemberId], [e].[IsUser]
    FROM [EFMemberships] AS [e]
    Sunday, April 5, 2020 3:57 PM
  • I decided that there too many things for me to piece together within the amount of effort I prepared to spend on a forum question.

    I looked that code for SecurityStorage.cs, and there was more than one place where I found places that I would change if I were in charge. But to be fair, they are following best practice with parameterised statements to some extent. We have seen worse.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Sunday, April 5, 2020 5:38 PM
  • Thanks for your effort and help. I'm really appreciate and totally agree with you on that solving this issue is not your task, especially if nobody pays for it :-)

    The truth is that I can also see a lot of "bottleneck" in that code base, not just SQL, including:

    • Waste of memory doing extra mappings. A bunch of flattened data are mapped into entity and then into another type wasting heap memory and CPU time.
    • Change tracker is enabled on the EF Context. There are no querable .AsNoTracking() calls to disable tracking a ton of the entities and it is not configured on global-level either:
                context.ChangeTracker.LazyLoadingEnabled = false;
                context.ChangeTracker.AutoDetectChangesEnabled = false;
                context.ChangeTracker.QueryTrackingBehavior = QueryTrackingBehavior.NoTracking;
    • Each method call creates a new DbContext and disposes that.
    • I would rather put all this inline SQL scripts to stored procedures. The database installation scripts are there so I do not understand why this scripts live in the code. This is the lazyness.

    We have to learn to cook with what we have :-)

    Sunday, April 5, 2020 7:10 PM