none
IoCallDriver() returns status c0000001 when IRP_MN_START_DEVICE is forwarded to the PCIe bus driver RRS feed

  • Question

  • Hi, 

      Would anyone has suggestions on how to troubleshoot the subject issue? This is a WDM filter driver that has been working w/o a glitch for several years; it sits on top of the PCI-express bus driver and it's for a proprietary PCIe 1/2 length board, single-function device and it uses BAR memory and MSI interrupt (one).

    This is on a Windows 2012 R2, though likely it'd be same on any other Windows x64 version. All was good until we got a BIOS/UEFI upgrade from the chassis vendor. PCIe analyzer doesn't reveal any issues and there are no PCI error reporting of any sorts. Furthermore it works like a charm on Linux on same system. 

    It's not specific to this one discrete system but to any system of this kind with same exact BIOS/FW version. We took it as far as we can with the vendor but w/o root-cause analysis on the HAL failure there's not a whole lot more it can be done at that level.

    The ADD_DEVICE IRP is successful, and the next one along, IRP_MN_FILTER_RESOURCE_REQUIREMENTS, shows no specific issues; CmResourceTypeMemory/Private/Interrupt resource descriptors don't seem to be wrong, nor we change them. The IRP is "ForwardAndWait" to the bus driver, IoCallDriver() succeeds, and completion routine is invoked w/o issues.

    Next is the MN_START_DEVICE which we FowardAndWait w/o any changes whatsoever. The IoCallDriver(pdo, Irp) fails with status c0000001. Completion routine is invoked and event is signaled. The Irp->IoStatus.Status also has he same value.

    0xC0000001 -> STATUS_UNSUCCESSFUL

    After that the device is removed automatically, as we get a MN_REMOVE_DEVICE and that's the end of it. The Device manager shows it with a yellow bang and we can see that no memory or IRQ have been assigned to it

    Examining the device object in WinDbg !DevObj/!DevStack is of no much more help.

    There are no errors that lead to the IRP START_DEVICE failure. It seems the HAL is not liking something and thus fails assigning resources, but beyond that is there are there any further WDM-driver troubleshooting to get any visibility on what is the nature of the failure? 

    btw, absolutely nothing relevant from the Event Viewer that we can tell.

    Pablo



    Pablo De Paulis

    Tuesday, October 13, 2015 1:46 PM

Answers

  • To close the loop on this thread, it turned out the problem was the BIOS was incorrectly configuring the Max Payload Size on the bridges to be less than what the devices on the downstream side of the bridges were configured with. This is an unsupported configuration by the PCIe bus driver (and in violation of the PCIe spec; although there is an implementation note that seems to permit this), so it would not provide hardware resources to the device and the upstream bridge.

    Currently, the PCIe bus driver will not try to rebalance the resources to fix the problem, so such devices are deconfigured.

     -Brian


    Azius Developer Training www.azius.com Windows device driver, internals, security, & forensics training and consulting. Blog at www.azius.com/blog

    Friday, November 13, 2015 6:37 PM
    Moderator

All replies

  • Well it's a function driver which sits on top of the PCI bus driver.

    so !DevNode will show like this (much removed):

    lkd> !devnode ffffe00000977010
    DevNode 0xffffe00000977010 for PDO 0xffffe00000959730

    ...

    lkd> !devobj ffffe00000959730
    Device object (ffffe00000959730) is for:
     NTPNP_PCI0100 \Driver\pci DriverObject ffffe0000074a7f0


    Pablo De Paulis

    Tuesday, October 13, 2015 4:59 PM
  • I would set a breakpoint in your IRP_MN_START_DEVICE handler, and use the !cmreslist command to examine the resources in the IRP's stack location Parameters.StartDevice.AllocatedResourcesTranslated. If the resources look OK, then I'd turn on the PCI driver's ETW logger (look here for info on how to do that). Using LOGMAN QUERY PROVIDERS, I see that the PCI driver's logger is:
    Microsoft-Windows-PCI                    {1A9443D4-B099-44D6-8EB1-829B9C2FE290}

     -Brian


    Azius Developer Training www.azius.com Windows device driver, internals, security, & forensics training and consulting. Blog at www.azius.com/blog

    Tuesday, October 13, 2015 6:08 PM
    Moderator
  • Brian, thanks very much!

    K found the GUID name ""Microsoft-Windows-PCI".

    Now, in the ETW logger do I choose "Performance" or "System" in the Trace Session (from GUI)? I suppose the latter; in that case is this going to give me any IRP START_DEVICE insight? I say that because I used the ETW with the tracelog and xperf in the past to trace ISR and DPC performance, but I wasn't aware it would give me the kinds of data I'm looking for. 

    In that case would I use xperf to read it?

    Pavel, we don't do any hackery with resources in handling IRP_MN_FILTER_RESOURCE_REQUIREMENTS. In fact we were just fwd it to the PCI bus drivers; I recently hacked it a bit to just dumped it to make sure and indeed the resources are fine so we don't alter anything. Again keep in mind this all worked fine and a BIOS update caused the issue. 



    Pablo De Paulis

    Tuesday, October 13, 2015 11:46 PM
  • It should give you information about any errors it has internally that is causing it to return STATUS_UNSUCCESSFUL. Was the CM_RESOURCE_LIST in the START_DEVICE IRP OK?

     -Brian


    Azius Developer Training www.azius.com Windows device driver, internals, security, & forensics training and consulting. Blog at www.azius.com/blog

    Wednesday, October 14, 2015 12:49 AM
    Moderator
  • I was able to dump the CM_RESOURCE_LIST; it would appear a CmResourceTypePort got added while the device doesn't use I/O Ports; furthermore that type is not in the PIO_RESOURCE_DESCRIPTOR list as part of the  IRP_MN_FILTER_RESOURCE_REQUIREMENTS.

    We think that is the source of the failure. What's interesting is that the CmResourceTypePort resource->u.Port.Start.u.LowPart matches the (correct) CmResourceTypeMemory so it's obvious there's a confusion somewhere as we get both but only the latter is used. CmResourceTypeInterrupt seems alright.

    What I struggle with is with the ETW logger for the Provider: 

    Microsoft-Windows-PCI                    {1A9443D4-B099-44D6-8EB1-829B9C2FE290}

    I got the .etl but can't be decoded w/o the tmf file. It would appear that has to come from MSFT; unless the the system.tmf  which apparently shipts with the DDK but haven't been able to locate yet.

    Any idea how to decode the ETW Microsoft-Windows-PCI logger output?


    Pablo De Paulis

    Wednesday, October 14, 2015 6:10 PM
  • You don't need the TMF file, because that is a manifest-based logger. Here is information on decoding.

    Is there a filter driver that might be messing up the resources?

     -Brian


    Azius Developer Training www.azius.com Windows device driver, internals, security, & forensics training and consulting. Blog at www.azius.com/blog

    Wednesday, October 14, 2015 6:20 PM
    Moderator
  • I was able to dump the CM_RESOURCE_LIST; it would appear a CmResourceTypePort got added while the device doesn't use I/O Ports


    This is very strange. Can you dump the device config space in linux to verify that it does not have a Port type BAR?

    Wednesday, October 14, 2015 9:09 PM
  • Pavel, Brian, my apologies! embarrassed to say it was a missing break statement! I didn't dump the CM_RESOURCE_LIST with remote Windbg; rather added instrumentation and dbgView on target, actually moved existing func that gathered the resource list upon successful IRP START_DEVICE, up so it dumps it regardless. Unbeknownst to me it was buggy but normally the prints are disabled so no one noticed.

    I then got the remote debugging session going, so I'm going thru that now and expect something else soon, but the system is being shared with HW folks doing their thing.

    NOTE: As it turns out bypassing a txparent switch from the HW, "solves" the problem and the board is assigned resources. Obviously it's not a solution since the switch is there for a reason. 

    OK that said, I'm still having trouble with the ETW Microsoft-Windows-PCI logger. No matter what I only get one event.

    NOTE: MSFT Professional Support is unable to help at this level.

    So, e.g. creating a evt collector set:

    logman create trace trace_pci2 -p "Microsoft-Windows-PCI" 0x8000000000000001 0xFF -nb 16 16 -bs 64 -o pci2.etl -ct System -max 20 -ets

    I can disable and re-enable the device in dev mgr to "reproduce" the problem. Nonetheless there's always one and only one event (as seen by tracertp), like:

    -<RenderingInfo Culture="en-US">
    <Opcode>Header</Opcode>
    <Provider>MSNT_SystemTrace</Provider>
    <EventName xmlns="http://schemas.microsoft.com/win/2004/08/events/trace">EventTrace</EventName>
    </RenderingInfo>
    -<ExtendedTracingInfo xmlns="http://schemas.microsoft.com/win/2004/08/events/trace">
    <EventGuid>{68fdd900-4a3e-11d1-84f4-0000f80464e3}</EventGuid>

    Even tried having a running etl w/o a file but again there's nothing going on there as far as I can tell. Any suggestions on the logman cmd line options?


    Pablo De Paulis

    Thursday, October 15, 2015 2:36 PM
  • It's an XIO TI PCIe-PCIe bridge actually; it provides access to the PCI express bus and behind it we have two devices which also have a PCIe interface, one being the CPU and the other one an FPGA.

    We have a strapping option to bypass the XIO, so by doing so the START_DEVICE doesn't fail. However as I said it's not a solution for this particular product (it would be for a different incarnation of it). We are also looking at erratas, etc.

    As I said there are no errors reported upstream nor anything out of place in any of the PCIe transactions as seen by the analyzer either. However it's obvious the HAL is not liking something when does the bus scanning, and fails. That's why I am very interested in getting the events and/or errors from the "Microsoft-Windows-PCI".


    Pablo De Paulis

    Thursday, October 15, 2015 7:28 PM
  • Unfortunately, the manifest-based loggers don't provide a huge amount of debugging info - especially compared to the ETW trace messages, which you need the private PDB files to decode. I had hoped some of the error messages might be present in the public log messages.

     -Brian


    Azius Developer Training www.azius.com Windows device driver, internals, security, & forensics training and consulting. Blog at www.azius.com/blog

    Thursday, October 15, 2015 7:56 PM
    Moderator
  • sorry we got stuck as the serial port suddenly died.

    OK so I have the Irp and the currentStackLocation. Trouble is how to obtain the Address of the CM_RESOURCE_LIST to use the !cmreslist.

    Not much in MSFT that I can tell other than one must have the Address of it. It's not obvious how to from the Stack location address.

    Is !arbiter of help here? 


    Pablo De Paulis

    Monday, October 19, 2015 4:59 PM
  • OK actually the way to do this is putting a bk at the driver loading:

     sxe ld <driver>.sys  <== ... your driver name whatever-it-is

    When it reboots (.reboot), it triggers the bk, then add a new bk at your IRP_MN_... dispatch routine, in this particular case for the IRP_MN_START_DEVICE prior to sending it down to the PCI driver. 

    At that time then dump the (pci) devnode for the device, e.g.

    !devnode 0 1 <driver>

    Maybe I'm missing something but it appears the !cmreslit is already embedded into the above cmd so I needed not to call it explicitly.

    From what I can tell an undesired descriptor to the CM_RESOURCE_LIST was added, in yellow highlight. Note the device requires an MSI interrupt and the 3rd descriptor has the proper flag; the 4th and superfluous, doesn't, so I have to assume it's a legacy one. The 3rd (and 1, and 2nd) do appeared in the IRP_MN_FILTER_RESOURCE_REQUIREMENTS though I'll have to reconfirm this), but the 4th seems like it was added. Either way in all likelihood this is the root of the problem.

          Preferred Descriptor 2 - Interrupt (0x2) Device Exclusive (0x1)

            Flags (0x03) - LATCHED MESSAGE

            0xfffffffe - 0xfffffffe

          Alternative Descriptor 3 - Interrupt (0x2) Shared (0x3)

            Flags (0000) - LEVEL_SENSITIVE

            0x0 - 0xffffffff

    We'll have to compare without the XIO device.


    Pablo De Paulis

    Monday, October 19, 2015 11:02 PM
  • Unless there is a filter in the DevStack, then the resources must have been reported by the device to the PCI bus driver. If you want to see exactly what the device is reporting, catch the IRP_MN_QUERY_RESOURCE_REQUIREMENTS IRP on the way UP the DevStack. You may also see this in the registry at HKLM\System\CCS\Enum\PCI\<hardware-id>\LogConf and double-click on the config structure

     -Brian


    Azius Developer Training www.azius.com Windows device driver, internals, security, & forensics training and consulting. Blog at www.azius.com/blog

    Monday, October 19, 2015 11:42 PM
    Moderator
  • Aha thanks Brian! 

    Was aware of the Interrupt Management key, but had no idea wha tthe LogConf was... it's great to know this.

    Yeah, indeed I see the Alternative Descriptor in there as well. 

    Ok, but it is my understanding the IRP_MN_QUERY_RESOURCE_REQUIREMENTS is only handled by a bus driver, which we aren't, so we never handle it, and I don't even know if we get it.

    At any rate, I should see it, and it should give me an oppty to catch it and nuke it in the IRP_MN_FILTER_RESOURCE_REQUIREMENTS. Although I didn't see it in the debug prints last week, I might've missed it as it's convoluted to do it with prints. It's a lot better to do it with the kernel debugger, which is what I'll do.



    Pablo De Paulis

    Tuesday, October 20, 2015 12:02 AM
  • Yes, it is handled by the bus driver, but the bus driver gets the device's hardware resource requirements from the device itself, via the BARs.

    Remember, when removing resources from IRP_MN_FILTER_RESOURCE_REQUIREMENTS, do so on the IRP's way UP the DevStack, and do not re-order the list.

     -Brian


    Azius Developer Training www.azius.com Windows device driver, internals, security, & forensics training and consulting. Blog at www.azius.com/blog

    Tuesday, October 20, 2015 12:08 AM
    Moderator
  • Ok actually it looks like there are supposed to be two different entries under the LogConf (I'm sure you knew that! ;-) ):

    LogConf\BootConfig — from IRP_MN_QUERY_RESOURCES
    LogConf\BasicConfigVector — from IRP_MN_QUERY_RESOURCE_REQUIREMENTS

    On a working system the latter, the one we discussed yesterday also has the Alternative Interrupt entry so that was a surprise; however the big difference is that the former is not even present in the failing system. 

    So, does that mean the former IRP didn't make it to the bus driver? Or that it failed in some fashion?

    I suppose I could call the IoGetDeviceProperty routine with DevicePropertyBootConfiguration and dump it apriori. Other than that I don't think there's any way to influence the former from SW. 

    btw, we see some odd PCI scanning of the XIO device at boot up, so it's possible it's confusing things and we end up in this situation.


    Pablo De Paulis

    Tuesday, October 20, 2015 1:49 PM
  • It means that the enumeration sequence didn't get far enough for the PnP manager to write out the value to the registry or the bus driver didn't return the resources. It seems likely that your XIO device may be causing the problem.

     -Brian


    Azius Developer Training www.azius.com Windows device driver, internals, security, & forensics training and consulting. Blog at www.azius.com/blog

    Tuesday, October 20, 2015 6:08 PM
    Moderator
  • What we found out is that the LogConf\BootConfig is not always there; I tried in a 3rd machine (though 2008 R2) and didn't see it, yet the board started.

    OK that said, I made a (gross) mistake yesterday in that I gave you a print of the IoResList rather than the CmResList. As it turns out the latter is clean as a whistle. I added a bk at the IRP_MN_QUERY_CAPABILITIES  and I only see the IoResList there (with the Alternative legacy IRQ); then when we get the IRP_MN_START_DEVICE we then get the CmResList. I'll copy/paste it for you:

       CmResourceList at 0xffffc00000510810  Version 1.1  Interface 0x5  Bus #0x7
        Entry 0 - Memory (0x3) Device Exclusive (0x1)
          Flags (0x80) - 
          Range starts at 0x0000000092d00000 for 0x100000 bytes
        Entry 1 - DevicePrivate (0x81) Device Exclusive (0x1)
          Flags (0000) - 
          Data - {0x00000001, 0000000000, 0000000000}
        Entry 2 - Interrupt (0x2) Device Exclusive (0x1)
          Flags (0x03) - LATCHED MESSAGE 
          Message Count 1, Vector 0xffffffc8, Group 0, Affinity 0

    It looks very clean; nothing missing (the DevicePrivate is a bit of a mystery but it's also on working systems). 

    The very interesting piece of this that I tracked it down to the IoCallDriver(Ext->lowerDeviceObject, Irp);

    and on return the CmResList is still there; the same flags, and no error in the Flags. However upon completion routine waking up the KEvent, the Irp->IoStatus.Status (which we return) has the infamous

    c0000001

    which makes the NT_SUCCESS() trip.

    Note that I wasn't able to get it from the windbg for some reason it can't access that memory; however I know from the DbgView prints. 

    Even upon that error the CmResList is still untouched. Obviously after we pass the status on IRP completion (w/o attempting to start it) so the stack will unwind and the CmResList is emptied, so at the end of all this is gone.

    Now we have a text-book fwdAndWait() func that is used for all of the IRPs and obviously we always return the Irp->IoStatus.Status so there should be no difference here, should it?

    But the point of this experiment is to say that from PnPManager stand point (which fills in the CmResList as I understand), all resources were properly filtered, and assigned as you can see. However when we pass them to the PCIBus driver, it fails with that error code.

    Remember that this only happens with the XIO txparent chip. That is the odd thing here. 

    Do you know if at the IRP_MN_START_DEVICE there are any PCIe transactions (like probing the resources on the bus/slot?), or is it all internal to the PCI bus driver? 


    Pablo De Paulis

    Tuesday, October 20, 2015 10:33 PM
  • As long as the status is anything other than STATUS_PENDING, then it is a requirement that the return status from a dispatch routine MUST be the same status specified when the IRP is completed (and on its way up the DevStack). So, if the return status of the IoCallDriver call is not pending, then the resources in the IRP are valid.

    The START_DEVICE IRP is processed from the bottom of the DevStack upwards, and it is the responsibility of each driver in the stack to ensure that drivers above it are able to access the device. I haven't looked at the PCI bus driver sources in a long time, so I don't remember exactly what it does when it sees a START_DEVICE IRP, but in theory, it should be talking to the root port and assigning the hardware resources to it.

    At this point, I suspect that you might either have a corruption or a sync error somewhere in your driver, or your hardware is doing something unexpected.

     -Brian


    Azius Developer Training www.azius.com Windows device driver, internals, security, & forensics training and consulting. Blog at www.azius.com/blog

    Tuesday, October 20, 2015 11:57 PM
    Moderator
  • thanks a lot!

    Well, for now I hacked the code a bit to ignore the FwdAndWait "error" and all worked fine! I don't know if the device is operational since we mangled a bit HW wise; however the resources got assigned, and all looks great.

    In looking at the FwdAndWait a bit it looks like a latent bug; it goes something like this (:

        status = IoCallDriver(Ext->lowerDeviceObject, Irp);

        KeWaitForSingleObject(&event, Executive, KernelMode, FALSE, NULL); 

           return Irp->IoStatus.Status;

          }

    So, it doesn't check for (status == STATUS_PENDING) to wait for the KEvent, it does it regardless. I suspect timing changed here, and now the IRP completes w/o having to wait, and thus we see the error, but this has been like this for a long time. 

    OK we were debating internally, in that the START_DEVICE IRP should not need to assign the resources; I would think they were assigned already by the PnPManager, since at the onset of the IRP I see the CmResList populated. If that is the case it makes a lot of sense that the IRP is completed synchronously in the IoCallDriver() I think.

    Maybe it's the street-light-effect but still this makes little sense in that it would change with and w/o the XIO but it's a breakthru for sure!


    Pablo De Paulis

    Wednesday, October 21, 2015 1:00 AM
  • Yes, your code isn't quite right. You want something like this:

    	//+
    	// Prepare to send the IRP down, and catch it on the way back up
    	//-
    
    	KeInitializeEvent (&completion_event, SynchronizationEvent, FALSE);
    	IoCopyCurrentIrpStackLocationToNext (Irp);
    	IoSetCompletionRoutine (Irp, PNP_fdo$$resume_waiting_thread, &completion_event, TRUE, TRUE, TRUE);
    
    	//+
    	// Pass the IRP down to the next device in the DevNode to allow filters and/or the
    	// PDO to modify the allocated resources.
    	//-
    
    	if ((status = IoCallDriver (dev_extension->below_us, Irp)) == STATUS_PENDING)
    		{
    
    		//+
    		// The device below us returned STATUS_PENDING, which means that we have to wait for it to
    		// complete the IRP
    		//-
    
            if (!NT_SUCCESS (status = KeWaitForSingleObject (&completion_event, Executive, KernelMode, FALSE, NULL)))
    			{
    
    			//+
    			// Something weird happened
    			//-
    
    			_DBG_ERROR (("KeWaitForSingleObject returned non-success status %08x\n", status));
    			}
    
    		//+
    		// Get the IRP's completion status
    		//-
    
    		status = Irp->IoStatus.Status;
    		}
    
    	//+
    	// The IRP is now on its way up the stack. If everything went well below us, it is OK to
    	// process the IRP, otherwise just leave the error status in the IRP and complete the IRP
    	//-
    
    	if (NT_SUCCESS (status))
    		{
    
     -Brian


    Azius Developer Training www.azius.com Windows device driver, internals, security, & forensics training and consulting. Blog at www.azius.com/blog

    Wednesday, October 21, 2015 1:04 AM
    Moderator
  • Good point Pavel. Yes... that is fine.

    Still the IoCallDriver() seems to be returning an error; so if I do this:

    ==========================================================

        status = IoCallDriver(
                ((DEVICE_EXTENSION*)fdo->DeviceExtension)->pLowerDeviceObject, pIrp);

        if (status == STATUS_PENDING)
        {
            // Wait for completion
            KeWaitForSingleObject(&WaitEvent, Executive, KernelMode, FALSE, NULL );

            return pIrp->IoStatus.Status;
        }

        return status;

    ========================================================

    The "status" from IoCallDriver() seems to be an error value since it trips the NT_STATUS() macro. So, it says:

    IoCallDriver returns the NTSTATUS value that a lower driver set in the I/O status block for the given request, or STATUS_PENDING if the request was queued for additional processing.

    btw, the Completion routine just calls KeSetEvent() and returns STATUS_MORE_PROCESSING_REQUIRED


    Pablo De Paulis

    Wednesday, October 21, 2015 3:04 PM
  • Quick update in that we found that BAR address for the device was disabled (bit 2 in config register offset 4 , Memory Space Enable).

    Enabling it at the ADD_DEVICE IRP did succeed and BAR ends up enabled at the time of login; however the IoCallDriver() still fails at the START_DEVICE IRP, so I had to keep the hack to let it continue do its thing.

    Note that it would appear BAR address(es) are functional at least at first glance.

    However at this point the device is not fully functional: it can be seen by the OAM configuration, but it fails at the time of downloading its firware. Likely there's some other bit in config space register(s) that is disabled so we have to poke around a bit more. 

    We speculate that one thing leads to the other one... once we are able to get rid of the PCI bus driver to fail at the START_DEVICE IRP, we are relatively certain all will be good.


    Pablo De Paulis

    Wednesday, October 21, 2015 11:06 PM
  • We've had some interaction with HP indeed; I'll get back to that point later.

    Enabling the BARs (and the interrupt) is exactly what the bus driver should do upon successful completion of START_DEVICE.  

    Yes, it looks like you hit the nail on the head. We setup PCIe analyzer on a good machine from other vendor (SM) but high end as well, with same board, same Win 2k12 R2, etc. Break at the FwdAndWait (i.e. before IoCallDriver() PCI bus driver), and after its return. 

    1. On the HP obviously we get the C...01 error,
    2. On the SM obviously we don't get any error in the IoCallDriver()
    3. On the HP there are NO (zero, zip) PCIe transactions between before and after
    4. On the SM there are a lot of PCIe transactions, and indeed we can see accessing the MSI Capability Register Set which makes sense (more below).

    So, what we tried was to not only enable BAR access in cmd register 0x4, but also since we know the location of the MSI control register, also enabled MSI (bit 0), which we do both at the ADD_DEVICE. Still doesn't help.

    The reason is most likely because the Message Address Register (Dword 1 of the MSI Capability Reg Set) is not setup at all, i.e. all zeros on the HP machine. 

    So, enabling the MSI doesn't help since there would be no address to route it to. 

    On the SM we can see this sequence of accessing the Message Address Register when we fwd the START_DEVICE to the pdo (bus driver):

    1. Disable MSI in header (Dword 0)
    2. Set the Low address word (Dword 1)
    3. Set the High address word (Dword 2)
    4. Put (16-bit) data in Data register (Dword 3)

    So, one question I have for you is where is the Address information suppose to come from? How does the pdo know what Address to assign to this MSI? Is that something that is in the driver stack or in the IRP START_DEVICE?

    One suspicion is that the pdo fails w/o even trying because the MSI Address wasn't assigned... if that is supposed to be passed to it from the stack and/or IRP.


    Pablo De Paulis

    Thursday, October 22, 2015 10:40 PM
  • The hardware resources are apportioned and managed by an undocumented component called the Arbiter. No, there is no interface for you, nor is there a way for you to influence it other than via the IRP_MN_FILTER_RESOURCE_REQUIREMENTS IRP.

    It seems like there is an issue with either the HP BIOS/UEFI or how the PCIe bridge is wired, configured, or managed. I ran into an issue similar to this with Compaq, 10-15 years ago.

    I would proceed like this: ensure that the IRP_MN_FILTER_RESOURCE_REQUIREMENTS IRP only contains exactly what your device needs, and if that doesn't work then you have the bus traces you can take back to HP to help them trace the problem.

     -Brian


    Azius Developer Training www.azius.com Windows device driver, internals, security, & forensics training and consulting. Blog at www.azius.com/blog

    Friday, October 23, 2015 12:41 AM
    Moderator
  • You both have been of great assistance!

    I will do that Brian though I'm on the fence about it. Meanwhile we continued looking at the XIO configuration with analyzer and PCI extensions windbg.

    The !arbiter caught a possible conflict though might be a red herring so I'd like to ask for your opinion here. I tried this after a couple of reboots, and at the START_DEVICE and upon completion of all IRPs. I see this 'C' flag in the memory range used by the XIO upstream port (and thus our device). So for instance:

    Root bridge:

    02:0  8086:2f04.02  Cmd[0546:.mb.ps]  Sts[0010:c....]  Intel PCI-PCI Bridge 0->0x5-0x9

    XIO:

    PCI Segment 0 Bus 0x5
    00:0  104c:8232.02  Cmd[0546:.mb.ps]  Sts[0010:c....]  TI PCI-PCI Bridge 0x5->0x6-0x9
    PCI Segment 0 Bus 0x6
    00:0  104c:8233.02  Cmd[0546:.mb.ps]  Sts[0010:c....]  TI PCI-PCI Bridge 0x6->0x7-0x7
    01:0  104c:8233.02  Cmd[0546:.mb.ps]  Sts[0010:c....]  TI PCI-PCI Bridge 0x6->0x8-0x8
    02:0  104c:8233.02  Cmd[0544:..b.ps]  Sts[0010:c....]  TI PCI-PCI Bridge 0x6->0x9-0x9

    Arbiter snippet root:

                 

            DEVNODE ffffe000007e8010 (PCI\VEN_8086&DEV_2F04&SUBSYS_21EA103C&REV_02\3&11583659&0&10)
              Memory Arbiter "PCI Memory (b=5)" at ffffe000009642a0
                Allocated ranges:
                  0000000000000000 - 0000000092ffffff    
                    0000000000000000 - 0000000092ffffff  C    00000000 <Not on bus>
                    0000000080000000 - 000000008fffffff  CB   00000000 <Not on bus>
    0000000093000000 - 00000000939fffff       ffffe0000095d610  (pci)

    Arbiter snippet XIO upstream/downstream port to our device:

                   

              DEVNODE ffffe0000095c980 (PCI\VEN_104C&DEV_8232&SUBSYS_00000000&REV_02\4&257301f0&0&0010)
                Memory Arbiter "PCI Memory (b=6)" at ffffe000009926d0
                  Allocated ranges:
                    0000000000000000 - 0000000092ffffff    
                      0000000000000000 - 0000000092ffffff  C    00000000 <Not on bus>
                      0000000080000000 - 000000008fffffff  CB   00000000 <Not on bus>
                    0000000093000000 - 00000000938fffff       ffffe0000098c880  (pci)
    0000000093900000 - 00000000939fffff       ffffe0000098d680  (pci)

                DEVNODE ffffe0000098ad30 (PCI\VEN_104C&DEV_8233&SUBSYS_00000000&REV_02\5&5eba8d5&0&000010)
                  Memory Arbiter "PCI Memory (b=7)" at ffffe0000098c6d0
                    Allocated ranges:
                      0000000000000000 - 00000000938fffff    
                        0000000000000000 - 00000000938fffff  C    00000000 <Not on bus>
                        0000000080000000 - 000000008fffffff  CB   00000000 <Not on bus>
                      0000000093900000 - 00000000939fffff       ffffe00000985880  (dlgcmpd)

    Arbiter snippet "conflicting" bridge:

               

            DEVNODE ffffe00000959010 (PCI\VEN_8086&DEV_8D18&SUBSYS_8030103C&REV_D5\3&11583659&0&E4)
              Memory Arbiter "PCI Memory (b=2)" at ffffe000009966a0
                Allocated ranges:
                  0000000000000000 - 00000000939fffff    
        0000000000000000 - 00000000939fffff  C    00000000 <Not on bus>

    btw, the vendor ID is for an Intel bridge on bus 0:

    1c:4  8086:8d18.d5  Cmd[0546:.mb.ps]  Sts[0010:c....]  Intel PCI-PCI Bridge 0->0x2-0x2

    So, according to !Arbiter doc, the 'C' there means a conflict and the end of the range matches the end of XIO range, but the beginning of the range being zero makes look like it's a mistake or something. It's also not on the 8d18 bus 0 anyway so I don't even know why it even cares to report a "conflict".

    Is this something we should pursue in some other fashion, or just ignore and not bother? If so I don't know of any other extensions to get more data other than the !pci and !arbiter. For instance !pci 101 yields a very thorough output. 

    Among one of the things we noticed that differ in that respect is the fact that the MSI Address (MsgAddr/MsgAddrHi) have a peculiar characteristics on the DL 380 in that each XIO (upstream and three downstream) is assigned a different address. Funny thing our device(s) don't get any assigned.

    On the SM, the XIO MsgAddr are all the same and are the same as our device(s). However the data are all different. 

    So that leads more to an MSI issue...

    Thanks!


    Pablo De Paulis

    Friday, October 23, 2015 8:50 PM
  • Debugger extensions can be a great aid in debugging, but when potential hardware issues are involved they don't always provide well thought out error messages or behave properly. Also, most of the device-specific debugger extensions are written for the bus/class driver writer - not end users, so, again, they don't always handle error conditions well. I'd have to wade through the !arbiter sources (and probably some of the PCI and Arbiter, as well) to know under what circumstance it would display the Conflict flag.

    That being said, the !arbiter output appears to be reasonable, especially because it adds credence to the idea about the PCIe not being programmed/wired properly, since it only happens on certain vendor's machines. Are both machines using the same chipset?

     -Brian


    Azius Developer Training www.azius.com Windows device driver, internals, security, & forensics training and consulting. Blog at www.azius.com/blog

    Friday, October 23, 2015 9:05 PM
    Moderator
  • As per QuickSpecs the HP ProLiant DL380 Gen9 has Intel® C610 Series Chipset. DevMgr doesn't show much of this though.

    In SM at least from DevMgr I see Intel® C610/X99 Series Chipset PCI Express Root...

    btw, not sure if you caught my 2nd part of the question about the MSI interrupt Address...

    ...(MsgAddr/MsgAddrHi) have a peculiar characteristics on the DL 380 in that each XIO (upstream and three downstream) is assigned a different address. Funny thing our device(s) don't get any assigned.

    On SM they MsgAddr are always same; they are always populated for the three XIO and our two devices (one downstream port is not wired). At any rate they are all same address and different (16-bit) data.


    Pablo De Paulis

    Friday, October 23, 2015 10:41 PM
  • I'd have to dig into the sources to figure out what it means. I haven't looked at the PCI bus driver in almost two years. Clearly, there is something wrong with how it is being configured, but I'm at the limit of what I remember about the PCI bus driver. Sorry.

    If you want to hire me for more in-depth analysis, contact me outside the forum.

     -Brian


    Azius Developer Training www.azius.com Windows device driver, internals, security, & forensics training and consulting. Blog at www.azius.com/blog

    Friday, October 23, 2015 10:49 PM
    Moderator
  • Thanks again Brian!

    I just did thru LinkedIn and joined that group too.

    Cheers...


    Pablo De Paulis

    Friday, October 23, 2015 11:13 PM
  • To close the loop on this thread, it turned out the problem was the BIOS was incorrectly configuring the Max Payload Size on the bridges to be less than what the devices on the downstream side of the bridges were configured with. This is an unsupported configuration by the PCIe bus driver (and in violation of the PCIe spec; although there is an implementation note that seems to permit this), so it would not provide hardware resources to the device and the upstream bridge.

    Currently, the PCIe bus driver will not try to rebalance the resources to fix the problem, so such devices are deconfigured.

     -Brian


    Azius Developer Training www.azius.com Windows device driver, internals, security, & forensics training and consulting. Blog at www.azius.com/blog

    Friday, November 13, 2015 6:37 PM
    Moderator