locked
azureml-fe pods failing to start RRS feed

  • Question

  • Hi, 

    In my Azure ML workspace I've attached an AKS cluster as an inference cluster, and it have been working for months. Suddently the azureml-fe pods started to fail. Drilling down, it's the mdsd container that it's failing, apparently due to a certificate error. In the mdsd log:

    HTTP request sent, awaiting response... 200 OK
    Length: 12954 (13K) [application/x-pkcs12]
    Saving to: 'mdsautokey-cert.pfx'

         0K .......... ..                                         100% 55.0M=0s

    2020-05-11 14:00:42 (55.0 MB/s) - 'mdsautokey-cert.pfx' saved [12954/12954]

    connection information downloaded
    140049914062488:error:0D0680A8:asn1 encoding routines:ASN1_CHECK_TLEN:wrong tag:tasn_dec.c:1217:
    140049914062488:error:0D07803A:asn1 encoding routines:ASN1_ITEM_EX_D2I:nested asn1 error:tasn_dec.c:386:Type=PKCS12
    139748841191064:error:0D0680A8:asn1 encoding routines:ASN1_CHECK_TLEN:wrong tag:tasn_dec.c:1217:
    139748841191064:error:0D07803A:asn1 encoding routines:ASN1_ITEM_EX_D2I:nested asn1 error:tasn_dec.c:386:Type=PKCS12
    Starting MDSD agent
    ------Dumping ERRORS-------
    2020-05-11T14:00:43.0293730Z: Error: initializing certificate failed: ReadCertFromFile(): failed to read certificate file: '/etc/mdsd.d/gcs_cert.pem'
    2020-05-11T14:00:43.0293990Z: Error: GcsMgr::Initialize failed. Abort MdsdConfig::Initialize().
    mdsd encountering errors
    Killing container with exit code 1

    This was working fine until yesterday... what can be the problem? Has the certificate expired?

    The mdsd image being used by the azureml-fe deployment is the following: mcr.microsoft.com/azureml/aml-mdsd-external:realtime-fe-20191212.1

    Note: in another AKS cluster, with a more recent AML workspace attached, everything is working as expected. In that cluster the mdsd image is also more recent: mcr.microsoft.com/azureml/aml-mdsd-external:realtime-fe-20200109.1

    Thanks for any help,

    Ricardo Santos
    Altitude Software

    Monday, May 11, 2020 5:30 PM

All replies

  • Hello Ricardo,

    We have reported this issue internally to our product team and awaiting their response. We will update the thread as soon as we have a response.

    -Rohit

    Wednesday, May 13, 2020 12:27 PM