none
HDInsight - Spark: Installing extra python dependencies using Conda hangs on: "Collecting package metadata (repodata.json)" RRS feed

  • Question

  • We are running a HDInsight 4.0 cluster on Azure. It's a test set-up with just 1 worker node (8 CPU, 56 GB ram) and 2 headnodes (4 CPU, 25 GB ram each). We need to add a few specific python dependencies and, preferably upgrade the entire python3.5 to python3.7

    Every document I've read about installing dependencies so far has the same structure:

    • Use a script action (bash)
    • Use conda to install the required packages: e.g. /usr/bin/anaconda/bin/conda install -n py35 -y dataclasses 

    Or, you coda-forge and install different packages: /usr/bin/anaconda/bin/conda install -n py35 -y -c coda-forge tika

    Every single command I execute on coda that needs to fetch something from the internet (metadata, searching for packages, ...) whether executed through a script action, or through ssh-ing into the headnode and running it straight on there, just _does not complete_.

    In the case of the install it hangs on "Collecting package metadata (repodata.json)"

    So, I think the longest I've waited now is 1 hour for an installation command on a VM with 4 cores and 25GB of RAM. So, I'm pretty curious, is it supposed to take this long? Is it supposed to take even longer? Because I have a hard time believing that it should take such a long time for a simple package install. 

    So, is there something else that is not working correctly? 

    Thanks in advance.

    Thursday, September 26, 2019 10:16 AM

All replies

  • Hello Tom Pauwaert and thank you for your inquiry.  Since you have tried multiple methods, and all hang on the same step for the same module (assuming this isn't the first module), it sounds like something might be wrong with the package repo.  Have you tried fetching from a mirror, or installing outside of HDInsights to test whether it completes at all?
    Thursday, September 26, 2019 11:29 PM
  • Hi Martin,

    Thanks for the response. I have now explicitely tested this:

    • I set up a VM (4 cores, 14 GB RAM) and installed anaconda. Then set up a new environment using the conda create -n py37 python=3.7 -c conda-forge tika ... (some other packages).
      This command completed in 15 seconds, with about 6 seconds spend on the repodata.json step.
    • I then set up another HDICluster, connected to the headnode and attempted the same. The command again did not complete for 1 hour after starting it. The Headnode had 4 cores, ~25GB RAM

    So I don't think the issue is the package repo itself. Every time I've tried this I've created a new HDI cluster in a new resource-group, so it shouldn't have had any interferences from outside services.

    Friday, September 27, 2019 10:28 AM
  • Thank you for the feedback.  I will reach out internally.
    Friday, September 27, 2019 5:31 PM
  • Hello, I got a response.  This seems to be a known issue with Conda 4.7.11, but not 4.5.12.

    Here is a script which will get you a fixed version:

    rm /usr/bin/anaconda/lib/python2.7/site-packages/conda/core/subdir*
    wget https://gregorysfixes.blob.core.windows.net/public/subdir_data.py -P /usr/bin/anaconda/lib/python2.7/site-packages/conda/core/

    If you are not comfortable with running that, I can tell you which lines of code to change.

    Tuesday, October 1, 2019 1:23 AM
  • Did this help you?
    Wednesday, October 2, 2019 6:17 PM
  • Hi Martin,

    Thanks for getting back with the following script. I haven't had a chance yet to test it as we temporarily switched to databricks instead. I have this thread bookmarked, so when I get the chance to test it I will report back!

    Best Regards,

    Tom

    Monday, October 7, 2019 7:50 PM