none
question about install packages on HDInsight spark cluster(linux) RRS feed

  • Question

  • Hi all,

    I am trying to build up spark cluster so that our ML team could use ADF to submit spark activity. I run into some configuration issue and hope any of you could provide some Highly Demanded insights.

    When run a pyspark script (against python 3), it errors out at the import statement:

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    #from azure.datalake.store import core, lib, multithread
    import statsmodels
    #import statsmodels.api as sm
    import statsmodels.formula.api as sm
    from IPython.display import display

    The log message is: 

    Note it checks against  the python 2.7 folder "/usr/bin/anaconda/lib/python2.7/site-packages/statsmodels/formula/api.py"

    I have installed both pandas and statasmodels through script actions by using the following statements: 

    pip3 install pandas==0.20.3 --user
    pip3 install statsmodels==0.9.0 --user

    I also tried ssh connect to the cluster  and install packages using conda but the following statements handing forever: 

    So my questions are:

    1: if required to install package using conda, then what is the extra steps to make it pass the collecting package metadata step?

    2: how to config the cluster so that it is in python 3 environment? Meaning import statements go to pthon3.5/site-packages for installed packages.

    3: since our ML team members using Jupyter notebook  for testing and development, is there a way to config the spark cluster to use PySpark3 kernel by default to execute spark script?

    Thanks and any feedback is highly appreciated.

    David

    Friday, September 6, 2019 5:49 PM

Answers

  • Hi all, 

    I will provide solution for other ppl reference: 

    in ADF spark activity, in the spark config section, change it to be:

    key: spark.pyspark.python     

    value: /usr/bin/anaconda/envs/py35/bin/python3

    • Marked as answer by davidyang2000 Monday, September 30, 2019 11:40 PM
    Monday, September 30, 2019 11:40 PM

All replies

  • Hi David,

    This is a known issue with python3 version.

    For more details, you may refer SO thread which addresses similar issue.

    You can use script action to install external python packages for Jupyter notebooks in Apache Spark clusters on HDInsight. This article helps you to learn how to use Script Actions to configure an Apache Spark cluster on HDInsight to use external, community-contributed python packages that are not included out-of-the-box in the cluster.

    For more details, refer “Customize Azure HDInsight clusters by using script actions”.

    Hope this helps.

    Monday, September 9, 2019 9:40 AM
    Moderator
  • Hi David,

    Just checking in to see if the above answer helped. If this answers your query, do click “Mark as Answer” and Up-Vote for the same. And, if you have any further query do let us know.

    Wednesday, September 11, 2019 9:53 AM
    Moderator
  • Hi David,

    Following up to see if the above suggestion was helpful. And, if you have any further query do let us know.

    Friday, September 13, 2019 9:53 AM
    Moderator
  • Hi all, 

    I will provide solution for other ppl reference: 

    in ADF spark activity, in the spark config section, change it to be:

    key: spark.pyspark.python     

    value: /usr/bin/anaconda/envs/py35/bin/python3

    • Marked as answer by davidyang2000 Monday, September 30, 2019 11:40 PM
    Monday, September 30, 2019 11:40 PM
  • Hi David,

    Glad to know that your issue has resolved. And thanks for sharing the solution, which might be beneficial to other community members reading this thread. 

    Tuesday, October 1, 2019 4:36 AM
    Moderator