locked
How to use DocumentDB with HDP? RRS feed

  • Question

  • I'm about to run my own Hadoop Cluster using HDP on Azure and I would like to source data from DocumentDB to it. I noticed there is an option to connect HDInsight to DocumentDB and I don't know how to connect to DocumentDB from my own Hadoop cluster?
    Friday, May 15, 2015 5:19 PM

Answers

  • The DocumentDB Hadoop connector (https://github.com/Azure/azure-documentdb-hadoop) can be used with any Hadoop installation, not just HDInsight. Please let us know if you have any questions or problems using this.
    Friday, May 15, 2015 8:23 PM

All replies

  • The DocumentDB Hadoop connector (https://github.com/Azure/azure-documentdb-hadoop) can be used with any Hadoop installation, not just HDInsight. Please let us know if you have any questions or problems using this.
    Friday, May 15, 2015 8:23 PM
  • Thanks. So my next question is how throttling might negatively affect a HAdoop job whose source is DocumentDB? That is, if a job want to read many records at once it may get throttled?
    Friday, May 15, 2015 9:37 PM
  • As with any application, you can achieve the full provisioned throughput of the collection (S1, S2, or S3) using a Hadoop job. Throttling is just a mechanism to regulate this, so there's no adverse effect to throttling except to spread out requests over a longer period of time.

    The DocumentDB Hadoop connector automatically performs backoff and retry when throttled. 

    Friday, May 15, 2015 11:36 PM
  • Thanks, let's assume that a DocumentDB Collection is the source of a Hadoop job and also for a few other applications. So when the Hadoop get's a chunk of data which consumes all the Request Units, the applications get throttled.
    Friday, May 15, 2015 11:41 PM
  • You're right, you can budget the amount of RUs for the Hadoop job by tuning your number of inserts per batch, and the wait time between multiple calls.
    Sunday, May 17, 2015 1:12 AM
  • Thanks. I downloaded the documentdb api and wrote a sample hadoop job. However it gives me an error message that 'ClassNotFound'. Do I have to copy the Documentdb jar files to the name/data nodes? if so, do you know under which directory?

    Monday, May 25, 2015 9:30 PM
  • I added a directory called 'lib' and put the jar files there but it didn't helped. I also tried to add the jar files by -libjar option but it didn't help either
    Monday, May 25, 2015 9:53 PM