locked
Unusual slowness for my cosmos DB loads RRS feed

  • Question

  • Hi all, 

    I noticed that some of my runs recently started to behave very odd I'm using spark and cassandra connector to load data into Cosmos DB, the run for one of my tables used to take 18-20 min now it's taking 70-80 min with the same data and no change to the configuration. Here is the configuration I'm using for my connector:

    spark.cassandra.output.batch.size.rows=100
    spark.cassandra.connection.connections_per_executor_max=25
    spark.cassandra.output.concurrent.writes=500
    spark.cassandra.concurrent.reads=512
    spark.cassandra.output.batch.grouping.buffer.size=2000
    spark.cassandra.connection.keep_alive_ms=600000
    spark.cassandra.output.throughput_mb_per_sec=500

    And here is the spark application configuration

    driver memory=12G
    executor memory= 6G

    number of executors = 24

    The data contains around 57 million rows and I'm using 500k of RU/s

    I see this very often in my executors logs:

    20/05/13 17:50:43 WARN RequestHandler: Host xxxx.cassandra.cosmosdb.azure.com/xx.xx.xxx.xxx:10350 is overloaded.
    20/05/13 17:50:44 ERROR QueryExecutor: Failed to execute: com.datastax.spark.connector.writer.RichBoundStatement@20441aaa
    com.datastax.driver.core.exceptions.OperationTimedOutException: [xxxx.cassandra.cosmosdb.azure.com/xx.xx.xx.xxx:10350] Timed out waiting for server
     response
            at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:772)
            at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1374)
            at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:663)
            at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:738)
            at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:466)
            at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
            at java.lang.Thread.run(Thread.java:748)
    20/05/13 17:50:44 WARN RequestHandler: Host xxxxx.cassandra.cosmosdb.azure.com/xx.xx.xxx.xxx:10350 is overloaded.


    But I don't think it's new, this might have an impact on performance but I was thinking the error was there from the beginning due to the high load on the DB but not sure if this degradation is normal? Can someone help please.

    Thanks,

    Wednesday, May 13, 2020 8:53 PM

All replies

  • Hi Rih_AB,

    Thank you for bringing this to our attention. As per the Spark connector throughput configuration parameters, you should use the suggested values (see: Write Tuning Parameters for additional clarification).

    spark.cassandra.output.batch.size.rows None Number of rows per single batch. The default is 'auto' which means the connector will adjust the number of rows based on the amount of data in each row

    I also suggest you implement the following for a onetime run to to gather information about the behavior of your current implementation.

    spark.cassandra.output.metrics true Sets whether to record connector specific metrics on write

    I am hoping the change to auto for batch.size.rows and implementing output.metrics will provide visibility into the current capable throughput. From there, if you are seeing issues that need a deeper investigation we can pursue that investigation. 

    Regards,

    Mike

    Sunday, May 17, 2020 8:06 PM
  • Hi RiH_AB,

    Did you manage to find a solution to your issue? Please let us know if you have any additional questions.

    Regards,

    Mike

    Tuesday, May 26, 2020 2:11 AM