Hi all,
I noticed that some of my runs recently started to behave very odd I'm using spark and cassandra connector to load data into Cosmos DB, the run for one of my tables used to take 18-20 min now it's taking 70-80 min with the same data and no change to the
configuration. Here is the configuration I'm using for my connector:
spark.cassandra.output.batch.size.rows=100
spark.cassandra.connection.connections_per_executor_max=25
spark.cassandra.output.concurrent.writes=500
spark.cassandra.concurrent.reads=512
spark.cassandra.output.batch.grouping.buffer.size=2000
spark.cassandra.connection.keep_alive_ms=600000
spark.cassandra.output.throughput_mb_per_sec=500
And here is the spark application configuration
driver memory=12G
executor memory= 6G
number of executors = 24
The data contains around 57 million rows and I'm using 500k of RU/s
I see this very often in my executors logs:
20/05/13 17:50:43 WARN RequestHandler: Host xxxx.cassandra.cosmosdb.azure.com/xx.xx.xxx.xxx:10350 is overloaded.
20/05/13 17:50:44 ERROR QueryExecutor: Failed to execute: com.datastax.spark.connector.writer.RichBoundStatement@20441aaa
com.datastax.driver.core.exceptions.OperationTimedOutException: [xxxx.cassandra.cosmosdb.azure.com/xx.xx.xx.xxx:10350] Timed out waiting for server
response
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:772)
at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1374)
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:663)
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:738)
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:466)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
20/05/13 17:50:44 WARN RequestHandler: Host xxxxx.cassandra.cosmosdb.azure.com/xx.xx.xxx.xxx:10350 is overloaded.
But I don't think it's new, this
might have an impact on performance but I was thinking the error was there from the beginning due to the high load on the DB but not sure if this degradation is normal? Can someone help please.
Thanks,