mapPartitions vs mapInPandas
Prior to spark 3.0+, to optimize for performance and utilize vectorized operations, you'd generally have to repartition the dataset and invoke mapPartitions.
This had the major drawback of performance impact that was incurred from repartitioning (caused by shuffle) the DataFrame.
With spark 3.0+, if your underlying function is …