PySpark structured streaming

PySpark Structured Streaming is a stream processing engine built on top of Spark SQL. It enables the processing of real-time data streams in a fault-tolerant and scalable manner. Structured Streaming allows you to write streaming queries using the same DataFrame and SQL APIs that you would use for batch processing, making it easier to transition from batch processing to streaming processing. In this series we look at a variety of PySpark streaming worked examples.

PySpark structured streaming with applyInPandasWithState worked example

So far the background of PySpark structured streaming and the motivation for using applyInPandasWithState along with a notebook to generate streaming files has been covered. In part 3 of this tutorial on how to use applyInPandasWithState, the CSV files will be streamed, data will be grouped by flight id and custom logic to maintain the […]

PySpark structured streaming with applyInPandasWithState worked example Read More »

Supercharge PySpark steaming with applyInPandasWithState

This tutorial will cover a complete worked example of how to stream data in PySpark using the applyInPandasWithState function and foreachBatch. Spark structured streaming does not always come with the tools needed out-of-the-box and by using applyInPandasWithState and foreachBatch the streaming functionality can be  customised. The example will use the scenario of streaming data from

Supercharge PySpark steaming with applyInPandasWithState Read More »