Posts – Page 2

Java WatchService

In this post, I will cover a tutorial that involves different moving pieces. It covers the following:

  • Java WatchService
  • Spring Boot
  • Initialization-on-demand holder idiom
  • Managing concurrency
  • RXJava
  • Lombok (because why type more?)

The example will expose a Spring Boot REST service that exposes csv file records from a directory. In …

Reactively Streaming CSV using RXJava

RXJava is an extremely useful streaming framework (here is an example application using it for parallel processing of restful calls to both uber and lyft (RT_UBER_NYC_TAXI)). However, In this post, I will cover how you can reactively stream and process a CSV file.

Firstly, you can create a Flowable of …

Spark Scaling to large datasets

In this post, I will share a few quick tips about scaling your Spark applications to larger datasets without having large executor memory.

  • Increase Shuffle partitions: The default shuffle partitions is 200, for larger datasets, you are better off with larger number of shuffle partitions. This helps in many ways …

Improving Quality of Text Extraction

I have been working on ML projects that require image preprocessing and text extraction. To improve the quality of text extraction, there are many preprocessing steps that we need to do, they are elicited below. We use OpenCV for doing the preprocessing and tesseract-ocr for text extraction.

Image preprocessing

  • Rescaling …

Removing Projection Column Ambiguity in Spark

Column ambiguity is quite common when you join two tables. Now this poses a unnecessary hassle when you want to select all the columns from both the tables whilst discarding the duplicate columns. The aforementioned problem is difficult to handle especially, if you have wide tables, where you would want …