Ramandeep Singh author // All Things Technical // Blog about technological musings

Improving Quality of Text Extraction

I have been working on ML projects that require image preprocessing and text extraction. To improve the quality of text extraction, there are many preprocessing steps that we need to do, they are elicited below. We use OpenCV for doing the preprocessing and tesseract-ocr for text extraction.

Image preprocessing

Rescaling …

in Machine Learning · Sat 01 September 2018

Removing Projection Column Ambiguity in Spark

Column ambiguity is quite common when you join two tables. Now this poses a unnecessary hassle when you want to select all the columns from both the tables whilst discarding the duplicate columns. The aforementioned problem is difficult to handle especially, if you have wide tables, where you would want …

in spark · Thu 12 April 2018

Efficient Spark Dataframe Transforms

If you are working with Spark, you will most likely have to write transforms on dataframes. Dataframe exposes the obvious method df.withColumn(col_name,col_expression) for adding a column with a specified expression. Now, as we know that the dataframes are immutable in nature, so we are getting a newly …

in spark · Sun 18 March 2018

Writing Generic UDFs in Spark

Apache Spark offers the ability to write Generic UDFs. However, for an idiomatic implementation, there are a couple of things that one needs to keep in mind.

You should return a subtype of Option because Spark treats None subtype automatically as null and is able to extract value from Some …

in spark · Wed 24 January 2018

Testing Spark Dataframes

Testing Spark Dataframe transforms is essential and can be accomplished in a more reusable manner. The way, I generally accomplish that is to

Read the expected and test Dataframe, and
Invoke the desired transform, and
Calculate the difference between dataframes. The only caveat in calculating the difference is that in …

in spark · Wed 17 January 2018

Posts by 'Ramandeep Singh' – Page 3