Posts – Page 4

Parallel Orchestration of Spark ETL Processing

I have been working a lot on Spark and Scala. I have really like scala as a language, due to its numerous advantages over Java, the foremost being that for a simpler API having Type classes and Default Method Arguments does wonders. Also, idiomatic scala code uses higher order functions …

Apache Zeppelin Notebooks

Apache Zeppelin provides a Web-UI where you can iteratively build spark scripts in Scala, Python, etc. (It also provides autocomplete support), run Sparkql queries against Hive or other store and visualize the results from the query or spark dataframes. This is somewhat akin to what Ipython notebooks do for python …

Machine learning with Apache Spark, Scala and Hive

Apache spark has an advanced DAG execution engine and supports in memory computation. In memory computation combined with DAG execution leads to a far better performance than running map reduce jobs. In this post, I will show an example of using Linear regression with Apache Spark. The dataset is NYC-Yellow …

Migrating to Google Sign-In with Android

Google recently has deprecated the Google+ Sign in and process of obtaining oauth access tokens via GoogleAuthUtil.getToken API. Now, they reccomend a single entry point via new Google Sign-In API. The major reasons for doing so are 1. It enhances user experience and 2. It improves security, more here …

Impala vs Hive vs RDBMS

Hive or Impala ?

Hive and Impala both support SQL operation, but the performance of Impala is far superior than that of Hive. Although now with Spark SQL engine and use of HiveContext the performance of hive queries is also significantly fast, impala still has a better performance. The reason that …