CheatSheet | spark


1. Introduction

  • 1.1 Terminologies

    • Job : A piece of code that reads some input (HDFS or local), performs some computation and writes output.

    • Stages : Jobs are divided into stages. Stages are classified as a Map or reduce stages. Stages are divided based on computational boundaries, all computations cannot be Updated in a single Stage. It happens over many stages.

    • Tasks : Each stage has some tasks. One task is executed on one partition of data on one executor (machine).

    • DAG : DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.

    • Executor : The process responsible for executing a task.

    • Master : The machine on which the Driver program runs

    • Slave : The machine on which the Executor program runs

  • 1.2 Spark Components

    • Spark Driver

      • separate process to execute user applications

      • creates SparkContext to schedule jobs execution and negotiate with cluster manager

    • Executors

      • run tasks scheduled by driver

      • store computation results in memory, on disk or off-heap

      • interact with storage systems

    • Cluster Manager

      • Apache Mesos : a general cluster manager that can also run Hadoop MapReduce and service applications.

      • Hadoop YARN : the resource manager in Hadoop.

      • Spark Standalone : a simple cluster manager included with Spark that makes it easy to set up

  • 1.3 How Spark Works?

    • Spark has a small code base and the system is divided in various layer.

    • Each layers has some responsibilities. The layers are independent of each other

      • Interpreter : The first layer is the interpreter, Spark uses a Scala interpreter, with some modifications. As you enter your code in spark console (creating RDD’s and applying operators), Spark creates a operator graph.

      • DAG Scheduler : When the user runs an action (like collect), the Graph is submitted to a DAG Scheduler. The DAG scheduler divides operator graph into (map and reduce) stages. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together to optimize the graph. For e.g. Many map operators can be scheduled in a single stage. This optimization is key to Sparks performance. The final result of a DAG scheduler is a set of stages.

      • Task Schedular : The stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager. (Spark Standalone/Yarn/Mesos). The task scheduler doesn’t know about dependencies among stages.

  • 1.4 Practical Overview

    • write spark-scala code

    • build jar with sbt (build.sbt)

      • input

      • output : target/scala/*.jar



2. Install Spark Locally

  • TBD


3. Cheat Sheet

3.1 Initialization & Configuration


3.2 Dataframe

  • Explore Dataframe

  • Modifying Dataframe

  • Read and Write


3.3 Join

  • Inner join : essentially removes anything that is not common in both tables. It returns all data that has a match under the join condition(on expression is true) from both sides of the table
  • Outer join : allows us to include in the result rows of one table for which there are no matching rows round in another table
  • Left join : all rows of the left table remain unchanged, regardless of whether there is a match in the right table or not. When a id match is found in the right table, it will be returned or null otherwise
  • Right join : performs the same task as the left join, but for the right table.

3.4 SQL functions

  • TBD

3.5 User Defined Function

  • TBD

3.6 Others

  • TBD


4. Submit Pyspark

  • TBD


5. Logging

  • TBD


6. Pushdown Filter (Predicate Pushdown)

  • Pushdown Filtering
    • General workflow : read data from file system, filtering on memory
    • Pushdown workflow : read necessary data from file sys, not on memory


References

  • TBD