RKCP Information Technology




Cassandra
Home » Courses   » Cassandra

Spark And Scala

This training helps the participants to learn Big Data Introduction, NoSQL DB Introduction, scala Data Model, scala Modelling & Architecture, scala API, CQSHL, scala Administration, scala Analytics & Search Clusters

Overview

Apache Cassandra is a distributed database of second generation originally open-sourced by Facebook. Its write-optimized, shared-nothing architecture results in an excellent performance and scalability.

Cassandra moves away from the master-slave model to using a peer-to-peer model. This means there is no single master rather all the nodes are potentially masters. This helps to make the writes and reads extremely scalable and even allow nodes to function in case of partition tolerance.


Objective

Apache Cassandra is an open-source project and a distributed NoSQL database of second-generation. For higher availability and scalability of the database this is the best choice. Cassandra supports replication across multiple data-centers. It offers tunable consistency to make the write and read processes highly scalable.

This Apache Cassandra training provides an overview of the following:

  • Fundamentals of Big data and NoSQL database
  • Cassandra and the features provided by it.
  • Architecture of Cassandra and its data model
  • Installing, configuring, and monitoring Cassandra
  • Hadoop ecosystem of products around Cassandra


Prerequisites

  • Knowledge of any SQL database is preferred
  • Knowledge of Java is preferred (Not mandatory for Developers)

Syllabus

Expand All
  • 1. Why Spark?
    • Evolution of Distributed systems
    • Challenges with existing distributed systems
    • Need of new generation
    • Hardware/software evolution in last decade
    • Spark History
    • Unification in Spark
    • Spark ecosystem vs Hadoop
    • Spark with Hadoop
    • Who are using Spark?
  • 2. Scala Basics
    • Required for Spark
  • 3. Spark Architecture
    • RDD
    • Immutability
    • Laziness
    • Type inference
    • Cacheable
    • Spark on cluster management frameworks
    • Spark task distribution
  • 4. Spark SQL DDL
    • Case classes
    • Inferred schema
    • Parquet files
    • JSON
    • Schema RDD
  • 5. Spark SQL DML
    • Projection
    • Condition
    • groupBy
    • joins
    • partitioning
  • 6. Spark SQL JDBC
    • Meta store
    • JDBC driver
    • JDBC statement
    • Result Set
  • 7. Extending Spark SQL
    • User defined functions
    • User defined aggregate function
  • 8. Spark SQL in Streaming
    • Querying DStreaming
    • DStream joins
  • 9. Spark installation
    • Local
    • Spark on YARN
    • Stand alone
    • Spark on Mesos
  • 10. Caching and Lineage
    • RDD caching
    • Fault recovery
  • 11. Spark streaming ArchitectureL
    • DStreams
    • DStream vs RDD
    • Receivers
    • Batch vs Streaming
  • 12. Combining batch and Streaming
    • Foreach
    • Transform
    • Joins
  • 13. Persist and Caching
    • Saving DStream
    • Caching DStream
  • 14. Window Operations
    • window
    • countByWindow
    • reduceByWindow
  • 15. Deploying Spark Streaming
    • Clustering
    • Check pointing
    • Driver fallback
  • 16. Spark API Hands on
    • RDD operations
    • Key-value pair RDD
    • Map Reduce
    • Double RDD
  • 17. Advanced operations
    • Aggregate
    • Fold
    • mapPartitions
    • glom
    • Broadcasters
  • 18. Integration with HDFS
    • Introduction to HDFS
    • HDFS architecture
    • Using HDFS
  • 19. Input Streams
    • Socket
    • HDFS
    • Twitter
    • Kafka
  • 20. Streaming API Hands-on
    • DStream creation
    • Transformations
    • Stateful operations
  • 21. Check pointing
    • Recoverable computations
    • Error handling