Big Data Testing

Advance Big Data Testing (Duration – 45 hrs)

Big Data means large, complex and diverse data sets (Structured, Semi structured and Unstructured) that cannot be processed using traditional data processing methods. The importance of big data lies in its ability to provide valuable insights, enhance decision-making and drive innovation.

Benefits of Big Data:

Cost Savings: Big data tools like Apache Hadoop and Spark can bring cost-saving benefits to

businesses when they have to store large amounts of data.

Time-Saving: Real-time in-memory analytics helps companies to collect data from various sources.

Market Understanding: Big data analysis helps businesses to get a better understanding of market

conditions.

Social Media Listening: Companies can perform sentiment analysis using big data tools. These

enable them to get a better understanding of customer needs and preferences.

Customer Acquisition and Retention: Big data can help businesses to identify potential customers

and retain existing ones by providing personalized experiences.

Innovation and Product Development: Big data can drive innovation by providing insights into

customer behavior, preferences, and needs. This can help businesses to develop new

products and services that meet customer needs.

In conclusion, big data is important because it enables businesses to make informed decisions based on insights derived from large and complex data sets. It has the potential to revolutionize how businesses operate across various industries, from healthcare to finance to marketing.

Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and store big size data. Hadoop works on MapReduce Programming Algorithm that was introduced by Google. Today lots of Big Brand Companies are using Hadoop in their Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists of 4 components.

MapReduce
HDFS (Hadoop Distributed File System)
YARN (Yet Another Resource Negotiator)
Common Utilities or Hadoop Common

Detailed Technical Inputs

Big Data Basics: –

Introduction to Big Data & Big Data Challenges Preview
Limitations of DWH & Solutions of Big Data Architecture
Difference in Hadoop – 1 and Hadoop -2 features.
Different Hadoop jobs available in Big Data world.
Types of Big Data based on varieties of sources.
Hadoop & its components.
Usage of Big Data and its analysis in the current world scenarios.
Ex- E – Commerce, Social Media – (Twitter, Face book, Instagram) and Healthcare etc.

Hadoop Ecosystem and its Architecture: –

Hadoop Ecosystem
Complete Hadoop Cluster architecture based on Name node and Slave node.
Rack architecture etc.
Hadoop 2.x Core Components(five) Preview
Functionality of Each Daemons in a Hadoop Architecture.
Hadoop Storage: HDFS (Hadoop Distributed File System)
Hadoop Processing: MapReduce Framework
Different Hadoop Distributions
Hadoop 2.x Cluster Architecture Preview
Federation and High Availability Architecture Preview
Typical Production Hadoop Cluster and Hadoop Cluster Modes
Common Hadoop Shell Commands Preview
Hadoop 2.x Configuration Files
MapReduce Preview w.r.t YARN
YARN Components /YARN Architecture
YARN MapReduce Application Execution Flow
YARN Workflow discussions based on different pipelines
Anatomy of MapReduce Program Preview
Input Splits, Relation between Input Splits and HDFS Blocks
Map Reduce: Mapper and Reducer

Microsoft Azure Complete steps – From a Testers Prospective

Introduction to MS Azure.
Azure Data Bricks Services use for deployment and parameter setting —-Most important
Library creation
Client Tools – Data Bricks services, Data Lake / Data Factory (ADLS)
- Power Center Components – Author, Monitor
- Creating a Pipeline
- Creating a Trigger
- Running a pipeline to do ELT.
- Running a Trigger for scheduling.
- Tracking and monitoring a pipeline while running
- Failed pipeline RCA, how to track the real error from Error log.
Sources vs Targets – Based on a realtime archetecture.
- Working with Relational Targets and Flat file Targets
Transformations – Active and Passive Transformations (ETL approach)
- Aggregator,Expression, Filter , Sorter , Lookup ,Sequence Generator,Joiner ,Router
- Insert and Update Strategy based on SCD and type of loads.
Monitor
- Monitoring, debugging errors and log validations (Ex- Error Logs, Session log, pipeline log)
Complete ELT process descriptions based on practical’s of MS Azure.

ETL testing knowledge useful in Big data testing-

Slowly Changing Dimensions [SCD-I, SCD-II and SCD-III] & their advantages and disadvantages
Different types of data loadings – Full Load, Incremental Load and History Load
Transformations – Active and Passive Transformations
- Aggregator,Expression, Filter , Sorter , Lookup ,Sequence Generator,Joiner ,Router
- Insert and Update Strategy based on SCD and type of loads.

Hive: –

Topics: Introduction to Apache Hive
Hive Architecture and Components Hive Meta store.
Limitations of Hive Comparison with Traditional Database
Hive Data Types and Data Models
Hive Partition Hive Bucketing Hive Tables (Managed Tables and External Tables)
Importing Data Querying Data & Managing Outputs Hive Script
How Hive is helpful in reading data from HDFS.
How Hive is helpful in reading data from our local.
Validation of Scenarios of All ETL transformations by Hiveql Scripts
ETL and Big Data project difference.
All more as part of Big Data is there.
Any of the Big Data tool like MS Azure, clouderaetc…
Both Hive connections with Linux environment and hive with front end with DB visualize can be explained.

Spark and Scala: –

Configuration and token.
Basic Spark and Scala commands to read data from HDFS.
How to write Scala scripts by sparksql
Validation of Scenarios of All ETL transformations by Scala Scripts.
Testing strategies like Data completeness test, Data transformation test, data quality check etc. by Scala scripts.
File handling (.parquet files) by spark.sql

UNIX:

Multiple file handling commands which are used in complex file handling in HADOOP architecture as part of code deployment, config file validation, data comparison among compressed files of HDFS etc.

File Operations (Listing, View, Copy, Rename, Delete, Move, Create)
File Operation Commands -ls (ls –lrt/ ls –ltr), cat ,cp , rm , mv , touch
Directory Operations (Listing, Rename, Delete, Move, Create)
Directory Operation Commands – cd , pwd ,mkdir , rmdir
Permissions Using “chmod” command [rwx]
Search Commands – find , locate , grep (grep -i <keyword> filename)
Pipers and Filters
WC (count of records, words etc)
Other useful commands on day to day use – more , sort ,tail , head
vi editor , script running (./ script name)
Complete project discussion with one relevant project which I already worked on as a Hadoop tester.

Test Management and Requirement Understanding –

ELT STLC & testers roles and responsibility on a day to day basis.
Understanding the Big Data Test plan & Test Strategy based on actual practical examples.
HP ALM – Test case writing and upload, Defect logging, defect linking and tracking.
Required complex SQL queries to extract data from Files, Unix and other Real time project practical Exposures.

Note- As this is very vast subject and many more scenarios need to be discussed during testing study.

Sample Project ( 1hr )
Real time Mapping sheet, Test case writing based on the requirements.
Interview question & Answers ( 1hr)
Mock interviews.
Support in Resume preparation.
Complete support for getting a JOB by referrals.

Big Data Testing

Leave a Reply Cancel reply

Subscribe to our newsletter