Advance Big Data Testing (Duration – 45 hrs)
Big Data means large, complex and diverse data sets (Structured, Semi structured and Unstructured) that cannot be processed using traditional data processing methods. The importance of big data lies in its ability to provide valuable insights, enhance decision-making and drive innovation.
Benefits of Big Data:
Cost Savings: Big data tools like Apache Hadoop and Spark can bring cost-saving benefits to
businesses when they have to store large amounts of data.
Time-Saving: Real-time in-memory analytics helps companies to collect data from various sources.
Market Understanding: Big data analysis helps businesses to get a better understanding of market
conditions.
Social Media Listening: Companies can perform sentiment analysis using big data tools. These
enable them to get a better understanding of customer needs and preferences.
Customer Acquisition and Retention: Big data can help businesses to identify potential customers
and retain existing ones by providing personalized experiences.
Innovation and Product Development: Big data can drive innovation by providing insights into
customer behavior, preferences, and needs. This can help businesses to develop new
products and services that meet customer needs.
In conclusion, big data is important because it enables businesses to make informed decisions based on insights derived from large and complex data sets. It has the potential to revolutionize how businesses operate across various industries, from healthcare to finance to marketing.
Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and store big size data. Hadoop works on MapReduce Programming Algorithm that was introduced by Google. Today lots of Big Brand Companies are using Hadoop in their Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists of 4 components.
- MapReduce
- HDFS (Hadoop Distributed File System)
- YARN (Yet Another Resource Negotiator)
- Common Utilities or Hadoop Common
Detailed Technical Inputs
Big Data Basics: –
- Introduction to Big Data & Big Data Challenges Preview
- Limitations of DWH & Solutions of Big Data Architecture
- Difference in Hadoop – 1 and Hadoop -2 features.
- Different Hadoop jobs available in Big Data world.
- Types of Big Data based on varieties of sources.
- Hadoop & its components.
- Usage of Big Data and its analysis in the current world scenarios.
- Ex- E – Commerce, Social Media – (Twitter, Face book, Instagram) and Healthcare etc.
Hadoop Ecosystem and its Architecture: –
- Hadoop Ecosystem
- Complete Hadoop Cluster architecture based on Name node and Slave node.
- Rack architecture etc.
- Hadoop 2.x Core Components(five) Preview
- Functionality of Each Daemons in a Hadoop Architecture.
- Hadoop Storage: HDFS (Hadoop Distributed File System)
- Hadoop Processing: MapReduce Framework
- Different Hadoop Distributions
- Hadoop 2.x Cluster Architecture Preview
- Federation and High Availability Architecture Preview
- Typical Production Hadoop Cluster and Hadoop Cluster Modes
- Common Hadoop Shell Commands Preview
- Hadoop 2.x Configuration Files
- MapReduce Preview w.r.t YARN
- YARN Components /YARN Architecture
- YARN MapReduce Application Execution Flow
- YARN Workflow discussions based on different pipelines
- Anatomy of MapReduce Program Preview
- Input Splits, Relation between Input Splits and HDFS Blocks
- Map Reduce: Mapper and Reducer
Microsoft Azure Complete steps – From a Testers Prospective
- Introduction to MS Azure.
- Azure Data Bricks Services use for deployment and parameter setting —-Most important
- Library creation
- Client Tools – Data Bricks services, Data Lake / Data Factory (ADLS)
- Power Center Components – Author, Monitor
- Creating a Pipeline
- Creating a Trigger
- Running a pipeline to do ELT.
- Running a Trigger for scheduling.
- Tracking and monitoring a pipeline while running
- Failed pipeline RCA, how to track the real error from Error log.
- Sources vs Targets – Based on a realtime archetecture.
- Working with Relational Targets and Flat file Targets
- Transformations – Active and Passive Transformations (ETL approach)
- Aggregator,Expression, Filter , Sorter , Lookup ,Sequence Generator,Joiner ,Router
- Insert and Update Strategy based on SCD and type of loads.
- Monitor
- Monitoring, debugging errors and log validations (Ex- Error Logs, Session log, pipeline log)
- Complete ELT process descriptions based on practical’s of MS Azure.
ETL testing knowledge useful in Big data testing-
- Slowly Changing Dimensions [SCD-I, SCD-II and SCD-III] & their advantages and disadvantages
- Different types of data loadings – Full Load, Incremental Load and History Load
- Transformations – Active and Passive Transformations
- Aggregator,Expression, Filter , Sorter , Lookup ,Sequence Generator,Joiner ,Router
- Insert and Update Strategy based on SCD and type of loads.
Hive: –
- Topics: Introduction to Apache Hive
- Hive Architecture and Components Hive Meta store.
- Limitations of Hive Comparison with Traditional Database
- Hive Data Types and Data Models
- Hive Partition Hive Bucketing Hive Tables (Managed Tables and External Tables)
- Importing Data Querying Data & Managing Outputs Hive Script
- How Hive is helpful in reading data from HDFS.
- How Hive is helpful in reading data from our local.
- Validation of Scenarios of All ETL transformations by Hiveql Scripts
- ETL and Big Data project difference.
- All more as part of Big Data is there.
- Any of the Big Data tool like MS Azure, clouderaetc…
- Both Hive connections with Linux environment and hive with front end with DB visualize can be explained.
Spark and Scala: –
- Configuration and token.
- Basic Spark and Scala commands to read data from HDFS.
- How to write Scala scripts by sparksql
- Validation of Scenarios of All ETL transformations by Scala Scripts.
- Testing strategies like Data completeness test, Data transformation test, data quality check etc. by Scala scripts.
- File handling (.parquet files) by spark.sql
UNIX:
Multiple file handling commands which are used in complex file handling in HADOOP architecture as part of code deployment, config file validation, data comparison among compressed files of HDFS etc.
- File Operations (Listing, View, Copy, Rename, Delete, Move, Create)
- File Operation Commands -ls (ls –lrt/ ls –ltr), cat ,cp , rm , mv , touch
- Directory Operations (Listing, Rename, Delete, Move, Create)
- Directory Operation Commands – cd , pwd ,mkdir , rmdir
- Permissions Using “chmod” command [rwx]
- Search Commands – find , locate , grep (grep -i <keyword> filename)
- Pipers and Filters
- WC (count of records, words etc)
- Other useful commands on day to day use – more , sort ,tail , head
- vi editor , script running (./ script name)
- Complete project discussion with one relevant project which I already worked on as a Hadoop tester.
Test Management and Requirement Understanding –
- ELT STLC & testers roles and responsibility on a day to day basis.
- Understanding the Big Data Test plan & Test Strategy based on actual practical examples.
- HP ALM – Test case writing and upload, Defect logging, defect linking and tracking.
- Required complex SQL queries to extract data from Files, Unix and other Real time project practical Exposures.
Note- As this is very vast subject and many more scenarios need to be discussed during testing study.
- Sample Project ( 1hr )
- Real time Mapping sheet, Test case writing based on the requirements.
- Interview question & Answers ( 1hr)
- Mock interviews.
- Support in Resume preparation.
- Complete support for getting a JOB by referrals.