Some of the frequently asked Interview questions for hadoop developers are:
(1)What is Difference between Secondary namenode, Checkpoint namenode & backupnod Secondary Namenode, a poorly named component of hadoop. (2)What are the Side Data Distribution Techniques. (3)What is shuffleing in mapreduce? (4)What is partitioning? (5)Can we change the file cached by Distributed Cache (6)What if job tracker machine is down? (7)Can we deploy job tracker other than name node? (8)What are the four modules that make up the Apache Hadoop framework? (9)Which modes can Hadoop be run in? List a few features for each mode. (10)Where are Hadoop’s configuration files located? (11)List Hadoop’s three configuration files.
(12)What are “slaves” and “masters” in Hadoop?
(13)How many datanodes can run on a single Hadoop cluster?
(14)What is job tracker in Hadoop?
(15)How many job tracker processes can run on a single Hadoop cluster?
(16)What sorts of actions does the job tracker process perform?
(17)How does job tracker schedule a job for the task tracker?
(18)What does the mapred.job.tracker command do?
(19)What is “PID”?
(20)What is “jps”?
(21)Is there another way to check whether Namenode is working?
(22)How would you restart Namenode?
(23)What is “fsck”?
(24)What is a “map” in Hadoop?
(25)What is a “reducer” in Hadoop?
(26)What are the parameters of mappers and reducers?
(27)Is it possible to rename the output file, and if so, how?
(28)List the network requirements for using Hadoop.
(29)Which port does SSH work on?
(30)What is streaming in Hadoop?
(31)What is the difference between Input Split and an HDFS Block?
(32)What does the file hadoop-metrics.properties do?
(33)Name the most common Input Formats defined in Hadoop? Which one is default?
(34)What is the difference between TextInputFormat and KeyValueInputFormat class? (35)What is InputSplit in Hadoop?
(36)How is the splitting of file invoked in Hadoop framework (37)Consider case scenario: In M/R system,
- HDFS block size is 64 MB
- Input format is FileInputFormat
- We have 3 files of size 64K, 65Mb and 127Mb
(38)How many input splits will be made by Hadoop framework?
(39)What is the purpose of RecordReader in Hadoop?
(39)After the Map phase finishes, the Hadoop framework does “Partitioning, Shuffle and sort”. Explain what happens in this phase?
(40)If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer?
(41)What is JobTracker?
(42)What are some typical functions of Job Tracker?
(43)What is TaskTracker?
(44)What is the relationship between Jobs and Tasks in Hadoop?
(46)Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do?
(47)Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this?
(48)How does speculative execution work in Hadoop?
(49)Using command line in Linux, how will you
- See all jobs running in the Hadoop cluster
- Kill a job?
(50)What is Hadoop Streaming?
(51)What is the characteristic of streaming API that makes it flexible run MapReduce jobs in languages like Perl, Ruby, Awk etc.?
(52)What is Distributed Cache in Hadoop?
(53)Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job?
(54)Is it possible to have Hadoop job output in multiple directories? If yes, how?
(55)What will a Hadoop job do if you try to run it with an output directory that is already present? Will it
- Overwrite it
- Warn you and continue
- Throw an exception and exit
(56)How can you set an arbitrary number of mappers to be created for a job in Hadoop?
(57)How can you set an arbitrary number of Reducers to be created for a job in Hadoop?
(58)How will you write a custom partitioner for a Hadoop job?
(59)How did you debug your Hadoop code?
(60)What is BIG DATA?
(61)Can you give some examples of Big Data?
(62)Can you give a detailed overview about the Big Data being generated by Facebook?
(63)According to IBM, what are the three characteristics of Big Data?
(64)How Big is ‘Big Data’?
(65)How analysis of Big Data is useful for organizations?
(66)Who are ‘Data Scientists’?
(67)What are some of the characteristics of Hadoop framework?
(68)Give a brief overview of Hadoop history.
(69)Give examples of some companies that are using Hadoop structure?
(70)What is the basic difference between traditional RDBMS and Hadoop?
(71)What is structured and unstructured data?
(72)What are the core components of Hadoop?
(73)What is HDFS?
(74)What are the key features of HDFS?
(75)What is Fault Tolerance?
(76)Replication causes data redundancy then why is is pursued in HDFS?
(77)Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be replicated on the other two?
(78)What is throughput? How does HDFS get a good throughput?
(79)What is streaming access?
(80)What is a commodity hardware? Does commodity hardware include RAM?
(81)What is a metadata?
(82)Why do we use HDFS for applications having large data sets and not when there are lot of small files?
(83)What is a daemon?
(84)Is Namenode machine same as datanode machine as in of hardware?
(85)What is a heartbeat in HDFS?
(86)Are Namenode and job tracker on the same host?
(87)What is a ‘block’ in HDFS?
(88)What are the benefits of block transfer?
(89)If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks, can the blocks be broken at the time of replication?
(90)How indexing is done in HDFS?
(91)If a data Node is full how it’s identified?
(92)If datanodes increase, then do we need to upgrade Namenode?
(93)Are job tracker and task trackers present in separate machines?
(94)When we send a data to a node, do we allow settling in time, before sending another data to that node?
(95)Does hadoop always require digital data to process?
(96)On what basis Namenode will decide which datanode to write on?
(97)Doesn’t Google have its very own version of DFS?
(98)Who is a ‘’ in HDFS?
(99)Is client the end in HDFS?
(100)What is the communication channel between client and namenode/datanode?
(101)What is a rack?
(102)On what basis data will be stored on a rack?
(103)Do we need to place 2nd and 3rd data in rack 2 only?
(104)What if rack 2 and datanode fails?
(105)What is a Secondary Namenode? Is it a substitute to the Namenode?
(106)What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?
(107)What is ‘Key value pair’ in HDFS?
(108)What is the difference between MapReduce engine and HDFS cluster?
(109)Is map like a pointer?
(110)Do we require two servers for the Namenode and the datanodes?
(111)Why are the number of splits equal to the number of maps?
(112)Is a job split into maps?
(113)Which are the two types of ‘writes’ in HDFS?
(114)Why ‘Reading‘ is done in parallel and ‘Writing‘ is not in HDFS?
(115)Can Hadoop be compared to NOSQL database like Cassandra?
(116)How can I install Cloudera VM in my system?
(117)What is a Task Tracker in Hadoop? How many instances of Task Tracker run on a hadoop cluster
(118)What are the four basic parameters of a mapper?
(119)What is the input type/format in MapReduce by default?
(120)Can we do online transactions(OLTP) using Hadoop? SRVMTrainings
(121)Explain how HDFS communicates with Linux native file system
(122)What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster?
(123)What is the InputFormat ?
(124)What is the InputSplit in map reduce ?
(125)What is a IdentityMapper and IdentityReducer in MapReduce ?
(126)How JobTracker schedules a task?
(127)When is the reducers are started in a MapReduce job?
(128)On What concept the Hadoop framework works?
(129)What is a DataNode? How many instances of DataNode run on a Hadoop Cluster?
(130)What other technologies have you used in hadoop sta ck?
(131)How NameNode Handles data node failures?
(132)How many Daemon processes run on a Hadoop system?
(133)What is configuration of a typical slave node on Hadoop cluster? (134) How many JVMs run on a slave node?
(135)How will you make changes to the default configuration files?
(136)Can I set the number of reducers to zero?
(137)Whats the default port that jobtrackers listens ?
(138)unable to read options file while i tried to import data from mysql to hdfs. Narendra
(139)What problems have you faced when you are working on Hadoop code?
(140)how would you modify that solution to only count the number of unique words in all the documents?
(141)What is the difference between a Hadoop and Relational Database and Nosql?
(142)How the HDFS Blocks are replicated?
(143)What is a Task instance in Hadoop? Where does it run?
(144)what is meaning Replication factor?
(145)If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?
(146)How the Client communicates with HDFS? (147)Which object can be used to get the progress of a particular job (148)What is next step after Mapper or MapTask? (149)What are the default configuration files that are used in Hadoop? (150)Does MapReduce programming model provide a way for reducers to communicate with each other? In a MapReduce job can a reducer communicate with another reducer? (151)What is HDFS Block size? How is it different from traditional file system block size? (152)what is SPF? (153)Where do you specify the Mapper Implementation?
(154)What is a NameNode? How many instances of NameNode run on a Hadoop Cluster? (155)Explain the core methods of the Reducer? (156)What is Hadoop framework? (157)Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job
(158)How would you tackle counting words in several text documents?
(159)How does master slave architecture in the Hadoop?
(160)How would you tackle calculating the number of unique visitors for each hour by mining a huge Apache log? You can use post processing on the output of the MapReduce job.
(161)How did you debug your Hadoop code ?
(162)How will you write a custom partitioner for a Hadoop job?
(163)How can you add the arbitrary key-value pairs in your mapper?
(164)what is a datanode?
(165)What are combiners? When should I use a combiner in my MapReduce Job?
(166)How Mapper is instantiated in a running job?
(167)Which interface needs to be implemented to create Mapper and Reducer for the Hadoop?
(168)What happens if you don?t override the Mapper methods and keep them as it is?
(169)How does an Hadoop application look like or their basic components? (170)What is the meaning of speculative execution in Hadoop? Why is it important?
(170)What are the restriction to the key and value class ?
(171)Explain the WordCount implementation via Hadoop framework ?
(172)What Mapper does?
(173)what is MAP REDUCE?
(174)Explain the Reducer?s Sort phase?
(175)What are the primary phases of the Reducer?
(176)Explain the Reducer's reduce phase?
(177)Explain the shuffle?
(178)What happens if number of reducers are 0?
(179)How many Reducers should be configured?
(180)What is Writable & WritableComparable interface?
(181)What is the Hadoop MapReduce API contract for a key and value Class?
(182)Where is the Mapper Output (intermediate kay-value data) stored ?
(183)What is the difference between HDFS and NAS ?
(184)Whats is Distributed Cache in Hadoop
(185)Have you ever used Counters in Hadoop. Give us an example scenario?
(186)can we write map reduce program in other than java programming language. how.
(187)What alternate way does HDFS provides to recover data in case a Namenode, without backup, fails and cannot be recovered?
(188)What is the use of Context object?
(189)What is the Reducer used for?
(190)What is the use of Combiner? (191)Explain how input and output data format of the Hadoop framework?
(192)What is compute and Storage nodes?
(193)what is namenode?
(194)How does Mappers run() method works?
(195)what is the default replication factor in HDFS?
(196)It can be possible that a Job has 0 reducers?
(197)How many maps are there in a particular Job?
(198)How many instances of JobTracker can run on a Hadoop Cl?
(199)How can we control particular key should go in a specific reducer?
(200)what is the typical block size of an HDFS block?
(201)What do you understand about Object Oriented Programming (OOP)? Use Java examples.
(202)What are the main differences between versions 1.5 and version 1.6 of Java?
(203)Describe what happens to a MapReduce job from submission to output?
(204)What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache during runtime of the application
(205)Did you ever built a production process in Hadoop ? If yes then what was the process when your hadoop job fails due to any reason
(206)Did you ever ran into a lop sided job that resulted in out of memory error, if yes then how did you handled it
(207)What is HDFS ? How it is different from traditional file systems?
(208)What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it
(209)How JobTracker schedules a task?
(210)How many Daemon processes run on a Hadoop system?
(211)What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?
(212)What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?
(213)What is the difference between HDFS and NAS ?
(214)How NameNode Handles data node failures?
(215)Does MapReduce programming model provide a way for reducers to communicate with each other? In a MapReduce job can a reducer communicate with another reducer?
(216)Where is the Mapper Output (intermediate kay-value data) stored ?
(217)What are combiners? When should I use a combiner in my MapReduce Job?
(218)What is a IdentityMapper and IdentityReducer in MapReduce ?
(219)When is the reducers are started in a MapReduce job?
(220)If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?
(221)What is HDFS Block size? How is it different from traditional file system block size?
(222)How the Client communicates with HDFS?
(223)What is NoSQL?
(224)We have already SQL then Why NoSQL?
(225)What is the difference between SQL and NoSQL?
(226)Is NoSQL follow relational DB model?
(227)Why would NoSQL be better than using a SQL Database? And how much better is it?
(228)What do you understand by Standalone (or local) mode?
(229)What is Pseudo-distributed mode?
(230)What does /var/hadoop/pids do?
(231)Pig for Hadoop - Give some points?
(232)Hive for Hadoop - Give some points?
(233)File permissions in HDFS?
(234)what is ODBC and JDBC connectivity in Hive?
(235)What is Derby database?
(236)What is Schema on Read and Schema on Write?
(237)What infrastructure do we need to process 100 TB data using Hadoop?
(238)What is Internal and External table in Hive?
(239)what is Small File Problem in Hadoop
(240)How does a client read/write data in HDFS?
(241)What should be the ideal replication factor in Hadoop?
(242)What is the optimal block size in HDFS?
(243)explain Metadata in Namenode
(244)how to enable recycle bin or trash in Hadoop
(245)what is difference between int and intwritable
(246)How to change Replication Factor (For below cases):
(247)In Map Reduce why map write output to Local Disk instead of HDFS?
(248)Rack awareness of Namenode
(249)Hadoop the definitive guide (2nd edition) pdf
(250)What is bucketing in Hive?
(251)What is Clustring in Hive?
(252)What type of data we should put in Distributed Cache? When to put the data in DC? How much volume we should put in?
(253)What is Distributed Cache?
(254)What is Partioner in hadoop? Where does it run,mapper or reducer?
(255) what are mapreduce new and old apis while writing map reduce program . explain how it works
(256)How to write a Custom Key Class?
(257)What is the utility of using Writable Comparable (Custom Class) in Map Reduce code?
(258)What are Input Format, Input Split & Record Reader and what they do?
(259)Why we use IntWritable instead of Int? Why we use LongWritable instead of Long?
(260)How to enable Recycle bin in Hadoop?
(261)If data is present in HDFS and RF is defined, then how can we change Replication Factor?
(262)How we can change Replication factor when Data is on the fly?
(262)mkdir: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory //hadoop/inpdata. Name node is in safemode.
(263)What Hadoop Does in Safe Mode
(264)What should be the ideal replication factor in Hadoop Cluster?
(265)Heartbeat for Hadoop
(266)What will be the consideration while we do Hardware Planning for Master in Hadoop architecture?
(267)When should be hadoop archive create
(268)what factors the block size takes before creation?
(269)In which location Name Node sores its Metadata and why?
(270)Should we use RAID in Hadoop or not?
(271)How blocks are distributed among all data nodes for a particular chunk of data?
(272)How to enable Trash/Recycle Bin in Hadoop?
(273)what is hadoop archive
(274)How to create hadoop archive
(275)How we can take Hadoop out of Safe Mode
(276)What is safe mode in Hadoop?
(277)Why Mapreduce output written in local disk
(278)When Hadoop Enter in Safe Mode
(279)Data node block size in HDFS, why 64MB?
(280)What is the Non DFS Used
(281)Virtual Box & Ubuntu Installation
(282)What is Rack awareness?
(283)On what basis name node distribute blocks across the data nodes?
(284)What is Output Format in hadoop?
(285)How to write data in Hbase using flume?
(286)What is difference between memory channel and file channel in flume?
(287)How to create table in hive for a json input file.
(288)What is speculative execution in Hadoop?
(289)What is a Record Reader in hadoop?
(290)How to resolve the following error while running a query in hive: Error in metadata: Cannot validate serde
(291)What is difference between internal and external tables in hive?
(292)What is Bucketing and Clustering in Hive?
(293)How to enable/configure the compression of map output data in hadoop?
(294)What is InputFormat in hadoop?
(295)How to configure hadoop to reuse JVM for mappers?
(296)What is difference between split and block in hadoop?
(297)What is Input Split in hadoop?
(298)How can one write custom record reader?
(299)What is balancer? How to run a cluster balancing utility?
(300)What is version-id mismatch error in hadoop?
(301)How to handle bad records during parsing?
(302)What is identity mapper and reducer? In which cases can we use them?
(303)What is Reduce only jobs?
(304)What is crontab? Explain with suitable example.
(305)Safe-mode execeptions
(306)What is the meaning of the term "non-DFS used" in Hadoop web-console?
(307)What is AMI
(308)Can we submit the mapreduce job from slave node?
(309)How to resolve small file problem in hdfs?
(310)How to overwrite an existing output file during execution of mapreduce jobs?
(311)What is difference between reducer and combiner?
(311)What do you understand from Node redundancy and is it exist in hadoop cluster
(312)how to proceed to write your first mapreducer program.
(313)How to change replication factor of files already stored in HDFS
(314) How to resolve IOException: Cannot create directory, while formatting namenode in hadoop.
(315)How can one set space quota in Hadoop (HDFS) directory
(316)How can one increase replication factor to a desired value in Hadoop?
Are there any problems which can only be solved by MapReduce and cannot be solved by PIG? In which kind of scenarios MR jobs will be more useful than PIG? 2. How can we change the split size if our commodity hardware has less storage space? 3. What is the difference between an HDFS Block and Input Split? 4. How can we check whether Namenode is working or not? 5. Why do we need a -less SSH in Fully Distributed environment? 6. Some details about SSH communication between Masters and the Slaves 7. Why is Replication pursued in HDFS in spite of its data redundancy? Difference between map-side and reduce side ? Difference between static and dynamic partitioning? What is safe-mode?
How to avoid select * kind of queries in hive? What are sequence files? What are map files? There are 3 input files. write a MR program for word count, such that output should be in 3 different files corresponding to respective word counts of the 3 input files 8.How can the no of mappers be controlled? Different configuration files in Hadoop? Different modes of execution? Explain JVM profiling? Load balancing in hdfs cluster Difference between partitioning and bucketing? Difference between manages and external tables? Explain performance tuning is done in hive? Explain about MRUnit? Command for moving data from one cluster to another cluster? Difference between RC and ORC file format? How to check the schema of a table in hive? What is metadata? where is it stored in Hive?
For a hadoop developer the questions which mostly asked during interview are: 1. What is shuffling in map reduce? 2. Difference between HDFD block and split? 3. What are the mapfiles in hadoop? 4. What is the use of .pagination class? 5. What are the core components of hadoop?
1.What is Apache Spark? Spark is a fast, easy-to-use and flexible data processing framework. It has an advanced execution engine ing cyclic data flow and in-memory computing. Spark can run on Hadoop, standalone or in the cloud and is capable of accessing diverse data sources including HDFS, HBase, Cassandra and others. 2.Explain key features of Spark.
Allows Integration with Hadoop and files included in HDFS. Spark has an interactive language shell as it has an independent Scala (the language in which Spark is written) interpreter Spark consists of RDD’s (Resilient Distributed Datasets), which can be cached across computing nodes in a cluster. Spark s multiple analytic tools that are used for interactive query analysis , real-time analysis and graph processing
3.Define RDD. RDD is the acronym for Resilient Distribution Datasets – a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD is immutable and distributed. There are primarily two types of RDD: Parallelized Collections : The existing RDD’s running parallel with one another Hadoop datasets: perform function on each file record in HDFS or other storage system
4.What does a Spark Engine do? Spark Engine is responsible for scheduling, distributing and monitoring the data application across the cluster. 5.Define Partitions? As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in MapReduce. Partitioning is the process to derive logical units of data to speed up the processing process. Everything in Spark is a partitioned RDD. 6.What operations RDD ? Transformations Actions
7.What do you understand by Transformations in Spark? Transformations are functions applied on RDD, resulting into another RDD. It does not execute until an action occurs. map() and filer() are examples of transformations, where the former applies the function ed to it on each element of RDD and results into another RDD. The filter() creates a new RDD by selecting elements form current RDD that function argument. 8. Define Actions. An action helps in bringing back the data from RDD to the local machine. An action’s execution is the result of all previously created transformations. reduce() is an action that implements the function ed again and again until one value if left. take() action takes all the values from RDD to local node. 9.Define functions of SparkCore.
Serving as the base engine, SparkCore performs various important functions like memory management, monitoring jobs, fault-tolerance, job scheduling and interaction with storage systems. 10.What is RDD Lineage? Spark does not data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always re how to build from other datasets. 11.What is Spark Driver? Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple , driver in Spark creates SparkContext, connected to a given Spark Master. The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs. 12.What is Hive on Spark? Hive contains significant for Apache Spark, wherein Hive execution is configured to Spark: hive> set spark.home=/location/to/sparkHome; hive> set hive.execution.engine=spark; Hive on Spark s Spark on yarn mode by default. 13.Name commonly-used Spark Ecosystems. Spark SQL (Shark)- for developers Spark Streaming for processing live data streams GraphX for generating and computing graphs MLlib (Machine Learning Algorithms) SparkR to promote R Programming in Spark engine.
14.Define Spark Streaming. Spark s stream processing – an extension to the Spark API , allowing stream processing of live data streams. The data from different sources like Flume, HDFS is streamed and finally processed to file systems, live dashboards and databases. It is similar to batch processing as the input data is divided into streams like batches. 15.What is GraphX? Spark uses GraphX for graph processing to build and transform interactive graphs. The GraphX component enables programmers to reason about structured data at scale. 16.What does MLlib do? MLlib is scalable machine learning library provided by Spark. It aims at making machine learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and alike. 17.What is Spark SQL?
SQL Spark, better known as Shark is a novel module introduced in Spark to work with structured data and perform structured data processing. Through this module, Spark executes relational SQL queries on the data. The core of the component s an altogether different RDD called SchemaRDD, composed of rows objects and schema objects defining data type of each column in the row. It is similar to a table in relational database. 18.What is a Parquet file? Parquet is a columnar format file ed by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics format so far. 19.What file systems Spark ? • Hadoop Distributed File System (HDFS) • Local File system • S3 20.What is Yarn? Similar to Hadoop, Yarn is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster . Running Spark on Yarn necessitates a binary distribution of Spar as built on Yarn . 21.List the functions of Spark SQL. Spark SQL is capable of: • Loading data from a variety of structured sources • Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau • Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to RDDs and SQL tables, expose custom functions in SQL, and more 22.What are benefits of Spark over MapReduce? Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data processing tasks. Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only s batch processing. Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.
23.Is there any benefit of learning MapReduce, then? Yes, MapReduce is a paradigm used by many big data tools including Spark as well. It is extremely relevant to use MapReduce when the data grows bigger and bigger. Most tools like Pig and Hive convert their queries into MapReduce phases to optimize them better.
24.What is Spark Executor? When SparkContext connect to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution. 25.Name types of Cluster Managers in Spark. The Spark framework s three major types of Cluster Managers: Standalone: a basic manager to set up a cluster Apache Mesos: generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications Yarn: responsible for resource management in Hadoop
26.What do you understand by worker node? Worker node refers to any node that can run the application code in a cluster. 27.What is PageRank? A unique feature and algorithm in graph, PageRank is the measure of each vertex in the graph. For instance, an edge from u to v represents endorsement of v’s importance by u. In simple , if a at Instagram is followed massively, it will rank high on that platform. 28.Do you need to install Spark on all nodes of Yarn cluster while running Spark on Yarn? No because Spark runs on top of Yarn. 29.Illustrate some demerits of using Spark. Since Spark utilizes more storage space compared to Hadoop and MapReduce, there may arise certain problems. Developers need to be careful while running their applications in Spark. Instead of running everything on a single node, the work must be distributed over multiple clusters. 30.How to create RDD? Spark provides two methods to create RDD: • By parallelizing a collection in your Driver program. This makes use of SparkContext’s ‘parallelize’ method val data = Array(2,4,6,8,10) val distData = sc.parallelize(data) • By loading an external dataset from external storage like HDFS, HBase, shared file system Posted by Kalyan Hadoop at 00:26 Reactions: Links to this post Email ThisBlogThis!Share to TwitterShare to FacebookShare to Pinterest Labels: Big Data Interview Questions and Answers, Spark
Kalyan Hadoop Spark Training in Hyderabad Hadoop Training in Hyderabad Interview Questions & Answers on Apache Spark [Part 2] Q1: Say I have a huge list of numbers in RDD(say myrdd). And I wrote the following code to compute average: def myAvg(x, y): return (x+y)/2.0; avg = myrdd.reduce(myAvg); What is wrong with it? And How would you correct it? Ans: The average function is not commutative and associative; I would simply sum it and then divide by count. def sum(x, y):
return x+y; total = myrdd.reduce(sum); avg = total / myrdd.count(); The only problem with the above code is that the total might become very big thus over flow. So, I would rather divide each number by count and then sum in the following way. cnt = myrdd.count(); def devideByCnd(x): return x/cnt; myrdd1 = myrdd.map(devideByCnd); avg = myrdd.reduce(sum);
Q2: Say I have a huge list of numbers in a file in HDFS. Each line has one number.And I want to compute the square root of sum of squares of these numbers. How would you do it? Ans: # We would first load the file as RDD from HDFS on spark numsAsText = sc.textFile("hdfs://namenode:9000//kayan/mynumbersfile.txt"); # Define the function to compute the squares def toSqInt(str): v = int(str); return v*v; #Run the function on spark rdd as transformation nums = numsAsText.map(toSqInt);
#Run the summation as reduce action total = nums.reduce(sum)
#finally compute the square root. For which we need to import math. import math; print math.sqrt(total);
Q3: Is the following approach correct? Is the sqrtOfSumOfSq a valid reducer?
numsAsText =sc.textFile("hdfs://namenode:9000//kalyan/mynumbersfile.txt"); def toInt(str): return int(str); nums = numsAsText.map(toInt); def sqrtOfSumOfSq(x, y): return math.sqrt(x*x+y*y); total = nums.reduce(sum) import math; print math.sqrt(total); Ans: Yes. The approach is correct and sqrtOfSumOfSq is a valid reducer.
Q4: Could you compare the pros and cons of the your approach (in Question 2 above) and my approach (in Question 3 above)? Ans: You are doing the square and square root as part of reduce action while I am squaring in map() and summing in reduce in my approach. My approach will be faster because in your case the reducer code is heavy as it is calling math.sqrt() and reducer code is generally executed approximately n-1 times the spark RDD. The only downside of my approach is that there is a huge chance of integer overflow because I am computing the sum of squares as part of map.
Q5: If you have to compute the total counts of each of the unique words on spark, how would you go about it?
Ans: #This will load the bigtextfile.txt as RDD in the spark lines = sc.textFile("hdfs://namenode:9000//kalyan/bigtextfile.txt");
#define a function that can break each line into words def toWords(line): return line.split();
# Run the toWords function on each element of RDD on spark as flatMap transformation. # We are going to flatMap instead of map because our function is returning multiple values.
words = lines.flatMap(toWords);
# Convert each word into (key, value) pair. Her key will be the word itself and value will be 1. def toTuple(word): return (word, 1);
wordsTuple = words.map(toTuple);
# Now we can easily do the reduceByKey() action.
def sum(x, y): return x+y;
counts = wordsTuple.reduceByKey(sum)
# Now, print counts.collect() Q6: In a very huge text file, you want to just check if a particular keyword exists. How would you do this using Spark? Ans: lines = sc.textFile("hdfs://namenode:9000//kalyan/bigtextfile.txt"); def isFound(line): if line.find(“mykeyword”) > -1: return 1; return 0; foundBits = lines.map(isFound); sum = foundBits.reduce(sum); if sum > 0: print “FOUND”; else: print “NOT FOUND”;
Q7: Can you improve the performance of this code in previous answer? Ans: Yes. The search is not stopping even after the word we are looking for has been found. Our map code would keep executing on all the nodes which is very inefficient.
We could utilize accumulators to report whether the word has been found or not and then stop the job. Something on these line: import thread, threading from time import sleep result = "Not Set" lock = threading.Lock() accum = sc.accumulator(0) def map_func(line): #introduce delay to emulate the slowness sleep(1); if line.find("Adventures") > -1: accum.add(1); return 1; return 0; def start_job(): global result try: sc.setJobGroup("job_to_cancel", "some description") lines = sc.textFile("hdfs://namenode:9000//kalyan/wordcount/input/big.txt"); result = lines.map(map_func); result.take(1); except Exception as e: result = "Cancelled" lock.release() def stop_job(): while accum.value < 3 : sleep(1); sc.cancelJobGroup("job_to_cancel") supress = lock.acquire() supress = thread.start_new_thread(start_job, tuple())
supress = thread.start_new_thread(stop_job, tuple()) supress = lock.acquire()
Posted by Kalyan Hadoop at 00:15 Reactions: Links to this post Email ThisBlogThis!Share to TwitterShare to FacebookShare to Pinterest Labels: Big Data Interview Questions and Answers, Spark
Kalyan Hadoop Spark Training in Hyderabad Hadoop Training in Hyderabad Interview Questions & Answers on Apache Spark [Part 1] Q1: When do you use apache spark? OR What are the benefits of Spark over Mapreduce? Ans: Spark is really fast. As per their claims, it runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. It aptly utilizes RAM to produce the faster results.
In map reduce paradigm, you write many Map-reduce tasks and then tie these tasks together using Oozie/shell script. This mechanism is very time consuming and the map-reduce task have heavy latency. And quite often, translating the output out of one MR job into the input of another MR job might require writing another code because Oozie may not suffice. In Spark, you can basically do everything using single application / console (pyspark or scala console) and get the results immediately. Switching between 'Running something on cluster' and 'doing something locally' is fairly easy and straightforward. This also leads to less context switch of the developer and more productivity. Spark kind of equals to MapReduce and Oozie put together. Q2: Is there are point of learning Mapreduce, then? Ans: Yes. For the following reason: Mapreduce is a paradigm used by many big data tools including Spark. So, understanding the MapReduce paradigm and how to convert a problem into series of MR tasks is very important. When the data grows beyond what can fit into the memory on your cluster, the Hadoop MapReduce paradigm is still very relevant. Almost, every other tool such as Hive or Pig converts its query into MapReduce phases. If you understand the Mapreduce then you will be able to optimize your queries better. Q3: When running Spark on Yarn, do I need to install Spark on all nodes of Yarn Cluster? Ans: Since spark runs on top of Yarn, it utilizes yarn for the execution of its commands over the cluster's nodes. So, you just have to install Spark on one node. Q4: What are the downsides of Spark? Ans: Spark utilizes the memory. The developer has to be careful. A casual developer might make following mistakes: She may end up running everything on the local node instead of distributing work over to the cluster. She might hit some webservice too many times by the way of using multiple clusters. The first problem is well tackled by Hadoop Map reduce paradigm as it ensures that the data your code is churning is fairly small a point of time thus you can make a mistake of trying to handle whole data on a single node. The second mistake is possible in Map-Reduce too. While writing Map-Reduce, may hit a service from inside of map() or reduce() too many times. This overloading of service is also possible while using Spark. Q5: What is a RDD? Ans:
The full form of RDD is resilience distributed dataset. It is a representation of data located on a network which is Immutable - You can operate on the rdd to produce another rdd but you can’t alter it. Partitioned / Parallel - The data located on RDD is operated in parallel. Any operation on RDD is done using multiple nodes. Resilience - If one of the node hosting the partition fails, another nodes takes its data. RDD provides two kinds of operations: Transformations and Actions. Q6: What is Transformations? Ans: The transformations are the functions that are applied on an RDD (resilient distributed data set). The transformation results in another RDD. A transformation is not executed until an action follows. The example of transformations are: map() - applies the function ed to it on each element of RDD resulting in a new RDD. filter() - creates a new RDD by picking the elements from the current RDD which the function argument. Q7: What are Actions? Ans: An action brings back the data from the RDD to the local machine. Execution of an action results in all the previously created transformation. The example of actions are: reduce() - executes the function ed again and again until only one value is left. The function should take two argument and return one value. take() - take all the values back to the local node form RDD.
Hadoop Developer Interview Questions Explain how Hadoop is different from other parallel computing solutions. What are the modes Hadoop can run in? What will a Hadoop job do if developers try to run it with an output directory that is already present? How can you debug your Hadoop code?
Did you ever built a production process in Hadoop? If yes, what was the process when your Hadoop job fails due to any reason? (Open Ended Question) Give some examples of companies that are using Hadoop architecture extensively. Hadoop Interview Questions If you want to analyze 100TB of data, what is the best architecture for that? Explain about the functioning of Master Slave architecture in Hadoop? What is distributed cache and what are its benefits? What are the points to consider when moving from an Oracle database to Hadoop clusters? How would you decide the correct size and number of nodes in a Hadoop cluster? How do you benchmark your Hadoop Cluster with Hadoop tools? Hadoop Interview Questions on HDFS Explain the major difference between an HDFS block and an InputSplit. Does HDFS make block boundaries between records? What is streaming access? What do you mean by “Heartbeat” in HDFS? If there are 10 HDFS blocks to be copied from one machine to another. However, the other machine can copy only 7.5 blocks, is there a possibility for the blocks to be broken down during the time of replication? What is Speculative execution in Hadoop? What is WebDAV in Hadoop? What is fault tolerance in HDFS?
How are HDFS blocks replicated? Which command is used to do a file system check in HDFS? Explain about the different types of “writes” in HDFS. Hadoop MapReduce Interview Questions What is a NameNode and what is a DataNode? What is Shuffling in MapReduce? Why would a Hadoop developer develop a Map Reduce by disabling the reduce step? What is the functionality of Task Tracker and Job Tracker in Hadoop? How many instances of a Task Tracker and Job Tracker can be run on a single Hadoop Cluster? How does NameNode tackle DataNode failures? What is InputFormat in Hadoop? What is the purpose of RecordReader in Hadoop? What is InputSplit in MapReduce? 31)In Hadoop, if custom partitioner is not defined then, how is data partitioned before it is sent to the reducer? What is replication factor in Hadoop and what is default replication factor level Hadoop comes with? What is SequenceFile in Hadoop and Explain its importance? If you are the of a MapReduce framework, then what are the configuration parameters you need to specify? Explain about the different parameters of the mapper and reducer functions.
How can you set random number of mappers and reducers for a Hadoop job? How many Daemon processes run on a Hadoop System? What happens if the number of reducers is 0? What is meant by Map-side and Reduce-side in Hadoop? How can the NameNode be restarted? Hadoop attains parallelism by isolating the tasks across various nodes; it is possible for some of the slow nodes to rate-limit the rest of the program and slows down the program. What method Hadoop provides to combat this? What is the significance of conf.setMapper class? What are combiners and when are these used in a MapReduce job? How does a DataNode know the location of the NameNode in Hadoop cluster? How can you check whether the NameNode is working or not? Pig Interview Questions When doing a in Hadoop, you notice that one reducer is running for a very long time. How will address this problem in Pig? Are there any problems which can only be solved by MapReduce and cannot be solved by PIG? In which kind of scenarios MR jobs will be more useful than PIG? Give an example scenario on the usage of counters. Hive Interview Questions Explain the difference between ORDER BY and SORT BY in Hive? Differentiate between HiveQL and SQL.
Gartner predicted that, "Big Data Movement will generate 4.4 million new IT jobs by end of 2015 and Hadoop will be in most advanced analytics products by 2015.” With the increasing demand for Hadoop for Big Data related issues, the prediction by Gartner is ringing true. During March 2014, there were approximately 17,000 Hadoop Developer jobs d online. As of 4 th, April 2015 - there are about 50,000 job openings for Hadoop Developers across the world with close to 25,000 openings in the US alone. Of the 3000 Hadoop students that we have trained so far, the most popular blog article request was one on Hadoop interview questions. There are 4 steps which you must take if you are trying to get a job in emerging technology domains: Carefully outline the roles and responsibilities Make your resume highlight the required core skills Document each and every step of your efforts Purposefully Network With more than 30,000 open Hadoop developer jobs, professionals must familiarize themselves with the each and every component of the Hadoop ecosystem to make sure that they have a deep understanding of what Hadoop is so that they can form an effective approach to a given big data problem. With the help of Hadoop Instructors, we have put together a detailed list of Hadoop latest interview questions based on the different components of the Hadoop Ecosystem such as MapReduce, Hive, HBase, Pig, YARN, Flume, Sqoop, HDFS, etc. Hadoop Basic Interview Questions What is Big Data? Any data that cannot be stored into traditional RDBMS is termed as Big Data. As we know most of the data that we use today has been generated in the past 20 years. And this data is mostly unstructured or semi structured in nature. More than the volume of the data – it is the nature of the data that defines whether it is considered as Big Data or not. What do the four V’s of Big Data denote? IBM has a nice, simple explanation for the four critical features of big data: a) Volume –Scale of data b) Velocity –Different forms of data c) Variety –Analysis of streaming data d) Veracity –Uncertainty of data IBM has a nice, simple explanation for the four critical features of big data: a) Volume –Scale of data b) Velocity –Different forms of data
c) Variety –Analysis of streaming data d) Veracity –Uncertainty of data For more the Basic questions and answers click here Hadoop HDFS Interview Questions What is a block and block scanner in HDFS? Block - The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB. Block Scanner - Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode. Explain the difference between NameNode, Backup Node and Checkpoint NameNode. NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespacefsimage file- It keeps track of the latest checkpoint of the namespace. edits file-It is a log of changes that have been made to the namespace since checkpoint. Checkpoint NodeCheckpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by ing the edits and fsimage file from the NameNode and merging it locally. The new image is then again updated back to the active NameNode. BackupNode: Backup Node also provides check pointing functionality like that of the checkpoint node but it also maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode. For more the Hadoop HDFS Interview Questions click here MapReduce Interview Questions Explain the usage of Context Object. Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for updating counters, to report the progress and to provide any application level status updates. ContextObject has the configuration details for the job and also interfaces, that helps it to generating the output. What are the core methods of a Reducer?
The 3 core methods of a reducer are – 1)setup () – This method of the reducer is used for configuring various parameters like the input data size, distributed cache, heap size, etc. Function Definition- public void setup (context) 2)reduce () it is heart of the reducer which is called once per key with the associated reduce task. Function Definition -public void reduce (Key,Value,context) 3)cleanup () - This method is called only once at the end of reduce task for clearing all the temporary files. Function Definition -public void cleanup (context) For more the MapReduce Interview Questions click here Hadoop HBase Interview Questions When should you use HBase and what are the key components of HBase? HBase should be used when the big data application has – 1)A variable schema 2)When data is stored in the form of collections 3)If the application demands key based access to data while retrieving. Key components of HBase are – Region- This component contains memory data store and Hfile. Region Server-This monitors the Region. HBase Master-It is responsible for monitoring the region server. Zookeeper- It takes care of the coordination between the HBase Master component and the client.
Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the META table is and META table stores all the regions in the system. For more the Hadoop HBase Interview Questions click here Hadoop Sqoop Interview Questions Explain about some important Sqoop commands other than import and export. Create Job (--create) Here we are creating a job with the name my job, which can import the table data from RDBMS table to HDFS. The following command is used to create a job that is importing data from the employee table in the db database to the HDFS file. $ Sqoop job --create myjob \ --import \ --connect jdbc:mysql://localhost/db \ --name root \ --table employee --m 1 Job (--list) ‘--list’ argument is used to the saved jobs. The following command is used to the list of saved Sqoop jobs. $ Sqoop job --list Inspect Job (--show) ‘--show’ argument is used to inspect or particular jobs and their details. The following command and sample output is used to a job called myjob. $ Sqoop job --show myjob Execute Job (--exec) ‘--exec’ option is used to execute a saved job. The following command is used to execute a saved job called myjob. $ Sqoop job --exec myjob
For moreHadoop Sqoop Interview Questions click here Hadoop Flume Interview Questions Explain about the core components of Flume. The core components of Flume are – Event- The single log entry or unit of data that is transported. Source- This is the component through which data enters Flume workflows. Sink-It is responsible for transporting data to the desired destination. Channel- it is the duct between the Sink and Source. Agent- Any JVM that runs Flume. Client- The component that transmits event to the source that operates with the agent. For more Hadoop Flume Interview Questions click here Hadoop Zookeeper Interview Questions Can Apache Kafka be used without Zookeeper? It is not possible to use Apache Kafka without Zookeeper because if the Zookeeper is down Kafka cannot serve client request. Name a few companies that use Zookeeper. Yahoo, Solr, Helprace, Neo4j, Rackspace For more Hadoop Zookeeper Interview Questions click here Pig Interview Questions What do you mean by a bag in Pig? Collection of tuples is referred as a bag in Apache Pig Does Pig multi-line commands? Yes For more Pig Interview Questions click here Hive Interview Questions What is a Hive Metastore? Hive Metastore is a central repository that stores metadata in external database.
Are multiline comments ed in Hive? No For more Hive Interview Questions click here Hadoop YARN Interview Questions What are the stable versions of Hadoop? Release 2.7.1 (stable) Release 2.4.1 Release 1.2.1 (stable) What is Apache Hadoop YARN? YARN is a powerful and efficient feature rolled out as a part of Hadoop 2.0.YARN is a large scale distributed system for running big data applications.
What is Big Data? Any data that cannot be stored into traditional RDBMS is termed as Big Data. As we know most of the data that we use today has been generated in the past 20 years. And this data is mostly unstructured or semi structured in nature. More than the volume of the data – it is the nature of the data that defines whether it is considered as Big Data or not. Here is an interesting and explanatory visual on “What is Big Data?”
What do the four V’s of Big Data denote? IBM has a nice, simple explanation for the four critical features of big data: a) Volume –Scale of data b) Velocity –Different forms of data c) Variety –Analysis of streaming data d) Veracity –Uncertainty of data Here is an explanatory video on the four V’s of Big Data How big data analysis helps businesses increase their revenue? Give example. Big data analysis is helping businesses differentiate themselves – for example Walmart the world’s largest retailer in 2014 in of revenue - is using big data analytics to increase its sales through
better predictive analytics, providing customized recommendations and launching new products based on customer preferences and needs. Walmart observed a significant 10% to 15% increase in online sales for $1 billion in incremental revenue. There are many more companies like Facebook, Twitter, LinkedIn, Pandora, JPMorgan Chase, Bank of America, etc. using big data analytics to boost their revenue. Here is an interesting video that explains how various industries are leveraging big data analysis to increase their revenue Name some companies that use Hadoop. Yahoo (One of the biggest & more than 80% code contributor to Hadoop) Facebook Netflix Amazon Adobe eBay Hulu Spotify Rubikloud Twitter To view a detailed list of some of the top companies using Hadoop CLICK HERE Differentiate between Structured and Unstructured data. Data which can be stored in traditional database systems in the form of rows and columns, for example the online purchase transactions can be referred to as Structured Data. Data which can be stored only partially in traditional database systems, for example, data in XML records can be referred to as semi structured data. Unorganized and raw data that cannot be categorized as semi structured or structured data is referred to as unstructured data. Facebook updates, Tweets on Twitter, Reviews, web logs, etc. are all examples of unstructured data. On what concept the Hadoop framework works? Hadoop Framework works on the following two core components1)HDFS – Hadoop Distributed File System is the java based file system for scalable and reliable storage of large datasets. Data in HDFS is stored in the form of blocks and it operates on the Master Slave Architecture. 2)Hadoop MapReduce-This is a java based programming paradigm of Hadoop framework that provides scalability across various Hadoop clusters. MapReduce distributes the workload into various tasks that can run in parallel. Hadoop jobs perform 2 separate tasks- job. The map job breaks down the data sets into key-value pairs or tuples. The reduce job then takes the output of the map job and combines the data tuples to into smaller set of tuples. The reduce job is always performed after the map job is executed.
Here is a visual that clearly explain the HDFS and Hadoop MapReduce Concepts-
7) What are the main components of a Hadoop Application? Hadoop applications have wide range of technologies that provide great advantage in solving complex business problems. Core components of a Hadoop application areHadoop Common HDFS Hadoop MapReduce YARN Data Access Components are - Pig and Hive Data Storage Component is - HBase Data Integration Components are - Apache Flume, Sqoop, Chukwa Data Management and Monitoring Components are - Ambari, Oozie and Zookeeper. Data Serialization Components are - Thrift and Avro Data Intelligence Components are - Apache Mahout and Drill. What is Hadoop streaming? Hadoop distribution has a generic application programming interface for writing Map and Reduce jobs in any desired programming language like Python, Perl, Ruby, etc. This is referred to as Hadoop
Streaming. s can create and run jobs with any kind of shell scripts or executable as the Mapper or Reducers. What is the best hardware configuration to run Hadoop? The best configuration for executing Hadoop jobs is dual core machines or dual processors with 4GB or 8GB RAM that use ECC memory. Hadoop highly benefits from using ECC memory though it is not low - end. ECC memory is recommended for running Hadoop because most of the Hadoop s have experienced various checksum errors by using non ECC memory. However, the hardware configuration also depends on the workflow requirements and can change accordingly. What are the most commonly defined input formats in Hadoop? The most common Input Formats defined in Hadoop are: Text Input Format- This is the default input format defined in Hadoop. Key Value Input Format- This input format is used for plain text files wherein the files are broken down into lines. Sequence File Input Format- This input format is used for reading files in sequence.
We have further categorized Big Data Interview Questions for Freshers and ExperiencedHadoop Interview Questions and Answers for Freshers - Q.Nos- 1,2,4,5,6,7,8,9 Hadoop Interview Questions and Answers for Experienced - Q.Nos-3,8,9,10 Click here to know more about our IBM Certified Hadoop Developer course Hadoop HDFS Interview Questions and Answers What is a block and block scanner in HDFS? Block - The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB. Block Scanner - Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode. Explain the difference between NameNode, Backup Node and Checkpoint NameNode. NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespacefsimage file- It keeps track of the latest checkpoint of the namespace. edits file-It is a log of changes that have been made to the namespace since checkpoint.
Checkpoint NodeCheckpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by ing the edits and fsimage file from the NameNode and merging it locally. The new image is then again updated back to the active NameNode. BackupNode: Backup Node also provides check pointing functionality like that of the checkpoint node but it also maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode. What is commodity hardware? Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM. Hadoop can be run on any commodity hardware and does not require any super computer s or high end hardware configuration to execute jobs. What is the port number for NameNode, Task Tracker and Job Tracker? NameNode 50070 Job Tracker 50030 Task Tracker 50060 Explain about the process of inter cluster data copying. HDFS provides a distributed data copying facility through the Dist from source to destination. If this data copying is within the hadoop cluster then it is referred to as inter cluster data copying. Dist requires both source and destination to have a compatible or same version of hadoop. How can you overwrite the replication factors in HDFS? The replication factor in HDFS can be modified or overwritten in 2 ways1)Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command$hadoop fs –setrep –w 2 /my/test_file (test_file is the filename whose replication factor will be set to 2) 2)Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command-
3)$hadoop fs –setrep –w 5 /my/test_dir (test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5) Explain the difference between NAS and HDFS. NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol. NAS stores data on a dedicated hardware whereas in HDFS all the data blocks are distributed across local drives of the machines. In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data. Explain what happens if during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3. Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1 ,if the DataNode crashes under any circumstances, then only single copy of the data would be lost. What is the process to change the files at arbitrary locations in HDFS? HDFS does not modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file. Explain about the indexing process in HDFS. Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored. What is a rack awareness and on what basis is data stored in a rack? All the data nodes put together form a storage area i.e. the physical location of the data nodes is referred to as Rack in HDFS. The rack information i.e. the rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness. The contents present in the file are divided into data block as soon as the client is ready to load the file into the hadoop cluster. After consulting with the NameNode, client allocates 3 data nodes for each data block. For each data block, there exists 2 copies in one rack and the third copy is present in another rack. This is generally referred to as the Replica Placement Policy. We have further categorized Hadoop HDFS Interview Questions for Freshers and Experienced-
Hadoop Interview Questions and Answers for Freshers - Q.Nos- 2,3,7,9,10,11 Hadoop Interview Questions and Answers for Experienced - Q.Nos- 1,2, 4,5,6,7,8 Hadoop MapReduce Interview Questions and Answers Explain the usage of Context Object. Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for updating counters, to report the progress and to provide any application level status updates. ContextObject has the configuration details for the job and also interfaces, that helps it to generating the output. What are the core methods of a Reducer? The 3 core methods of a reducer are – 1)setup () – This method of the reducer is used for configuring various parameters like the input data size, distributed cache, heap size, etc. Function Definition- public void setup (context) 2)reduce () it is heart of the reducer which is called once per key with the associated reduce task. Function Definition -public void reduce (Key,Value,context) 3)cleanup () - This method is called only once at the end of reduce task for clearing all the temporary files. Function Definition -public void cleanup (context) Explain about the partitioning, shuffle and sort phase Shuffle Phase-Once the first map tasks are completed, the nodes continue to perform several other map tasks and also exchange the intermediate outputs with the reducers as required. This process of moving the intermediate outputs of map tasks to the reducer is referred to as Shuffling. Sort Phase- Hadoop MapReduce automatically sorts the set of intermediate keys on a single node before they are given as input to the reducer. Partitioning Phase-The process that determines which intermediate keys and value will be received by each reducer instance is referred to as partitioning. The destination partition is same for any key irrespective of the mapper instance that generated it. How to write a custom partitioner for a Hadoop MapReduce job? Steps to write a Custom Partitioner for a Hadoop MapReduce JobA new class must be created that extends the pre-defined Partitioner Class. getPartition method of the Partitioner class must be overridden.
The custom partitioner to the job can be added as a config file in the wrapper which runs Hadoop MapReduce or the custom partitioner can be added to the job by using the set method of the partitioner class. What is the relationship between Job and Task in Hadoop? A single job can be broken down into one or many tasks in Hadoop. Is it important for Hadoop MapReduce jobs to be written in Java? It is not necessary to write Hadoop MapReduce jobs in java but s can write MapReduce jobs in any desired programming language like Ruby, Perl, Python, R, Awk, etc. through the Hadoop Streaming API. What is the process of changing the split size if there is limited storage space on Commodity Hardware? If there is limited storage space on commodity hardware, the split size can be changed by implementing the “Custom Splitter”. The call to Custom Splitter can be made from the main method. What are the primary phases of a Reducer? The 3 primary phases of a reducer are – 1)Shuffle 2)Sort 3)Reduce What is a TaskInstance? The actual hadoop MapReduce jobs that run on each slave node are referred to as Task instances. Every task instance has its own JVM process. For every new task instance, a JVM process is spawned by default for a task. Can reducers communicate with each other? Reducers always run in isolation and they can never communicate with each other as per the Hadoop MapReduce programming paradigm. We have further categorized Hadoop MapReduce Interview Questions for Freshers and ExperiencedHadoop Interview Questions and Answers for Freshers - Q.Nos- 2,5,6 Hadoop Interview Questions and Answers for Experienced - Q.Nos- 1,3,4,7,8,9,10
Hadoop HBase Interview Questions and Answers When should you use HBase and what are the key components of HBase? HBase should be used when the big data application has – 1)A variable schema 2)When data is stored in the form of collections 3)If the application demands key based access to data while retrieving. Key components of HBase are – Region- This component contains memory data store and Hfile. Region Server-This monitors the Region. HBase Master-It is responsible for monitoring the region server. Zookeeper- It takes care of the coordination between the HBase Master component and the client. Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the META table is and META table stores all the regions in the system. What are the different operational commands in HBase at record level and table level? Record Level Operational Commands in HBase are –put, get, increment, scan and delete. Table Level Operational Commands in HBase are-describe, list, drop, disable and scan. What is Row Key? Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array. Explain the difference between RDBMS data model and HBase data model. RDBMS is a schema based database whereas HBase is schema less data model.
RDBMS does not have for in-built partitioning whereas in HBase there is automated partitioning. RDBMS stores normalized data whereas HBase stores de-normalized data. Explain about the different catalog tables in HBase? The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system. What is column families? What happens if you alter the block size of ColumnFamily on an already populated database? The logical deviation of data is represented through a key known as column Family. Column families consist of the basic unit of physical storage on which compression features can be applied. In an already populated database, when the block size of column family is altered, the old data will remain within the old block size whereas the new data that comes in will take the new block size. When compaction takes place, the old data will take the new block size so that the existing data is read correctly. Explain the difference between HBase and Hive. HBase and Hive both are completely different hadoop based technologies-Hive is a data warehouse infrastructure on top of Hadoop whereas HBase is a NoSQL key value store that runs on top of Hadoop. Hive helps SQL savvy people to run MapReduce jobs whereas HBase s 4 primary operations-put, get, scan and delete. HBase is ideal for real time querying of big data where Hive is an ideal choice for analytical querying of data collected over period of time. Explain the process of row deletion in HBase. On issuing a delete command in HBase through the HBase client, data is not actually deleted from the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are removed at regular intervals during compaction. What are the different types of tombstone markers in HBase for deletion? There are 3 different types of tombstone markers in HBase for deletion1)Family Delete Marker- This markers marks all columns for a column family. 2)Version Delete Marker-This marker marks a single version of a column. 3)Column Delete Marker-This markers marks all the versions of a column. Explain about HLog and WAL in HBase.
All edits in the HStore are stored in the HLog. Every region server has one HLog. HLog contains entries for edits of all regions performed by a particular Region Server.WAL abbreviates to Write Ahead Log (WAL) in which all the HLog edits are written immediately.WAL edits remain in the memory till the flush period in case of deferred log flush. We have further categorized Hadoop HBase Interview Questions for Freshers and Experienced-
Hadoop Interview Questions and Answers for Freshers - Q.Nos-1,2,4,5,7 Hadoop Interview Questions and Answers for Experienced - Q.Nos-2,3,6,8,9,10 Hadoop Sqoop Interview Questions and Answers Explain about some important Sqoop commands other than import and export. Create Job (--create) Here we are creating a job with the name my job, which can import the table data from RDBMS table to HDFS. The following command is used to create a job that is importing data from the employee table in the db database to the HDFS file. $ Sqoop job --create myjob \ --import \ --connect jdbc:mysql://localhost/db \ --name root \ --table employee --m 1 Job (--list) ‘--list’ argument is used to the saved jobs. The following command is used to the list of saved Sqoop jobs. $ Sqoop job --list Inspect Job (--show) ‘--show’ argument is used to inspect or particular jobs and their details. The following command and sample output is used to a job called myjob.
$ Sqoop job --show myjob Execute Job (--exec) ‘--exec’ option is used to execute a saved job. The following command is used to execute a saved job called myjob. $ Sqoop job --exec myjob How Sqoop can be used in a Java program? The Sqoop jar in classpath should be included in the java code. After this the method Sqoop.runTool () method must be invoked. The necessary parameters should be created to Sqoop programmatically just like for command line. What is the process to perform an incremental data load in Sqoop? The process to perform incremental data load in Sqoop is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. The delta data can be facilitated through the incremental load command in Sqoop. Incremental load can be performed by using Sqoop import command or by loading the data into hive without overwriting it. The different attributes that need to be specified during incremental load in Sqoop are1)Mode (incremental) –The mode defines how Sqoop will determine what the new rows are. The mode can have value as Append or Last Modified. 2)Col (Check-column) –This attribute specifies the column that should be examined to find out the rows to be imported. 3)Value (last-value) –This denotes the maximum value of the check column from the previous import operation. Is it possible to do an incremental import using Sqoop? Yes, Sqoop s two types of incremental imports1)Append 2)Last Modified
To insert only rows Append should be used in import command and for inserting the rows and also updating Last-Modified should be used in the import command. What is the standard location or path for Hadoop Sqoop scripts? /usr/bin/Hadoop Sqoop How can you check all the tables present in a single database using Sqoop? The command to check the list of all tables present in a single database using Sqoop is as followsSqoop list-tables –connect jdbc: mysql: //localhost/; How are large objects handled in Sqoop? Sqoop provides the capability to store large sized data into a single field based on the type of data. Sqoop s the ability to store1)CLOB ‘s – Character Large Objects 2)BLOB’s –Binary Large Objects Large objects in Sqoop are handled by importing the large objects into a file referred as “LobFile” i.e. Large Object File. The LobFile has the ability to store records of huge size, thus each record in the LobFile is a large object. Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used? Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the – e and –query options with the import command the –target dir value must be specified. Differentiate between Sqoop and dist. Dist utility can be used to transfer data between clusters whereas Sqoop can be used to transfer data only between Hadoop and RDBMS. What are the limitations of importing RDBMS tables into Hcatalog directly? There is an option to import RDBMS tables into Hcatalog directly by making use of –hcatalog – database option with the –hcatalog –table but the limitation to it is that there are several arguments like –as-avrofile , -direct, -as-sequencefile, -target-dir , -export-dir are not ed. We have further categorized Hadoop Sqoop Interview Questions for Freshers and Experienced-
Hadoop Interview Questions and Answers for Freshers - Q.Nos- 4,5,6,9 Hadoop Interview Questions and Answers for Experienced - Q.Nos- 1,2,3,6,7,8,10 Hadoop Flume Interview Questions and Answers Explain about the core components of Flume. The core components of Flume are – Event- The single log entry or unit of data that is transported. Source- This is the component through which data enters Flume workflows. Sink-It is responsible for transporting data to the desired destination. Channel- it is the duct between the Sink and Source. Agent- Any JVM that runs Flume. Client- The component that transmits event to the source that operates with the agent. Does Flume provide 100% reliability to the data flow? Yes, Apache Flume provides end to end reliability because of its transactional approach in data flow. How can Flume be used with HBase? Apache Flume can be used with HBase using one of the two HBase sinks – HBaseSink (org.apache.flume.sink.hbase.HBaseSink) s secure HBase clusters and also the novel HBase IPC that was introduced in the version HBase 0.96. AsyncHBaseSink (org.apache.flume.sink.hbase.AsyncHBaseSink) has better performance than HBase sink as it can easily make non-blocking calls to HBase. Working of the HBaseSink – In HBaseSink, a Flume Event is converted into HBase Increments or Puts. Serializer implements the HBaseEventSerializer which is then instantiated when the sink starts. For every event, sink calls the initialize method in the serializer which then translates the Flume Event into HBase increments and puts to be sent to HBase cluster. Working of the AsyncHBaseSinkAsyncHBaseSink implements the AsyncHBaseEventSerializer. The initialize method is called only once by the sink when it starts. Sink invokes the setEvent method and then makes calls to the
getIncrements and getActions methods just similar to HBase sink. When the sink stops, the cleanUp method is called by the serializer. Explain about the different channel types in Flume. Which channel type is faster? The 3 different built in channel types available in Flume areMEMORY Channel – Events are read from the source into memory and ed to the sink. JDBC Channel – JDBC Channel stores the events in an embedded Derby database. FILE Channel –File Channel writes the contents to a file on the file system after reading the event from a source. The file is deleted only after the contents are successfully delivered to the sink. MEMORY Channel is the fastest channel among the three however has the risk of data loss. The channel that you choose completely depends on the nature of the big data application and the value of each event. Which is the reliable channel in Flume to ensure that there is no data loss? FILE Channel is the most reliable channel among the 3 channels JDBC, FILE and MEMORY. Explain about the replication and multiplexing selectors in Flume. Channel Selectors are used to handle multiple channels. Based on the Flume header value, an event can be written just to a single channel or to multiple channels. If a channel selector is not specified to the source then by default it is the Replicating selector. Using the replicating selector, the same event is written to all the channels in the source’s channels list. Multiplexing channel selector is used when the application has to send different events to different channels. How multi-hop agent can be setup in Flume? Avro RPC Bridge mechanism is used to setup Multi-hop agent in Apache Flume. Does Apache Flume provide for third party plug-ins? Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources and transfer it to external destinations. Is it possible to leverage real time analysis on the big data collected by Flume directly? If yes, then explain how. Data from Flume can be extracted, transformed and loaded in real-time into Apache Solr servers usingMorphlineSolrSink Differentiate between FileSink and FileRollSink
The major difference between HDFS FileSink and FileRollSink is that HDFS File Sink writes the events into the Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the local file system. Hadoop Flume Interview Questions and Answers for Freshers - Q.Nos- 1,2,4,5,6,10 Hadoop Flume Interview Questions and Answers for Experienced- Q.Nos- 3,7,8,9 Hadoop Zookeeper Interview Questions and Answers Can Apache Kafka be used without Zookeeper? It is not possible to use Apache Kafka without Zookeeper because if the Zookeeper is down Kafka cannot serve client request. Name a few companies that use Zookeeper. Yahoo, Solr, Helprace, Neo4j, Rackspace What is the role of Zookeeper in HBase architecture? In HBase architecture, ZooKeeper is the monitoring server that provides different services like – tracking server failure and network partitions, maintaining the configuration information, establishing communication between the clients and region servers, usability of ephemeral nodes to identify the available servers in the cluster. Explain about ZooKeeper in Kafka Apache Kafka uses ZooKeeper to be a highly distributed and scalable system. Zookeeper is used by Kafka to store various configurations and use them across the hadoop cluster in a distributed manner. To achieve distributed-ness, configurations are distributed and replicated throughout the leader and follower nodes in the ZooKeeper ensemble. We cannot directly connect to Kafka by byeing ZooKeeper because if the ZooKeeper is down it will not be able to serve the client request. Explain how Zookeeper works ZooKeeper is referred to as the King of Coordination and distributed applications use ZooKeeper to store and facilitate important configuration information updates. ZooKeeper works by coordinating the processes of distributed applications. ZooKeeper is a robust replicated synchronization service with eventual consistency. A set of nodes is known as an ensemble and persisted data is distributed between multiple nodes. 3 or more independent servers collectively form a ZooKeeper cluster and elect a master. One client connects to any of the specific server and migrates if a particular node fails. The ensemble of ZooKeeper nodes is alive till the majority of nods are working. The master node in ZooKeeper is dynamically selected by the consensus within the ensemble so if the master node fails then the role of master node will migrate to another node which is selected dynamically. Writes are linear and reads are concurrent in ZooKeeper. List some examples of Zookeeper use cases.
Found by Elastic uses Zookeeper comprehensively for resource allocation, leader election, high priority notifications and discovery. The entire service of Found built up of various systems that read and write to Zookeeper. Apache Kafka that depends on ZooKeeper is used by LinkedIn Storm that relies on ZooKeeper is used by popular companies like Groupon and Twitter. How to use Apache Zookeeper command line interface? ZooKeeper has a command line client for interactive use. The command line interface of ZooKeeper is similar to the file and shell system of UNIX. Data in ZooKeeper is stored in a hierarchy of Znodes where each znode can contain data just similar to a file. Each znode can also have children just like directories in the UNIX file system. Zookeeper-client command is used to launch the command line client. If the initial prompt is hidden by the log messages after entering the command, s can just hit ENTER to view the prompt. What are the different types of Znodes? There are 2 types of Znodes namely- Ephemeral and Sequential Znodes. The Znodes that get destroyed as soon as the client that created it disconnects are referred to as Ephemeral Znodes. Sequential Znode is the one in which sequential number is chosen by the ZooKeeper ensemble and is pre-fixed when the client assigns name to the znode. What are watches? Client disconnection might be troublesome problem especially when we need to keep a track on the state of Znodes at regular intervals. ZooKeeper has an event system referred to as watch which can be set on Znode to trigger an event whenever it is removed, altered or any new children are created below it. What problems can be addressed by using Zookeeper? In the development of distributed systems, creating own protocols for coordinating the hadoop cluster results in failure and frustration for the developers. The architecture of a distributed system can be prone to deadlocks, inconsistency and race conditions. This leads to various difficulties in making the hadoop cluster fast, reliable and scalable. To address all such problems, Apache ZooKeeper can be used as a coordination service to write correct distributed applications without having to reinvent the wheel from the beginning. Hadoop ZooKeeper Interview Questions and Answers for Freshers - Q.Nos- 1,2,8,9 Hadoop ZooKeeper Interview Questions and Answers for Experienced- Q.Nos-3,4,5,6,7, 10 Hadoop Pig Interview Questions and Answers What do you mean by a bag in Pig? Collection of tuples is referred as a bag in Apache Pig Does Pig multi-line commands? Yes
What are different modes of execution in Apache Pig? Apache Pig runs in 2 modes- one is the “Pig (Local Mode) Command Mode” and the other is the “Hadoop MapReduce (Java) Command Mode”. Local Mode requires access to only a single machine where all files are installed and executed on a local host whereas MapReduce requires accessing the Hadoop cluster. Explain the need for MapReduce while programming in Apache Pig. Apache Pig programs are written in a query language known as Pig Latin that is similar to the SQL query language. To execute the query, there is need for an execution engine. The Pig engine converts the queries into MapReduce jobs and thus MapReduce acts as the execution engine and is needed to run the programs. Explain about co-group in Pig. COGROUP operator in Pig is used to work with multiple tuples. COGROUP operator is applied on statements that contain or involve two or more relations. The COGROUP operator can be applied on up to 127 relations at a time. When using the COGROUP operator on two tables at once-Pig first groups both the tables and after that s the two tables on the grouped columns. Explain about the BloomMapFile. BloomMapFile is a class that extends the MapFile class. It is used n HBase table format to provide quick hip test for the keys using dynamic bloom filters. Differentiate between Hadoop MapReduce and Pig Pig provides higher level of abstraction whereas MapReduce provides low level of abstraction. MapReduce requires the developers to write more lines of code when compared to Apache Pig. Pig coding approach is comparatively slower than the fully tuned MapReduce coding approach. Read More in Detail- http://www.dezyre.com/article/-mapreduce-vs-pig-vs-hive/163 What is the usage of foreach operation in Pig scripts? FOREACH operation in Apache Pig is used to apply transformation to each element in the data bag so that respective action is performed to generate new data items. Syntax- FOREACH data_bagname GENERATE exp1, exp2 Explain about the different complex data types in Pig. Apache Pig s 3 complex data typesMaps- These are key, value stores ed together using #. Tuples- Just similar to the row in a table where different items are separated by a comma. Tuples can have multiple attributes. Bags- Unordered collection of tuples. Bag allows multiple duplicate tuples. What does Flatten do in Pig?
Sometimes there is data in a tuple or bag and if we want to remove the level of nesting from that data then Flatten modifier in Pig can be used. Flatten un-nests bags and tuples. For tuples, the Flatten operator will substitute the fields of a tuple in place of a tuple whereas un-nesting bags is a little complex because it requires creating new tuples. We have further categorized Hadoop Pig Interview Questions for Freshers and ExperiencedHadoop Interview Questions and Answers for Freshers - Q.Nos-1,2,4,7,9 Hadoop Interview Questions and Answers for Experienced - Q.Nos- 3,5,6,8,10 Hadoop Hive Interview Questions and Answers What is a Hive Metastore? Hive Metastore is a central repository that stores metadata in external database. Are multiline comments ed in Hive? No What is ObjectInspector functionality? ObjectInspector is used to analyze the structure of individual columns and the internal structure of the row objects. ObjectInspector in Hive provides access to complex objects which can be stored in multiple formats.
Hadoop Hive Interview Questions and Answers for Freshers- Q.Nos-1,2,3 Hadoop YARN Interview Questions and Answers 1)What are the stable versions of Hadoop? Release 2.7.1 (stable) Release 2.4.1 Release 1.2.1 (stable) 2) What is Apache Hadoop YARN? YARN is a powerful and efficient feature rolled out as a part of Hadoop 2.0.YARN is a large scale distributed system for running big data applications. Is YARN a replacement of Hadoop MapReduce? YARN is not a replacement of Hadoop but it is a more powerful and efficient technology that s MapReduce and is also referred to as Hadoop 2.0 or MapReduce 2.
We have further categorized Hadoop YARN Interview Questions for Freshers and ExperiencedHadoop Interview Questions and Answers for Freshers - Q.Nos- 2,3 Hadoop Interview Questions and Answers for Experienced - Q.Nos- 1 Hadoop Interview Questions – Answers Needed Interview Questions on Hadoop Hive 1)Explain about the different types of in Hive. 2)How can you configure remote metastore mode in Hive? 3)Explain about the SMB in Hive. 4)Is it possible to change the default location of Managed Tables in Hive, if so how? 5)How data transfer happens from Hive to HDFS? 6)How can you connect an application, if you run Hive as a server? 7)What does the overwrite keyword denote in Hive load statement? 8)What is SerDe in Hive? How can you write yourown customer SerDe? 9)In case of embedded Hive, can the same metastore be used by multiple s? Hadoop YARN Interview Questions 1)What are the additional benefits YARN brings in to Hadoop? 2)How can native libraries be included in YARN jobs? 3)Explain the differences between Hadoop 1.x and Hadoop 2.x Or 4)Explain the difference between MapReduce1 and MapReduce 2/YARN 5)What are the modules that constitute the Apache Hadoop 2.0 framework?
6)What are the core changes in Hadoop 2.0? 7)How is the distance between two nodes defined in Hadoop? 8)Differentiate between NFS, Hadoop NameNode and JournalNode. We hope that these Hadoop Interview Questions and Answers have pre-charged you for your next Hadoop Interview.Get the Ball Rolling and answer the unanswered questions in the comments below.Please do! It's all part of our shared mission to ease Hadoop Interviews for all prospective Hadoopers.We invite you to get involved. What is Hadoop Map Reduce ? For processing large data sets in parallel across a hadoop cluster, Hadoop MapReduce framework is used. Data analysis uses a two-step map and reduce process. How Hadoop MapReduce works? In MapReduce, during the map phase it counts the words in each document, while in the reduce phase it aggregates the data as per the document spanning the entire collection. During the map phase the input data is divided into splits for analysis by map tasks running in parallel across Hadoop framework. Explain what is shuffling in MapReduce ? The process by which the system performs the sort and transfers the map outputs to the reducer as inputs is known as the shuffle Explain what is distributed Cache in MapReduce Framework ? Distributed Cache is an important feature provided by map reduce framework. When you want to share some files across all nodes in Hadoop Cluster, DistributedCache is used. The files could be an executable jar files or simple properties file. Explain what is NameNode in Hadoop? NameNode in Hadoop is the node, where Hadoop stores all the file location information in HDFS (Hadoop Distributed File System). In other words, NameNode is the centrepiece of an HDFS file system. It keeps the record of all the files in the file system, and tracks the file data across the cluster or multiple machines
Explain what is JobTracker in Hadoop? What are the actions followed by Hadoop? In Hadoop for submitting and tracking MapReduce jobs, JobTracker is used. Job tracker run on its own JVM process Hadoop performs following actions in Hadoop Client application submit jobs to the job tracker JobTracker communicates to the Namemode to determine data location Near the data or with available slots JobTracker locates TaskTracker nodes On chosen TaskTracker Nodes, it submits the work When a task fails, Job tracker notify and decides what to do then. The TaskTracker nodes are monitored by JobTracker Explain what is heartbeat in HDFS? Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker Explain what combiners is and when you should use a combiner in a MapReduce Job? To increase the efficiency of MapReduce Program, Combiners are used. The amount of data can be reduced with the help of combiner’s that need to be transferred across to the reducers. If the operation performed is commutative and associative you can use your reducer code as a combiner. The execution of combiner is not guaranteed in Hadoop What happens when a datanode fails ? When a datanode fails Jobtracker and namenode detect the failure On the failed node all tasks are re-scheduled Namenode replicates the s data to another node Explain what is Speculative Execution?
In Hadoop during Speculative Execution a certain number of duplicate tasks are launched. On different slave node, multiple copies of same map or reduce task can be executed using Speculative Execution. In simple words, if a particular drive is taking long time to complete a task, Hadoop will create a duplicate task on another disk. Disk that finish the task first are retained and disks that do not finish first are killed. Explain what are the basic parameters of a Mapper? The basic parameters of a Mapper are LongWritable and Text Text and IntWritable Explain what is the function of MapReducer partitioner? The function of MapReducer partitioner is to make sure that all the value of a single key goes to the same reducer, eventually which helps evenly distribution of the map output over the reducers Explain what is difference between an Input Split and HDFS Block? Logical division of data is known as Split while physical division of data is known as HDFS Block Explain what happens in textinformat ? In textinputformat, each line in the text file is a record. Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text Mention what are the main configuration parameters that need to specify to run Mapreduce Job ? The of Mapreduce framework needs to specify Job’s input locations in the distributed file system Job’s output location in the distributed file system Input format Output format Class containing the map function Class containing the reduce function JAR file containing the mapper, reducer and driver classes
Explain what is WebDAV in Hadoop? To editing and updating files WebDAV is a set of extensions to HTTP. On most operating system WebDAV shares can be mounted as filesystems , so it is possible to access HDFS as a standard filesystem by exposing HDFS over WebDAV. Explain what is sqoop in Hadoop ? To transfer the data between Relational database management (RDBMS) and Hadoop HDFS a tool is used known as Sqoop. Using Sqoop data can be transferred from RDMS like MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS Explain how JobTracker schedules a task ? The task tracker send out heartbeat messages to Jobtracker usually every few minutes to make sure that JobTracker is active and functioning. The message also informs JobTracker about the number of available slots, so the JobTracker can stay upto date with where in the cluster work can be delegated Explain what is Sequencefileinputformat? Sequencefileinputformat is used for reading files in sequence. It is a specific compressed binary file format which is optimized for ing data between the output of one MapReduce job to the input of some other MapReduce job. Explain what does the conf.setMapper Class do ? Conf.setMapperclass sets the mapper class and all the stuff related to map job such as reading data and generating a key-value pair out of the mapper th the help of our top Hadoop instructors we’ve put to1gether a comprehensive list of questions to help you get through your first Hadoop interview. We’ve made sure that the most probable questions asked during interviews are covered in this list. If you want to learn more, check out our new courses in Hadoop! Q1. Name the most common Input Formats defined in Hadoop? Which one is default? The two most common Input Formats defined in Hadoop are: TextInputFormat KeyValueInputF6ormat SequenceFileInputFormat TextInputFormat is the Hadoop default. Q2. What is the difference between TextInputFormat and KeyValueInputFormat class? TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper.
KeyValueInputFormat: Reads text file and parses lines into key, Val pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper. Q3. What is InputSplit in Hadoop? When a Hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called InputSplit. Q4. How is the splitting of file invoked in Hadoop framework? It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class (like FileInputFormat) defined by the . Q5. Consider case scenario: In M/R system, - HDFS block size is 64 MB Input format is FileInputFormat We have 3 files of size 64K, 65Mb and 127Mb How many input splits will be made by Hadoop framework? Hadoop will make 5 splits as follows: 1 split for 64K files 2 splits for 65MB files 2 splits for 127MB files Q6. What is the purpose of RecordReader in Hadoop? The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the Input Format. Q7. After the Map phase finishes, the Hadoop framework does “Partitioning, Shuffle and sort”. Explain what happens in this phase? Partitioning: It is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them. It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same. Shuffle: After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling.
Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer. Q8. If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer? The default partitioner computes a hash value for the key and assigns the partition based on this result. Q9. What is a Combiner? The Combiner is a ‘mini-reduce’ process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances o4n a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers. Q10. What is JobTracker? JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster. Q11. What are some typical functions of Job Tracker? The following are some typical tasks of JobTracker:Accepts jobs from clients It talks to the NameNode to determine the location of the data. It locates TaskTracker nodes with available slots at or near the data. It submits the work to the chosen TaskTracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker. Q12. What is TaskTracker? TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations – from a JobTracker. Q13. What is the relationship between Jobs and Tasks in Hadoop? One job is broken down into one or many tasks in Hadoop. Q14. Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do? It will restart the task again on some other TaskTracker and only if the task fails more than four (default setting and can be changed) times will it kill the job.
Q15. Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this? Speculative Execution. Q16. How does speculative execution work in Hadoop? JobTracker makes different TaskTrackers pr2ocess same input. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first. Q17. Using command line in Linux, how will you See all jobs running in the Hadoop cluster Kill a job? Hadoop job – list Hadoop job – kill jobID Q18. What is Hadoop Streaming? Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations. Q19. What is the characteristic of streaming API that makes it flexible run MapReduce jobs in languages like Perl, Ruby, Awk etc.? Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a MapReduce job by having both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on stdout. Q20. What is Distributed Cache in Hadoop? Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node. Q21. What is the benefit of Distributed cache? Why can we just have the file in HDFS and have the application read it? This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then
every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this. Q.22 What mechanism does Hadoop framework provide to synchronise changes made in Distribution Cache during runtime of the application? This is a tricky question. There is no such mechanism. Distributed Cache by design is read only during the time of Job execution. Q23. Have you ever used Counters in Hadoop. Give us an example scenario? Anybody who claims to have worked on a Hadoop project is expected to use counters. Q24. Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job? Yes, the input format class provides methods to add multiple directories as input to a Hadoop job. Q25. Is it possible to have Hadoop job output in multiple directories? If yes, how? Yes, by using Multiple Outputs class. Q26. What will a Hadoop job do if you try to run it with an output directory that is already present? Will it Overwrite it Warn you and continue Throw an exception and exit The Hadoop job will throw an exception and exit. Q27. How can you set an arbitrary number of mappers to be created for a job in Hadoop? You cannot set it. Q28. How can you set an arbitrary number of Reducers to be created for a job in Hadoop? You can either do it programmatically by using method setNumReduceTasks in the Jobconf Class or set it up as a configuration setting. Q29. How will you write a custom partitioner for a Hadoop job? To have Hadoop use a custom partitioner you will have to do minimum the following three: Create a new class that extends Partitioner Class
Override method getPartition In the wrapper that runs the Mapreduce, either Add the custom partitioner to the job programmatically using method set Partitioner Class or – add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie) Q30. How did you debug your Hadoop code? There can be several ways of doing this but most common ways are:By using counters. The web interface provided by Hadoop framework. Q31. Did you ever built a production process in Hadoop? If yes, what was the process when your Hadoop job fails due to any reason? It is an open-ended question but most candidates if they have written a production job, should talk about some type of alert mechanism like email is sent or there monitoring system sends an alert. Since Hadoop works on unstructured data, it is very important to have a good alerting system for errors since unexpected data can very easily break the job.