Introduction
During the last years, the technology had a
big impact on the applications and in the
processing of data, and organizations have
begun give more importance to data and
invest more in their collection and
management. Big Data created as well a new
era and new technologies that allow analysis
types of data like text and voice, which have a
huge volume in the Internet and in other
structures digital. The evolution of data is
spectacular and it is very important to mention
in this paper that in the past, the volume of
data was at the level of bytes and nowadays
the companies use a huge volume of data at
the level of petabytes. Experts from the
National Climatic Data Center in Asheville
estimated that if we want to store all the data
that exist in world we had need at least 1200
exabytes, but is impossible to pin down a
relevant number. Maybe these sizes do not
mean something for the people who do not
have a direct contact to big data, but the
volume of data is huge and it is very difficult
to understand what these numbers mean.
V. Mayer-Schönberger and K. Cukier
mentioned in [24] that “There is no good way
to think about what this size of data means” to
prove once that big data is in a continuous
evolution and the future of it will be
gloriously. The paper presents an analysis of
the use of big data ingestion and present a
research used to evidence the functionality
and performance of most widely tools. We
first introduce some concepts about data
ingestion and the importance to choose it to
process big data and we propose to do a short
description for the tools used in analyze,
offering some information about Hadoop
ecosystem.
We then review existing three Apache
ingestion tools: NiFi, Flume and Kafka in
processing of big data and we will examine
the differences between them and the strong
parts of each of them. We want to offer
systematic information of the main
functionality of three tools developed by
Apache: Flume, Kafka and NiFi used in data
ingestion process and a detailed way how to
combine the tools to improve the results for
your requirements using for our research
different ways to compare the tools based on
performance, functionalities, the complexity.
Our analysis shows that all three tools have
something special, but there is not a one and
only tool which address all of customer’s
requirements and the combination of tools is
the answer for that problem. We examine and
recommend all the possible combinations
based on the needs of customers.
The paper analyses the main characteristics of
data ingestion tools. It provides key
information about typical issues of data
ingestion and about the reasons why we
1
26 Informatica Economică vol. 22, no. 2/2018
DOI: 10.12948/issn14531305/22.2.2018.03
choose the three Apache ingestion tools
instead others. Using a preliminary view, it is
important to identify the common
characteristics for tools. After that, the
analysis results will be providing. We decided
to examine Apache tools because Apache is
very well known in developer’s area and it is
the most used web server software, running on
67% of web servers from entire world.
According to the [3], “The name 'Apache' was
chosen from respect for the Native American
Indian tribe of Apache, well-known for their
superior skills in warfare strategy and their
inexhaustible endurance.”
The paper has the following structure. Section
2 introduced the concept of data ingestion
with big data, the necessity of it, including a
short description for Hadoop ecosystem and a
short paragraph where we offer information
about each tool. Section 3 contends our
research based on tools, analyzing the main
characteristics for NiFi, Flume and Kafka,
offers solid arguments why this paper use
them instead another developed tool and an
analyze based on the functionalities and
performance for them. Section 4 presents the
results of our research based on functionalities
and performance for the tools and a detailed
explanation for each result. Section 5, the
conclusion section is the most important part
because here we can observe the real
importance of the information found in this
paper and the scope of it and contains our final
results.
2. Data ingestion and Hadoop ecosystem
2.1 Data ingestion concept According to [9], “in typical ingestion
scenarios, you have multiple data sources to
process. As the number of data sources
increases, the processing starts to become
complicated”. For a long time, data storage
does not need additional tools to process the
volume of data because the quantity was
insignificant, but in last years when the
concept of big data had appeared that begin to
be a problem. As we mentioned in
introduction, this paper analyses a new
process to obtain and import data for their
storage in a database or for immediate use
called “data ingestion”. According to [27], the
term “ingestion” means a consumption of a
substance by an organism, in our paper the
consumption of a substance is represented by
data and the organism can be, for example a
database where the data are storage.
Data ingestion layer represents the initial step
for the data coming from different sources, the
step where they are categorized and
prioritized, but it is important to note that is
close the toughest job in the process of big
data. N. S. Gill [26] mentioned that “Big Data
Ingestion involves connecting to various data
sources, extracting the data, and detecting the
changed data”. For unfamiliar readers, data
ingestion can be explained like moving data
(structured or unstructured) from their origin
into a system where is easy to be analyzed and
stored. In the next paragraphs of this paper we
find important information about data
ingestion from different perspectives: we
prove the necessity of data ingestion, we note
the challenges met in data ingestion, we offer
information about parameters and key
principles.
To finish the process of data ingestion it is
necessary to use a tool that is capable to
support the following key principles: network
bandwidth, unreliable network, choosing
right data format and streaming data
2.1.1 The necessity of data ingestion
In many situations, when using big data, the
source of data structure is not known and if the
companies, for example use the common data
ingestion methods it is difficult to manipulate
the data. For the companies data ingestion
represents an important strategy, helping them
to retain customers and obtain increase
profitability.
The main advantages that demonstrate the
necessity of data ingestion are the following:
Increased productivity. It is taking a lot of time for companies to analyze and to
move data from different sources, but with
data ingestion the process is easier and the
time can be used to do something else
Ingestion of data in batches or in real time. In batches, data are stored based on
periodic intervals of time
Informatica Economică vol. 22, no. 2/2018 27
DOI: 10.12948/issn14531305/22.2.2018.03
Data are automatically organized and structured, even if there are different big
data formats or protocols.
2.1.2 Data ingestion challenges
Variance and volume of data sources are in a
continuously expansion. Extracting data from
these sources can be extremely challenging
for users, considering the required time and
resources. The main issues in data ingestion
are the following:
Different formats in data sources
Applications and data sources are evolving rapidly
Data capture and detection are time consuming
Validation of ingested data
Data compression and transformation before ingestion
2.1.3 Data ingestion parameters
The main ingestion parameters used in the
comparison are the following:
Data velocity- this parameter is based on the speed to process data from different
sources like human interaction, social
media, networks.
Data size- Because data ingestion works with huge volume of data, they are
generated from multiple sources to
increase the time
Data Frequency-This parameter can have two ways to process data: in real time or
batch
Data Format-Every company choose different format for their data and the data
ingestion needs to adapt for every
situation
2.2 Hadoop ecosystem
Apache Hadoop ecosystem is an essential
supporting structure for processing and
storing large amount of data (Big Data). The
Apache Hadoop ecosystem grows
continuously and it consists of multiple
projects and tools with valuable features and
benefits that provide capacity of loading,
transferring, streaming, indexing, messaging,
querying and many others. Hadoop contains
two main elements: Hadoop Distributed
Filesystem (HDFS) and MapReduce. The
HDFS is a file system designed for data
storage and processing of data. HDFS is made
for storing and providing streaming, parallel
access to large amount of data (up to 100s of
TB). HDFS storage is distributed over a
cluster of nodes. MapReduce is a large dataset
processing model. As the name suggests, it is
composed of two steps. The initial step, Map,
establishes a process for each single key of the
records to be processed (key value type). The
final step, Reduce, performs the operation of
summing the results, processing the data
output from the map phase according to the
required operator or function, resulting in a set
of value key pairs for each single key.
Swizec Teller notes in [22] that these two
projects can be configured in combination
with other projects into a Hadoop cluster. A
cluster can have hundreds or thousands of
nodes and they can be difficult to manually
configure. Hadoop cluster covers the need for
tools to easily and effectively configure
systems and data. HDFS and MapReduce
might be executed on separate servers. They
are named Hadoop clients and security is the
main reason for physically separating Hadoop
nodes from Hadoop clients. If we are deciding
to install clients on the same servers as
Hadoop, we will have to provide a high level
of security to every user for access. Logically
and physically separating them simplifies the
needed configuration steps. There are many
sub-projects (managed mostly by Apache,
which are made by free organizations)
designed for maintenance and monitoring that
very well integrate with Hadoop and they lets
us concentrate for developing data ingestion
rather than monitoring it. Three of the
commonly used tools for data ingestion in
Hadoop are rigorously described in the
following sections.
2.3 Kafka
Apache Kafka is a distributed, high-
throughput messaging system, a publish -
subscribe environment that provide highly
availability. With one broker handling
hundreds of MB per second of reads and
writes from several clients, Kafka is a very
28 Informatica Economică vol. 22, no. 2/2018
DOI: 10.12948/issn14531305/22.2.2018.03
fast system. Replication is used for messages
across the cluster and then stored on disk.
According to [7], “Kafka can be used for
stream processing, web site activity tracking,
metrics collection and monitoring, and log
aggregation”. “It is a paradox that a key
feature of Kafka is its small number of
features. It has far fewer features and is much
less configurable than Flume”, noted Ellen
Friedman and Ted Dunning in [8]. Flume will
be described in the next section of this paper.
They also discovered that “Kafka is similar to
Flume in that it streams messages, but Kafka
is designed for a different purpose.
While Flume is designed to stream messages
to a sink such as HDFS or HBase, Kafka is
designed for messages to be consumed by
several applications”. The principal elements
of Kafka architecture are Producer, Broker,
Consumer and Topic. Topics are used for
feeding of messages. Producers send the
messages to topics and consumers, who can
subscribe to those topics and consume the
messages from that topics. Topics are
partitioned and is attached a key to each
message and a partition is like a log.
2.4 Flume
“Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating,
and moving large amounts of log data.” is a
reliable definition found on official website
for Apache Flume. This paper analyses the
newest version for tools and the last version
stable for Flume is 1.8.0, the eleventh release
for apache project that offer to the users a
stable product, compatible with older versions
of the Flume (1.x code line) and it is a
software ready for production. This tool is
made to ingest and collect huge volumes of
data from multiple sources into Hadoop
Distributed File System (HDFS), the most
used types of data for processing are sensor
and machine data, social media data, maps,
astronomy, aerospace, application logs or
geo-location data. We observe the using of
Flume in one specific example where the tool
is used for the logging of manufacturing
operations, the log is generated in every run of
the product when it comes off the line and
generate a file with information about the
respective run. In a day the product runs for
thousands of times and generate a large
volume of data stored in log files and using
Flume, data can be stream into a tool for
analyze, like we can see in the image below,
followed by storage process of them in HDFS.
We remark that in general, Flume allows users
to ingest and store into Hadoop for future
analyses, data from multiple sources and with
different sizes, use horizontally scale to ingest
data, the user has the guarantee that his data
are delivered based on the transactions
between agents, use the insulate system in the
situation when incoming data rate is bigger
than a standard rate and it has a better
integrated bond with Hadoop ecosystem in
contrast to Kafka or NiFi.
J. Kim and B. Bengfort note in [11] that the
data flows in Flume like a pathway which
ingest data from origin to destination. Data or
events are moved from source to destination
based one sequence of hops and the concept is
named Flume agent (a JVM process) which
consist three important components: channel,
sink and the source as we can see in the below
image.
Source represents the part of the Agent where
data are received from data generators,
followed by transfer them to the channels
from Flume events, Channel can work with
different sources and sinks and are
represented a bridge between the sinks and the
sources: receives the data from the source and
use the buffer till they are consumed by sinks.
Sink represents the final component of the
Flume agent where the data from the channels
are consumed and send to the destination (the
data are stored in HDFS at this step). We note
that the biggest disadvantage of using Flume
is that the data can be lose in a very easy way,
for example if the user choose the Memory
channel with high throughput, when the agent
node goes down the data will be lost.
2.5 NiFi
Some systems are generating the data and
other systems are consuming it. Apache NiFi
is developed for the automation of this flow.
In [16] Apache NiFi is defined as “a data flow
Informatica Economică vol. 22, no. 2/2018 29
DOI: 10.12948/issn14531305/22.2.2018.03
management system that comes with a web
User Interface that helps to build data flows in
real time, it supports flow-based programming
and the graph programming includes
processors and connectors, instead of nodes
and edges”. The user connects processors
together with connectors and the data will be
defined how to be manipulated. A strong
feature is Nifi’s capability of ingesting any
data using ingestion methodologies for any
particular data. Similar to inputting data. The
output is very customizable too. T. John and
P. Misra observed in [23] that “The Apache
NiFi website states Apache NiFi as - An easy
to use, powerful, and reliable system to
process and distribute data”. We consider that
is a good alternative of Apache Flume having
a vast set of features and easy to use web user
interface. It is easy to set-up and it is very
highly customizable.
3. The research methods This section contains information about the
analysis of the main characteristic for tools:
Flume, NiFi and Kafka and represents our
analysis of the functionalities and
characteristics with the scope to put in
evidence the choice of them for creating the
content of this paper. Our analysis is based on
the comparison of them from different
perspectives like performance, functionality,
necessity. First, when we searched
information about data ingestion tools we
found over 30 different tools used by
companies in this process and we were in the
situation to choose the best of them for our
analysis. The choice of the data ingestion tool
for a company depends on multiple factors
such as target, transformations (simple or
complex), data source, performance, necessity
so we used the same criteria in our research.
Next, we searched for articles and opinions
that contained the key words like “data
ingestion used tool”, “first option of data
ingestion tool” and we obtained the main used
tools for data ingestion. The final decision was
based on [18] were, based on top 18 data
ingestion tools, Flume is on second position,
followed by Apache Kafka and Apache NiFi,
first option been Amazon Kinesis. Based on
this top we decided that our paper will analyze
three tools which represent a main choice for
companies and users. We preferred to use in
our research Apache tools mentioned in this
paper because an advantage of them is the
possibility to combine them for a better result.
Comparing with the other tools from data
ingestion area, we noted that the analyzed
tools from this paper have some special
characteristics such as the guarantee that they
are reliable offering zero data loss, using large
volumes of data Apache Kafka and Flume
systems provide scalable, reliable and high-
performance. In our decision we based on
criteria which convinced us that if we are put
in a hypothetical situation to choose a tool for
data ingestion we will use one of them.
The last criteria and maybe the most important
which helped us to decide on this choice was
the information from articles and books such
as [16], [17] and [21], based on using of this
tools in known application. We find that the
main criteria when a company wants to
choose a tool for data ingestion are: speed to
ingest data in a rapid way, platform support
which offers the facility to connect with data
stores, the facility to scale the framework to
work with large datasets and the facility to
extract and access data from sources without
impact on their ability to execute transactions
or performance and in our choice for NiFi,
Kafka and Flume we used that criteria.
3.1 Functionalities of analyzed tools
An important part of the study is the analysis
of tools functionalities. In this subsection we
present the functionalities used for tool
comparison. We analyzed the following
indicators: reliability, system requirements,
limits of the tool, stream ingest and
processing, guaranteed delivery and data type.
In the following paragraph we will justify our
choice for these indicators. In the next
paragraph, we will present the results. In the
result section we examine the functionality for
each tool using a standard scale, with three
values: complete implemented functionality,
partial implemented functionality and no
implemented functionality.
https://www.safaribooksonline.com/search/?query=Pankaj%20Misra&field=authors
https://www.safaribooksonline.com/search/?query=Pankaj%20Misra&field=authors
30 Informatica Economică vol. 22, no. 2/2018
DOI: 10.12948/issn14531305/22.2.2018.03
According to [25], “data collection and
transportation should be reliable with
minimum data loss”. In our research,
reliability was based on the ability to deliver
events and logging them in situation when one
of the tools presented have a software crash,
scarce memory or bandwidth and the
possibility to lose data can appear. We
consider that guaranteed delivery
functionality is very important because user
needs to have the guarantee that his data are
completed after data ingestion process. Limits
of the tools were included in our analysis to
expose their disadvantages.
We want to note in this paper the fact that we
consider that from this point of view a perfect
tool does not exist and for companies can be a
better solution to combine them. Another
important functionality used in our
investigation was data type because it is
important for the user to know what type of
data can ingest with the chosen tool.
3.2 Performance measurements of the
selected tools
In performance testing using big data are
included two main actions: data ingestion and
throughout where is verified how the fast
system can consume data from various data
source and data processing which involves to
verify the speed with which the map or queries
can reduce jobs in execution. We note in our
research performance measurements because
we consider that the user needs to know
information about speed, the number of
processed files per minute, the acceptable size
for the files for the tool. In our analyze we
based on performance measurements studies
created on the tools official pages and on the
opinions found in articles, from different users
who use Kafka, NiFi or Flume. In this paper,
we used for our research in performance the
following indicators: speed, number of
processed files per second, scalability and
message durability.
For businesses can be challenging to ingest
big data at a reasonable speed or to process it
efficiently with the scope to maintain a
competitive advantage so the speed indicator
needs to be included in our analysis. Another
indicator included in our research was the
number of processed files per second. We
want to note the fact that for our tools the
number of processed data per second differed
because of the size of files. Scalability was
included in our research based on what Cory
Isaacson said in [5]: “When managing a
successful expanding application, the ability
to scale becomes a critical need. Whether you
are introducing the latest new game, a highly
popular mobile application, or an online
analytics service, it is important to be able to
accommodate rapid growth in traffic and data
volume to keep your users happy.”. We
consider that indicator an important one
because data ingestion tools work with a huge
volume of big data and the scalability can help
in this process. Message durability was
included in our research to make sure that
even if the tool dies, the task is not lost. When
one of tools quit or crash the information
about queues and messages are forgot and to
make sure that messages are not lost it is
necessary to mark both the and messages
queue as durable. In the result section we
examine the performance indicators for each
tool using a standard scale which can have
three values: high, medium and slow.
Bellow, we present a performance
measurements study for Flume, presented on
the Flume’s official page and we used it in our
research because it is the most explicit study
found on this subject which evidence
performance for Flume in big data ingestion
process. Test was made by M. Percy in [15]
who used the following test setup: Flume
agent was run in a single JVM on his own
physical machine and a separate client
machine was used to generate load in syslog
format against the Flume box. Data was store
by Flume onto a 9 node HDFS cluster which
was configured on a separate hardware. In this
test, virtual machines were not used. On the
point of view of hardware specifications CPU
used was Intel Xeon L5630 2 x quad-core with
Hyper-Threading @ 2133MHz (8 physical
cores), memory: 48GB and operation system
SuSE Linux 64-bit. For Flume configurations,
Mike Percy used 1.6.0u26 java version, one
agent, for channel used Memory Channel and
Informatica Economică vol. 22, no. 2/2018 31
DOI: 10.12948/issn14531305/22.2.2018.03
for sink HDFSEventSink with Avro event
serialization and snappy serialized
compression. For data description he used
event size 300 bytes. According to the
obtained results, Flume is capable to assure an
approximate average of performance equal
with 70000 events/second, on a single one
machine, without data loss during the test.
4. The research results 4.1 The functionality comparison results
Based on the results obtained in our research,
we compare the tools from the functionality
point of view for a better analyze (see Table
1). According to definition from Wikipedia
for Apache Flume, “is a distributed, reliable,
and available service for efficiently collecting,
aggregating, and moving large amounts of log
data” in our research we can prove that this
functionality is completed implemented and
Flume have a fail over mechanism that can
move the data flows on a new agent without
exterior interventions. We consider that the
best choice for this functionality is Kafka
because in situation when a single point
failure data is available in contrast of Flume
where the user cannot access events till the
disk is recovered. On the other hand, Nifi is
reliable throw definition, but in real world is
better to combine this tool with Kafka for
using Kafka’s reliable data stream storage.
Flume and Nifi represent the main choices for
data guarantee delivery in comparison with
Kafka where that is not guarantee in totality.
Depending on protocol, NiFI allows supports
guaranteed delivery with the mention that
supports most once or at least once and Flume
guarantee the delivery of the Events using a
transactional approach. All of them have a
limitation for functionality “limits of the
tools” because a one and only tool does not
exist to be addressed for all requirements or to
do everything. In our analyze we obtained a
list of limits for every tool. For Flume we
obtained that when Kafka Channel is used the
possibility to loss data appear, data are limited
at kb dimension and the data replication does
not exist. On the other hand, Kafka has the
same problem as Flume for dimension of data
(kb), it has fixed protocol, format and schema,
a custom code is often needed. The last one,
NiFi does not accept data replication and it is
not a variant to use for CEP or windowed
computations. Data type for Flume is
represented by the next file formats Sequence
File, DataStream or Compressed Stream,
Kafka accept data type like JSon, PoJo or Java
bean and the fastest way: arrays, Nifi uses data
object (Flow File). Conclusion for this
functionality is that a tool that can process all
types of data does not exist and the decision
for the user depends on his needs. We consider
that Nifi is the best option from the point of
view of system requirements because can run
on laptop and can be used with a cluster across
enterprise class servers, hardware and
memory needed depends on the size of data,
can run on Windows, Linux, Unix or Mac OS,
support all type of browser and requires java
8 or newer. On the other hand, Flume can run
only on Linux, requires java 8 or newer,
requires sufficient space on disk for sinks and
channels and the agents need permission to
Write/Read directories. Kafka requires
machines with a lot of memory on them, can
run on Unix or Windows, requires java 8 or
newer. Using Hadoop, Flume can be used to
transfer, collect, aggregate streaming events
because it is a distributed system, while
simple, flexible and intuitive programming
model is based on streaming data flows. Our
analysis provides the fact that Flume
maintains a main list of ongoing data flows. In
Kafka, messages are put into topics, which are
split into partitions and this one is replicated
across the nodes in the cluster. Kafka provide
a huge throughput persistent messaging which
is used to scalable and allow parallel data
loads for Hadoop. NiFi provides real-time
control and it is easier to manage when run in
a cluster the movement of data between
source and destination. In conclusion from the
point of view of functionality, according to
our analyze we consider that NiFi represents
the best solution to use in a company.
32 Informatica Economică vol. 22, no. 2/2018
DOI: 10.12948/issn14531305/22.2.2018.03
Table 1. Functionality tools comparison
Functionality Flume NiFi Kafka Recommended tool
Reliability Partial
implemented
functionality
Partial
implemented
functionality
Complete
implemented
functionality
Kafka
Guaranteed
delivery
Complete
implemented
functionality
Complete
implemented
functionality
Partial
implemented
functionality
Flume and NiFi
Data type Partial
implemented
functionality
Partial
implemented
functionality
Partial
implemented
functionality
Flume, NiFi and
Kafka
System
Requirements
Partial
implemented
functionality
Complete
implemented
functionality
Partial
implemented
functionality
NiFi
Stream ingest
and processing
Partial
implemented
functionality
Complete
implemented
functionality
Partial
implemented
functionality
Flume, NiFi and
Kafka
Limits of the tool Partial
implemented
functionality
Partial
implemented
functionality
Partial
implemented
functionality
Flume, NiFi and
Kafka
4.2 The performance comparison results
In the table 2, we did a summary for
performance indicators for Kafka, Flume and
NiFi based on the results of our analysis. To
determine the measure for tools in our
research we based on the functionality results
where we note that Flume and Kafka have a
limit for data size (KB), so from this point of
view Flume and Kafka are the best options for
the number of files processed per second
because the size of them is small. Speed of
tools was put to medium for all of them
because the indicator does not have a standard
limit, it depends on the needs of the user and
if the user wants to increase it can use tuning
procedure. According to [14], “Kafka is a
general purpose publish-subscribe model
messaging system, which offers strong
durability, scalability and fault-tolerance
support.” we consider that this tool is the best
option for scalability in comparison with NiFi
and Flume which are distributed, reliable, and
available systems. Kafka is very scalable and
one of the key benefits of it is that adding a
large number of consumers can be made in an