Steve Hoffman has 30 years of software development experience and holds
a B.S. in computer engineering from the University of Illinois Urbana-Champaign
and a M.S. in computer science from the DePaul University. He is currently
a Principal Engineer at Orbitz Worldwide.
More information on Steve can be found at http://bit.ly/bacoboy or on
Twitter @bacoboy .
This is Steve's first book.
发表于2024-11-22
Apache Flume: Distributed Log Collection for Hadoop 2024 pdf epub mobi 电子书
图书标签: 分布式
Hadoop is a great open source tool for sifting tons of unstructured data into something
manageable, so that your business can gain better insight into your customers, needs.
It is cheap (can be mostly free), scales horizontally as long as you have space and
power in your data center, and can handle problems your traditional data warehouse
would be crushed under. That said, a little known secret is that your Hadoop cluster
requires you to feed it with data; otherwise, you just have a very expensive heat
generator. You will quickly find, once you get past the “playing around” phase
with Hadoop, that you will need a tool to automatically feed data into your cluster.
In the past, you had to come up with a solution for this problem, but no more! Flume
started as a project out of Cloudera when their integration engineers had to keep
writing tools over and over again for their customers to import data automatically.
Today the project lives with the Apache Foundation, is under active development,
and boasts users who have been using it in their production environments for years.
In this book I hope to get you up and running quickly with an architectural overview
of Flume and a quick start guide. After that we’ll deep-dive into the details on many
of the more useful Flume components, including the very important File Channel
for persistence of in-flight data records and the HDFS Sink for buffering and writing
data into HDFS, the Hadoop Distributed File System. Since Flume comes with
a wide variety of modules, chances are that the only tool you’ll need to get started
is a text editor for the configuration file.
By the end of the book, you should know enough to build out a highly available,
fault tolerant, streaming data pipeline feeding your Hadoop cluster.
工具书籍
评分工具书籍
评分工具书籍
评分数据
评分数据
Apache Flume: Distributed Log Collection for Hadoop 2024 pdf epub mobi 电子书