Algorithms of Big Data Processing As a trump in a new De Beers of IT-2

(The beginning of the story is here)


Subject: Too hasty to be correct

Goddamn, my boy, this is my fault! Hadoop is undoubtedly a great pre-computation tool. However, I don’t expect you to rush into the most over-exposed Big Data technology without considering all pros and cons. Of course, the application you are going to create on top of Hadoop can quite likely meet your client’s requirements. But something tells me that your decision is made offhand (at least the brief time that has passed since your last message hinted at some prematurity of your decision). What other paradigms than MapReduce you are familiar with were taken into consideration? Did you ask yourselves about the latency of the Hadoop-based solutions? Are you sure, whether your client is ready to steer at a static screen for 1.0 – 1.5 minutes while waiting for your app’s response? Besides, as I may know, your client’s dark data compose a dozen of terabytes now (looks appropriate for Hadoop, right?) But how about the pace of new data input? Is your client to have a continuous influx of dark data?

Whereas your reputation as a professional software developer is at stake, another brainstorming of such questions is never redundant.


Subject: Hadoop. Why not?  

Dear Mr. Rootlord,

Although it was a swift decision, as you rightly noted, I could hardly call it groundless. Believe me, we are quite aware of frequent problems many software development companies face while moving toward Big Data processing. In fact, we, outsource developers all are living in each other’s pockets :). Many start, for example, from Apache Hive. When something goes wrong, they shift to Spark SQL or Impala. Some of them are so frustrated with Big Data that go back to old marry RDMS. Sometimes it seems to me that certain “professionals” have created a number of toys for other “professionals” just to play. But business does not accept such an approach. Business is looking for diamonds in the end, not for “extractors” and “diamond cutters” as such capable of just playing with opportunities.

Big data crown

There are also HBase, Spark MLLib, Mahout, Cassandra, and many others. However, it would be silly to require our DevOps engineers to be conversant with all the technologies. Besides, HDFS is a classical cluster file system allowing us to rely on Hadoop (together with Spark when latency is crucial). On top of everything else, our current customer does not need a real-time Big Data processing with an instant response from our application (a batch processing would satisfy him apparently). The only aspect remaining uncertain is the “influx of input data” – we are still trying to figure out its exact volume. I may note your gut feeling does not fail you in that critical issue when you emphasize its importance. What else could your intuition advise us about with regard to our “diamond mining” attempts?


Subject: Debunking some myths

My dear friend, I’m glad I was wrong about your decision which seemed unadvised at first glance. Turns out you are quite well-versed on the Big Data processing subject (even better than I could expect :). However, I’m going to throw out some ideas in order to shake your confidence in the decision you’ve made (hope you can accept it as my “gut feeling’s” contribution in a correctness of your approach).     

Please consider the following seemingly obvious facts that can appear myths upon close examination:

  1. Marketers actively promote Hadoop as a free of charge solution. And officially it looks that way. However, a customer is to hire expensive specs to maintain the “custom buildings”. In many cases, instead of being busy with data processing the specialists have to continuously patch up holes in green and raw software. Actually, quite serious money is to be spent for specs up along the way nullifying any profit from the “free of charge” approach;

  2. Hadoop can run on cheap hardware. Seems it does. However, a client will need the appropriate powerful servers to perform fast and reliable processing. Desktop solutions can hardly fit. In accordance with Cloudera, the system requirements suggest having at least two 4-16 core CPUs, 12-24 JBODs, 64-512 GB of RAM, and 10Gbit net. It does not look like cheap hardware;

  3. And the most enduring myth concerns not Hadoop itself but the Big Data concept in general and the so-called “unstructured data” in particular. While tables are commonly accepted as the “structured data”, XML, YAM, and JSON belong to the “semi-structured” ones having, in fact, less evident structure than tables, but a structure. Another favorite bogey of Big Data pundits is logs. However, logs are quite appropriate to be structured with tables without using MapReduce. Messages, music tracks, video files, and many other formats are structured (or at least can be structured) one way or another. The very rare cases such as genomes or huge document archives can compose, of course, the “true unstructured data”. But they are beyond business analytics in most cases (like rocks containing no rough diamonds from our metaphor).

Don’t get me wrong, my boy, I do not persuade you from Hadoop. However, you should realize that Hadoop is a huge crawler-mounted mining shovel designed for the opencast diamond mining (I follow our De Beers analogy). Are you sure you need such a big enterprise solution while IoT and artificial intelligence are not involved? What if a smaller hydraulic excavator could fit? I suppose Kudu, Vertica, Greenplum, Teradata, and even Postgres-XL are worth considering in your case. Besides, many outsource app makers use cloud services such as Redshift and BigQuery when it comes to small and middle-sized companies and startups. Please think it over again, my friend; I just want you not to overplay your hand.

Data storage



Subject: Spades & barrows  

Dear Mr. Rootlord,

I can honestly say your last message gave me the seed of doubt. The more I think about our current project, the more it looks to me like Hadoop is excessive for us now. In any case, we are not going to come back to any manual data processing - spades and barrows are not sufficient for the real diamond mining obviously. The challenge lies in maintaining a proper balance between efforts and results as well as between our customers’ expenses and the value we can add to their business with our Big Data processing (you are absolutely right saying that our reputation as a software development company is at stake). I clearly realize that marketers overheat the very term of Big Data making both customers and software vendors forge through dense buzzwords. Good things there’s always an opportunity to find some relevant use cases and success stories from colleagues sharing their practical experience on the Internet. And of course Indeema thanks you a lot for the great metaphor of De Beers where a diamond mining process makes Big Data processing clearer and closer to the solid reality.

(Part 1)

Share via:
Recommended posts:
Microsoft Azure IoT Suite: Benefits and Features

Microsoft Azure IoT Suite: Benefits and Features

Creating Unhackable IoT: Can Blockchain Be Of Any Help?

Creating Unhackable IoT: Can Blockchain Be Of Any Help?

CI server on Mac OS for iOS using GitLab and Fastlane

CI server on Mac OS for iOS using GitLab and Fastlane

Interested in latest news in IT sphere?

Subscribe to start receiving notifications about new posts
28 Oct 2019

Indeema receives Clutch leader award for top development companies in Ukraine

23 Oct 2019

6 Promising German-Austrian IoT Startups

8 Oct 2019

Giving a presentation with perfect UI/UX design

17 Sep 2019

5 IoT Trends to Watch in 2020