What if... The Avengers weren't a collection of the most powerful superheroes the world has ever known, and instead were a group of data scientists? That sounds like fun, right?
What if... I were to take some of my time and write something along those lines? You'd read it, right?
For those of us familiar with comic books from our childhood (or... not so long ago), The Avengers are Marvel Comics' premier superhero ensemble, bringing together otherwise independent and capable individuals of the superhuman variety to battle Earth's most dangerous villains, earning our gratitude and idolization forever. For the uninitiated, The Avengers are not actually a set crew of superheroes, with their membership being rather fluid and having changed dramatically over the years.
But what if, for fun (obviously), we matched up hero attributes and personalities with the super tools of the data science trade? Just as our heroes have their strengths, weaknesses, and preferences, so do data scientists. How would The Avenger members' individual characteristics translate to the world of analytics?
With that in mind, I have taken some liberties with putting together my own Avengers iteration and ascribing their personality traits to their envisioned super data scientist equivalents.
Incidentally, What If... is actually the name of a long-running, if sporadic, series of comics in the Marvel Universe, where non-canonical and generally "fun" storylines are pursued which are far outside of the norm, challenging Marvel's status quo. This novel and unorthodox approach to story telling actually seems like a pretty good fit for data science. Plus, as they run out of story ideas, maybe a future issue imagines some of our heroes as data-oriented professionals.
So here they are, Earth's mightiest analysts. Data Avengers... Assemble!
Strong. Virtuous. A true leader. Captain America is clearly an executive. I envision Steve Rogers - his real name - to be the capable Chief Data Officer of Avengers, Inc. He may not be hands-on any longer, but he came up in the rough scrabble world of data munging, so he gets the everyday struggle the data scientists working under him go through.
Or, at least that's what he tells them.
Hulk smash... data!
More certain am I of this than I am that Marvel is clearly superior to DC, Hulk is definitely that data scientist who would try to solve every problem using the Map Reduce algorithm. Think about it: Map Reduce smashes problems down into smaller pieces and then uses brute strength to further process. Hulk maps whatever is in his way and reduces the rubble.
It's a perfect match. When Hulk is calm and away from data, assuming his Bruce Banner form, he's very insightful and sees value in all of the things, but when challenged by the task at hand he reverts to what he knows best.
Of course, Map Reduce can't solve everything. But who is going to tell that to Hulk?
The Inundated Iron Man
Given the impressive scientific, academic, and entrepreneural background of Tony Stark (Iron Man's alter ego), I would think that Iron Man would prefer to make use of the consummate data science tool of flexibility. As such, Iron Man is a Python kinda guy.
Munge data? Run some analysis? Whip up a classifier using libraries? Implement some neural nets from scratch? Build and scale a production-ready system? Python can really do all of this. Tony loves a prototype, too, and name a programming language preferred by data scientists that can do all of this. No, that other one can't.
Of course, if Iron Man only developed production systems then he would probably grab a book on C++, but he prides himself in being a Tony of All Trades, and so he understands the value of Python.
Thor, Son of Spark
Hulk may be beastly strong, but Thor is godly strong. Also, he knows not one algorithmic approach is capable of solving all of his problems. But he understands the power of a single framework running atop arguably the strongest data processing engine in the universe as a central piece in his problem-solving approach. Also, he is very focused when he works, and never allows anyone to pick up his keyboard.
Thor may be an Asgardian by birth, but he is Apache by choice, and relies on Spark daily.
The web-slinger has been both an Avenger and close confidant of the crew at different times. Peter Parker, his alter ego, is a studious young man with the gift of high intelligence. His scientific mind wants to solve problems of consequence, and is not interested in the petty, necessary practicalities which facilitate this ambition.
Your friendly neighborhood Spider-Man is an analyst's analyst. He isn't really concerned with building production systems or implementing his own algorithms, and as such software is merely a tool for him to use to gain insight and solve problems. R is his tool of choice, since it is built for exactly what he needs, no thing more. He doesn't mind that learning R is wrought with confusion, since he does not come from a computer science background and is unencumbered by the knowledge of how other programming languages are implemented.
Also, he's smug and a bit of a smart ass, so people dislike him. He's more of a backroom, away-from-the-client sort of data scientist.
Doctor Strange Approach
In the comics, Dr. Stephen Strange, MD, becomes Sorcerer Supreme, guardian of the entire universe. He uses magic to mystify his foes, confounding them while solving problems. In this sense, it seems that Dr. Strange would be an advocate of blackbox algorithms. Better yet, as Sorcerer Supreme, and the most powerful entity in the Marvel Universe, not really understood by anyone else, perhaps he would only use blackbox algorithms.
Black boxes are his first choice for everything. Iris dataset? Neural nets! Weather dataset? Random Forests! When he uses ensemble methods he prefers stacking. He will entertain the idea of a Support Vector Machine, but only at very high dimensionality.
When the other Data Avengers have hit a wall and don't know how to get their heads around an issue, they turn to his mystic unknowable algorithms for solutions.
Vision is an android created by... well, it doesn't matter. His role as a Data Avenger is to perform automated machine learning to help the others. Vision takes a hybrid Bayesian and genetic algorithm approach to feature selection and model building, performing training and testing on vast numbers of models in parallel in order to come up with the most accurate results and help point the other team members in the right direction.
The real important point, here, is that Vision has not supplanted the other team members. He is complementary to the more fleshy data scientists, and has not aimed to take over their profession and leave them all unemployed. Wink, wink.
There's no "I" in Vision.
J.A.R.V.I.S. (Just A Rather Very Intelligent System) is the Data Avengers' proprietary rip-off of IBM's Watson. They have decided to test the waters of cognitive computing, and implemented such a system from scratch.
Along with being used in-house for their own clients' needs, they have also implemented a publicly-accessible API, available by subscription, which is the Data Avenger's primary source of income these days. The API economy was foreseen by Tony Stark years ago, and cash in he did.
Bonus: Fantastic Four of Data Preparation
As a slight to Richard Reed & Co., the Data Avengers keep this handy checklist to make sure new recruits know how to approach every task:
- Understand the problem domain and questions being asked
- Investigate the data
- Clean, prepare, and transform the data as required
- Approach problem solving from within a well-defined framework
Given the above, I can't help think there is something about putting together complementary and effective data science teams in real life. I'll leave it up to you to judge for yourself.
All comic personalities mentioned herein, and images used, are the sole and exclusive property of Marvel Comics.
- How to Structure Your Team When Building a Data Startup
- Big Data Comic Explains the Current State of Privacy
- Learn Data Science in 8 (Easy) Steps