Browsed by
Category: Information

Basics you wanted to know about big data, machine learning and artificial intelligence, but were afraid to ask

Basics you wanted to know about big data, machine learning and artificial intelligence, but were afraid to ask

Over the past few months, the field of artificial intelligence has been exploding. A lot of people I meet here in the bay area talk about it constantly, and they try to come up with different use cases for artificial intelligence. It is increasingly clear that artificial intelligence will be a major toolset of the future. I believe it will exceed the status of a toolset and find an evolutionary path of its own.

But the more conversations I have around this, the more definitions I hear around the different buzzwords. What is artificial intelligence? Is it the same as machine learning? Some people throw around words like Natural Language Processing (NLP). What is that? Most predictive analytics companies claim to be using some form of artificial intelligence. Are they really all using cutting-edge technologies? If not, what are they using? And how does it help or hurt them when competing with other companies who, in fact, are using some of the cutting edge tools?

Over the series of the next few blog posts, we plan to illuminate the key differences between what people are doing, how to think about machine learning and AI in your product, and how to prepare your company to be competitive in the future that is inevitable.

But first, some definitions. Keep in mind that these blog posts are written from the point of view of practitioners and not researchers (although we work hand in glove with researchers). Thus, we won’t get super technical about any of these items. There are people far smarter and far more articulate who have done an excellent job of demystifying the science behind all of these concepts. We will do a blog post compiling some of our favorite resources very soon. For now, we will focus on the practical aspects of the field and how company executives should be thinking about the best ways to use data to put their companies on the far end of the competitive spectrum.



Ok, enough chatter. On to some loosey-goosey definitions, along with a recap of some of the basics:

What is big data?
So, you have heard the term big data and understand that it is a large amount of data that could be structured or unstructured. As you know, it is important because there are meanings, patterns and predictive behavior hidden in the large swath of data. However, traditional computational and data processing techniques that we all grew up studying just don’t solve the problem of understanding the meaning behind such large amounts of data. Firstly, this large amount of data needs to be stored across hundreds (or thousands) or servers. Then, it has to be presented in a format where the data can be analyzed. Traditional techniques of analyzing massive amounts of data in one go just don’t work. This is the main problem that traditional analysts have. They just can’t hold and analyze like they did in the past. Along with the proliferation of the cloud, newer big data techniques can help wrangle this large amount of data much more easily. This makes it easier to handle ‘big data’. Which brings us to the next question:

How do we make sense of all this data? 
To make sense of the data, we first have to present it in a format that any algorithm can consume. The next part is tweaking those algorithms to get a desired understanding. Machine learning is one of the newer techniques that can help understand the patterns in the data without an analyst starting from a specific viewpoint. Actually, machine learning techniques have been around for decades (yes, decades). But in 2012, there was a major breakthrough that was able to get a phenomenal result in identifying handwritten digits. The technique that the researchers used came to be known as deep learning. Researchers, and then practitioners, all over the world rejoiced, and felt that this was the new silver bullet to solve the world’s data analysis problems. Coupled with the fact that everyone was generating vast amounts of data, researchers felt more confident that this technique + big data could find hidden meanings which were more difficult to find in the decades past. It looks like their excitement was well placed. Great progress has been done in this area, and the progress continues to surprise even the most ardent fans of the techniques.

So, machine learning lets computers find meanings in data?
In short, yes. But that’s a very broad definition. More specifically, machine learning refers to the idea of letting these new algorithms and techniques find meaning in data without starting from an analyst’s viewpoint. Let me give you an example. With data analysis, a typical analyst will come up with theories on how the data could be related and then validate those theories. Most of the time, their hypothesis proves incorrect, but not without giving them more information so that they can come up with a new hypothesis. Machine learning techniques turn this approach over on its head. By letting machines discover patterns in the data, they can be used to find highly complex relationships within the data which cannot be adequately modeled by the best of mathematicians. Exactly how they do this is the subject of another blog post, where we will cover basic concepts like supervised learning and unsupervised learning, and when each one makes sense. For now, let’s keep in mind that the machine learning techniques are more powerful and try to uncover patterns which the machine learning theorist or practitioner need not be aware of before the process begins.

Ok, I get it. Can machine learning be applied to ‘small data’?
Yes. It is not necessary that a large amount of data be present for the techniques to be successful. The simple way to think is whether the data contains enough information and structure to make some sense. For example, a list of 100 houses in a zipcode with prices and square footage will give one a very good idea how to price a new house given it’s square footage. However, if the data only contained house prices and the number of windows in the house, then that’s not a good indicator. The best way to think is that if a human can be trained to make some sense of the data without relying on other knowledge, then a machine can probably do so as well.

So, what is this artificial intelligence?
Artificial intelligence is the most difficult one to define. I tried to read the definition on Wikipedia, and it gave me a headache. Everyone defines it differently, but in general it refers to the idea of computers and algorithms doing things that were earlier considered the dominion of humans. For example, understanding complex voice commands, sentences and phrases was considered near impossible about a decade ago, and yet, computers are able to do just that. Similarly, reading, characterizing and understanding handwritten signs, or the landscape while driving a car are all things that seem fantastic for a machine to be able to do. Ultimately, under the covers, it is a matter of getting a lot of information from various sources (multiple cameras and all kinds of sensors) and correlating it in a manner which is similar to how we make sense of the data. Hence, the term ‘artificial intelligence’ — there is a lot more complex “solving” and “learning” happening. Also, it sounds cool!

I hope the above gives you some sense of the world of machine learning and artificial intelligence. Over the next few posts, we will go a little deeper into each topic, while keeping in mind that our target audience are industry executives who should be prepared for the changes which are already occurring in their industries.

If you have any questions, feel free to email me or find me on linkedin. I’d love to hear from you if I can help you or your team with machine learning.

Why don’t online reviews work as well as they are supposed to?

Why don’t online reviews work as well as they are supposed to?


The internet is primarily designed, and has evolved, to solve information problems. The internet cannot yet deliver experiences, except where the experience itself comprises entirely of information. The internet can only deliver information for the senses of hearing and seeing through pictures, text and video. Thus, any experience that comprises wholly of stimuli to these senses can be delivered, such as games. However, the internet cannot deliver information for the senses of touch, taste and smell. Thus, for us to experience stimuli to those senses, we must experience it in the ‘real’ world. All the internet can do is to deliver the information about those stimuli through text, and therefore, understanding. What it really means is that we receive an account of what the stimuli will comprise of, and in our minds we try to experience it. This exactly what online reviews are, and why they are becoming increasingly popular.

Everyday I come across different websites with a 5-star rating for reviews of something or the other. Even the most popular sites with reviews (yelp, youtube, etc.) do not provide a whole lot of value from the reviews, as more and more people add reviews. I am not saying that reviews are completely meaningless, but only that they do not completely encapsulate the information that they are supposed to.

However, the value of the reviews is measured by how accurately we feel those sensations that we expect to feel, when we actually do go and have that experience. Any time there is a gap between the expectation that we form in our mind versus the experience we have, the value of that information becomes suspect, and the source of that information gets discredited. In this context, the value of information can really be measured by this experience – expectation gap. It is worthwhile to note here that if that gap is positive (meaning that we end up having a better experience than expected), we are pleasantly surprised, while a negative gap induces disappointment.

Of course, this brings us to another problem, which is how to measure this difference between expectation and experience. Theoretically, the experience is captured in the description of the review, thereby contributing to the expectation. However, not all people are the same, and though the expectation from the same piece of information might be different for different people, what is more troubling is that the experience of different people varies a lot as well. An inherent assumption in the review model is that all reviewers are equal and that the set of reviewers is large enough to statistically represent the vagaries of human nature accurately. Thus, each reviewer gets an equal amount of voting power, while votes get averaged over many reviewers.

The upshot of this is that even though the amount of information contained in different reviews is different, it gets averaged over reviews to provide a more or less consistent amount of average information, which is enough to form a sort of personalized average expectation in the mind. For the present state of the internet, this is considered a fair system mostly because of a lack of a better automated and scalable system. This is the reason, however, why the average expectation for most “average” items (be it restaurants for yelp, or videos for youtube) tend to converge to ~3.5-4. The outliers are the superb (4.5-5) and the horrible (<3 stars). For the express purpose of classifying the item in one of these 3 buckets, the current review system is fine. But there is no real benefit from having the granular system of 5 stars, as the discrepancy between reviews is great. That is the reason we see a lot of websites today switching to the easier and simpler vote up/down system.

However, the truth still remains that the amount of information contained in each review is different, based upon the prior experiences and nature of the person generating the information itself. If we are able to capture this difference in a meaningful way, then the amount of information contained in a single unit of transport mechanism will greatly increase.
Get Adobe Flash playerPlugin by wordpress themes