Thursday, September 17, 2015

Machine Learning and Kaggle : Their Demo from a Conference

This is me liveblogging from a conference lecture by Ben Hammer here (located here https://www.youtube.com/watch?v=9Zag7uhjdYo):
  • There seem to be a lot of people from India on the board as well.
  • Meh. Machine LEarning for optimizing petroleum exploration. Wow.
  • Grading essays automatically. This asks us for one grade, but takes in an entity which for the ordinary person is very subjective.
    • Human Level Performance!
  • I've always been fascinated by the prospect of diving headlong into one project, without having to worry about anything else, and THEN improvise my way around. Kaggle seems to be one avenue where I can pick up a project for 80 days. 
  • Toxicity of compounds. Biology. Reminds me, the more we concentrate ourselves on social data, the more we go away from the bulk of biological data.
  • Also, toxicity of a compound might not be an answer we expect from a machine learning algorithm, because toxicity is ultimately the effect the compound has on the host.
    • BUT, it reminds us, that if we take in as many features and make it a supervised learning algorithm, we do get very close to the feature-sets that are relevant in the toxicity mechanisms.
How do Machine Learning Competitions Work?
  • The simple task about the training set and the test set.
  • The leaderboard serves the purpose of motivating everyone to try and improve their individual rankings, which effectively means that the entire community submits one more time.
    • In turn, this leads to the entire solution space being explored.
    • The frontier of machine learning is reached pretty soon for easy competitions.
    • For tougher challenges, the frontier is reached a bit late.
  • On short answer for prompts, the best entry had hand tuned approaches for a given prompt.
    • The second entry, which came in very close, had the same approach for all prompts.
  • Interestingly, StartUps have come out of such competitions.
    • Others have used what they learnt over here to develop solutions for edX as well.
  • Current Vendors did not necessarily win. The vendors who participated went on contract with the winners.
  • Interestingly, http://www.nytimes.com/2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-artificial-intelligence.html?_r=0


This just serves as a reminder that domain specific knowledge might hinder the machine. The take away seems to be that our focus has to be on building the machine, we cannot feed the machine our own understanding.

At the same time, when all the first three method were ensembled into one method, the prediction results were far better. What does this mean? I presume that the domain specific knowledge is very useful at a particular place in the data processing pipeline.

.Looking across the competitions holistically

Boruta seems to be very useful to narrow down our feature selection
Porter Stemming algorithm seems to do a good job of reducing words down to their stems
Model Ensembling results in marginal but significant results
Anything related to computer Vision? Deep Learning. Three libraries are available as well

I am skipping the problems with competitions, primarily because I did not understand many of them.


Oh yes, one big advantage is that the Kaggle network exposes many exotic problem domains to a wide variety of experts. Higgs Boson! Epilepsy!

An Awkward question:
  1. Apparently, there are several tutorials for The Titanic challenge on Kaggle. One person took the most basic R solution available on the YouTube results, and ended up being ranked 219. His question was, what was wrong with the 1200 people who were behind him?
  2. Reminds me, that a bulk of people who enter something, are generally of the same level. Once you cross that plateau, you are talking business.


Wednesday, August 26, 2015

Starting with an An Intro to Algorithms

There were a few reasons to get started with a Udacity course. The simplest of the reason was that Udacity courses are straightforward, short, and easy to grok.

So, what is it? Does this mean we are shying away from chalenges?

No. We are in this for the long hault, considering how many times we have restarted learning and re learning CS.

So, what is it? Why are we looking at Udacity?

We are looking at Udacity, because
  • If you want to learn the fundamentals of CS, and want to do so at MIT OCW CS level, then it makes sense to have a soft start. This is similar to how Indian students learn the same topics in Physics and Chemistry thrice, from secondary school, higher secondary to their first two years at college. Two years of the same topics, done three times.
  • The goal right now is not strength but endurance. MIT OCW CS is a pretty huge goal, but what if you had an easy curricullum? Would you show up for Udacity day after day, everyday? If you get this, showing up everyday is the challenge, and can we have this level of self discipline?
So, there you go. Dream big all that you want to, but show up everyday. Can we do that?