TJ Machine Learning Club
Making AI more accessible
Join Us TodayCompetition Instructions
Iceberg Competition Instructions
Competition Introduction
For the next few weeks, in lieu of our own internal competitions, you have the opportunity to participate in a public Kaggle competition. Your performance on this competition will be counted in our club rankings.
The competition we have chosen is the Statoil/C-CORE Iceberg Classifier Challenge. Essentially, the goal is to classify ships vs. icebergs given images. Just like any of our previous competitions, you aim to achieve the highest accuracy and be at the top of the leaderboard. The procedure for participating is also similar to our in-class competitions. Download the training and testing data, train on the training data, generate a submission file from the testing data, and upload the submission file for grading. The private leaderboard results determine final rankings, and are made available after the final deadline.
Since this competition is difficult and long, you are allowed to work individually or in teams of 2.
Differences between Public and Private Kaggle Competitions
However, there are a few differences between a public Kaggle competitions and our private classroom competitions:
- Anyone can enter a public Kaggle competition (obviously). There are currently 2,000+ teams in the Iceberg competition.
- There are monetary prizes! They range from $10,000 to >$1,000,000. The iceberg competition in particular has prizes of $25,000 for 1st, $15,000 for 2nd, and $10,000 for 3rd.
One might ask: Why are such large sums of money being given out for classifying ships vs. icebergs? Well, this competition is sponsored by a shipping company. So, telling an iceberg vs. another ship more accurately could prevent a collision and save them millions of dollars. However, they clearly don't want to hire a data scientists, so they essentially crowdsourcing their solution through Kaggle. They take the winner's model and use it for their business.
Some more differences between a public Kaggle competition and our in-class competitions:
- Data size tends to be much larger (sometimes on the order of 1 TB). We partially chose this competition due to the low data size (~1 GB).
- Even with this relatively low data size for a public competition, you will need a complex model. If the answer was trivial and could be achieved simply, they wouldn't be offering $50,000.
- Because you will need a complex model, you will need a GPU to run your models. You can begin writing your code now, and we will try to get more machines with GPUs running in the syslab soon. The machine "infosphere" currently has the only high-performance GPUs in the syslab.
- Rather than viewing our website for information about the data and competition procedures, everything is directly on the competition site itself. The competition link is below.
- Public competitions last months, not one week. This competition ends on January 23rd, 2017. Do not delay. Begin as soon as possible. Everything takes far longer than expected to complete, especially when working with large amounts of data.
- People discuss solutions and post code in the Discussion tab. Of course, the leaders of competition aren't going to give to the world their winning solution while the competition is ongoing, but you can often find a half-decent model available. I recommend viewing and understanding what people have done and made available.
Tips
Consider all the techniques we have recently covered: Convolutional networks, transfer learning, Inception and ResNet, image preprocessing, image normalization, data augmentation, etc. Some may be useful, others will not.
If you chose to use PyTorch, which is faster and lower-level but has less documentation, this tutorial on transfer learning may be helpful when starting out. If you use Keras, this tutorial may be useful for beginners. If you're stuck, and can't figure out how to do something, Google it or check the documentation.
Final words
I will leave you with these final words: Public competitions are very hard. Don't be discouraged if you don't do well initially. Training, retraining, and tweaking your models are essential for your success. If you do well, say, in the top 5%, it is extremely impressive. I'm not aware of a high school student who has won any significant money from Kaggle competitions–think about it. Just as you will be working on this in your free time, so are PhD's, graduate students, and data scientists.
Good Luck!
Competition LinkConvolutional Neural Networks Competition Instructions
Each predicted keypoint is specified by an (x,y) real-valued pair in the space of pixel indices. There are 15 keypoints, which represent the following elements of the face:
left_eye_center, right_eye_center, left_eye_inner_corner, left_eye_outer_corner, right_eye_inner_corner, right_eye_outer_corner, left_eyebrow_inner_end, left_eyebrow_outer_end, right_eyebrow_inner_end, right_eyebrow_outer_end, nose_tip, mouth_left_corner, mouth_right_corner, mouth_center_top_lip, mouth_center_bottom_lip
Left and right here refers to the point of view of the subject.
The input image is given in the last field of the data files, and consists of a list of pixels (ordered by row), as integers in (0,255). The images are 96x96 pixels.
Data files
- train.csv: list of training 5000 images. Each row contains the (x,y) coordinates for 15 keypoints, and image data as row-ordered list of pixels. The first row is the header and associates each column with the feature. There are 31 values in each row, with the first 30 being x value for feature 1, y value for feature 1, etc and the 31st value being the image. The first 30 should be the outputs of your network and the 31st (image) should be the input.
- test.csv: list of 2049 test images. Each row contains ImageId and image data as row-ordered list of pixels.
- samplesubmission.csv: list of keypoints to predict. Each row has an Id and a value. The Id corresponds to the image and feature in the format "ImageId.FeatureId" where featureId is based on the order in which they are sorted in the train file. For example, the first feature is left_eye_center_x so to predict left_eye_center_x for image 1, I would have "1.1 34.555" as the first row.
Helpful tips
- Sample code
- Use Keras. Documentation available here. The first half of this tutorial may come in handy. Our introductory lecture is available here.
- Start with a stadard neural network before moving on to a convolutional one. When you do use a convolutional network, make sure to reshape your input to 96x96 instead of 1x9216 as it is right now. This is best done using numpy.
- Use a linear activation in the final layer because the challenge is regression not classification. There should be 30 output nodes.
- The Ids start at 1 not 0. Don't mix this up.
- Don't forget your header in the submission file.
To better understand the format, open the files in Excel. Feel free to ask any clarifying format questions to the officers.
Neural Networks Competition Instructions
11/01/17 - Your job is to write the code to create an neural network, train it on the training data, and use it to predict the classes of the testing data. We are trying to images of handwritten digits. The data we are using is from the famous MNIST dataset. Your neural network is supposed to classify which digit (0,1,2,3,4,5,6,7,8, or 9) the image represents, and the inputs are the 784 values that make up the 28x28 images.
The training data looks like this:
label, pixel 11, pixel 12, pixel 13, pixel 14, etc. (784 pixel values)
label, pixel 21, pixel 22, pixel 23, pixel 24, etc. (784 pixel values)
label, pixel 31, pixel 32, pixel 33, pixel 34, etc. (784 pixel values)
etc. (60,000 lines)
The testing data looks like this:
id, pixel 11, pixel 12, pixel 13, pixel 14, etc. (784 pixel values)
id, pixel 21, pixel 22, pixel 23, pixel 24, etc. (784 pixel values)
id, pixel 31, pixel 32, pixel 33, pixel 34, etc. (784 pixel values)
etc. (10,000 lines)
Where pixel ij
is the pixel value in the ith
row and jth
column. Each image is 28x28. Each pixel value ranges from 0, black, to 255, white. The MNIST dataset is black and white, which is why each pixel value is a single value instead of an (R,G,B) triple.
Your end goal is to create a file which looks like:
id, solution
1, predicted_label
2, predicted_label
3, predicted_label
4, predicted_label
etc. (10,000 lines)
All standard competition rules apply. You are only allowed to use the numpy library. We highly recommend you use the library for vectors (bias vectors, etc.) and matrices (weights, partial matrices).
We've written a small shell. The shell has a network class, and each network is made up of a list of layers (which are a separate class). Each layer is designed to have its own vectors and matrices (biases and weights, etc.). You don't have to structure your network this way by any means, or have a layer class at all. Most of the time when people write neural networks from scratch, they only have a single network class.
The competition ends in two weeks, at 11:59:59 p.m. on 11/14/17, since our next meeting is 11/15. Also, since writing a neural network from scratch is an involved process, the competition will be worth double in our rankings.
Support Vector Machine Competition Instructions
10/04/17 - Your job is to write the code to create an SVM, train it on the training data, and use it to predict the classes of the testing data. We are trying to classify survival of passengers on the Titanic. The data we are using are from actual passengers on the ship. Your SVM is supposed to classify whether a passenger survived [RIP (0) or Survived (1)], based on 11 different metrics (features).
The purpose of this contest is not to test your ability to write an SVM. Instead, we are using this opportunity to test two abilities:
- Your ability to learn how to use Scikit-Learn
- Your ability to work with real-world data
The second is far more important (and difficult) than the first. With that in mind, the training data now looks like this:
feature 1, survival, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11
feature 1, survival, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11
feature 1, survival, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11
etc. (636 lines)
and the testing data looks like this:
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11
etc. (255 lines)
Your end goal is to create a file which looks like:
id, solution
For every set of features in line N in the training data, you should have a line
1, predicted_class
2, predicted_class
3, predicted_class
4, predicted_class
etc. (255 lines)
N, predicted_class
in your submission file.
There are missing data points. Some features are not useful. The difficult part of this contest is formatting the data given, determining which features are useful, which ones should be trained on, and how to deal with the missing data.
Your are allowed to (and should) use Scikit-Learn to create your SVM. Scikit has detailed instructions on how to write an SVM using the library here. This is the trivial portion of the competition.
We highly recommend you open both the training and testing csv files in a program like Excel, which will help you modify columns of data and perform calculations quickly.
The Data I/O code from the decision tree lecture may still be useful.
Standard competition rules apply, except for Rule #1, since we are using Scikit-Learn.
Random Forests Competition Instructions
9/27/17 - Your job is to write the code to create a random forest, train it on the training data, and use it to predict the classes of the testing data. We are trying to classify survival of passengers on the Titanic. The data we are using are from actual passengers on the ship. Your random forest is supposed to classify whether a passenger survived [RIP (0) or Survived (1)], based on 7 different metrics (features). The training data looks like this:
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, survival
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, survival
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, survival
etc. (500 lines)
and the testing data looks like this:
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7
etc. (214 lines)
Your end goal is to create a file which looks like:
id, solution
For every set of features in line N in the training data, you should have a line
1, predicted_class
2, predicted_class
3, predicted_class
4, predicted_class
etc. (214 lines)
N, predicted_class
in your submission file.
Standard competition instructions and rules apply.
In case you care, the features correspond to:
pclass - Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd
sex - Sex: 0 = Male, 1 = Female
Age - Age in years
sibsp - # of siblings / spouses aboard the Titanic
parch - # of parents / children aboard the Titanic
fare - passenger fare
embarked - Port of Embarkation: 0 = Southampton, 1 = Cherbourg, 2 = Queenstown
survival - Survival: 0 = No, 1 = Yes
Shell code is available here. The Data I/O code from the decision tree lecture will still be useful, but will need to be adapted for this data.
2017-2018 First Competition Instructions
The Data
9/20/17 - Welcome to the first contest of the year! Your job is to write the code to create a decision tree, train it on the training data, and use it to predict the classes of the testing data. We are trying to classify breast cancer data. The data we are using is actual data from breast cancer patients. Your decision tree is supposed to classify the type of breast cancer they have (benign (0) or malignant (1)), based on 9 different metrics (features). The training data looks like this:
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, class
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, class
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, class
etc. (533 lines)
and the testing data looks like this:
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9
etc. (150 lines)
Your end goal is to create a file which looks like:
id, solution
For every set of features in line N in the training data, you should have a line
1, predicted_class
2, predicted_class
3, predicted_class
4, predicted_class
etc. (150 lines)
N, predicted_class
in your submission file.
Shell code is available on the lecture schedule page. The Data I/O code from the decision tree lecture will still be useful, but will need to be adapted for this data.
How to participate
Our contests will be held on Kaggle, using Kaggle InClass. This allows us to upload data and competition instructions, as well as impose submission deadlines. It also ranks submissions automatically! To participate:
- Create a Kaggle account by clicking "sign up" in the top right.
- Click on this link (the competition link).
- Download the training and testing data.
- Download the I/O Code.
- Write your algorithm and train it on the training data.
- Then, test it on the testing data, creating a submission file with the predicted ground truth in the format shown in the sample submission file.
- Upload your submission file and see your results!
- Tweak your code, repeating steps 5-8 to improve your accuracy and move up the leaderboard.
Some More Rules:
- Use Python. Everything we do is in Python this year. Don't use a library, other than numpy. For this contest only, if you have no Python experience, you can use whatever language you are comfortable with.
- The Competition ends at 11:59:00 PM next Monday, 9/25.
- The leaderboard on Kaggle that you can see is the Public Leaderboard, which is your accuracy for 50% of the testing data. Your final rankings will be based on the Private leaderboard, which is based on the other 50% of the testing data and will become public as soon as the competition ends. This is to prevent you from just writing a decision tree that overfits the testing data, which defeats the purpose.
Standard Rules and Procedures
These instructions are common to almost every competition, so we're only listing them once.
To Participate:
- Create a Kaggle account if you don't already have one by clicking "sign up" in the top right.
- Click on the competition link, which will be posted in the lecture table on this page.
- Download the training and testing data.
- Download any shell code from the website.
- Write your algorithm and train it on the training data.
- Then, test it on the testing data, creating a submission file with the predicted ground truth in the format shown in the sample submission file.
- Upload your submission file and see your results!
- Tweak your code, repeating steps 5-8 to improve your accuracy and move up the leaderboard.
Rules:
- Use Python. Everything we do is in Python this year. Unless otherwise specified, do not use any ML library other than numpy.
- The Competition ends at 11:59:00 PM the following Tuesday.
- The leaderboard on Kaggle that you can see is the public leaderboard, which is your accuracy for some percentage of the testing data. Your final rankings will be based on the private leaderboard, which is based on the other part of the testing data and will become public as soon as the competition ends. This is to prevent you from just writing an algorithm to overfit the testing data, which defeats the purpose of the competition.