2018-2019 Competitions | TJ Machine Learning Club

Competition Instructions

Beginner Series Competitions

Convolutional Neural Networks Competition Instructions

12/21/18 - Your job is to write a CNN that predicts the lcocation of facial features on an image. Each predicted keypoint is specified by an (x,y) real-valued pair in the space of pixel indices. There are 15 keypoints, which represent the following elements of the face:

left_eye_center, right_eye_center, left_eye_inner_corner, left_eye_outer_corner, right_eye_inner_corner, right_eye_outer_corner, left_eyebrow_inner_end, left_eyebrow_outer_end, right_eyebrow_inner_end, right_eyebrow_outer_end, nose_tip, mouth_left_corner, mouth_right_corner, mouth_center_top_lip, mouth_center_bottom_lip

Left and right here refers to the point of view of the subject.

The input image is given in the last field of the data files, and consists of a list of pixels (ordered by row), as integers in (0,255). The images are 96x96 pixels.

Data files

train.csv: list of training 5000 images. Each row contains the (x,y) coordinates for 15 keypoints, and image data as row-ordered list of pixels. The first row is the header and associates each column with the feature. There are 31 values in each row, with the first 30 being x value for feature 1, y value for feature 1, etc and the 31st value being the image. The first 30 should be the outputs of your network and the 31st (image) should be the input.
test.csv: list of 2049 test images. Each row contains ImageId and image data as row-ordered list of pixels.
samplesubmission.csv: list of keypoints to predict. Each row has an Id and a value. The Id corresponds to the image and feature in the format "ImageId.FeatureId" where featureId is based on the order in which they are sorted in the train file. For example, the first feature is left_eye_center_x so to predict left_eye_center_x for image 1, I would have "1.1 34.555" as the first row.

Helpful tips

Sample code
Use Keras. Documentation available here. The first half of this tutorial may come in handy. Our introductory lecture on Keras is available here.
Start with a standard neural network before moving on to a convolutional one. When you do use a convolutional network, make sure to reshape your input to 96x96 instead of 1x9216 as it is right now. This is best done using numpy.
Use a linear activation in the final layer because the challenge is regression not classification. There should be 30 output nodes.
The Ids start at 1 not 0. Don't mix this up.
Don't forget your header in the submission file.
To better understand the format, open the files in Excel.

The competition end Friday night at 11:59:59 p.m. on 1/11/19. Feel free to ask any clarifying format questions to the officers.

Competition link

Neural Networks Competition Instructions

11/25/18 - Your job is to write the code to create an neural network, train it on the training data, and use it to predict the classes of the testing data. We are trying to images of handwritten digits. The data we are using is from the famous MNIST dataset. Your neural network is supposed to classify which digit (0 through 9) the image represents, and the inputs are the 784 pixel values that make up the 28x28 images.

The training data looks like this:

label, pixel 11, pixel 12, pixel 13, pixel 14, etc. (784 pixel values)

                        label, pixel 21, pixel 22, pixel 23, pixel 24, etc. (784 pixel values)

                        label, pixel 31, pixel 32, pixel 33, pixel 34, etc. (784 pixel values)

                        etc. (60,000 lines)

The testing data looks like this:

id, pixel 11, pixel 12, pixel 13, pixel 14, etc. (784 pixel values)

                        id, pixel 21, pixel 22, pixel 23, pixel 24, etc. (784 pixel values)

                        id, pixel 31, pixel 32, pixel 33, pixel 34, etc. (784 pixel values)

                        etc. (10,000 lines)

Where pixel ij is the pixel value in the ith row and jth column. Each image is 28x28. Each pixel value ranges from 0, black, to 255, white. The MNIST dataset is black and white, which is why each pixel value is a single value instead of an (R,G,B) triple.

Your end goal is to create a file which looks like:

id, solution 1, predicted_label 2, predicted_label 3, predicted_label 4, predicted_label etc. (10,000 lines)

As always, only use Python for this competition. The only external libraries allowed for this competition is Numpy. We highly recommend you use the library for vectors (bias vectors, etc.) and matrices (weights, partial matrices).

We've written a small shell. The shell has a network class, and each network is made up of a list of layers (which are a separate class). Each layer is designed to have its own vectors and matrices (biases and weights, etc.). You don't have to structure your network this way by any means, or have a layer class at all. Most of the time when people write neural networks from scratch, they only have a single network class.

Some important tips for the competition:

Weight initialization - We didn't cover this really, but there are many ways to determine your starting weights. In general, we suggest using guassian dist from -1 to 1.
One hot encoding - Since you have 10 output nodes, make sure you convert your expected y value to 10 values. For example, 3 would go to (0 0 0 1 0 0 0 0 0 0) for the network's output where each node refers to if a specific digit is present. One hot encoding is used as a representation for categorical variables, which in this case would be a digit. Check this post out if you want to read more into it.

The competition ends in a week and a half, at 11:59:59 p.m. on 12/5/18. Also, since writing a neural network from scratch is harder than previous competitions, this competition will be worth double in our rankings.

Competition link

Support Vector Machine Competition Instructions

10/26/18 - Your job is to write the code to create an SVM, train it on the training data, and use it to predict the classes of the testing data. We are trying to classify survival of passengers on the Titanic. The data we are using are from actual passengers on the ship. Your SVM is supposed to classify whether a passenger survived [RIP (0) or Survived (1)], based on 11 different metrics (features).

The purpose of this contest is not to test your ability to write an SVM. Instead, we are using this opportunity to test two abilities:

Your ability to learn how to use Scikit-Learn
Your ability to work with real-world data

The second is far more important (and difficult) than the first. With that in mind, the training data now looks like this:

feature 1, survival, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11 feature 1, survival, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11 feature 1, survival, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11 etc. (636 lines)

and the testing data looks like this:

feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11 feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11 feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11 etc. (255 lines)

Your end goal is to create a file which looks like:
id, solution 1, predicted_class 2, predicted_class 3, predicted_class 4, predicted_class etc. (255 lines) For every set of features in line N in the training data, you should have a line N, predicted_class in your submission file.

There are missing data points. Some features are not useful. The difficult part of this contest is formatting the data given, determining which features are useful, which ones should be trained on, and how to deal with the missing data.

Your are allowed to (and should) use Scikit-Learn to create your SVM. Scikit has detailed instructions on how to write an SVM using the library here. This is the trivial portion of the competition.

We highly recommend you open both the training and testing csv files in a program like Excel, which will help you modify columns of data and perform calculations quickly.

The Data I/O code from the decision tree competition may still be useful.

Competition Link

As always, only use Python for this competition. The only external libraries allowed for this competition are Scikit-Learn and Numpy. This competition will end at 11:59 PM next Tuesday, October 30th. If its your first time competing, check out how to participate under the Decision Trees Competition link on this page.

Random Forests Competition Instructions

10/15/18 - Your job is to write the code to create a random forest, train it on the training data, and use it to predict the classes of the testing data. We are trying to classify survival of passengers on the Titanic. The data we are using are from actual passengers on the ship. Your random forest is supposed to classify whether a passenger survived [RIP (0) or Survived (1)], based on 7 different metrics (features). The training data looks like this:

feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, survival feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, survival feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, survival etc. (500 lines)

and the testing data looks like this:

feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7 feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7 feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7 etc. (214 lines)

Your end goal is to create a file which looks like:
id, solution 1, predicted_class 2, predicted_class 3, predicted_class 4, predicted_class etc. (214 lines) For every set of features in line N in the training data, you should have a line N, predicted_class in your submission file.

Standard competition instructions and rules apply from the decision trees competition.

Competition Link

In case you care, the features correspond to:


                            pclass - Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd 

                            sex - Sex: 0 = Male, 1 = Female 

                            Age - Age in years  

                            sibsp - # of siblings / spouses aboard the Titanic  

                            parch - # of parents / children aboard the Titanic  

                            fare - passenger fare 

                            embarked - Port of Embarkation: 0 = Southampton, 1 = Cherbourg, 2 = Queenstown

                            survival - Survival: 0 = No, 1 = Yes

Shell code is available here. The Data I/O code from the decision tree lecture will still be useful, but will need to be adapted for this data.

Decision Trees Competition

The Data

9/27/18 - Welcome to the first contest of the year! Your job is to write the code to create a decision tree, train it on the training data, and use it to predict the classes of the testing data. We are trying to classify breast cancer data. The data we are using is actual data from breast cancer patients. Your decision tree is supposed to classify the type of breast cancer they have (benign (0) or malignant (1)), based on 9 different metrics (features). The training data looks like this:

feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, class feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, class feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, class etc. (533 lines)

and the testing data looks like this:

feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9 feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9 feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9 etc. (150 lines)

Your end goal is to create a file which looks like:
id, solution 1, predicted_class 2, predicted_class 3, predicted_class 4, predicted_class etc. (150 lines) For every set of features in line N in the training data, you should have a line N, predicted_class in your submission file.

Basic shell code is available here. The Data I/O code from the decision tree lecture will still be useful, but will need to be adapted for this data.

How to participate

Our contests will be held on Kaggle, using Kaggle InClass. This allows us to upload data and competition instructions, as well as impose submission deadlines. It also ranks submissions automatically! To participate:

Create a Kaggle account by clicking "sign up" in the top right.
Click on this link (the competition link).
Download the training and testing data.
Download the I/O Code.
Write your algorithm and train it on the training data.
Then, test it on the testing data, creating a submission file with the predicted ground truth in the format shown in the sample submission file.
Upload your submission file and see your results!
Tweak your code, repeating steps 5-8 to improve your accuracy and move up the leaderboard.

Some More Guidelines:

We recommend that you use Python. Everything we do is in Python this year, and it's the main language for machine learning in general. Don't use any external libraries, other than numpy and pandas. Both of these packages are very useful for data processing and manipulation, so it's a good idea to learn them if you can. If you're unfamiliar with Python, for this competition, you can use whichever language you want to, but in the future it will be much more difficult. If it's your first time using Python, check out these links for a jumpstart: Google's Python course, LearnPython.org, and The Python Guru.
The Competition ends at 11:59:00 PM, October 9th. For this competition, we're giving an extra week than we normally do, so be sure to try it out!
The leaderboard on Kaggle that you can see is the Public Leaderboard, which is your accuracy for 50% of the testing data. Your final rankings will be based on the Private leaderboard, which is based on the other 50% of the testing data and will become public as soon as the competition ends. This is to prevent you from just writing a decision tree that overfits the testing data, which defeats the purpose.

Advanced Series Competitions

This year, in collaboration with professors at John Hopkins, we will hold a series of long-term competitions for club members to participate. Based on the rankings for these competitions, we will select interested members for machine learning internships with these professors during the coming summer. Because this competition ends very early on in the year, when we select the final members for the internship, we will take into account that new members have not had the sufficient experience to perform well in this competition when considering the rankings. New members are welcome to participate, but this competition is primarily for advanced, returning members. Please see the individual competition instructions below.

TJ Machine Learning Club

Making AI more accessible

Competition Instructions

Beginner Series Competitions

Convolutional Neural Networks Competition Instructions

Data files

Helpful tips

Neural Networks Competition Instructions

Support Vector Machine Competition Instructions

Random Forests Competition Instructions

Decision Trees Competition

The Data

How to participate

Some More Guidelines:

Advanced Series Competitions

RSNA Pneumonia Competition Instructions

Competition Introduction