Meet the winners of TheMathCompany’s Data Scientist Hiring Hackathon
The MATHCO.thon ended on July 19, 2021. The Data Scientist Hiring Hackathon was followed by an SQL-based quiz for all shortlisted participants. Based on the double hurdle format, TheMathCompany announced the three winners of the cash prizes. Here, we take a look at their personal journey, their approach to the solution, and their experience at MachineHack.
First Prize – Sai Deepak
Sai Deepak graduated from UG in Production Engineering. He completed his latest project on air pollutants and automobile exhaust using IoT data.
After graduating, he took a data science course at Great Learning and then joined a telecommunications company as a data analyst. Sai Deepak aspires to move from a data analyst role to a data scientist role.
Sai Deepak did an EDA on all of his features, and he combined high cardinality columns into fewer columns. He built the basic models with several experiments on the output variables and focused on the logarithmic transformation. He also used the logarithmic transformation on one of the predictor variables: Mileage and Levy. Levy was particularly tricky due to missing data; the imputation was done by trial and error. He had incorporated the “ID” column as a feature. He discovered that outliers had an impact on the outcome of the model. To achieve robustness, he applied a quantile transformation on the data, improving the result of the model. It used categorical coding and One hot Encoding techniques to convert categorical data.
Sai Deepak used math functions like square root and arithmetic operators and KMeans to create new functionality. He tried out various models such as tree-based regressors, SVC, and Stacking. The Light GBM regressor, XGBoost regressor, Random Forest regressor, and Catboost regressor significantly contributed to the prediction. He used a blending technique to incorporate the results of all models.
This is Sai Deepak’s second release at MachineHack. He said MachineHack is the only platform where data science issues are available at the Basic, Intermediate, and Advanced levels. “MachineHack is the best platform for learning data science,” he said. The bootcamp videos helped him prepare for the hackathon.
Discover its solution HERE.
Second prize- Akash Gupta
Akash Gupta graduated with a B.Tech in Computer Science from AKGEC: Ajay Kumar Garg Engineering College, Ghaziabad in 2020. In his second semester, he enrolled in Andrew Ng’s course on Machine Learning. He has also enrolled in courses such as IBM Data Science Professional course, Mathematics for Machine Learning by Imperial College London, Deep Learning Specialization, etc. on Coursera.
He began to actively participate in ML hackathons from the fourth semester. His data science accomplishments include Grandmaster at MachineHack (AIM), Expert Kaggle, 7 gold medals at Dockship, and top 3 positions in over 40 ML hackathons.
To start, he did the EDA on the data and noted the deviation of the train price on the given data set. He found out the top 10 donated car prices. He detected a huge price difference with the quartile. He removed some train lines that weren’t the best for the model and cut out the outliers to make the model sturdy.
Feature engineering included the aggregation of various features correlated with mileage and statistical features such as quartile, mean, and norm.
For modeling, he first converted the price to “np.log1p” format, followed by the train test split. The LightGBM algorithm gave the best performance. He also experimented with the type of amplification -GBDT, DART and different random states. He obtained the final prediction using the ten K Fold cross-validation technique
“MachineHack is one of the top organizers of data science hackathons and knowledge portals in India,” Akash said. He’s been a MachineHacker since inception. He also likes the new user interface. Plus, features like Practice and Boot Camp helped him prepare for the MATHCO.thon.
Discover its solution HERE.
Third Prize – N Sai Sandeep
Sai Sandeep turned to data science after seeing a video from Google’s 2018 I / O conference. He then enrolled in an applied AI course. He then completed an internship at AppliedAI and built a chatbot using the Seq to Seq-Pytorch Framework model. He worked full time in the same organization to guide students in their projects and develop course content. Currently, he works as a Data Scientist at Sutherland Global.
Sai Sandeep spent 95% of the time analyzing and preparing the data for the model and the rest in building models. He said the EDA and data preprocessing steps are important, and the AutoML tools for building models work exceptionally well for hackathons. Here’s his step-by-step approach:
1) Univariate and multivariate analysis: EDA on all independent characteristics and the target column. Columns like mileage, tax, engine volume, etc. had missing values or additional strings attached to them (formatted as categorical instead of continuous). Multivariate analysis performed using Plotly to understand the relationship of independent variables with the target variable.
He noticed a strong correlation of a few variables with respect to manufacturers, such as missing pick values and price, median car prices and year of production, etc.
2) Data preprocessing and imputation: From the information taken above, columns such as Engine Volume, Levy, Doors Mileage have been transformed. Imputation of missing values was carried out on the basis of the other characteristics (grouping by column and taking the median) which are correlated with the characteristics having missing values.
3) Target variable and metric analysis: the target column “Price” was skewed and contained many extreme values. The target column has undergone a logarithmic transformation.
4) Engineering of features and importance of features: combinations such as with and without raw features (features before preprocessing), ID column, etc. have been tested to calibrate the ideal feature configurations for optimal results. Categorical features were transformed using hot coding and tag coding, while digital features were scaled. The Extra-Trees regressor model was trained to remove unwanted functionality.
5) Model tuning using Auto Gluon: Auto Gluon gave better results compared to other packages. However, the direct application of AG did not help as the stacked models were over-equipped. After going through the source code and logs, he discovered a data leak with the Random Forest model. Therefore, RF was excluded from the model list and Auto Gluon was trained to produce a 3-level stacked model. The final model was a level -3 weighted ensemble model with base models such as the Lightgbm, Extratrees and Catboost algorithms.
“MachineHack has been an incredible platform for machine learning enthusiasts. I registered for the first time to participate in the Great India Hiring Hackathon. Although I didn’t perform well at the time, I applied my learning in the next Car Price Predictions challenge and made it into the top 3. In addition, the discussion board was very useful and the user interface was fast and clutter-free, ”he said. .
Discover its solution HERE.
Subscribe to our newsletter
Receive the latest updates and relevant offers by sharing your email.
Join our Telegram Group. Be part of an engaging community