MLOps Part 2 - Feature Engineering and Training

Previously, we have set up the main skeleton of our training pipeline using mlflow project and implemented a download step component. Now let’s continue building the training pipeline. Right now we are going to develop the feature engineering and training part. For the sake of simplicity, we are going to implement a bare minimum feature engineering for our model, because we are looking to focus our work on mlops. It is very possible to develop a more rigorous feature engineering step that results in much better model performance. ...

September 14, 2021 · Junda

MLOps Part 1 - Intro to MLflow Project and Setting-up Our First Component

MLflow is a very nice tool to handle our MLOps needs. It covers several important features for doing MLOps, namely tracking server, model registry, and source code packaging. Here we are going to focus on MLFlow Projects, the source code packaging feature that can help us develop a reproducible machine learning pipeline. MLFlow projects enable us to run source codes in a consistent way by encapsulating the runtime environment together with the source code, so that we can develop our source code on OSX, and have it run on linux with the same reproducible result, if we so need. ...

August 2, 2021 · Junda

Intuition to Recommender System for Implicit Feedback Dataset

I have been tinkering with recommender system at work for a few months now in order to gain deeper understanding on how the model works, how the training process learns from observation data, and how to make recommendation from learned model. This post is basically the overview on what I’ve learnt and will be divided into several parts, this is the first. This post will rely heavily on paper from Yifan Hu, Yehuda Koren, and Chris Volinsky titled “Collaborative Filtering with Implicit Feedback Dataset”. The theory laid out in the paper has been incorporated into several open source tools to build recommender systems, most prominently perhaps the Apache Spark’s ALS package. ...

May 17, 2021 · Junda

Setting Up Unit Test for Your Apache Spark Job Using Scalatest

By nature, machine learning models that run on production need to deal with… well… data, presumably lots of them. There will be times that among many data that our model need to deal with, there will be bad ones. In which case, machine learning models tend to either immediately stop processing data, or continued on with processing and produce smelly result. The impact of both are bad. Unit testing ML models equip us as developers with an extra confidence to put models in production, by giving a way to an isolated modules in an ML pipeline to face various edge cases and try to handle them accordingly. ...

November 28, 2020 · Junda

Understanding the Data - Exploring CO2 Emissions, Internet Usage, GDP per Capita, and Oil Consumption between Countries

One often overlooked aspect of data analysis project is keeping track of the data that we are working on. Our data will evolve during the course of analysis project, sometime new variables will be introduced, sometimes we redefined an old variables, or sometimes we dropped a variable that deemed no longer relevant. Whatever the reason is, it make a good sense that we keep track of the changes in our data. This is where a code book comes in handy. Code books are simply a document where we put information about our data. At the very least we want to keep track of our variable names, their description, and the unit of measurement. ...

June 7, 2020 · Junda

Starting Analytics Project - What is the Connection Between CO2 Emissions and Internet Usage Across Countries

Every analytics project MUST start from a question. I have been always curious about the explosive growth of the Internet, especially as someone who built his career on enabling wider adoption of Internet in fields that traditionally not rely on the Internet. I want to know if Internet usage is bringing bad effect to CO2 emission - one of the variable most strongly linked to global warming. The Internet - and the digital age at large - has been viewed as mainly bringing net postive. It powers education and economy. It supports our health and well-being. It also connects individuals to their community and their loved ones. However, as I got older, I kept finding myself contemplating whether or not the overwhelming positive overshadows a potentialy serious downside. For me, an environmental impact is one area that I believe will grow in urgency as we are going to keep witnessing the impact of a changing climate on our everyday life. For this analytics project, not only I want to know if every country is making more co2 emission with higher internet useage, but also if there is anomaly out there where a country managed to power their internet growth from sustainable sources. ...

June 6, 2020 · Junda