ODSC West 2020: Notes on ModelOps and MLOps talks

Clare Huang
4 min readOct 30, 2020

--

There is always too much to learn at a conference but too little time to write a summary and reflection on talks I listened to. I shall start doing it bit-by-bit from today.

I am attending the Open Data Science Conference (ODSC) West this week (Oct 27–30, 2020) and would like to share what I learn from the following two talks:

(1) Are We Ready for the Era of Analytics Heterogeneity? Maybe… but the Data Says No by Marinela Profi, and

(2) MLOps in DL Model Development by Anna Petrovicheva (CTO of OpenCV.ai).

Talk (1) discusses the most needed skills, namely, ModelOps, in the current field of analytics, and how developed models can become assets of a company. The speaker also suggests some ways for newbies to pick up such skills.

Talk (2) focuses on a specific aspect of the model life cycle orchestration mentioned in (1), namely, MLOps (and how it differs from DevOps). Doing this properly would enable reproducible machine learning.

Talk #1: Are We Ready for the Era of Analytics Heterogeneity? Maybe… but the Data Says No

This talk is given by Marinela Profi, who is a Global Strategist for AI, with a focus on ModelOps solutions.

Marinela starts by discussing what is needed in the field of analytics. Stakeholders would expect analytics to make predictions, make decisions, and also taking action to address a business question.

In order to do the work, data scientists write codes to implement the procedures. Nowadays, analytics is not limited to a single tool, e.g., one can build deep learning models with not just TensorFlow, but also PyTorch, Caffe, BigDL, etc. This phenomenon is what Marinela referred to as analytics heterogeneity.

It has gradually become a problem that lots of machine learning models are constructed but they are not reusable (e.g. due to differences in the coding environment). According to Gartner’s report, 60% of models are not put into productions. In order to turn machine learning (ML) models into corporate assets, ModelOps is important to ensure the models are reusable and reproducible.

ModelOps includes the following components:

1. Post-development and deployment: how the models are deployed into an execution environment (e.g. cloud)

2. Monitoring and management: how the execution of the analytics processes is monitored

3. Governments & quality control: how the models comply with ethics, regulations are privacy

4. Integration and reusability: how can we compare and reuse different models (MLOps, which will be focused on in Talk #2)

5. Composite integration: how can the models fit into workflows with other procedures, e.g. optimization (not related to the models)

6. Collaboration between talents: how can practitioners communicate about the findings and models

Marinela suggests 3 keys that would help data scientists stand out in the field:

1. Learn about ModelOps

ModelOps a valuable skill on-demand. Beginners shall aim at learning how to master model lifecycle orchestration.

2. Look at Analysts Reports

Reports from research institutes such as Gartner, Forrester, and IDC provide valuable insights on what is on-demand in the field.

3. Never stop growing

Keep acquiring new skills and update yourself with development in the field. Keep learning and sharing as much as possible.

Talk #2: MLOps in DL Model Development

This talk is given by Anna Petrovicheva, who is the CTO of OpenCV.ai.

As data scientists, we often come across an issue in which we cannot reproduce the exact results from certain papers. In order to train and deploy models in a reproducible manner, one shall adopt the practice in DevOps: continuous integration(CI) and continuous delivery(CD).

However, data scientists cannot simply adopt the procedures in classical DevOps to solve the problem. Why is MLOps a lot tricker?

The reason is that, on top of (1) codebase maintenance (which DevOps handles), for MLOps, one also has to manage (2) the dataset (~3Tb) used to train, validate and test models, and (3) the ML model weights (~1Gb) produced in model training. For each model, there is a specific version of the code and a specific version of the training data that produces it. MLOps shall keep track of all these.

To keep track of the dataset, one shall consider practicing data versioning. One shall store the dataset version along with the model. Available tools (for local data storage versioning) include Git-LFS (large file storage) and DVC.

When making a decision on which model to use, one shall do a fair comparison by only varying one aspect at a time, e.g. one shall among all these

  • Model A
  • Model A + certain features
  • Model A + certain training scheme
  • Model A + certain features + certain training scheme

instead of

  • Model A
  • Model B + certain features

To ensure reproducibility, each experiment shall correspond to a specific git commit. Each commit shall include specifications of the model configuration (e.g. input size, number of iterations), and dataset version in its message. The model can be named by the commit hash such that the specification can be easily traced.

Data scientists shall also consider adopting good engineering practices such as including integration tests for the training process, and also accuracy tests for validation and inference speed check. To ensure the reproducibility of the tests, one shall fix all random seeds involved in the process.

The execution environment can be in a form of a virtual environment (simpler) or a docker container (more complex).

Some feedback on Talk #2

The data versioning tools suggested by Anna only works for local files. In large companies, data are stored in warehouses instead — when the data is injected/updated is often not decided by the data science team. In that case, how shall data versioning be done?

I believe this problem can be alleviated by including the timestamp when data is updated in the warehouse. For example, updated data of the same individual can be stored in a partition of a newer timestamp. Also, sampling of datasets (into train/test/validation) shall be done in a deterministic manner. All these can be stored as the configuration of the model in the git commit message as suggested above.

--

--

Clare Huang

Data Scientist | Climate Scientist | Musician | Writer | 現居美國的香港人 | 想為曾經存在過的留下一點痕跡。 | Tech Blog: csyhuang.github.io.