Home » Machine Learning Primer for Clinicians » Recent Articles:

A Machine Learning Primer for Clinicians–Part 5

Alexander Scarlat, MD is a physician and data scientist, board-certified in anesthesiology with a degree in computer sciences. He has a keen interest in machine learning applications in healthcare. He welcomes feedback on this series at drscarlat@gmail.com.


Previous articles:

  1. An Introduction to Machine Learning
  2. Supervised Learning
  3. Unsupervised Learning
  4. How to Properly Feed Data to a ML Model

How Does a Machine Actually Learn?

Most ML models will have the following components:

  • Weights – units that contain the model parameters and are modified with each new learning experience. a.k.a train epoch.
  • Metric – a measure (accuracy, mean square error) of the distance between the model prediction and the true value of that epoch.
  • Loss or Cost Function – used to update the weights with each train epoch according to the calculated metric.
  • Optimizer – algorithm overseeing the loss function so the model will find the global minimum in a reasonable time frame, basically preventing the model from wondering all over the loss function hyperspace.

The learning process or model training is done in epochs. With each epoch, the model is exposed to a batch of samples. 

Each epoch has two steps:

  1. Forward propagation of the input. The input features undergo math calculations with all the model weights and the model predicts an output.
  2. Back propagation of the errors. The model prediction is compared to the real output. This metric is used by the loss function and its master – the optimizer algorithm – to update all the weights according to the last epoch performance.


Consider a model, in this case a neural network (NN) that tries to predict LOS using two features: age and BMI. We have a table with 100 samples / instances / rows and three columns: age, BMI, and LOS.

  • Task: using age and BMI, predict the LOS.
  • Input: age and BMI,
  • Output: LOS.
  • Performance: mean square error (MSE),  the squared difference between predicted and true value of LOS.


Forward Propagation of the Input

Input is being fed one batch at a time. In our example, let’s assume the batch size is equal to one sample (instance).

In the above case, one instance enters the model at the two left red dots: one for age and one for BMI (I stands for Input). The model weights are initialized as very small, random numbers near zero.

All the input features of one instance interact simultaneously through a complex mathematical transformation with all the model parameters (weights are denoted with H from hidden). These interactions are then summarized as the model LOS prediction at the rightmost green dot – output.

Note the numbers on the diagram above and the color of the lines as weights are being modified according to the last train epoch performance. Blue = positive feedback vs. black = negative feedback.

The predicted value of LOS will be far off initially as the weights have been randomly initialized, but the model improves iteratively as it is exposed to more experiences. The difference between the predicted and true value is calculated as the model metric.

Back Propagation of the Error

The model optimizer updates all the weights simultaneously, according to the last metric and loss function results. The weights are slightly modified with each sample the model sees – the cost function is providing the necessary feedback from the metric that measures the distance between the recent prediction vs. the true LOS value. The optimizer basically searches for the global minimum of the loss function

This process is now repeated with the next instance (sample or batch) and so on. The model learns with each and every experience until it is trained on the whole dataset.

The cost function below shows how the model approaches the minimum with each iteration / epoch.


From “Coding with Data” by Tamas Szilagyi.

Once the training has ended, the model has a set of weights that have been exposed to 100 samples of age and BMI. These weights have been iteratively modified during the forward propagation of the input and the back propagation of errors. Now, when faced with a new, never before seen instance of age and BMI, the model can predict the LOS based on previous experiences.

Unsupervised Learning

Just because there is no output (labels) in unsupervised learning doesn’t mean the model is not constrained by a loss / cost function. In the clustering algorithm from the article on Unsupervised Learning , for example, its cost function was the distance between each point and its cluster centroid, and the model optimizer tried to minimize this function with each iteration.

Loss / Cost Function vs. Features

We can chart the loss function (Z) vs. the input features: age (X) and BMI (Y) and follow the model as it performs a gradient descent on a nice, convex cost function that has only one (global) minimum:


Sometimes two features can present a more complex landscape of a loss function, one with many local minima, saddles and the one, much sought after, global minimum:


From “Intro to optimization in deep learning” – PaperSpace Blog.

Here is a comparison of several ML model optimizers, competing to escape a saddle point on a loss function, in order to get to the optimizers’ nirvana – the global minimum. Some optimizers are using a technique called momentum, which simulates a ball accumulating physical momentum as it goes down hill.  Getting stuck on a saddle in hyperspace is not a good thing for a model / optimizer, as the poor red Stochastic Gradient Descent (SGD) optimizer may be able to tell, if it will ever escape.


From “Behavior of adaptive learning rate algorithms at a saddle point – NVIDIA blog.

Just to give you an idea of how complex a loss / cost function landscape can be, below is the loss function of VGG-56 – a known image analysis model trained on a set of several million images. This specific model loss function has as X – Y axes the two main principal components of all the features of an image. Z axis is the cost function.

The interesting landscape below is where VGG-56 has to navigate and find the global minimum – not just any minimum, but the lowest of them. Not a trivial task.


From “Intro to optimization in deep learning” – PaperSpace Blog.

Compressing many dimensions an image usually has, into only two (X-Y) – while minimizing the loss of variance – is usually a job performed by principal component analysis (PCA),  a type of insupervised ML algorithm. That’s another aspect of ML – models that can help us visualize stuff which was unimaginable only a couple of years ago, such as the 3D map of the cost function of an image analysis algorithm.

Next Article

Artificial Neural Networks Exposed

A Machine Learning Primer for Clinicians–Part 4

Alexander Scarlat, MD is a physician and data scientist, board-certified in anesthesiology with a degree in computer sciences. He has a keen interest in machine learning applications in healthcare. He welcomes feedback on this series at drscarlat@gmail.com.


Previous articles:

  1. An Introduction to Machine Learning
  2. Supervised Learning
  3. Unsupervised Learning

How to Properly Feed Data to a ML Model

While in the previous articles I’ve tried to give you an idea about what AI / ML models can do for us, in this article, I’ll sketch what we must do for the machines before asking them to perform magic. Specifically, the data preparation before it can be fed into a ML model. 

For the moment, assume the raw data is arranged in a table with samples as rows and features as columns. These raw features / columns may contain free text; categorical text; discrete data such as ethnicity; integers like heart rate, floating point numbers like 12.58 as well as ICD, DRG, CPT codes; images; voice recordings; videos; waveforms, etc.

What are the dietary restrictions of an artificial intelligence agent? ML models love their diet to consist of only floating point numbers, preferably small values, centered and scaled /normalized around their means +/- their standard deviations.

No Relational Data

If we have a relational database management system (RDBMS),  we must first flatten the one-to-many relationships and summarize them, so one sample or instance fed into the model is truly a good representative summary of that instance. For example, one patient may have many hemoglobin lab results, so we need to decide what to feed the ML model — the minimum Hb, maximum, Hb averaged daily, only abnormal Hb results, number of abnormal results per day? 

No Missing Values

There can be no missing values, as it is similar to swallowing air while eating. 0 and n/a are not considered missing values. Null is definitely a missing value.

The most common methods of imputing missing values are:

  • Numbers – the mean, median, 0, etc.
  • Categorical data – the most frequent value, n/a or 0

No Text

We all know by now that the genetic code is made of raw text with only four letters (A,C,T,G). Before you run to feed your ML model some raw DNA data and ask it questions about the meaning of life, remember that one cannot feed a ML model raw text. Not unless you want to see an AI entity burp and barf.

There are various methods to transform words or characters into numbers. All of them start with a process of tokenization, in which a larger unit of language is broken into smaller tokens. Usually it suffices to break a document into words and stop there:

  • Document into sentences.
  • Sentence into words.
  • Sentence into n-grams, word structures that try to maintain the same semantic meaning (three-word n-grams will assume that chronic atrial fibrillation, atrial chronic fibrillation, fibrillation atrial chronic are all the same concept).
  • Words into characters.

Once the text is tokenized, there are two main approaches of text-to-numbers transformations so text will become more palatable to the ML model: 


One Hot Encode (right side of the above figure)

Using a dictionary of the 20,000 most commonly used words in the English language, we create a large table with 20,000 columns. Each word becomes a row of 20,000 columns. The word “cat” in the above figure is encoded as: 0,1,0,0,…. 20,000 columns, all 0’s except one column with 1. One Hot Encoder – as only one column gets the 1, all the others get 0.

This a widely used, simple transformation which has several limitations: 

  • The table created will be mostly sparse, as most of the values will be 0 across a row. Sparse tables with high dimensionality (20,000) have their own issues, which may cause a severe indigestion to a ML model, named the Curse of Dimensionality (see below).
  • In addition, one cannot represent the order of the words in a sentence with a One Hot Encoder.
  • In many cases, such as sentiment analysis of a document, it seems the order of the words doesn’t really matter.

Words like “superb,” “perfectly” vs. “awful,” “horrible” pretty much give away the document sentiment, disregarding where exactly in the document they actually appear.


From “Does sentiment analysis work? A tidy analysis of Yelp reviews” by David Robinson.

On the other hand, one can think about a medical document in which a term is negated, such as “no signs of meningitis.” In a model where the order of the words is not important, one can foresee a problem with the algorithm not truly understanding the meaning of the negation at the beginning of the sentence. 

The semantic relationship between the words mother-father, king-queen, France-Paris, and Starbucks-coffee will be missed by such an encoding process.

Plurals such as child-children will be missed by the One Hot Encoder and will be considered as unrelated terms.

Word Embedding / Vectorization

A different approach is to encode words into multi-dimensional arrays of floating point numbers (tensors) that are either learned on the fly for a specific job or using an existing pre-trained model such as word2vec, which is offered by Google and trained mostly on Google news. 

Basically a ML model will try to figure the best word vectors — as related to a specific context — and then encode the data to tensors (numbers) in many dimensions so another model may use it down the pipeline.

This approach does not use a fixed dictionary with the top 20,000 most-used words in the English language. It will learn the vectors from the specific context of the documents being fed and create its own multi-dimensional tensors “dictionary.” 

An Argentinian start-up generates legal papers without lawyers and suggests a ruling, which in 33 out of 33 cases has been accepted by a human judge.

Word vectorization is context sensitive. A great set of vectorized legal words (like the Argentinian start-up may have used) will fail when presented with medical terms and vice versa.

In the figure above, I’ve used many colors, instead of 0 and 1, in each cell of the word embedding example to give an idea about 256 dimensions and their capability to store information in a much denser format. Please do not try to feed colors directly to a ML model as it may void your warranty.

Consider an example where words are vectors in two dimensions (not 256). Each word is an arrow starting at 0,0 and ending on some X,Y coordinates.


From Deep Learning Cookbook by Douwe Osinga.

The interesting part about words as vectors is that we can now visualize, in a limited 2D space, how the conceptual distance between the terms man-woman is being translated by the word vectorization algorithm into a physical geometrical distance, which is quite similar to the distance between the terms king-queen. If in only two dimensions the algorithm can generalize from man-woman to king-queen, what can it learn about more complex semantic relationships and hundreds of dimensions?

We can ask such a ML model interesting questions and get answers that are already beyond human level performance:

  • Q: Paris is to France as Berlin is to? A: Germany.
  • Q: Starbucks is to coffee as Apple is to? A: IPhone.
  • Q: What are the capitals of all the European countries? A: UK-London, France-Paris, Romania-Bucharest, etc.
  • Q: What are the three products IBM is most related to? A: DB2, WebSphere Portal, Tamino_XML_Server.

The above are real examples using a a model trained on Google news.

One can train a ML model with relevant vectorized medical text and see if it can answer questions like:

  • Q: Acute pulmonary edema is to CHF as ketoacidosis is to? A: diabetes.
  • Q: What are the three complications a cochlear implant is related to? A: flap necrosis, improper electrode placement, facial nerve problems.
  • Q: Who are the two most experienced surgeons in my home town for a TKR? A: Jekyll, Hyde.

Word vectorization allows other ML models to deal with text (as tensors) — models that do care about the order of the words, algorithms that deal with time sequences, which I will detail in the next articles.

Discrete Categories

Consider a drop-down with the following mutually exclusive drugs:

  1. Viadur
  2. Viagra
  3. Vibramycin
  4. Vicodin

As the above text seems already encoded (Vicodin=4), you may be tempted to eliminate the text and leave the numbers as the encoded values for these drugs. That’s not a good idea. The algorithm will erroneously deduce there is a conceptual similarity between the above drugs just because of their similar range of numbers. After all, two and three are really close from a machine’s perspective, especially if it is a 20,000-drug list. 

The list of drugs being ordered alphabetically by their brand names doesn’t imply there is any conceptual or pharmacological relationship between Viagra and Vibramycin.

Mutually exclusive categories are transformed to numbers with the One Hot Encoder technique detailed above. The result will be a table with the columns: Viadur, Viagra, Vibramycin, Viocodin (similar to the words tokenized above: “the,” “cat,” etc.) Each instance (row) will have one and only one of the above columns encoded with a 1, while all the others will be encoded to 0. In this arrangement, the algorithm is not induced into error and the model will not find conceptual relationships where there are none.


When an algorithm is comparing numerical values such as creatinine=3.8, age=1, heparin=5,000, the ML model will give a disproportionate importance and incorrect interpretation to the heparin parameter, just because heparin has a high raw value when compared to all the other numbers. 

One of the most common solutions is to normalize each column:

  • Calculate the mean and standard deviation
  • Replace the raw values with the new normalized ones

When normalized, the algorithm will correctly interpret the creatinine and the age of the patient to be the important, deviant from the average kind of features in this sample, while the heparin will be regarded as normal.

Curse of Dimensionality

If you have a table with 10,000 features (columns),  you may think that’s great as it is feature-rich. But if this table has fewer than 10,000 samples (examples), you should expect ML models that would vehemently refuse to digest your data set or just produce really weird outputs.

This is called the curse of dimensionality. As the number of dimensions increases, the “volume” of the hyperspace created increases much faster, to a point where the data available becomes sparse. That interferes with achieving any statistical significance on any metric and will also prevent a ML model from finding clusters since the data is too sparse.

Preferably the number of samples should be at least three orders of magnitude larger than the number of features. A 10,000-column table had be better garnished by at least 10,000 rows (samples).


After all the effort invested in the data preparation above, what kind of tensors can we offer now as food for thought to a machine ?

  • 2D – table: samples, features
  • 3D – time sequences: samples, features, time
  • 4D – images: samples, height, width, RGB (color)
  • 5D – videos: samples, frames, height, width, RGB (color)

Note that samples is the first dimension in all cases.


Hopefully this article will cause no indigestion to any human or artificial entity.

Next Article

How Does a Machine Actually Learn?


A Machine Learning Primer for Clinicians–Part 3

Alexander Scarlat, MD is a physician and data scientist, board-certified in anesthesiology with a degree in computer sciences. He has a keen interest in machine learning applications in healthcare. He welcomes feedback on this series at drscarlat@gmail.com.


Previous articles:

  1. An Introduction to Machine Learning
  2. Supervised Learning

Unsupervised Learning

In the previous article, we defined unsupervised machine learning as the type of algorithm used to draw inferences from input data without having a clue about the output expected. There are no labels such as patient outcome, diagnosis, LOS, etc. to provide a feedback mechanism during the model training process.

In this article, I’ll focus on the two most common models of unsupervised learning: clustering and anomaly detection.

Unsupervised Clustering

Note: do not confuse this with with classification, which is a supervised learning model introduced in the last article.

As a motivating factor, consider the following image from Wikipedia:


The above is a heat map that details the influence of a set of parameters on the expression (production) of a set of genes. Red means increased expression and green means reduced expression. A clustering model has organized the information in a heat map plus the hierarchical clustering on top and on the right sides of the diagram above. 

There are two types of clustering models:

  • Models that need to be told a priori the number of groups / clusters we’re looking for
  • Models that will find the optimal number of clusters

Consider a simple dataset:


Problem definition:

  • Task: identify the four clusters in Dataset1.
  • Input: sets of X and Y and the number of groups (four in the above example).
  • Performance metric: total sum of the squared distances of each point in a cluster from its centroid (the center of the cluster) location.

The model initializes four centroids, usually at a random location. The centroids are then moved according to a cost function that the model tries to minimize at each iteration. The cost function is the total sum of the squared distance of each point in the cluster from its centroid. The process is repeated iteratively until there is little or no improvement in the cost function.

In the animation below you can see how the centroids – white X’s – are moving towards the centers of their clusters in parallel to the decreasing cost function on the right.


While doing great on Dataset1, the same model fails miserably on Dataset2, so pick your clustering ML model wisely by exposing the model to diverse experiences / datasets:


Clustering models that don’t need to know a priori the number of centroids (groups) will have the following problem definition:

  • Task: identify the clusters in Dataset1 with the lowest cost function.
  • Input: sets of X and Y (there are NO number of groups / centroids).
  • Performance metric: same as above.

The model below initializes randomly many centroids and then works through an algorithm that tells it how to consolidate together other neighboring centroids to reduce the number of groups to the overall lowest cost function.


From “Clustering with SciKit” by David Sheehan

3D Clustering

While the above example had as input two dimensions (features) X and Y, the following gene expression in a population has three dimensions: X, Y, and Z. The mission definition for such a clustering ML model is the same as above, except the input has now three features: X, Y, and Z.


The animated graphic is at www.arthrogenica.com

Unsupervised Anomaly Detection

As a motivating factor, consider the new criteria for early identification of patients at risk for sepsis or septic shock, qSOFA 2018. The three main rules:

  • Glasgow Coma Scale (GCS) < 15
  • Respiratory rate (RR) >= 22
  • Systolic blood pressure (BP) <=100 mmHg

Let’s focus on two parameters, RR and BP, and a patient who presents with:

  • RR = 21
  • BP = 102

A rule-based engine with only two rules will miss this patient, as it doesn’t sound the alarm per the above qSOFA definition. Not if the rule was written with AND and not it had OR between the conditions. Can a ML model do better? Would you define the above two parameters, when taken together, as an anomaly ? 

Before I explain how a machine can detect anomalies unsupervised by humans, a quick reminder from Gauss (born 1777) about his eponymous distribution.

One-Variable Gaussian Distribution


You may remember from statistics that the above bell-shaped normal Gaussian distribution can accurately describe many phenomena around us. The mean on the above X axis is zero and then there are several standard deviations around the mean (from -3 to +3). The Y axis defines the probability of X. Each point on the chart has a probability of occurrence: the red dot on the right can be defined as an anomaly with a probability of ~ 1 percent. The dot on the left side has a probability of ~ 18 percent,  so most probably it’s not an anomaly. 

The sum (integral) of a Gaussian probability distribution is one, or 100 percent. Thus even an event right on top of our chart has a probability of only 40 percent. Given a point on the X axis and using the Gaussian distribution, we can easily predict the probability of that event happening.

Two-Variable Gaussian Distribution

Back to the patient that exhibits RR = 21 and BP=102 and the decision whether this patient is in for a septic shock adventure or not. There are two variables: X and Y, and a new problem definition:

  • Task: automatically identify instances as anomalies if they are beyond a given threshold. Let’s set the anomaly threshold at three percent.
  • Input: sets of X and Y and the threshold to be considered an anomaly (three percent).
  • Performance metric: number of correct vs. incorrect classifications with a test set, with known anomalies (more about unbalanced classes in next articles).

The following 3D peaks chart has X (RR), Y (BP), and the Gaussian probability as Z axis. Each point on the X-Y plane has a probability associated with it on the Z axis. Usually a peaks chart has an accompanying contour map  in which the 3D is flattened to 2D, with the color still expressing the probability.

Note the elongated, oblong shape of both the peaks chart and the contour map underneath it. This is the crucial fact: the shape of the Gaussian distribution of X and Y  is not a circle (which we may have naively assumed), it’s elliptical. On the peaks chart, there is a red dot with its corresponding red dot on the contour map below. The elliptic shape of our probability distribution of X and Y helps visualizing the following:

  • Each parameter, when considered separately on its own probability distribution, is within its normal limits.
  • Both parameters, when taken together, are definitely abnormal, an anomaly with a probability of ~ 0.8 percent (0.008 on the Z axis), much smaller than the three percent threshold wee set above.


Unsupervised anomaly detection should be considered when:

  • The number of normal instances is much larger than the number of anomalies. We just don’t have enough samples of labeled anomalies to use with a supervised model.
  • There may be unforeseen instances and combinations of parameters that when considered together are abnormal. Remember that a supervised model cannot predict or detect instances never seen during training. Unsupervised anomaly detection models can deal with the unforeseen circumstances by using a function from the 1800s.

Scale the above two-parameter model to one that considers hundreds to thousands of patient parameters, together and at the same time, and you have an unsupervised anomaly detection ML model to prevent patients deterioration while being monitored in a clinical environment. 

The fascinating part about ML algorithms is that we can easily scale a model to thousands of dimensions while having, at the same time, a severe human limitation to visualize more than 5D (see previous article on how a 4D / 5D problem may look).

Next Article

How to Properly Feed Data to a ML Model

A Machine Learning Primer for Clinicians–Part 2

Alexander Scarlat, MD is a physician and data scientist, board-certified in anesthesiology with a degree in computer sciences. He has a keen interest in machine learning applications in healthcare. He welcomes feedback on this series at drscarlat@gmail.com.


To recap from Part 1, the difference between traditional statistical models and ML models is their approach to a problem:


We feed Statistics as an Input and some Rules. It provides an Output.

With a ML model, we have two steps:

  • We feed ML an Input and the Output and the ML model learns the Rules. This learning phase is also called training or model fit.
  • Then ML uses these rule to predict the Output.

In an increasing number of fields, we realize that these rules learned by the machine are much better than the rules we humans can come up with.

Supervised vs. Unsupervised Learning

With the above in mind, the difference between supervised and unsupervised is simple.

  • Supervised learning. We know the labels of each input instance. We have the Output (discharged home, $85,300 cost to patient, 12 days in ICU, 18 percent chance of being readmitted within 30 days). Regression to a continuous variable (number) and Classification to two or more classes are the main subcategories of supervised learning.
  • Unsupervised learning. We do not have the Output. Actually, we may have no idea what the Output even looks like. Note that in this case there’s no teacher (in the form of Output) and no rules around to tell the ML model what’s correct and what’s not correct during learning. Clustering, Anomaly Detection, and Primary Component Analysis are the main subcategories of unsupervised learning.
  • Other. Models working in parallel, one on top of another in ensembles, some models supervising other models which in turn are unsupervised, some of these models getting into a  “generative adversarial relationship” with other models (not my terminology). A veritable zoo, an exciting ecosystem of ML model architectures that is growing fast as people experiment with new ideas and existing tools.


Supervised Learning – Regression

With regression problems, the Output is continuous — a number such the LOS (number of hours or days the patient was in hospital) or the number of days in ICU, cost to patient, days until next readmission, etc. It is also called regression to continuous arbitrary values.

Let’s take a quick look at the Input (a thorough discussion about how to properly feed data to a baby ML model will follow in one of the next articles).

At this stage, think about the Input as a table. Rows are samples and columns are features.

If we have a single column or feature called Age and the output label is LOS, then it may be either a linear regression or a  polynomial (non-linear) regression.

Linear Regression

Reusing Tom Mitchell definition of a machine learning algorithm:

” A computer program is said to learn from Experience (E) with respect to some class of tasks (T) and performance (P) – IF its performance at tasks in T, as measure by P, improves with experience (E)”

  • The Task is to find the best straight line between the scatter points below – LOS vs. age – so that, in turn, can be used later for prediction of new instances.
  • The Experience is all the X and Y data points we have.
  • The Performance will be measured as the distance between the model prediction and the real value.

The X axis is age, the Y axis is LOS (disregard the scale for the sake of discussion):


Monitoring the model learning process, we can see how it approaches the best line with the given data (X and Y) while going through an iterative process. This process will be detailed later in this series:


From “Linear Regression the Easier Way,” by Sagar Sharma.  

Polynomial (Non-Linear) Regression

What happens when the relationship between our (single) variable age and output LOS is not linear?

  • The Task is to find the best line — obviously not a straight line — to describe the (non-linear) polynomial relationship between X (Age) and Y (LOS).
  • Experience and Performance stay the same as in the previous example.


As above, monitoring the model learning process as it approximates the data (Experience) during the fit process:


From “R-english Freakonometrics” by Arthur Carpentier.

Life is a bit more complicated than one feature (age) when predicting LOS, so we’d like see what happens with two features — age and BMI. How do they contribute to LOS when taken together?

  • The Task is to predict the Z axis (vertical) with a set of X and Y (horizontal plane).
  • The Experience. The model now has two features: age and BMI. The output stays the same: LOS.
  • Performance is measured as above.

Find the following peaks chart — a function — similar to the linear line or the polynomial line we found above.

This time the function defines the input as a plane (age on X and BMI on Y) and the output LOS on Z axis.

Knowing a specific age as X and BMI as Y and using such a function / peaks 3D chart, one can predict LOS as Z:


A peaks chart usually has a contour chart accompanying, visualizing the relationships between X, Y, and Z (note that besides X and Y, color is considered a dimension, too, Z on a contour chart). The contour chart of the 3D chart above:


From the MathWorks manual

How does a 4D problem look?

If we add the time as another dimension to a 3D like the one above, it becomes a 4D chart (disregard the axes and title):


From “Doing Magic and Analyzing Time Series with R,” by Peter Laurinec.

Supervised Learning – Classification

When the output is discrete rather than continuous, the problem is one of classification.

Note: classification problems are also called logistic regression. That’s a misnomer and just causes confusion.

  • Binary classification, such as dead or alive.
  • Multi-class classification, such as discharged home, transferred to another facility, discharged to nursing facility, died, etc.

One can take a regression problem such as LOS and make it a classification problem using several buckets or classes: LOS between zero and four days, LOS between five and eight days, LOS greater than nine days, etc.

Binary Classification with Two Variables

  • The Task is to find the best straight line that separates the blue and red dots, the decision boundary between the two classes.
  • The Experience. Given the input of the X and Y coordinate of each dot,  predict the output as the color of the dot — blue or red.
  • Performance is the accuracy of the prediction. Note that just by chance, with no ML involved, the accuracy of guessing is expected to be around 50 percent in this of type of binary classification, as the blue and the red classes are well balanced.

In visualizing the data, it seems there may be a relatively straight line to separate the dots:


The model learning iteratively the best separating straight line. This model is linearly constrained when searching for a decision boundary:


From “Classification” by Davi Frossard.

Binary Classification with a Non-Linear Separation

A bit more complex dots separation exercise, when the separation line is obviously non-linear.

The Task, Experience, and Performance remain the same:


The ML model learning the best decision boundary, while not being constrained to a linear solution:


From gfycat.

Multi-Class Classification

Multi-class classification is actually an extension of the simple binary classification. It’s called the “One vs. All” technique. 

Consider three groups: A, B, and C. If we know how to do a Binary classification (see above), then we can calculate probabilities for:

  • A vs. all the others (B and C)
  • B vs. all the others (A and C)
  • C vs. all the others (A and B)

More about Classification models in the next articles.

Binary Classification with Three Variables

While the last example had two variables (X and Y) with one output (color of the dot), the next one has three input variables (X, Y, and Z) and the same output: color of the dot.

  • The Task is to find the best hyperplane shape (also known as rules / function) to separate the blue and red dots.
  • Experience has now three input variables (X, Y, and Z) and one output label (the dot color).
  • Performance remains the same.


From Ammon Washburn, data scientist. Click here to see the animated 3D picture.

How can we visualize a problem with 5,000 dimensions?

Unfortunately, we cannot visualize more than 4-5 dimensions. The above 4D chart (3D plus time) on a map with multiple locations — charts running in parallel, over time — I guess that would be considered a 5D visualization, having the geo location as the fifth dimension. One can imagine how difficult it would be to actually visualize, absorb, and digest the information and just monitor such a (limited) 5D problem for a couple of days.

Alas, if it’s difficult for us to visualize and monitor a 5D problem,  how can we expect to learn from each and every experience of such a complex system and improve our performance, in real time, on a prediction task? 

How about a problem with 10,000 features, the task of predicting one out of 467 DRG classifications for a specific patient with an error less than 8 percent? 

In this series, we’ll tackle problems with many features and many dimensions while visualizing additional monitors — the ML learning curves — which in my opinion are as beautiful and informative as the charts above, even as they are only 2D.

Other ML Architectures

On an artsy note, as it is related to the third group of ML models (Other) in my chart above, are models that are not purely supervised or unsupervised.

In 2015, Gatys et al. published a paper on a model that separates style from content in a painting. The generative model learns an artist’s style and then predicts the style effects learned on any arbitrary image. The left upper corner is an image in Europe and then each cell is the same image rendered in one of several famous painters’ styles:


Scoff you may, but this week, the first piece of art generated by AI goes for sale at Christie’s.

Generative ML models have been trained to generate text in a Shakespearean style or the Bible style by feeding ML models with all Shakespeare or the whole Bible text. There are initiatives I’ll report about where a ML model learns the style of a physician or group of physicians and then creates admission / discharge notes accordingly. This is the ultimate dream come true for any young and tired physician or resident.

In Closing

I’d like to end this article on a philosophical note.

Humans recently started feeding ML models the actual Linux and Python programming languages (the relevant documentation, manuals, Q&A forums, etc.) 

As expected, the machines started writing computer code on their own. 

The computer software written by machines cannot be compiled nor executed and it will not actually run.

Yet …

I’ll leave you with this intriguing, philosophical, recursive thought in mind — computers writing their own software — until the next article in the series: Unsupervised Learning.

A Machine Learning Primer for Clinicians–Part 1

Alexander Scarlat, MD is a physician and data scientist, board-certified in anesthesiology with a degree in computer sciences. He has a keen interest in machine learning applications in healthcare. He welcomes feedback on this series at drscarlat@gmail.com.


AI State of the Art in 2018

Near human or super-human performance:

  • Image classification
  • Speech recognition
  • Handwriting transcription
  • Machine translation
  • Text-to-speech conversion
  • Autonomous driving
  • Digital assistants capable of conversation
  • Go and chess games
  • Music, picture, and text generation

Considering all the above — AI/ML (machine learning), predictive analytics, computer vision, text and speech analysis — you may wonder:

How can a machine possibly learn?!

As a physician with a degree in CS and curious about ML, I took the ML Stanford/Coursera course by Andrew Ng. It was a painful, but at the same time an immensely pleasurable educational experience. Painful because of the non-trivial math involved. Immensely pleasurable because I’ve finally understood how a machine actually learns.

If you are a clinician who is interested in AI / ML but short on math / programming skills or time, I will try to clarify in a series of short articles — under the gracious auspices of HIStalk — what I have learned from my short personal journey in ML. You can check some of my ML projects here.

I promise that no math or programming are required.

Rules-Based Systems

The ancient predecessors of ML are rules-based systems. They are easy to explain to humans:

  • IF the blood pressure is between normal and normal +/- 25 percent
  • AND the heart rate is between normal and normal + 27 percent
  • AND the urinary output is between normal and normal – 43 percent
  • AND / OR etc.
  • THEN consider septic shock as part of the differential diagnosis.

The problem with these systems is that they are time-consuming, error-prone, difficult and expensive to build and test, and do not perform well in real life.

Rules-based systems also do not adapt to new situations that the model has never seen.

Even when rules-based systems predict something, it is based on a human-derived rule, on a human’s (limited?) understanding of the problem and how well that human represented the restrictions in the rules-based system.

One can argue about the statistical validation that is behind each and every parameter in the above short example rule. You can imagine what will happen with a truly big, complex system with thousands or millions of rules.

Rules-based systems are founded on a delicate and very brittle process that doesn’t scale well to complex medical problems.

ML Definitions

Two definitions of machine learning are widely used:

  • “The field of study that gives computers the ability to learn without being explicitly programmed.” (Arthur Samuel).
  • “A computer program is said to learn from experience (E) with respect to some class of tasks (T) and performance measure (P) — IF its performance at tasks in T, as measured by P, improves with experience E.” (Tom Mitchell).

Rules-based systems, therefore are by definition NOT an ML model. They are explicitly programmed according to some fixed, hard-wired set of finite rules

With any model, ML or not ML, or any other common sense approach to a task, one MUST measure the model performance. How good are the model predictions when compared to real life? The distance between the model predictions and the real-life data is being measured with a metric, such as accuracy or mean-squared error.

A true ML model MUST learn with each and every new experience and improve its performance with each learning step while using an optimization and loss function to calibrate its own model weights. Monitoring and fine-tuning the learning process is an important part of training a ML algorithm.

What’s the Difference Between ML and Statistics?

While ML and statistics share a similar background and use similar terms and basic methods like accuracy, precision, recall, etc., there is a heated debate about the differences between the two. The best answer I found is the one in Francois Chollet’s excellent book “Deep Learning with Python.”

Imagine Data going into a black box, which uses Rules (classical statistical programming) and then predicts Answers:


One provides Statistics, the Data, and the Rules. Statistics will predict the Answers.

ML takes a different approach:


One provides ML, the Data, and the Answers. ML will return the Rules learned.

The last figure depicts only the training / learning phase of ML known as FIT – the model fits to the experiences learned – while learning the Rules.

Then one can use these machine learned Rules with new Data, never seen by the model before, to PREDICT Answers.

Fit / Predict are the basics of a ML model life cycle: the model learns (or train / fit) and then it predicts on new Data.

Why is ML Better than Traditional Statistics for Some Tasks?

There are numerous examples where there are no statistical models available: on-line multi-lingual translation, voice recognition, or identifying melanoma in a series of photos better than a group of dermatologists. All are ML-based models, with some theoretical foundations in Statistics.

ML has a higher capacity to represent complex, multi-dimensional problems. A model, be it statistical or ML, has inherent, limited, problem-representation capabilities. Think about predicting inpatient mortality based on only one parameter, such as age. Such a model will quickly achieve a certain performance ceiling it cannot possibly overpass, as it is limited in its capabilities to represent the true complexity involved in this type of prediction. The mortality prediction problem is much more complicated than considering only age.

On the other hand, a model that takes into consideration 10,000 parameters when predicting mortality (diagnosis at admission, procedures, lab, imaging, pathology results, medications, consultations, etc.) has a theoretically much higher capacity to better represent the problem complexity, the numerous interrelations that may exist within the data, non-linear, complex relation and such. ML deals with multivariate, multi-dimensional complex issues better than statistics.

ML model predictions are not bound by the human understanding of a problem or the human decision to use a specific model in a specific situation. One can test 20 ML models with thousands of dimensions each on the same problem and pick the top five to create an ensemble of models. Using this architecture allows several mediocre-performing models to achieve a genius level just by combining their individual, non-stellar predictions. While it will be difficult for a human to understand the reasoning of such an ensemble of models, it may still outperform and beat humans by a large margin. Statistics was never meant to deal with this kind of challenge.

Statistical models do not scale well to the billions of rows of data currently available and used for analysis.

Statistical models can’t work when there are no Rules. ML models can – it’s called unsupervised learning. For example, segment a patient population into five groups. Which groups, you may ask? What are their specifications (a.k.a. Rules)? ML magic: even as we don’t know the specs of these five groups, an algorithm can still segment the patient population and then tell us about these five groups’ specs.

Why the Recent Increased Interest in AI/ML?

The recent increased interest in AI/ML is attributed to several factors:

  • Improved algorithms derived from the last decade progress in the math foundations of ML
  • Better hardware, specifically designed for ML needs, based on the video gaming industry GPU (graphic processor unit)
  • Huge quantity of data available as a playground for nascent data scientists
  • Capability to transfer learning and reuse models trained by others for different purposes
  • Most importantly,  having all the above as free, open source while being supported by a great users’ community

Articles Structure

I plan this structure for upcoming articles in this series:

  • The task or problem to solve
  • Model(s) usually employed for this type of problem
  • How the model learns (fit) and predicts
  • A baseline, sanity check metric against which model is trying to improve
  • Model(s) performance on task
  • Applications in medicine, existing and tentative speculations on how the model can be applied in medicine

In Upcoming Articles

  • Supervised vs. unsupervised ML
  • How to prep the data before feeding the model
  • Anatomy of a ML algorithm
  • How a machine actually learns
  • Controlling the learning process
  • Measuring a ML model performance
  • Regression to arbitrary values with single and multiple variables (e.g. LOS, costs)
  • Classification to binary classes (yes/no) and to multiple classes (discharged home, discharged to rehab, died in hospital, transferred to ICU, etc.
  • Anomaly detection: multiple parameters, each one may be within normal range (temperature, saturation, heart rate, lactic acid, leucocytes, urinary output, etc.) , but taken together, a certain combination may be detected as abnormal – predict patients in risk of deterioration vs. those ready for discharge
  • Recommender system: next clinical step (lab, imaging, med, procedure) to consider in order to reach the best outcome (LOS, mortality, costs, readmission rate)
  • Computer vision: melanoma detection in photos, lung cancer / pneumonia detection in chest X-ray and CT scan images, and histopathology slides classification to diagnosis
  • Time sequence classification and prediction – predicting mortality or LOS hourly, with a model that considers the order of the sequence of events, not just the events themselves

Subscribe to Updates



Text Ads

Report News and Rumors

No title

Anonymous online form
Rumor line: 801.HIT.NEWS



Founding Sponsors


Platinum Sponsors

























































Gold Sponsors

















Reader Comments

  • The Trip: Love your new Get Involved section. Don't you know that I will be contacting you shortly!...
  • Vaporware?: No such thing as bad PR, i guess? If I was Cerner (or CommonWell ... is there a difference anymore?) I would not wan...
  • Ross Martin: File under HIMSS19 Advice... Here's a taste of the steady diet of marketing and life wisdom I've been sampling from g...
  • Lazlo Hollyfeld: Why would anyone assume a random poster on Seeking Alpha has much credibility when they write an overview of a publicly-...
  • Kyle: Sounds like CCF didn't employ very good "Rizk management"....
  • Lazlo Hollyfeld: Well the DoD just spent another $100M, now well over $6B, on an audit of its spending that was years overdue, failed aby...
  • Vaporware?: I thought Travis Dalton did a good job not insulting our intelligence and not lying... "hard lessons," "should have done...
  • DrJay: Thank you on the follow up and explaining what Black book says the process is. From my own personal experiences be...
  • Oh Really, Harold?: I just get red flags all over the place from this: * how does 2,194 round to 3,000? * why use two decimal places on KP...
  • jp: I've tried for months to have any contact with Black Book to better understand how they determine which services we offe...

Sponsor Quick Links