Motivation
You have just trained a gradient boosting machine (GBM) and a random forest (RF) classifier on your data using Stata’s new h2oml command suite. Your GBM model achieves 87% accuracy on the testing data, and your RF model, 85%. It looks as if GBM is the preferred classifier, right? Not so fast.
Why accuracy alone isn’t enough
Accuracy, area under the curve, and root mean squared error are popular metrics, but they provide only point estimates. These numbers reflect how well a model performed on one specific testing sample, but they don’t account for the variability that can arise from sample to sample. In other words, they don’t answer this key question: Will the difference in performance between these methods hold at the population level, or could it have occurred by chance only in this particular testing dataset? Read more…
I am excited to let you be the second to know that Stata 19 is now available. Statalist is always the first to know!
Highlights include
And more. Visit stata.com/new-in-stata for all the details. You can also visit stata.com/help.cgi?whatsnew18to19 for the nitty gritty on every single change from Stata 18 to Stata 19.
Those of you with StataNow already received some of these features along the way in updates to StataNow. And, those of you with StataNow are eligible for an automatic upgrade to StataNow 19. Watch your inbox for an email from us with instructions on how to request your upgrade.
Categories: New Products Tags: Bayesian, biostatistics, CATE, cox, CRE, data science, econometrics, H2O, HDFE, machine learning, meta-analysis, mundlak, new release
Machine learning, deep learning, and artificial intelligence are a collection of algorithms used to identify patterns in data. These algorithms have exotic-sounding names like “random forests”, “neural networks”, and “spectral clustering”. In this post, I will show you how to use one of these algorithms called a “support vector machines” (SVM). I don’t have space to explain an SVM in detail, but I will provide some references for further reading at the end. I am going to give you a brief introduction and show you how to implement an SVM with Python.
Our goal is to use an SVM to differentiate between people who are likely to have diabetes and those who are not. We will use age and HbA1c level to differentiate between people with and without diabetes. Age is measured in years, and HbA1c is a blood test that measures glucose control. The graph below displays diabetics with red dots and nondiabetics with blue dots. An SVM model predicts that older people with higher levels of HbA1c in the red-shaded area of the graph are more likely to have diabetes. Younger people with lower HbA1c levels in the blue-shaded area are less likely to have diabetes. Read more…
Why use lasso to do inference about coefficients in high-dimensional models?
High-dimensional models, which have too many potential covariates for the sample size at hand, are increasingly common in applied research. The lasso, discussed in the previous post, can be used to estimate the coefficients of interest in a high-dimensional model. This post discusses commands in Stata 16 that estimate the coefficients of interest in a high-dimensional model. Read more…