<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Stata Blog</title>
	<atom:link href="http://blog.stata.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.stata.com</link>
	<description>Not Elsewhere Classified</description>
	<lastBuildDate>Sun, 09 Jun 2013 22:07:41 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Stata 13 ships June 24</title>
		<link>http://blog.stata.com/2013/06/09/stata-13-ships-june-24/</link>
		<comments>http://blog.stata.com/2013/06/09/stata-13-ships-june-24/#comments</comments>
		<pubDate>Sun, 09 Jun 2013 22:07:41 +0000</pubDate>
		<dc:creator>The Stata Blog Team</dc:creator>
				<category><![CDATA[New Products]]></category>
		<category><![CDATA[BLOBs]]></category>
		<category><![CDATA[effect sizes]]></category>
		<category><![CDATA[endogenous treatment effects]]></category>
		<category><![CDATA[forecasts]]></category>
		<category><![CDATA[generalized SEM]]></category>
		<category><![CDATA[Java plugins]]></category>
		<category><![CDATA[long strings]]></category>
		<category><![CDATA[multilevel mixed-effects]]></category>
		<category><![CDATA[power]]></category>
		<category><![CDATA[Project Manager]]></category>
		<category><![CDATA[random-effects panel data]]></category>
		<category><![CDATA[sample size]]></category>
		<category><![CDATA[treatment effects]]></category>

		<guid isPermaLink="false">http://blog.stata.com/?p=1424</guid>
		<description><![CDATA[There&#8217;s a new release of Stata. You can order it now, it starts shipping on June 24, and you can find out about it at www.stata.com/stata13/. Well, we sure haven&#8217;t made that sound exciting when, in fact, Stata 13 is a big &#8212; we mean really BIG &#8212; release, and we really do want to [...]]]></description>
				<content:encoded><![CDATA[<p>There&#8217;s a new release of Stata. You can <a href="http://www.stata.com/order/">order</a> it now, it starts shipping on June 24, and you can find out about it at <a href="http://www.stata.com/stata13/">www.stata.com/stata13/</a>.</p>
<p>Well, we sure haven&#8217;t made that sound exciting when, in fact, Stata 13 is a big &#8212; we mean really BIG &#8212; release, and we really do want to tell you about it.</p>
<p>Rather than summarizing, however, we&#8217;ll send you to <a href="http://www.stata.com/stata13/">the website</a>, which in addition to the standard marketing materials, has technical sheets, demonstrations, and even videos of the new features.</p>
<p>And all 11,000 pages of the manuals are now online.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stata.com/2013/06/09/stata-13-ships-june-24/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Update on the Stata YouTube Channel</title>
		<link>http://blog.stata.com/2013/03/18/update-on-the-stata-youtube-channel/</link>
		<comments>http://blog.stata.com/2013/03/18/update-on-the-stata-youtube-channel/#comments</comments>
		<pubDate>Mon, 18 Mar 2013 17:13:02 +0000</pubDate>
		<dc:creator>Chuck Huber, Senior Statistician</dc:creator>
				<category><![CDATA[Resources]]></category>
		<category><![CDATA[videos]]></category>
		<category><![CDATA[YouTube]]></category>

		<guid isPermaLink="false">http://blog.stata.com/?p=1397</guid>
		<description><![CDATA[What is it about round numbers that compels us to pause and reflect? We celebrate 20-year school reunions, 25-year wedding anniversaries, 50th birthdays and other similar milestones. I don&#8217;t know the answer but the Stata YouTube Channel recently passed several milestones &#8211; more than 1500 subscribers, over 50,000 video views and it was launched six [...]]]></description>
				<content:encoded><![CDATA[<p>What is it about round numbers that compels us to pause and reflect?  We celebrate 20-year school reunions, 25-year wedding anniversaries, 50th birthdays and other similar milestones.  I don&#8217;t know the answer but the <a href="http://www.youtube.com/statacorp">Stata YouTube Channel</a> recently passed several milestones &#8211; more than 1500 subscribers, over 50,000 video views and it was launched six months ago.  We felt the need for a small celebration to mark the occasion, and I thought that I would give you a brief update.</p>
<p>I could tell you about re-recording the original 24 videos with a larger font to make them easier to read.  I could tell you about the hardware and software that we use to record them including our experiments with various condenser and dynamic microphones.  I could share quotes from some of the nice messages we&#8217;ve received.  But I think it would be more fun to talk about&#8230;.you!</p>
<p>YouTube collects data about the number of views each video receives as well as summary data about who, what, when, where, and how you are watching them.  There is no need to be concerned about your privacy; there are no personal identifiers of any kind associated with these data.  But the summary data are interesting, and I thought it might be fun to share some of the data with you.</p>
<h2>Who&#8217;s watching?</h2>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/03/Figure1.png"><img src="http://blog.stata.com/wp-content/uploads/2013/03/Figure1-300x225.png" alt="Figure1" width="300" height="225" class="aligncenter size-medium wp-image-1398" /></a></p>
<p>Figure 1 shows the age distribution of Stata YouTube Channel viewers.  If you have ever attended a <a href="http://www.stata.com/meeting/">Stata Conference</a>, you will not be surprised by this graph&#8230;until you notice the age group at the bottom.  I would not have guessed that 13-17 year olds are watching our videos.  Perhaps they saw Stata in the movie &#8220;Moneyball&#8221; with Brad Pitt and wanted to learn more.  Or maybe they were influenced by the <a href="http://www.stata.com/giftshop/trex-adult-shirt/">latest fashion craze sweeping the youth of the world</a>.</p>
<h2>What are you watching?</h2>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/03/Figure2.png"><img src="http://blog.stata.com/wp-content/uploads/2013/03/Figure2-300x225.png" alt="Figure2" width="300" height="225" class="aligncenter size-medium wp-image-1400" /></a> </p>
<p>We have posted more than 50 videos over a wide range of topics.  Figure 2 shows the total number of views for the ten most popular videos.  The more popular of the ten are about broad topics.  These broader videos are mostly older and have thus had time to accumulate more views. </p>
<p>Even so, these videos receive more views per day currently than do the special topic videos that have been posted more recently.  This supports my belief that Stata YouTube Channel viewers tend to be relatively new Stata users who want to learn about general topics, and that means more generic videos in the future.  So you and your two post-docs will just have to read the manual if you want to learn how to fit asymmetric power ARCH models with outer-product gradient standard errors.</p>
<h2>When are you watching?</h2>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/03/Figure3.png"><img src="http://blog.stata.com/wp-content/uploads/2013/03/Figure3-300x225.png" alt="Figure3" width="300" height="225" class="aligncenter size-medium wp-image-1401" /></a></p>
<p>We usually post new videos on Tuesday mornings which might lead you to believe that the peak viewing day would also be Tuesday.  Figure 3, however, shows us that the average number of views per day (vpd) is higher on Wednesdays at 420 vpd and in fact peaks on Thursdays at 430 vpd before declining Friday through Sunday.</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/03/Figure4.png"><img src="http://blog.stata.com/wp-content/uploads/2013/03/Figure4-300x225.png" alt="Figure4" width="300" height="225" class="aligncenter size-medium wp-image-1402" /></a></p>
<p>Figure 4 also shows us that late September may have been not the best time to launch the Stata YouTube Channel.  Our early momentum in September and October slowed during the November and December holiday seasons.  We were, however, pleased to see that 49 of you spent New Years Eve watching our videos.  Perhaps next year we&#8217;ll prepare something more festive just for you!</p>
<h2>Where are you watching?</h2>
<p>What do the Czech Republic, Pakistan, Uganda, Madagascar, the United Kingdom, the Bahamas, the United States, Montenegro, and Italy have in common?  Correct!  They are all countries in which you are watching our videos.  They are also locations depicted in one of my favorite action films but I&#8217;ll leave that to the trivia buffs.  I think the most exciting information that we found in our data is that the Stata YouTube Channel is being viewed in 164 countries!</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/03/Figure5.png"><img src="http://blog.stata.com/wp-content/uploads/2013/03/Figure5-300x225.png" alt="Figure5" width="300" height="225" class="aligncenter size-medium wp-image-1403" /></a></p>
<p>You might not be surprised to learn that roughly half of the people watching the videos live in the United States, the United Kingdom, or Canada.  The results may be unexpected when we consider the &#8220;view rate&#8221; defined as the number of views per 100,000 residents.  Figure 5 shows the top 20 countries ranked by view rate for countries with at least four million residents.  Denmark had the highest view rate which was nearly twice the rate of Norway which had the second highest view rate.  The view rate in Denmark was more than three times the rate in the US and the UK.</p>
<h2>How are you watching?</h2>
<p>You might think that I would have anything to report about &#8220;how&#8221; you are watching the videos, but it turns out that 5.2% of you are watching on mobile devices.  Perhaps this explains the 13-17 year old demographic or the 49 people watching on New Year&#8217;s Eve.  Or maybe we are helping you pass the time in the dentist office waiting room.</p>
<h2>Final thoughts</h2>
<p>Six months isn&#8217;t much of a milestone.  We Stata folk will use any excuse to break out the cake and ice cream.  Even so, the Stata YouTube Channel began as an experiment and often experiments do not work out as we would like.  This experiment has exceeded our expectations and, as a result, we have started taking requests for videos on our <a href="https://www.facebook.com/StataCorp">Facebook page</a> and we&#8217;ll be adding more videos every week.  So thanks for watching and stay tuned!</p>
<p>Now if you will excuse me, I&#8217;m going to get some cake and ice cream.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stata.com/2013/03/18/update-on-the-stata-youtube-channel/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Multilevel linear models in Stata, part 2: Longitudinal data</title>
		<link>http://blog.stata.com/2013/02/18/multilevel-linear-models-in-stata-part-2-longitudinal-data/</link>
		<comments>http://blog.stata.com/2013/02/18/multilevel-linear-models-in-stata-part-2-longitudinal-data/#comments</comments>
		<pubDate>Mon, 18 Feb 2013 17:49:56 +0000</pubDate>
		<dc:creator>Chuck Huber, Senior Statistician</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[longitudinal data]]></category>
		<category><![CDATA[multilevel models]]></category>
		<category><![CDATA[xtmixed]]></category>

		<guid isPermaLink="false">http://blog.stata.com/?p=1380</guid>
		<description><![CDATA[In my last posting, I introduced you to the concepts of hierarchical or &#8220;multilevel&#8221; data. In today&#8217;s blog, I&#8217;d like to show you how to use multilevel modeling techniques to analyse longitudinal data with Stata&#8217;s xtmixed command. Last time, we noticed that our data had two features. First, we noticed that the means within each [...]]]></description>
				<content:encoded><![CDATA[<p>In my <a href="http://blog.stata.com/2013/02/04/multilevel-linear-models-in-stata-part-1-components-of-variance/">last posting</a>, I introduced you to the concepts of hierarchical or &#8220;multilevel&#8221; data. In today&#8217;s blog, I&#8217;d like to show you how to use multilevel modeling techniques to analyse longitudinal data with Stata&#8217;s <b>xtmixed</b> command.</p>
<p>Last time, we noticed that our data had two features. First, we noticed that the means within each level of the hierarchy were different from each other and we incorporated that into our data analysis by fitting a &#8220;variance component&#8221; model using Stata&#8217;s <b>xtmixed</b> command.</p>
<p>The second feature that we noticed is that repeated measurement of GSP showed an upward trend. We&#8217;ll pick up where we left off last time and stick to the concepts again and you can refer to the references at the end to learn more about the details.</p>
<h2>The videos</h2>
<p>Stata has a very friendly dialog box that can assist you in building multilevel models. If you would like a brief introduction using the GUI, you can watch a demonstration on Stata&#8217;s YouTube Channel:</p>
<p><a href="http://www.youtube.com/watch?v=rUWT_EWV6QI">Introduction to multilevel linear models in Stata, part 2: Longitudinal data</a></p>
<h2>Longitudinal data</h2>
<p>I&#8217;m often asked by beginning data analysts &#8211; &#8220;What&#8217;s the difference between longitudinal data and time-series data? Aren&#8217;t they the same thing?&#8221;.</p>
<p>The confusion is understandable &#8212; both types of data involve some measurement of time. But the answer is no, they are not the same thing.</p>
<p>Univariate time series data typically arise from the collection of many data points over time from a single source, such as from a person, country, financial instrument, etc.</p>
<p>Longitudinal data typically arise from collecting a few observations over time from many sources, such as a few blood pressure measurements from many people.</p>
<p>There are some multivariate time series that blur this distinction but a rule of thumb for distinguishing between the two is that time series have more repeated observations than subjects while longitudinal data have more subjects than repeated observations.</p>
<p>Because our GSP data from <a href="http://blog.stata.com/2013/02/04/multilevel-linear-models-in-stata-part-1-components-of-variance/">last time</a> involve 17 measurements from 48 states (more sources than measurements), we will treat them as longitudinal data.</p>
<p>GSP Data: <a href="http://www.stata-press.com/data/r12/productivity.dta">http://www.stata-press.com/data/r12/productivity.dta</a></p>
<h2>Random intercept models</h2>
<p>As I mentioned last time, repeated observations on a group of individuals can be conceptualized as multilevel data and modeled just as any other multilevel data. We left off last time with a variance component model for GSP (Gross State Product, logged) and noted that our model assumed a constant GSP over time while the data showed a clear upward trend.</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/Graph3.png"><img class="aligncenter size-medium wp-image-1381" alt="Graph3" src="http://blog.stata.com/wp-content/uploads/2013/02/Graph3-300x225.png" width="300" height="225" /></a></p>
<p>If we consider a single observation and think about our model, nothing in the fixed or random part of the models is a function of time.</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/Slide15.png"><img class="aligncenter size-medium wp-image-1382" alt="Slide15" src="http://blog.stata.com/wp-content/uploads/2013/02/Slide15-300x225.png" width="300" height="225" /></a></p>
<p>Let&#8217;s begin by adding the variable year to the fixed part of our model.</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/Slide16.png"><img class="aligncenter size-medium wp-image-1383" alt="Slide16" src="http://blog.stata.com/wp-content/uploads/2013/02/Slide16-300x225.png" width="300" height="225" /></a></p>
<p>As we expected, our grand mean has become a linear regression which more accurately reflects the change over time in GSP. What might be unexpected is that each state&#8217;s and region&#8217;s mean has changed as well and now has the same slope as the regression line. This is because none of the random components of our model are a function of time. Let&#8217;s fit this model with the <b>xtmixed</b> command:</p>
<pre>. xtmixed gsp year, || region: || state:

------------------------------------------------------------------------------
         gsp |      Coef.   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        year |   .0274903   .0005247    52.39   0.000     .0264618    .0285188
       _cons |  -43.71617   1.067718   -40.94   0.000    -45.80886   -41.62348
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
region: Identity             |
                   sd(_cons) |   .6615238   .2038949      .3615664    1.210327
-----------------------------+------------------------------------------------
state: Identity              |
                   sd(_cons) |   .7805107   .0885788      .6248525    .9749452
-----------------------------+------------------------------------------------
                sd(Residual) |   .0734343   .0018737      .0698522    .0772001
------------------------------------------------------------------------------</pre>
<p>The fixed part of our model now displays an estimate of the intercept (_cons = -43.7) and the slope (year = 0.027). Let&#8217;s graph the model for Region 7 and see if it fits the data better than the variance component model.</p>
<pre>predict GrandMean, xb
label var GrandMean "GrandMean"
predict RegionEffect, reffects level(region)
predict StateEffect, reffects level(state)
gen RegionMean = GrandMean + RegionEffect
gen StateMean = GrandMean + RegionEffect + StateEffect

twoway  (line GrandMean year, lcolor(black) lwidth(thick))      ///
        (line RegionMean year, lcolor(blue) lwidth(medthick))   ///
        (line StateMean year, lcolor(green) connect(ascending)) ///
        (scatter gsp year, mcolor(red) msize(medsmall))         ///
        if region ==7,                                          ///
        ytitle(log(Gross State Product), margin(medsmall))      ///
        legend(cols(4) size(small))                             ///
        title("Multilevel Model of GSP for Region 7", size(medsmall))</pre>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/Graph4.png"><img class="aligncenter size-medium wp-image-1384" alt="Graph4" src="http://blog.stata.com/wp-content/uploads/2013/02/Graph4-300x225.png" width="300" height="225" /></a></p>
<p>That looks like a much better fit than our variance-components model from last time. Perhaps I should leave well enough alone, but I can&#8217;t help noticing that the slopes of the green lines for each state don&#8217;t fit as well as they could. The top green line fits nicely but the second from the top looks like it slopes upward more than is necessary. That&#8217;s the best fit we can achieve if the regression lines are forced to be parallel to each other. But what if the lines were not forced to be parallel? What if we could fit a &#8220;mini-regression model&#8221; for each state within the context of my overall multilevel model. Well, good news &#8212; we can!</p>
<h2>Random slope models</h2>
<p>By introducing the variable year to the fixed part of the model, we turned our grand mean into a regression line. Next I&#8217;d like to incorporate the variable year into the random part of the model. By introducing a fourth random component that is a function of time, I am effectively estimating a separate regression line within each state.</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/Slide19.png"><img class="aligncenter size-medium wp-image-1385" alt="Slide19" src="http://blog.stata.com/wp-content/uploads/2013/02/Slide19-300x225.png" width="300" height="225" /></a></p>
<p>Notice that the size of the new, brown deviation u<sub>1ij.</sub> is a function of time. If the observation were one year to the left, u<sub>1ij.</sub> would be smaller and if the observation were one year to the right, u<sub>1ij.</sub>would be larger.</p>
<p>It is common to &#8220;center&#8221; the time variable before fitting these kinds of models. Explaining why is for another day. The quick answer is that, at some point during the fitting of the model, Stata will have to compute the equivalent of the inverse of the square of year. For the year 1986 this turns out to be 2.535e-07. That&#8217;s a fairly small number and if we multiply it by another small number&#8230;well, you get the idea. By centering age (e.g. cyear = year &#8211; 1978), we get a more reasonable number for 1986 (0.01). (Hint: If you have problems with your model converging and you have large values for time, try centering them. It won&#8217;t always help, but it might).</p>
<p>So let&#8217;s center our year variable by subtracting 1978 and fit a model that includes a random slope.</p>
<pre>gen cyear = year - 1978
xtmixed gsp cyear, || region: || state: cyear, cov(indep)</pre>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/Slide21.png"><img class="aligncenter size-medium wp-image-1386" alt="Slide21" src="http://blog.stata.com/wp-content/uploads/2013/02/Slide21-300x225.png" width="300" height="225" /></a></p>
<p>I&#8217;ve color-coded the output so that we can match each part of the output back to the model and the graph. The fixed part of the model appears in the top table and it looks like any other simple linear regression model. The random part of the model is definitely more complicated. If you get lost, look back at the graphic of the deviations and remind yourself that we have simply partitioned the deviation of each observation into four components. If we did this for every observation, the standard deviations in our output are simply the average of those deviations.</p>
<p>Let&#8217;s look at a graph of our new &#8220;random slope&#8221; model for Region 7 and see how well it fits our data.</p>
<pre>predict GrandMean, xb
label var GrandMean "GrandMean"
predict RegionEffect, reffects level(region)
predict StateEffect_year StateEffect_cons, reffects level(state)

gen RegionMean = GrandMean + RegionEffect
gen StateMean_cons = GrandMean + RegionEffect + StateEffect_cons
gen StateMean_year = GrandMean + RegionEffect + StateEffect_cons + ///
                     (cyear*StateEffect_year)

twoway  (line GrandMean cyear, lcolor(black) lwidth(thick))             ///
        (line RegionMean cyear, lcolor(blue) lwidth(medthick))          ///
        (line StateMean_cons cyear, lcolor(green) connect(ascending))   ///
        (line StateMean_year cyear, lcolor(brown) connect(ascending))   ///
        (scatter gsp cyear, mcolor(red) msize(medsmall))                ///
        if region ==7,                                                  ///
        ytitle(log(Gross State Product), margin(medsmall))              ///
        legend(cols(3) size(small))                                     ///
        title("Multilevel Model of GSP for Region 7", size(medsmall))</pre>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/Graph6.png"><img class="aligncenter size-medium wp-image-1387" alt="Graph6" src="http://blog.stata.com/wp-content/uploads/2013/02/Graph6-300x225.png" width="300" height="225" /></a></p>
<p>The top brown line fits the data slightly better, but the brown line below it (second from the top) is a much better fit. Mission accomplished!</p>
<h2>Where do we go from here?</h2>
<p>I hope I have been able to convince you that multilevel modeling is easy using Stata&#8217;s <b>xtmixed</b> command and that this is a tool that you will want to add to your kit. I would love to say something like &#8220;And that&#8217;s all there is to it. Go forth and build models!&#8221;, but I would be remiss if I didn&#8217;t point out that I have glossed over many critical topics.</p>
<p>In our GSP example, we would still like to consider the impact of other independent variables. I haven&#8217;t mentioned choice of estimation methods (ML or REML in the case of <b>xtmixed</b>). I&#8217;ve assessed the fit of our models by looking at graphs, an approach important but incomplete. We haven&#8217;t thought about hypothesis testing. Oh &#8212; and, all the usual residual diagnostics for linear regression such as checking for outliers, influential observations, heteroskedasticity and normality still apply&#8230;.times four! But now that you understand the concepts and some of the mechanics, it shouldn&#8217;t be difficult to fill in the details. If you&#8217;d like to learn more, check out the links below.</p>
<p>I hope this was helpful&#8230;thanks for stopping by.</p>
<h2>For more information</h2>
<p>If you&#8217;d like to learn more about modeling multilevel and longitudinal data, check out</p>
<p><a href="http://www.stata.com/bookstore/multilevel-longitudinal-modeling-stata/">Multilevel and Longitudinal Modeling Using Stata, Third Edition</a><br />
Volume I: Continuous Responses<br />
Volume II: Categorical Responses, Counts, and Survival<br />
by Sophia Rabe-Hesketh and Anders Skrondal</p>
<p>or sign up for our popular public training course <a href="http://www.stata.com/training/multilevel-mixed-models-using-stata/">Multilevel/Mixed Models Using Stata</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stata.com/2013/02/18/multilevel-linear-models-in-stata-part-2-longitudinal-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Multilevel linear models in Stata, part 1: Components of variance</title>
		<link>http://blog.stata.com/2013/02/04/multilevel-linear-models-in-stata-part-1-components-of-variance/</link>
		<comments>http://blog.stata.com/2013/02/04/multilevel-linear-models-in-stata-part-1-components-of-variance/#comments</comments>
		<pubDate>Mon, 04 Feb 2013 22:49:28 +0000</pubDate>
		<dc:creator>Chuck Huber, Senior Statistician</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[multilevel models]]></category>
		<category><![CDATA[variance components]]></category>
		<category><![CDATA[xtmixed]]></category>

		<guid isPermaLink="false">http://blog.stata.com/?p=1358</guid>
		<description><![CDATA[In the last 15-20 years multilevel modeling has evolved from a specialty area of statistical research into a standard analytical tool used by many applied researchers. Stata has a lot of multilevel modeling capababilities. I want to show you how easy it is to fit multilevel models in Stata. Along the way, we&#8217;ll unavoidably introduce [...]]]></description>
				<content:encoded><![CDATA[<p>In the last 15-20 years multilevel modeling has evolved from a specialty area of statistical research into a standard analytical tool used by many applied researchers.</p>
<p>Stata has a lot of multilevel modeling capababilities.</p>
<p>I want to show you how easy it is to fit multilevel models in Stata. Along the way, we&#8217;ll unavoidably introduce some of the jargon of multilevel modeling.</p>
<p>I&#8217;m going to focus on concepts and ignore many of the details that would be part of a formal data analysis. I&#8217;ll give you some suggestions for learning more at the end of the blog.</p>
<ul>The videos</ul>
<p>Stata has a friendly dialog box that can assist you in building multilevel models. If you would like a brief introduction using the GUI, you can watch a demonstration on Stata&#8217;s YouTube Channel:</p>
<p><a href="http://www.youtube.com/watch?v=KALxDwwqX1A">Introduction to multilevel linear models in Stata, part 1: The <b>xtmixed</b> command</a></p>
<ul>Multilevel data</ul>
<p>Multilevel data are characterized by a hierarchical structure. A classic example is children nested within classrooms and classrooms nested within schools. The test scores of students within the same classroom may be correlated due to exposure to the same teacher or textbook. Likewise, the average test scores of classes might be correlated within a school due to the similar socioeconomic level of the students.</p>
<p>You may have run across datasets with these kinds of structures in your own work. For our example, I would like to use a dataset that has both longitudinal and classical hierarchical features. You can access this dataset from within Stata by typing the following command:</p>
<p style="padding-left: 15px;"><b>use http://www.stata-press.com/data/r12/productivity.dta</b></p>
<p>We are going to build a model of gross state product for 48 states in the USA measured annually from 1970 to 1986. The states have been grouped into nine regions based on their economic similarity. For distributional reasons, we will be modeling the logarithm of annual Gross State Product (GSP) but in the interest of readability, I will simply refer to the dependent variable as GSP.</p>
<pre>. describe gsp year state region

              storage  display     value
variable name   type   format      label      variable label
-----------------------------------------------------------------------------
gsp             float  %9.0g                  log(gross state product)
year            int    %9.0g                  years 1970-1986
state           byte   %9.0g                  states 1-48
region          byte   %9.0g                  regions 1-9</pre>
<p>Let&#8217;s look at a graph of these data to see what we&#8217;re working with.</p>
<pre>twoway (line gsp year, connect(ascending)), ///
        by(region, title("log(Gross State Product) by Region", size(medsmall)))</pre>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/graph1.png" rel="attachment wp-att-1359"><img class="aligncenter size-medium wp-image-1359" alt="graph1" src="http://blog.stata.com/wp-content/uploads/2013/02/graph1-300x225.png" width="300" height="225" /></a></p>
<p>Each line represents the trajectory of a state&#8217;s (log) GSP over the years 1970 to 1986. The first thing I notice is that the groups of lines are different in each of the nine regions. Some groups of lines seem higher and some groups seem lower. The second thing that I notice is that the slopes of the lines are not the same. I&#8217;d like to incorporate those attributes of the data into my model.</p>
<ul>Components of variance</ul>
<p>Let&#8217;s tackle the vertical differences in the groups of lines first. If we think about the hierarchical structure of these data, I have repeated observations nested within states which are in turn nested within regions. I used color to keep track of the data hierarchy.</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/slide2.png" rel="attachment wp-att-1361"><img class="aligncenter size-medium wp-image-1361" alt="slide2" src="http://blog.stata.com/wp-content/uploads/2013/02/slide2-300x225.png" width="300" height="225" /></a></p>
<p>We could compute the mean GSP within each state and note that the observations within in each state vary about their state mean.</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/slide3.png"><img class="aligncenter size-medium wp-image-1364" alt="slide3" src="http://blog.stata.com/wp-content/uploads/2013/02/slide3-300x225.png" width="300" height="225" /></a></p>
<p>Likewise, we could compute the mean GSP within each region and note that the state means vary about their regional mean.</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/slide4.png"><img class="aligncenter size-medium wp-image-1365" alt="slide4" src="http://blog.stata.com/wp-content/uploads/2013/02/slide4-300x225.png" width="300" height="225" /></a></p>
<p>We could also compute a grand mean and note that the regional means vary about the grand mean.</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/slide5.png"><img class="aligncenter size-medium wp-image-1366" alt="slide5" src="http://blog.stata.com/wp-content/uploads/2013/02/slide5-300x225.png" width="300" height="225" /></a></p>
<p>Next, let&#8217;s introduce some notation to help us keep track of our mutlilevel structure. In the jargon of multilevel modelling, the repeated measurements of GSP are described as &#8220;level 1&#8243;, the states are referred to as &#8220;level 2&#8243; and the regions are &#8220;level 3&#8243;. I can add a three-part subscript to each observation to keep track of its place in the hierarchy.</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/slide7.png"><img class="aligncenter size-medium wp-image-1367" alt="slide7" src="http://blog.stata.com/wp-content/uploads/2013/02/slide7-300x225.png" width="300" height="225" /></a></p>
<p>Now let&#8217;s think about our model. The simplest regression model is the intercept-only model which is equivalent to the sample mean. The sample mean is the &#8220;fixed&#8221; part of the model and the difference between the observation and the mean is the residual or &#8220;random&#8221; part of the model. Econometricians often prefer the term &#8220;disturbance&#8221;. I&#8217;m going to use the symbol μ to denote the fixed part of the model. μ could represent something as simple as the sample mean or it could represent a collection of independent variables and their parameters.</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/slide8.png"><img class="aligncenter size-medium wp-image-1368" alt="slide8" src="http://blog.stata.com/wp-content/uploads/2013/02/slide8-300x225.png" width="300" height="225" /></a></p>
<p>Each observation can then be described in terms of its deviation from the fixed part of the model.</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/slide9.png"><img class="aligncenter size-medium wp-image-1369" alt="slide9" src="http://blog.stata.com/wp-content/uploads/2013/02/slide9-300x225.png" width="300" height="225" /></a></p>
<p>If we computed this deviation of each observation, we could estimate the variability of those deviations. Let&#8217;s try that for our data using Stata&#8217;s <b>xtmixed</b> command to fit the model:</p>
<pre>. xtmixed gsp

Mixed-effects ML regression                     Number of obs      =       816

                                                Wald chi2(0)       =         .
Log likelihood = -1174.4175                     Prob &gt; chi2        =         .

------------------------------------------------------------------------------
         gsp |      Coef.   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   10.50885   .0357249   294.16   0.000     10.43883    10.57887
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
                sd(Residual) |   1.020506   .0252613      .9721766    1.071238
------------------------------------------------------------------------------</pre>
<p>The top table in the output shows the fixed part of the model which looks like any other regression output from Stata, and the bottom table displays the random part of the model. Let&#8217;s look at a graph of our model along with the raw data and interpret our results.</p>
<pre>predict GrandMean, xb
label var GrandMean "GrandMean"
twoway  (line GrandMean year, lcolor(black) lwidth(thick))              ///
        (scatter gsp year, mcolor(red) msize(tiny)),                    ///
        ytitle(log(Gross State Product), margin(medsmall))              ///
        legend(cols(4) size(small))                                     ///
        title("GSP for 1970-1986 by Region", size(medsmall))</pre>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/graph1b.png"><img class="aligncenter size-medium wp-image-1370" alt="graph1b" src="http://blog.stata.com/wp-content/uploads/2013/02/graph1b-300x225.png" width="300" height="225" /></a></p>
<p>The thick black line in the center of the graph is the estimate of _cons, which is an estimate of the fixed part of model for GSP. In this simple model, _cons is the sample mean which is equal to 10.51. In &#8220;Random-effects Parameters&#8221; section of the output, sd(Residual) is the average vertical distance between each observation (the red dots) and fixed part of the model (the black line). In this model, sd(Residual) is the estimate of the sample standard deviation which equals 1.02.</p>
<p>At this point you may be thinking to yourself &#8211; &#8220;That&#8217;s not very interesting &#8211; I could have done that with Stata&#8217;s <b>summarize</b> command&#8221;. And you would be correct.</p>
<pre>. summ gsp

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
         gsp |       816    10.50885    1.021132    8.37885   13.04882</pre>
<p>But here&#8217;s where it does become interesting. Let&#8217;s make the random part of the model more complex to account for the hierarchical structure of the data. Consider a single observation, y<sub>ijk</sub> and take another look at its residual.</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/slide11.png"><img class="aligncenter size-medium wp-image-1371" alt="slide11" src="http://blog.stata.com/wp-content/uploads/2013/02/slide11-300x225.png" width="300" height="225" /></a></p>
<p>The observation deviates from its state mean by an amount that we will denote e<sub>ijk</sub>. The observation&#8217;s state mean deviates from the the regionals mean u<sub>ij.</sub> and the observation&#8217;s regional mean deviates from the fixed part of the model, μ, by an amount that we will denote u<sub>i..</sub>. We have partitioned the observation&#8217;s residual into three parts, aka &#8220;components&#8221;, that describe its magnitude relative to the state, region and grand means. If we calculated this set of residuals for each observation, wecould estimate the variability of those residuals and make distributional assumptions about them.</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/slide12.png"><img class="aligncenter size-medium wp-image-1372" alt="slide12" src="http://blog.stata.com/wp-content/uploads/2013/02/slide12-300x225.png" width="300" height="225" /></a></p>
<p>These kinds of models are often called &#8220;variance component&#8221; models because they estimate the variability accounted for by each level of the hierarchy. We can estimate a variance component model for GSP using Stata&#8217;s <b>xtmixed</b> command:</p>
<pre>xtmixed gsp, || region: || state:

------------------------------------------------------------------------------
         gsp |      Coef.   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   10.65961   .2503806    42.57   0.000     10.16887    11.15035
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
region: Identity             |   
                   sd(_cons) |   .6615227   .2038944       .361566    1.210325
-----------------------------+------------------------------------------------
state: Identity              |   
                   sd(_cons) |   .7797837   .0886614      .6240114    .9744415
-----------------------------+------------------------------------------------
                sd(Residual) |   .1570457   .0040071       .149385    .1650992
------------------------------------------------------------------------------</pre>
<p>The fixed part of the model, _cons, is still the sample mean. But now there are three parameters estimates in the bottom table labeled &#8220;Random-effects Parameters&#8221;. Each quantifies the average deviation at each level of the hierarchy.</p>
<p>Let&#8217;s graph the predictions from our model and see how well they fit the data.</p>
<pre>predict GrandMean, xb
label var GrandMean "GrandMean"
predict RegionEffect, reffects level(region)
predict StateEffect, reffects level(state)
gen RegionMean = GrandMean + RegionEffect
gen StateMean = GrandMean + RegionEffect + StateEffect

twoway  (line GrandMean year, lcolor(black) lwidth(thick))      ///
        (line RegionMean year, lcolor(blue) lwidth(medthick))   ///
        (line StateMean year, lcolor(green) connect(ascending)) ///
        (scatter gsp year, mcolor(red) msize(tiny)),            ///
        ytitle(log(Gross State Product), margin(medsmall))      ///
        legend(cols(4) size(small))                             ///
        by(region, title("Multilevel Model of GSP by Region", size(medsmall)))</pre>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/graph2.png"><img class="aligncenter size-medium wp-image-1374" alt="graph2" src="http://blog.stata.com/wp-content/uploads/2013/02/graph2-300x225.png" width="300" height="225" /></a></p>
<p>Wow &#8211; that&#8217;s a nice graph if I do say so myself. It would be impressive for a report or publication, but it&#8217;s a little tough to read with all nine regions displayed at once. Let&#8217;s take a closer look at Region 7 instead.</p>
<pre>twoway  (line GrandMean year, lcolor(black) lwidth(thick))      ///
        (line RegionMean year, lcolor(blue) lwidth(medthick))   ///
        (line StateMean year, lcolor(green) connect(ascending)) ///
        (scatter gsp year, mcolor(red) msize(medsmall))         ///
        if region ==7,                                          ///
        ytitle(log(Gross State Product), margin(medsmall))      ///
        legend(cols(4) size(small))                             ///
        title("Multilevel Model of GSP for Region 7", size(medsmall))</pre>
<p><a href="http://blog.stata.com/wp-content/uploads/2013/02/graph3.png"><img class="aligncenter size-medium wp-image-1375" alt="graph3" src="http://blog.stata.com/wp-content/uploads/2013/02/graph3-300x225.png" width="300" height="225" /></a></p>
<p>The red dots are the observations of GSP for each state within Region 7. The green lines are the estimated mean GSP within each State and the blue line is the estimated mean GSP within Region 7. The thick black line in the center is the overall grand mean for all nine regions. The model appears to fit the data fairly well but I can&#8217;t help noticing that the red dots seem to have an upward slant to them. Our model predicts that GSP is constant within each state and region from 1970 to 1986 when clearly the data show an upward trend.</p>
<p>So we&#8217;ve tackled the first feature of our data. We&#8217;ve succesfully incorporated the basic hierarchical structure into our model by fitting a variance componentis using Stata&#8217;s <b>xtmixed</b> command. But our graph tells us that we aren&#8217;t finished yet.</p>
<p>Next time we&#8217;ll tackle the second feature of our data &#8212; the longitudinal nature of the observations.</p>
<ul>For more information</ul>
<p>If you&#8217;d like to learn more about modelling multilevel and longitudinal data, check out</p>
<p><a href="http://www.stata.com/bookstore/multilevel-longitudinal-modeling-stata/">Multilevel and Longitudinal Modeling Using Stata, Third Edition</a><br />
Volume I: Continuous Responses<br />
Volume II: Categorical Responses, Counts, and Survival<br />
by Sophia Rabe-Hesketh and Anders Skrondal</p>
<p>or sign up for our popular public training course &#8220;<a href="http://www.stata.com/training/multilevel-mixed-models-using-stata/">Multilevel/Mixed Models Using Stata</a>&#8220;.</p>
<p>There&#8217;s a course coming up in Washington, DC on February 7-8, 2013.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stata.com/2013/02/04/multilevel-linear-models-in-stata-part-1-components-of-variance/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Using Stata&#8217;s random-number generators, part 4, details</title>
		<link>http://blog.stata.com/2012/10/24/using-statas-random-number-generators-part-4-details/</link>
		<comments>http://blog.stata.com/2012/10/24/using-statas-random-number-generators-part-4-details/#comments</comments>
		<pubDate>Wed, 24 Oct 2012 16:40:12 +0000</pubDate>
		<dc:creator>William Gould, President</dc:creator>
				<category><![CDATA[Numerical Analysis]]></category>
		<category><![CDATA[binary]]></category>
		<category><![CDATA[numerical analysis]]></category>
		<category><![CDATA[random numbers]]></category>
		<category><![CDATA[runiform()]]></category>
		<category><![CDATA[seed]]></category>
		<category><![CDATA[Statalist]]></category>

		<guid isPermaLink="false">http://blog.stata.com/?p=1337</guid>
		<description><![CDATA[For those interested in how pseudo random number generators work, I just wrote something on Statalist which you can see in the Statalist archives by clicking the link even if you do not subscribe: http://www.stata.com/statalist/archive/2012-10/msg01129.html To remind you, I&#8217;ve been writing about how to use random-number generators in parts 1, 2, and 3, and I [...]]]></description>
				<content:encoded><![CDATA[<p>For those interested in <b>how pseudo random number generators work</b>, I just wrote something on Statalist which you can see in the Statalist archives by clicking the link even if you do not subscribe:</p>
<p style="padding-left: 30px;"><b><a href="http://www.stata.com/statalist/archive/2012-10/msg01129.html">http://www.stata.com/statalist/archive/2012-10/msg01129.html</a></b></p>
<p>To remind you, I&#8217;ve been writing about how to <i>use</i> random-number generators in parts <b><a href="http://blog.stata.com/2012/07/18/using-statas-random-number-generators-part-1/">1</a></b>, <b><a href="http://blog.stata.com/2012/08/03/using-statas-random-number-generators-part-2-drawing-without-replacement/">2</a></b>, and <b><a href="http://blog.stata.com/2012/08/29/using-statas-random-number-generators-part-3-drawing-with-replacement/">3</a></b>, and I still have one more posting I want to write on the subject.  What I just wrote on Statalist, however, is about how random-number generators work, and I think you will find it interesting.</p>
<p>To find out more about Statalist, see</p>
<p style="padding-left:30px;"><b><a href="http://blog.stata.com/2010/11/08/statalist/">Statalist</a></b></p>
<p style="padding-left:30px;"><b><a href="http://blog.stata.com/2010/12/14/how-to-successfully-ask-a-question-on-statalist/">How to successfully ask a question on Statalist</a></b></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stata.com/2012/10/24/using-statas-random-number-generators-part-4-details/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using Stata’s SEM features to model the Beck Depression Inventory</title>
		<link>http://blog.stata.com/2012/10/17/using-statas-sem-features-to-model-the-beck-depression-inventory/</link>
		<comments>http://blog.stata.com/2012/10/17/using-statas-sem-features-to-model-the-beck-depression-inventory/#comments</comments>
		<pubDate>Wed, 17 Oct 2012 22:17:40 +0000</pubDate>
		<dc:creator>Chuck Huber, Senior Statistician</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[psychology]]></category>
		<category><![CDATA[psychometric]]></category>
		<category><![CDATA[SEM]]></category>

		<guid isPermaLink="false">http://blog.stata.com/?p=1293</guid>
		<description><![CDATA[I just got back from the 2012 Stata Conference in San Diego where I gave a talk on Psychometric Analysis Using Stata and from the 2012 American Psychological Association Meeting in Orlando. Stata&#8217;s structural equation modeling (SEM) builder was popular at both meetings and I wanted to show you how easy it is to use. [...]]]></description>
				<content:encoded><![CDATA[<p>I just got back from the <a href="http://www.stata.com/meeting/sandiego12/">2012 Stata Conference</a> in San Diego where I gave a talk on <a href="http://stata.com/meeting/sandiego12/materials/sd12_huber.pdf">Psychometric Analysis Using Stata</a> and from the 2012 American Psychological Association Meeting in Orlando. Stata&#8217;s structural equation modeling (SEM) builder was popular at both meetings and I wanted to show you how easy it is to use. If you are not familiar with the basics of SEM, please refer to the references at the end of the blog. My goal is simply to show you how to use the SEM builder assuming that you already know something about SEM. If you would like to view a video demonstration of the SEM builder, please click the play button below:</p>
<p><iframe src="http://www.youtube.com/embed/Xj0gBlqwYHI" height="315" width="560" allowfullscreen="" frameborder="0"></iframe></p>
<p>The data used here and for the silly examples in my talk were simulated to resemble one of the most commonly used measures of depression: the Beck Depression Inventory (BDI). If you find these data too silly or not relevant to your own research, you could instead imagine it being a set of questions to measure mathematical ability, the ability to use a statistical package, or whatever you wanted.</p>
<p><b>The Beck Depression Inventory</b></p>
<p>Originally published by Aaron Beck and colleagues in 1961, the BDI marked an important change in the conceptualization of depression from a psychoanalytic perspective to a cognitive/behavioral perspective. It was also a landmark in the measurement of depression shifting from lengthy, expensive interviews with a psychiatrist to a brief, inexpensive questionnaire that could be scored and quantified. The original inventory consisted of 21 questions each allowing ordinal responses of increasing symptom severity from 0-3. The sum of the responses could then be used to classify a respondent&#8217;s depressive symptoms as none, mild, moderate or severe. Many studies have demonstrated that the BDI has good psychometric properties such as high test-retest reliability and the scores correlate well with the assessments of psychiatrists and psychologists. The 21 questions can also be grouped into two subscales. The affective scale includes questions like &#8220;I feel sad&#8221; and &#8220;I feel like a failure&#8221; that quantify emotional symptoms of depression. The somatic or physical scale includes questions like &#8220;I have lost my appetite&#8221; and &#8220;I have trouble sleeping&#8221; that quantify physical symptoms of depression. Since its original publication, the BDI has undergone two revisions in response to the American Psychiatric Association&#8217;s (APA) Diagnostic and Statistical Manuals (DSM) and the BDI-II remains very popular.</p>
<p><b>The Stata Depression Inventory</b></p>
<p>Since the BDI is a copyrighted psychometric instrument, I created a fictitious instrument called the &#8220;Stata Depression Inventory&#8221;. It consists of 20 questions each beginning with the phrase &#8220;My statistical software makes me&#8230;&#8221;. The individual questions are listed in the variable labels below.</p>
<pre>. describe qu1-qu20

variable  storage  display    value
 name       type   format     label      variable label
------------------------------------------------------------------------------
qu1         byte   %16.0g     response   ...feel sad
qu2         byte   %16.0g     response   ...feel pessimistic about the future
qu3         byte   %16.0g     response   ...feel like a failure
qu4         byte   %16.0g     response   ...feel dissatisfied
qu5         byte   %16.0g     response   ...feel guilty or unworthy
qu6         byte   %16.0g     response   ...feel that I am being punished
qu7         byte   %16.0g     response   ...feel disappointed in myself
qu8         byte   %16.0g     response   ...feel am very critical of myself
qu9         byte   %16.0g     response   ...feel like harming myself
qu10        byte   %16.0g     response   ...feel like crying more than usual
qu11        byte   %16.0g     response   ...become annoyed or irritated easily
qu12        byte   %16.0g     response   ...have lost interest in other people
qu13        byte   %16.0g     qu13_t1    ...have trouble making decisions
qu14        byte   %16.0g     qu14_t1    ...feel unattractive
qu15        byte   %16.0g     qu15_t1    ...feel like not working
qu16        byte   %16.0g     qu16_t1    ...have trouble sleeping
qu17        byte   %16.0g     qu17_t1    ...feel tired or fatigued
qu18        byte   %16.0g     qu18_t1    ...makes my appetite lower than usual
qu19        byte   %16.0g     qu19_t1    ...concerned about my health
qu20        byte   %16.0g     qu20_t1    ...experience decreased libido</pre>
<p>The responses consist of a 5-point Likert scale ranging from 1 (Strongly Disagree) to 5 (Strongly Agree). Questions 1-10 form the affective scale of the inventory and questions 11-20 form the physical scale. Data were simulated for 1000 imaginary people and included demographic variables such as age, sex and race. The responses can be summarized succinctly in a matrix of bar graphs:</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2012/10/Figure1_BarGraphs.png"><img class="aligncenter size-full wp-image-1296" title="Figure1_BarGraphs" alt="" src="http://blog.stata.com/wp-content/uploads/2012/10/Figure1_BarGraphs.png" width="738" height="537" /></a></p>
<p><b>Classical statistical analysis</b></p>
<p>The beginning of a classical statistical analysis of these data might consist of summing the responses for questions 1-10 and referring to them as the &#8220;Affective Depression Score&#8221; and summing questions 11-20 and referring to them as the &#8220;Physical Depression Score&#8221;.</p>
<pre>egen Affective = rowtotal(qu1-qu10)
label var Affective "Affective Depression Score"
egen physical = rowtotal(qu11-qu20)
label var physical "Physical Depression Score"</pre>
<p>We could be more sophisticated and use principal components to create the affective and physical depression score:</p>
<pre>pca qu1-qu20, components(2)
predict Affective Physical
label var Affective "Affective Depression Score"
label var Physical "Physical Depression Score"</pre>
<p>We could then ask questions such as &#8220;Are there differences in affective and physical depression scores by sex?&#8221; and test these hypotheses using multivariate statistics such as Hotelling&#8217;s T-squared statistic. The problem with this analysis strategy is that it treats the depression scores as though they were measured without error and can lead to inaccurate p-values for our test statistics.</p>
<p><b>Structural equation modeling</b></p>
<p>Structural equation modeling (SEM) is an ideal way to analyze data where the outcome of interest is a scale or scales derived from a set of measured variables. The affective and physical scores are treated as latent variables in the model resulting in accurate p-values and, best of all….these models are very easy to fit using Stata! We begin by selecting the SEM builder from the Statistics menu:</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2012/10/Figure2_MenuPhoto.png"><img class="aligncenter size-full wp-image-1298" title="Figure2_MenuPhoto" alt="" src="http://blog.stata.com/wp-content/uploads/2012/10/Figure2_MenuPhoto.png" width="675" height="667" /></a></p>
<p>In the SEM builder, we can select the &#8220;Add Measurement Component&#8221; icon:</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2012/10/Figure3_AddMeasurementComponentIcon.png"><img class="aligncenter size-full wp-image-1299" title="Figure3_AddMeasurementComponentIcon" alt="" src="http://blog.stata.com/wp-content/uploads/2012/10/Figure3_AddMeasurementComponentIcon.png" width="334" height="339" /></a></p>
<p>which will open the following dialog box:</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2012/10/Figure4_MeasurementComponentDialogBox1.png"><img class="aligncenter size-full wp-image-1300" title="Figure4_MeasurementComponentDialogBox1" alt="" src="http://blog.stata.com/wp-content/uploads/2012/10/Figure4_MeasurementComponentDialogBox1.png" width="417" height="378" /></a></p>
<p>In the box labeled &#8220;Latent Variable Name&#8221; we can type &#8220;Affective&#8221; (red arrow below) and we can select the variables qu1-qu10 in the &#8220;Measured variables&#8221; box (blue arrow below).</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2012/10/Figure5_MeasurementComponentDialogBox2.png"><img class="aligncenter size-full wp-image-1301" title="Figure5_MeasurementComponentDialogBox2" alt="" src="http://blog.stata.com/wp-content/uploads/2012/10/Figure5_MeasurementComponentDialogBox2.png" width="423" height="384" /></a></p>
<p>When we click &#8220;OK&#8221;, the affective measurement component appears in the builder:</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2012/10/Figure6_AffectiveComponent.png"><img class="aligncenter size-full wp-image-1302" title="Figure6_AffectiveComponent" alt="" src="http://blog.stata.com/wp-content/uploads/2012/10/Figure6_AffectiveComponent.png" width="676" height="247" /></a></p>
<p>We can repeat this process to create a measurement component for our physical depression scale (images not shown). We can also allow for covariance/correlation between our affective and physical depression scales using the &#8220;Add Covariance&#8221; icon on the toolbar (red arrow below).</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2012/10/Figure7_AddCovarianceTool.png"><img class="aligncenter size-full wp-image-1303" title="Figure7_AddCovarianceTool" alt="" src="http://blog.stata.com/wp-content/uploads/2012/10/Figure7_AddCovarianceTool.png" width="676" height="393" /></a></p>
<p>I&#8217;ll omit the intermediate steps to build the full model shown below but it&#8217;s easy to use the &#8220;Add Observed Variable&#8221; and &#8220;Add Path&#8221; icons to create the full model:</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2012/10/Figure8_FullModel.png"><img class="aligncenter size-full wp-image-1304" title="Figure8_FullModel" alt="" src="http://blog.stata.com/wp-content/uploads/2012/10/Figure8_FullModel.png" width="676" height="426" /></a></p>
<p>Now we&#8217;re ready to estimate the parameters for our model. To do this, we click the &#8220;Estimate&#8221; icon on the toolbar (duh!):</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2012/10/Figure9_EstimateButton.png"><img class="aligncenter size-full wp-image-1305" title="Figure9_EstimateButton" alt="" src="http://blog.stata.com/wp-content/uploads/2012/10/Figure9_EstimateButton.png" width="518" height="105" /></a></p>
<p>And the flowing dialog box appears:</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2012/10/Figure10_EstimationDialog.png"><img class="aligncenter size-full wp-image-1306" title="Figure10_EstimationDialog" alt="" src="http://blog.stata.com/wp-content/uploads/2012/10/Figure10_EstimationDialog.png" width="533" height="583" /></a></p>
<p>Let&#8217;s ignore the estimation options for now and use the default settings. Click &#8220;OK&#8221; and the parameter estimates will appear in the diagram:</p>
<p><a href="http://blog.stata.com/wp-content/uploads/2012/10/Figure11_FullModelEstimated.png"><img class="aligncenter size-full wp-image-1307" title="Figure11_FullModelEstimated" alt="" src="http://blog.stata.com/wp-content/uploads/2012/10/Figure11_FullModelEstimated.png" width="676" height="421" /></a></p>
<p>Some of the parameter estimates are difficult to read in this form but it is easy to rearrange the placement and formatting of the estimates to make them easier to read.</p>
<p>If we look at Stata&#8217;s output window and scroll up, you&#8217;ll notice that the SEM Builder automatically generated the command for our model:</p>
<pre>sem (Affective -&gt; qu1) (Affective -&gt; qu2) (Affective -&gt; qu3)
    (Affective -&gt; qu4) (Affective -&gt; qu5) (Affective -&gt; qu6)
    (Affective -&gt; qu7) (Affective -&gt; qu8) (Affective -&gt; qu9)
    (Affective -&gt; qu10) (Physical -&gt; qu11) (Physical -&gt; qu12)
    (Physical -&gt; qu13) (Physical -&gt; qu14) (Physical -&gt; qu15)
    (Physical -&gt; qu16) (Physical -&gt; qu17) (Physical -&gt; qu18)
    (Physical -&gt; qu19) (Physical -&gt; qu20) (sex -&gt; Affective)
    (sex -&gt; Physical), latent(Affective Physical) cov(e.Physical*e.Affective)</pre>
<p>We can gather terms and abbreviate some things to make the command much easier to read:</p>
<pre>sem (Affective -&gt; qu1-qu10) ///
    (Physical -&gt; qu11-qu20) /// 
    (sex -&gt; Affective Physical) ///
    , latent(Affective Physical ) ///
    cov( e.Physical*e.Affective)</pre>
<p>We could then calculate a Wald statistic to test the null hypothesis that there is no association between sex and our affective and physical depression scales.</p>
<pre>test sex

 ( 1)  [Affective]sex = 0
 ( 2)  [Physical]sex = 0

           chi2(  2) =    2.51
         Prob &gt; chi2 =    0.2854</pre>
<p><b>Final thoughts</b><br />
This is an admittedly oversimplified example – we haven&#8217;t considered the fit of the model or considered any alternative models. We have only included one dichotomous independent variable. We might prefer to use a likelihood ratio test or a score test. Those are all very important issues and should not be ignored in a proper data analysis. But my goal was to demonstrate how easy it is to use Stata&#8217;s SEM builder to model data such as those arising from the Beck Depression Inventory. Incidentally, if these data were collected using a complex survey design, it would not be difficult to incorporate the sampling structure and sample weights into the analysis. Missing data can be handled easily as well using Full Information Maximum Likelihood (FIML) but those are topics for another day.</p>
<p>If you would like view the slides from my talk, download the data used in this example or view a video demonstration of Stata’s SEM builder using these data, please use the links below. For the dataset, you can also type <b>use</b> followed by the URL for the data to load it directly into Stata.</p>
<p>Slides:<br />
<a href="http://stata.com/meeting/sandiego12/materials/sd12_huber.pdf">http://stata.com/meeting/sandiego12/materials/sd12_huber.pdf</a></p>
<p>Data:<br />
<a href="http://stata.com/meeting/sandiego12/materials/Huber_2012SanDiego.dta">http://stata.com/meeting/sandiego12/materials/Huber_2012SanDiego.dta</a></p>
<p>YouTube video demonstration:<br />
<a href="http://www.youtube.com/watch?v=Xj0gBlqwYHI&amp;feature=plcp">http://www.youtube.com/watch?v=Xj0gBlqwYHI</a></p>
<p><b>References</b></p>
<p>Beck AT, Ward CH, Mendelson M, Mock J, Erbaugh J (June 1961). An inventory for measuring depression. Arch. Gen. Psychiatry 4 (6): 561–71.</p>
<p>Beck AT, Ward C, Mendelson M (1961). Beck Depression Inventory (BDI). Arch Gen Psychiatry 4 (6): 561–571</p>
<p>Beck AT, Steer RA, Ball R, Ranieri W (December 1996). Comparison of Beck Depression Inventories -IA and -II in psychiatric outpatients. Journal of Personality Assessment 67 (3): 588–97<br />
Bollen, KA. (1989). Structural Equations With Latent Variables. New York, NY: John Wiley and Sons</p>
<p>Kline, RB (2011). Principles and Practice of Structural Equation Modeling. New York, NY: Guilford Press</p>
<p>Raykov, T &amp; Marcoulides, GA (2006). A First Course in Structural Equation Modeling. Mahwah, NJ: Lawrence Erlbaum</p>
<p>Schumacker, RE &amp; Lomax, RG (2012) A Beginner&#8217;s Guide to Structural Equation Modeling, 3rd Ed. New York, NY: Routledge</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stata.com/2012/10/17/using-statas-sem-features-to-model-the-beck-depression-inventory/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Stata YouTube channel announced!</title>
		<link>http://blog.stata.com/2012/09/26/stata-youtube-channel-announced/</link>
		<comments>http://blog.stata.com/2012/09/26/stata-youtube-channel-announced/#comments</comments>
		<pubDate>Wed, 26 Sep 2012 19:56:18 +0000</pubDate>
		<dc:creator>Chuck Huber, Senior Statistician</dc:creator>
				<category><![CDATA[Company]]></category>
		<category><![CDATA[tutorials]]></category>
		<category><![CDATA[videos]]></category>
		<category><![CDATA[YouTube]]></category>

		<guid isPermaLink="false">http://blog.stata.com/?p=1277</guid>
		<description><![CDATA[StataCorp now provides free tutorial videos on StataCorp&#8217;s YouTube channel, http://www.youtube.com/user/statacorp There are 24 videos providing 1 hour 51 minutes of instructional entertainment: Stata Quick Tour (5:47) Stata Quick Help (2:47) Stata PDF Documentation (6:37) Stata One-sample t-test (3:43) Stata t-test for Two Independent Samples (5:09) Stata t-test for Two Paired Samples (4:42) Stata Simple [...]]]></description>
				<content:encoded><![CDATA[<p>StataCorp now provides free tutorial videos on StataCorp&#8217;s YouTube channel,</p>
<p style="padding-left: 30px;"><a href="http://www.youtube.com/user/statacorp">http://www.youtube.com/user/statacorp</a></p>
<p>There are 24 videos providing 1 hour 51 minutes of instructional entertainment:</p>
<p style="padding-left: 30px;"><a href="http://www.youtube.com/watch?v=L8iIj_8lhRc">Stata Quick Tour</a> (5:47)<br />
<a href="http://www.youtube.com/watch?v=UpXNMeTzmuI">Stata Quick Help</a> (2:47)<br />
<a href="http://www.youtube.com/watch?v=KPHxC-HyrMk">Stata PDF Documentation</a> (6:37)</p>
<p style="padding-left: 30px;"><a href="http://www.youtube.com/watch?v=HwzCyqW-0dc">Stata One-sample t-test</a> (3:43)<br />
<a href="http://www.youtube.com/watch?v=by4c3h3WXQc">Stata t-test for Two Independent Samples</a> (5:09)<br />
<a href="http://www.youtube.com/watch?v=GiDSnufmZgI">Stata t-test for Two Paired Samples</a> (4:42)</p>
<p style="padding-left: 30px;"><a href="http://www.youtube.com/watch?v=HafqFSB9x70">Stata Simple Linear Regression</a> (5:33)</p>
<p style="padding-left: 30px;"><a href="http://www.youtube.com/watch?v=Xj0gBlqwYHI">Stata SEM Builder</a> (8:09)<br />
<a href="http://www.youtube.com/watch?v=XEFGGkFRdD4">Stata One-way ANOVA</a> (5:15)<br />
<a href="http://www.youtube.com/watch?v=3g1Yj7Vd0mE">Stata Two-way ANOVA</a> (5:57)</p>
<p style="padding-left: 30px;"><a href="http://www.youtube.com/watch?v=o7ko844ff-g">Stata Pearson’s Correlation Coefficient</a> (3:29)<br />
<a href="http://www.youtube.com/watch?v=DBsMPZqJj-o">Stata Pearson’s Chi2 and Fisher’s Exact Test</a> (3:16)</p>
<p style="padding-left: 30px;"><a href="http://www.youtube.com/watch?v=y6dngL80xuo">Stata Box Plots</a> (4:04)<br />
<a href="http://www.youtube.com/watch?v=GhVGpe3lb3E">Stata Basic Scatterplots</a> (5:19)<br />
<a href="http://www.youtube.com/watch?v=jNjAdtQwW6M">Stata Bar Graphs</a> (4:15)<br />
<a href="http://www.youtube.com/watch?v=nPqNZVToGx8">Stata Histograms</a> (4:50)<br />
<a href="http://www.youtube.com/watch?v=T_skwxG4sTk">Stata Pie Charts</a> (5:32)</p>
<p style="padding-left: 30px;"><a href="http://www.youtube.com/watch?v=kKFbnEWwa2s">Stata Descriptive Statistics</a> (5:49)</p>
<p style="padding-left: 30px;"><a href="http://www.youtube.com/watch?v=3WpMRtTNZsw">Stata Tables and Crosstabulations</a> (7:20)<br />
<a href="http://www.youtube.com/watch?v=Dzg6AMSt10w">Stata Combining Crosstabs and Descriptives</a> (5:58)</p>
<p style="padding-left: 30px;"><a href="http://www.youtube.com/watch?v=3zp2byhr2GI">Stata Converting Data to Stata with Stat/Transfer</a> (2:47)<br />
<a href="http://www.youtube.com/watch?v=N5ZFgzN2_7c">Stata Import Excel Data</a> (1:33)<br />
<a href="http://www.youtube.com/watch?v=iCvZ9pvPy-8">Stata Excel Copy/Paste</a> (1:16)<br />
<a href="http://www.youtube.com/watch?v=_qb-qEkd-_c">Stata Example Data Included with Stata</a> (2:14)</p>
<p>And more are forthcoming.</p>
<p>&nbsp;</p>
<p><b>The inside story</b></p>
<p>Alright, that&#8217;s the official announcement.</p>
<p>Last Friday, 21 September 2012, was an exciting day here at StataCorp. After a couple of years of &#8220;wouldn&#8217;t it be cool if&#8221;, and a couple of months of &#8220;we&#8217;re almost there&#8221;, Stata&#8217;s YouTube channel was finally ready for prime time.</p>
<p>Stata&#8217;s YouTube Channel was the brainchild of Karen Strope, StataCorp&#8217;s Director of Marketing, but I had something to do with it, too. Well, maybe more than something, but I&#8217;m a modest guy. Anyway, I thought it sounded like fun and recorded a few prototype videos. Annette Fett, StataCorp&#8217;s Graphic Designer, added the cool splash-screen and after a few experiments, we soon had 24 Blu-ray resolution videos. We&#8217;ve kicked off with videos covering topics such as a tour of Stata&#8217;s interface, how to create basic graphs, how to conduct many common statistical analyses, and more.</p>
<p>My personal favorite is the video entitled <i><a href="http://www.youtube.com/watch?v=YFPls-OpWZw">Combining Crosstabs and Descriptives</a></i> because it&#8217;s relevant to nearly all Stata users and works well as a video demonstration.</p>
<p><iframe src="http://www.youtube.com/embed/Dzg6AMSt10w" height="315" width="560" allowfullscreen="" frameborder="0"></iframe></p>
<p><b>Videos about Stata &#8211; isn&#8217;t that like dancing about architecture?</b></p>
<p>Stata has over 9,000 pages of documentation included in PDF format, a built-in Help system, and a collection of books on general and special topics published by <i><a href="http://www.stata-press.com">Stata Press</a></i>, and an extensive collection of dialog boxes that make even the most complex graphs and analyses easy to perform.</p>
<p>So aren&#8217;t the videos, ahh, unnecessary?</p>
<p>The problem is, it&#8217;s cumbersome to describe how to use all of Stata&#8217;s features, especially dialog boxes, in a manual, even when you have 9,000 pages, and 9,000 pages tries even the most dedicated user&#8217;s patience.</p>
<p>In a 3-7 minutes video, we can show you how to create complicated graphs or a sophisticated structural equation model.</p>
<p>We have three audiences in mind.</p>
<ol>
<li>Videos for non-Stata users, whom we call future Stata users; videos intended to provide a loosely guided tour of Stata&#8217;s features.</li>
<li>Videos for new Stata users, such as the person who might simply want to know &#8220;How do I calculate a twoway ANOVA in Stata?&#8221; or &#8220;How do I create a Pie Chart?&#8221;. These videos will get them up and running quickly and painlessly.</li>
<li>Videos for experienced Stata users who want to learn new tips and tricks.</li>
</ol>
<p>There&#8217;s actually a fourth group that&#8217;s of interest, too; experienced Stata users teaching statistics or data analysis classes, who don&#8217;t want to spend valuable class time showing their students how to use Stata. They can refer their students to the relevant videos as homework and thus free class time for the teaching of statistics.</p>
<p><b>Request for comments</b></p>
<p>One of the fun things about working at StataCorp is that management doesn&#8217;t much use the word &#8220;no&#8221;. New ideas are more often met with the phrase, &#8220;well, let&#8217;s try it and see what happens&#8221;. So I&#8217;m trying this. My plan is to add a couple of videos to the channel every week or two as time permits. I have a list of topics I&#8217;d like to cover including things like multiple imputation, survey analysis, mixed models, Stata&#8217;s &#8220;immediate&#8221; commands (tabi, ttesti, csi, cci, etc&#8230;), and more examples using the SEM Builder.</p>
<p>However, I will take requests. If you have a suggested topic or a future video, leave a comment.</p>
<p>I&#8217;d like to keep the videos brief, between 3-7 minutes, so please don&#8217;t request feature-length films like &#8220;How to do survival analysis in Stata&#8221;. Similarly, topics that are only interesting to you and your two post-docs such as &#8220;Please describe the difference between the Laplacian Approximation and Adaptive Gauss-Hermite Quadrature in the <b>xtmepoisson</b> command&#8221; are not likely to see the light of day. But I am very interested in your ideas for small, bite-sized topics that will work in a video format.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stata.com/2012/09/26/stata-youtube-channel-announced/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>Using Stata&#8217;s random-number generators, part 3, drawing with replacement</title>
		<link>http://blog.stata.com/2012/08/29/using-statas-random-number-generators-part-3-drawing-with-replacement/</link>
		<comments>http://blog.stata.com/2012/08/29/using-statas-random-number-generators-part-3-drawing-with-replacement/#comments</comments>
		<pubDate>Wed, 29 Aug 2012 18:08:11 +0000</pubDate>
		<dc:creator>William Gould, President</dc:creator>
				<category><![CDATA[Numerical Analysis]]></category>
		<category><![CDATA[random numbers]]></category>
		<category><![CDATA[runiform()]]></category>

		<guid isPermaLink="false">http://blog.stata.com/?p=1256</guid>
		<description><![CDATA[The topic for today is drawing random samples with replacement. If you haven&#8217;t read part 1 and part 2 of this series on random numbers, do so. In the series we&#8217;ve discussed that Stata&#8217;s runiform() function produces random numbers over the range [0,1). To produce such random numbers, type . generate double u = runiform() [...]]]></description>
				<content:encoded><![CDATA[<p>The topic for today is drawing random samples with replacement.  If you haven&#8217;t read <a href="http://blog.stata.com/2012/07/18/using-statas-random-number-generators-part-1/">part 1</a> and <a href="http://blog.stata.com/2012/08/03/using-statas-random-number-generators-part-2-drawing-without-replacement/">part 2</a> of this series on random numbers, do so.  In the series we&#8217;ve<br />
discussed that</p>
<ol>
<li> Stata&#8217;s <b>runiform()</b> function produces random numbers over the range [0,1).  To produce such random numbers, type
<pre>
. generate double u = runiform()
</pre>
<p>&nbsp;</p>
<li> To produce <i>continuous</i> random numbers over [<i>a</i>,<i>b</i>), type
<pre>
. generate double u = (<i>b</i>-</i>a</i>)*runiform() + <i>a</i>
</pre>
<p>&nbsp;</p>
<li> To produce <i>integer</i> random numbers over [<i>a</i>,<i>b</i>], type
<pre>
. generate ui = floor((<i>b</i>-</i>a</i>+1)*runiform() + <i>a</i>)
</pre>
<p>If <i>b</i> &gt; 16,777,216, type</p>
<pre>
. generate long ui = floor((<i>b</i>-</i>a</i>+1)*runiform() + <i>a</i>)
</pre>
<p>&nbsp;</p>
<li> To place observations in random order &#8212; to shuffle observations &#8212; type
<pre>
. set seed <i>#</i>
. generate double u = runiform()
. sort u
</pre>
<p>&nbsp;</p>
<li> To draw <i>without</i> replacement a random sample of <i>n</i> observations from a dataset of <i>N</i> observations, type
<pre>
. set seed <i>#</i>
. sort <i>variables_that_put_dta_in_unique_order</i>
. generate double u = runiform()
. sort u
. keep in 1/<i>n</i>
</pre>
<p>If <i>N</i>&gt;1,000, generate two random variables u1 and u2 in place of u, and substitute <b>sort u1 u2</b> for <b>sort u</b>.</p>
<p>&nbsp;</p>
<li> To draw <i>without</i> replacement a <i>P</i>-percent random sample, type
<pre>
. set seed <i>#</i>
. keep if runiform() <= <i>P</i>/100
</pre>
</ol>
<p>I&#8217;ve glossed over details, but the above is the gist of it.</p>
<p>Today I&#8217;m going to tell you</p>
<ol>
<li> To draw a random sample of size <i>n</i> <i>with</i> replacement from a dataset of size <i>N</i>, type
<pre>
. set seed <i>#</i>
. drop _all
. set obs <i>n</i>
. generate long obsno = floor(<i>N</i>*runiform()+1)
. sort obsno
. save obsnos_to_draw

. use <i>your_dataset</i>, clear
. generate long obsno = _n
. merge 1:m obsno using obsnos_to_draw, keep(match) nogen
</pre>
<p>&nbsp;</p>
<li> You need to set the random-number seed only if you care about reproducibility.  I&#8217;ll also mention that if <i>N</i> &le; 16,777,216, it is not necessary to specify that new variable obsno be stored as <b>long</b>; the default <b>float</b> will be sufficient.
<p>&nbsp;<br />
The above solution works whether <i>n</i>&lt;<i>N</i>, <i>n</i>=<i>N</i>, or <i>n</i>&gt;<i>N</i>.</p>
</ol>
<p>&nbsp;</p>
<p><b>Drawing samples with replacement</b></p>
<p>The solution to sampling with replacement <i>n</i> observations from a dataset of size <i>N</i> is</p>
<ol>
<li> Draw <i>n</i> observation numbers 1, &#8230;, <i>N</i> with replacement.  For instance, if <i>N=4</i> and <i>n</i>=3, we might draw observation numbers 1, 3, and 3.
<p></p>
<li> Select those observations from the dataset of interest.  For instance, select observations 1, 3, and 3.
</ol>
<p>As previously discussed in <a href="http://blog.stata.com/2012/07/18/using-statas-random-number-generators-part-1/">part 1</a>, to generate random integers drawn with replacement over the range [<i>a</i>, <i>b</i>], use the formula</p>
<p style="padding-left:30px;">
<b>generate</b> <i>varname</i> <b>=</b> <b>floor((</b><i>b</i><b>-</b><i>a</i><b>+1)*runiform() +</b> <i>a</i><b>)</b>
</p>
<p>In this case, we want <i>a</i>=1 and <i>b</i>=<i>N</i>, and the formula reduces to,</p>
<p style="padding-left:30px;">
<b>generate</b> <i>varname</i> <b>=</b> <b>floor(</b><i>N</i><b>*runiform() + 1)</b>
</p>
<p>So the first half of our solution could read</p>
<pre>
. drop _all
. set obs <i>n</i>
. generate obsno = floor(<i>N</i>*runiform() + 1)
</pre>
<p>Now we are merely left with the problem of selecting those observations from our dataset, which we can do using <b>merge</b> by typing</p>
<pre>
. sort obsno
. save obsnos_to_draw
. use <i>dataset_of_interest</i>, clear
. generate obsno = _n
. merge 1:m obsno using obsnos_to_draw, keep(match) nogen
</pre>
<p>Let&#8217;s do an example.  In <a href="http://blog.stata.com/2012/08/03/using-statas-random-number-generators-part-2-drawing-without-replacement/">part 2</a> of this series, I had a dataset with observations corresponding to playing cards:</p>
<pre>
. <b>use cards</b>

. <b>list in 1/5</b>

     +-------------+
     | rank   suit |
     |-------------|
  1. |  Ace   Club |
  2. |    2   Club |
  3. |    3   Club |
  4. |    4   Club |
  5. |    5   Club |
     +-------------+
</pre>
<p>There are 52 observations in the dataset; I&#8217;m showing you just the first five.  Let&#8217;s draw 10 cards from the deck, but with replacement.</p>
<p>The first step is to draw the observation numbers.  We have <i>N</i>=52 cards in the deck, and we want to draw <i>n</i>=10, so we generate 10 random integers from the integers [1, 52]:</p>
<pre>
. <b>drop _all</b>

. <b>set obs 10</b>                            // we want <i>n</i>=10
obs was 0, now 10</b>

. <b>gen obsno = floor(52*runiform()+1)</b>    // we draw from <i>N</i>=52


. <b>list obsno</b>                            // let's see what we have

     +-------+
     | obsno |
     |-------|
  1. |    42 |
  2. |    52 |
  3. |    16 |
  4. |     9 |
  5. |    40 |
     |-------|
  6. |    11 |
  7. |    34 |
  8. |    20 |
  9. |    49 |
 10. |    42 |
     +-------+</pre>
<p>If you look carefully at the list, you will see that observation number 42 repeats.  It will be easier to see the duplicate if we sort the list, </p>
<pre>
. <b>sort obsno</b>
. <b>list</b>

     +-------+
     | obsno |
     |-------|
  1. |     9 |
  2. |    11 |
  3. |    16 |
  4. |    20 |
  5. |    34 |
     |-------|
  6. |    40 |
  7. |    42 |     <- <i>Obs. 42 repeats</i>
  8. |    42 |     <- <i>See?</i>
  9. |    49 |
 10. |    52 |
     +-------+</pre>
<p>An observation didn&#8217;t have to repeat, but it&#8217;s not surprising that one did because in drawing <i>n</i>=10 from <i>N</i>=52, we would expect one or more repeated cards about 60% of the time.</p>
<p>Anyway, we now know which cards we want, namely cards 9, 11, 16, 20, 34, 40, 42, 42 (again), 49, and 52.</p>
<p>The final step is to select those observations from cards.dta.  The way to do that is to perform a one-to-many merge of cards.dta with the list above and keep the matches.  Before we can do that, however, we must (1) save the list of observation numbers as a dataset, (2) load cards.dta, and (3) add a variable called <b>obsno</b> to it.  Then we will be able to perform the merge. So let&#8217;s get that out of the way,</p>
<pre>
. <b>save obsnos_to_draw</b>                // 1. save the list above
file obsnos_to_draw.dta saved

. <b>use cards</b>                          // 2. load cards.dta

. <b>gen obsno = _n</b>                     // 3.  Add variable obsno to it</pre>
<p>Now we can perform the merge:</p>
<pre>
. <b>merge 1:m obsno using obsnos_to_draw, keep(matched) nogen</b>

    Result                           # of obs.
    -----------------------------------------
    not matched                             0
    matched                                10
    -----------------------------------------
</pre>
<p>I&#8217;ll list the result, but let me first briefly explain the command</p>
<p style="padding-left:30px;">
<b>merge 1:m obsno using obsnos_to_draw, keep(matched) nogen</b></p>
<p style="padding-left:60px;"><b>merge</b> &#8230;, we are performing the <b>merge</b> command,
</p>
<p style="padding-left:60px;"> &#8230; <b>1:m</b> &#8230;, the merge is one-to-many,
</p>
<p style="padding-left:60px;">
&#8230; <b>using obsnos_to_draw</b> &#8230;, we merge data in memory with obsnos_todraw.dta,
</p>
<p style="padding-left:60px;">
&#8230;<b>, keep(matched)</b> &#8230;, we keep observations that appear in both datasets,
</p>
<p style="padding-left:60px;">
&#8230; <b>nogen</b>, do not add variable _merge to the resulting dataset; _merge reports the source of the resulting observations; we said <b>keep(matched)</b> so we know each came from both sources.
</p>
</p>
<p>And here is the result:</p>
<pre>
. list

     +-------------------------+
     |  rank      suit   obsno |
     |-------------------------|
  1. |     8      Club       9 |
  2. |  Jack      Club      11 |
  3. |   Ace     Spade      16 |
  4. |     2   Diamond      20 |
  5. |     6     Spade      34 |
     |-------------------------|
  6. |     8     Spade      40 |
  7. |     9     Heart      42 |   &lt;- <i>Obs. 42 is here ...</i>
  8. | Queen     Spade      49 |
  9. |  King     Spade      52 |
 10. |     9     Heart      42 |   &lt;- <i>and here</i>
     +-------------------------+
</pre>
<p>We drew 10 cards &#8212; those are the observation numbers on the left.  Variable obsno in our dataset records the original observation (card) number and really, we no longer need the variable.  Anyway, obsno==42 appears twice, in real observations 7 and 10, and thus we drew the 9 of Hearts twice.</p>
<p>&nbsp;</p>
<p><b>What could go wrong?</b></p>
<p>Not much can go wrong, it turns out.  At this point, our generic solution is</p>
<pre>
. drop _all
. set obs <i>n</i>
. generate obsno = floor(<i>n</i>*runiform()+1)
. sort obsno
. save obsnos_to_draw

. use <i>our_dataset</i>
. gen obsno = _n
. merge 1:m obsno using obsnos_to_draw, keep(matched) nogen
</pre>
<p>If you study this code, there are two lines that might cause problems,</p>
<pre>
. generate obsno = floor(<i>N</i>*runiform()+1)
</pre>
<p>and</p>
<pre>
. generate obsno = _n
</pre>
<p>When you are looking for problems and see a <b>generate</b> or <b>replace</b>, think about rounding.</p>
<p>Let&#8217;s look at the right-hand side first.  Both calculations produce integers over the range [1, <i>N</i>].  <b>generate</b> performs all calculations in <b>double</b> and the largest integer that can be stored without rounding is 9,007,199,254,740,992 (see previous blog post on <a href="http://blog.stata.com/2012/04/02/the-penultimate-guide-to-precision/">precision</a>).  Stata allows datasets up to 2,147,483,646, so we can be sure that <i>N</i> is less than the maximum precise-integer double.  There are no rounding issues on the right-hand side.</p>
<p>Next let&#8217;s look at the left-hand side.  Variable obsno is being stored as a <b>float</b> because we did not instruct otherwise.  The largest integer value that can be stored without rounding as a float (also covered in previous blog post on <a href="http://blog.stata.com/2012/04/02/the-penultimate-guide-to-precision/">precision</a>) is 16,777,216, and that is less than Stata&#8217;s 2,147,483,646 maximum observations.  When <i>N</i> exceeds 16,777,216, the solution is to store obsno as a <b>long</b>.  We could remember to use <b>long</b> on the rare occasion when dealing with such large datasets, but I&#8217;m going to change the generic solution to use <b>long</b>s in all cases, even when it&#8217;s unnecessary.</p>
<p>What else could go wrong?  Well, we tried an example with <i>n</i>&lt;<i>N</i> and that seemed to work.  We should now try examples with <i>n</i>=<i>N</i> and <i>n</i>&gt;<i>N</i> to verify there&#8217;s no hidden bug or assumption in our code.  I&#8217;ve tried examples of both and the code works fine.</p>
<p>&nbsp;</p>
<p><b>We&#8217;re done for today</b></p>
<p>That&#8217;s it. Drawing samples with replacement turns out to be easy, and that shouldn&#8217;t surprise us because we have a random-number generator that draws with replacement.</p>
<p>We could complicate the discussion and consider solutions that would run a bit more efficiently when <i>n</i>=<i>N</i>, which is of special interest in statistics because it is a key ingredient in bootstrapping, but we will not.  The above solution works fine in the <i>n</i>=<i>N</i> case, and I always advise researchers to favor simple-even-if-slower solutions because they will probably save you time.  Writing complicated code takes longer than writing simple code, and testing complicated code takes even longer.  I know because that&#8217;s what we do at StataCorp. </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stata.com/2012/08/29/using-statas-random-number-generators-part-3-drawing-with-replacement/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Using Stata&#8217;s random-number generators, part 2, drawing without replacement</title>
		<link>http://blog.stata.com/2012/08/03/using-statas-random-number-generators-part-2-drawing-without-replacement/</link>
		<comments>http://blog.stata.com/2012/08/03/using-statas-random-number-generators-part-2-drawing-without-replacement/#comments</comments>
		<pubDate>Fri, 03 Aug 2012 15:20:18 +0000</pubDate>
		<dc:creator>William Gould, President</dc:creator>
				<category><![CDATA[Numerical Analysis]]></category>
		<category><![CDATA[random numbers]]></category>
		<category><![CDATA[runiform()]]></category>

		<guid isPermaLink="false">http://blog.stata.com/?p=1234</guid>
		<description><![CDATA[Last time I told you that Stata&#8217;s runiform() function generates rectangularly (uniformly) distributed random numbers over [0, 1), from 0 to nearly 1, and to be precise, over [0, 0.999999999767169356]. And I gave you two formulas, To generate continuous random numbers between a and b, use generate double u = (b-a)*runiform() + a The random [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://blog.stata.com/2012/07/18/using-statas-random-number-generators-part-1/">Last time</a> I told you that Stata&#8217;s <b>runiform()</b> function generates rectangularly (uniformly) distributed random numbers over [0, 1), from 0 to nearly 1, and to be precise, over [0, 0.999999999767169356].  And I gave you two formulas,</p>
<ol>
<li> To generate <i>continuous</i> random numbers between <i>a</i> and <i>b</i>, use
<p></p>
<p style="padding-left:30px">
<b>generate double u =</b> <b>(</b><i>b</i><b>-</b><i>a</i><b>)*runiform() +</b> <i>a</i>
</p>
<p>The random numbers will not actually be between <i>a</i> and <i>b</i>: they will be between <i>a</i> and nearly <i>b</i>, but the top will be so close to <i>b</i>, namely 0.999999999767169356*<i>b</i>, that it will not matter.</p>
<li>To generate <i>integer</i> random numbers between <i>a</i> and <i>b</i>, use
<p></p>
<p style="padding-left:30px">
<b>generate ui =</b> <b>floor((</b><i>b</i><b>-</b><i>a</i><b>+1)*runiform() +</b> <i>a</i><b>)</b>
</p>
</ol>
<p>I also mentioned that <b>runiform()</b> can solve a variety of problems, including</p>
<ul>
<li>shuffling data (putting observations in random order),
<li>drawing random samples without replacement (there&#8217;s a minor detail we&#8217;ll have to discuss because <b>runiform()</b> itself produces values drawn <i>with</i> replacement),
<li>drawing random samples with replacement (which is easier to do than most people realize),
<li>drawing stratified random samples (with or without replacement),
<li>manufacturing fictional data (something teachers, textbook authors, manual writers, and blog writers often need to do).
</ul>
<p>Today we will cover shuffling and drawing random samples <i>without</i> replacement &#8212; the first two topics on the list &#8212; and we will leave drawing random samples with replacement for next time.  I&#8217;m going to tell you</p>
<ol>
<li>To place observations in random order &#8212; to shuffle the observations &#8212; type
<pre>
. generate double u = runiform()
. sort u
</pre>
<li>To draw without replacement a random sample of <i>n</i> observations from a dataset of <i>N</i> observations, type
<pre>
. set seed <i>#</i>
. generate double u = runiform()
. sort u
. keep in 1/<i>n</i>
</pre>
<p>I will tell you that there are good statistical reasons for setting the random-number seed even if you don&#8217;t care about reproducibility.</p>
<p>If you do care about reproducibility, I will mention (1) that you need to use <b>sort</b> to put the original data in a known, reproducible order, before you generate the random variate <i>u</i>, and I will explain (2) a subtle issue that leads us to use different code for <i>N</i>&le;1,000 and <i>N</i>&gt;1,000.  The code for for <i>N</i>&le;1,000 is</p>
<pre>
. set seed <i>#</i>
. sort <i>variables_that_put_data_in_unique_order</i>
. generate double u = runiform()
. sort u
. keep in 1/<i>n</i>
</pre>
<p>and the code for <i>N</i>&gt;1,000 is</p>
<pre>
. set seed <i>#</i>
. sort <i>variables_that_put_data_in_unique_order</i>
. generate double u1 = runiform()
. generate double u2 = runiform()
. sort u1 u2
. keep in 1/<i>n</i>
</pre>
<p>You can use the <i>N</i>&gt;1,000 code for the  <i>N</i>&le;1,000 case.</p>
<li> To draw without replacement a <i>P</i>-percent random sample, type
<pre>
. set seed <i>#</i>
. keep if runiform() &lt;= <i>P</i>/100
</pre>
<p>There&#8217;s no issue in this case when <i>N</i> is large.
</ol>
<p>As I mentioned, we&#8217;ll discuss drawing random samples <i>with replacement</i> next time.  Today, the topic is random samples <i>without replacement</i>.  Let&#8217;s start.</p>
<p>&nbsp;<br />
<b>Shuffling data</b></p>
<p>I have a deck of 52 cards, in order, the first four of which are</p>
<pre>
. list in 1/4

     +-------------+
     | rank   suit |
     |-------------|
  1. |  Ace   Club |
  2. |    1   Club |
  3. |    2   Club |
  4. |    3   Club |
     +-------------+
</pre>
<p>Well, actually I just have a Stata dataset with observations corresponding to playing cards.  To shuffle the deck &#8212; to place the observations in random order &#8212; type</p>
<pre>
. generate double u = runiform()

. sort u
</pre>
<p>Having done that, here&#8217;s your hand,</p>
<pre>
. list in 1/5

     +----------------------------+
     |  rank      suit          u |
     |----------------------------|
  1. | Queen      Club   .0445188 |
  2. |     5   Diamond   .0580662 |
  3. |     7      Club   .0610638 |
  4. |  King     Heart   .0907986 |
  5. |     6     Spade   .0981878 |
     +----------------------------+
</pre>
<p>and here&#8217;s mine:</p>
<pre>
. list in 6/10

     +---------------------------+
     | rank      suit          u |
     |---------------------------|
  6. |    8   Diamond   .1024369 |
  7. |    5      Club   .1086679 |
  8. |    8     Spade   .1091783 |
  9. |    2     Spade   .1180158 |
 10. |  Ace      Club   .1369841 |
     +---------------------------+
</pre>
<p>All I did was generate random numbers &#8212; one per observation (card) &#8212; and then place the observations in ascending order of the random values.  Doing that is equivalent to shuffling the deck.  I used <b>runiform()</b> random numbers, meaning rectangularly distributed random numbers over [0, 1), but since I&#8217;m only exploiting the random-numbers&#8217; ordinal properties, I could have used random numbers from any continuous distribution.</p>
<p>This simple, elegant, and obvious solution to shuffling data will play an important part of the solution to drawing observations without replacement.  I have already more than hinted at the solution when I showed you your hand and mine.</p>
<p>&nbsp;<br />
<b>Drawing n observations without replacement</b></p>
<p>Drawing without replacement is exactly the same problem as dealing cards.  The solution to the physical card problem is to shuffle the cards and then draw the top cards.  The solution to randomly selecting <i>n</i> from <i>N</i> observations is to put the <i>N</i> observations in random order and keep the first <i>n</i> of them.</p>
<pre>
. use cards, clear

. generate double u = runiform()

. sort u

. keep in 1/5
(47 observations deleted)

. list

     +---------------------------+
     | rank      suit          u |
     |---------------------------|
  1. |  Ace   Diamond   .0064866 |
  2. |    6     Heart   .0087578 |
  3. | King     Spade    .014819 |
  4. |    3     Spade   .0955155 |
  5. | King   Diamond   .1007262 |
     +---------------------------+

. drop u
</pre>
<p>&nbsp;<br />
<b>Reproducibility</b></p>
<p>You might later want to reproduce the analysis, meaning you do not want to draw another random sample, but you want to draw the <i>same</i> random sample.  Perhaps you informally distributed some preliminary results and, of course, then discovered a mistake.  You want to redistribute updated results and show that your mistake didn&#8217;t change results by much, and to drive home the point, you want to use the same samples as you used previously.</p>
<p>Part of the solution is to set the random-number seed.  You might type</p>
<pre>
. set seed 49983882

. use cards, clear

. generate double u = runiform()

. sort u

. keep in 1/5
</pre>
<p>See <b>help set seed</b> in Stata.  As a quick review, when you set the random-number seed, you set Stata&#8217;s random-number generator into a fixed, reproducible state, which is to say, the sequence of random numbers that <b>runiform()</b> produces is a function of the seed.  Set the seed today to the same value as yesterday, and <b>runiform()</b> will produce the same sequence of random numbers today as it did yesterday.  Thus, after setting the seed, if you repeat today exactly what you did yesterday, you will obtain the same results.</p>
<p>So imagine that you set the random number seed today to the value you set it to yesterday and you repeat the above commands.  Even so, you might not get the same results!  You will not get the same results if the observations in cards.dta are in a different order yesterday and today.  Setting the seed merely ensures that if yesterday the smallest value of <b>u</b> was in observation 23, it will again be in observation 23 today (and it will be the same value).  If yesterday, however, observation 23 was the 6 of Clubs, and today it&#8217;s the 7 of Hearts, then today you will select the 7 of Hearts in place of the 6 of Clubs.</p>
<p>So make sure the data are in the same order.  One way to do that is put the dataset in a known order before generating the random values on which you will sort.  For instance,</p>
<pre>
. set seed 49983882

. use cards, clear

. sort rank suit

. generate double u = runiform()

. sort u

. keep in 1/5
</pre>
<p>An even better solution would add the line</p>
<pre>
. by rank suit: assert _N==1
</pre>
<p>just before the <b>generate</b>.  That line would check whether sorting on variables rank and suit uniquely orders the observations.</p>
<p>With cards.dta, you can argue that the <b>assert</b> is unnecessary, but not because you know each rank-suit combination occurs once.  You have only my assurances about that.  I recommend you never trust anyone&#8217;s assurances about data.  In this case, however, you can argue that the <b>assert</b> is unnecessary because we sorted on <i>all the variables in the dataset</i> and thus uniqueness is not required.  Pretend there are two Ace of Clubs in the deck.  Would it matter that the first card was Ace of Clubs followed by Ace of Clubs as opposed to being the other way around?  Of course it would not; the two states are indistinguishable.</p>
<p>So let&#8217;s assume there is another variable in the dataset, say whether there was a grease spot on the back of the card.  Yesterday, after sorting, the ordering might have been,</p>
<pre>
     +---------------------------------+
     | rank   suit   grease          u |
     |---------------------------------|
  1. |  Ace   Club      yes   .6012949 |
  2. |  Ace   Club       no   .1859054 |
     +---------------------------------+
</pre>
<p>and today,</p>
<pre>
     +---------------------------------+
     | rank   suit   grease          u |
     |---------------------------------|
  1. |  Ace   Club       no   .6012949 |
  2. |  Ace   Club      yes   .1859054 |
     +---------------------------------+
</pre>
<p>If yesterday you selected the Ace of Clubs without grease, today you would select the Ace of Clubs with grease.</p>
<p>My recommendation is (1) sort on whatever variables put the data into a unique order, and then verify that, or (2) sort on all the variables in the dataset and then don&#8217;t worry whether the order is unique.</p>
<p>&nbsp;<br />
<b>Ensuring a random ordering</b></p>
<p>Included in our reproducible solution but omitted from our base solution was setting the random-number seed,</p>
<pre>
. set seed 49983882
</pre>
<p>Setting the seed is important even if you don&#8217;t care about reproducibility.  Each time you launch Stata, Stata sets the same random-number seed, namely 123456789, and that means that <b>runiform()</b> generates the same sequence of random numbers, and that means that if you generated all your random samples right after launching Stata, you would always select the same observations, at least holding <i>N</i> constant.</p>
<p>So set the seed, but don&#8217;t set it too often.  You set the seed once per problem.  If I wanted to draw 10,000 random samples from the same data, I could code:</p>
<pre>
  use <i>dataset</i>, clear
  set seed 1702213
  sort <i>variables_that_put_data_in_unique_order</i>
  preserve
  forvalues i=1(1)10000 {
          generate double u = runiform()
          sort u
          keep in 1/<i>n</i>
          drop u
          save sample`i', replace
          restore, preserve  
}
</pre>
<p>In the example I save each sample in a file.  In real life, I seldom (never) save the samples; I perform whatever analysis on the samples I need and save the results, which I usually append into a single dataset.  I don&#8217;t need to save the individual samples because I can recreate them.</p>
<p>&nbsp;<br />
<b>And the result still might not be reproducible &#8230;</b></p>
<p><b>runiform()</b> draws random-numbers <i>with</i> replacement.  It is thus possible that two or more observations could have the same random values associated with them.  Well yes, you&#8217;re thinking, I see that it&#8217;s possible, but surely it&#8217;s so unlikely that it just doesn&#8217;t happen.  But it does happen:</p>
<pre>
. clear all

. set obs 100000
obs was 0, now 100000

. generate double u = runiform() 

. by u, sort: assert _N==1
1 contradiction in 99999 by-groups
assertion is false      
r(9); 
</pre>
<p>In the 100,000-observation dataset I just created, I got a duplicate!  By the way, I didn&#8217;t have to look hard for such an example, I got it the first time I tried.</p>
<p>I have three things I want to tell you: </p>
<ol>
<li>Duplicates happen more often than you might guess.<br />
</p>
<li>Do not panic about the duplicates.  Because of how Stata is written, duplicates do not lower the quality of the sample selected.  I&#8217;ll explain.<br />
</p>
<li>Duplicates do interfere with reproducibility, however, and there is an easy way around that problem.
</ol>
<p>Let&#8217;s start with the chances of observing duplicates.  I mentioned in passing <a href="http://blog.stata.com/2012/07/18/using-statas-random-number-generators-part-1/">last time</a> that <b>runiform()</b> is a 32-bit random-number generator.  That means <b>runiform()</b> can return any of 2<sup>32</sup> values.  Their values are, in order,  </p>
<pre>
          0  =  0
      1/2<sup>32</sup>  =  2.32830643654e-10
      2/2<sup>32</sup>  =  4.65661287308e-10 
      3/2<sup>32</sup>  =  6.98491930962e-10
             .
             .
             .
 (2<sup>32</sup>-2)/2<sup>32</sup>  =  0.9999999995343387
 (2<sup>32</sup>-1)/2<sup>32</sup>  =  0.9999999997671694
</pre>
<p>So what are the chances that in <i>N</i> draws with replacement from an urn containing these 2<sup>32</sup> values, that all values are distinct?  The probability <i>p</i> that all values are distinct is</p>
<pre>
      2<sup>32</sup> * (2<sup>32</sup>-1) * ... *(2<sup>32</sup>-N)
p  =  ----------------------------
                 N*2<sup>32</sup>
</pre>
<p>Here are some values for various values of <i>N</i>.  <i>p</i> is the probability that all values are unique, and 1-<i>p</i> is the probability of observing one or more repeated values.</p>
<pre>
------------------------------------
      N         p            1-p
------------------------------------
     50   0.999999715    0.000000285
    500   0.999970955    0.000029045
  1,000   0.999883707    0.000116293
  5,000   0.997094436    0.002905564
 10,000   0.988427154    0.011572846
 50,000   0.747490440    0.252509560
100,000   0.312187988    0.687812012
200,000   0.009498117    0.990501883
300,000   0.000028161    0.999971839
400,000   0.000000008    0.999999992
500,000   0.000000000    1.000000000
------------------------------------
</pre>
<p>In shuffling cards we generated <i>N</i>=52 random values.  The probability of a repeated values is infinitesimal.  In datasets of <i>N</i>=10,000, I expect to see repeated values 1% of the time.  In datasets of <i>N</i>=50,000, I expect to see repeated values 25% of the time.  By <i>N</i>=100,000, I expect to see repeated values more often than not.  By <i>N</i>=500,000, I expect to see repeated value in virtually all sequences.</p>
<p>Even so, I promised you that this problem does not affect the randomness of the ordering.  It does not because of how Stata&#8217;s <b>sort</b> command is written.  Remember the basic solution,</p>
<pre> 
. use <i>dataset</i>, clear

. generate double u = runiform()

. sort u

. keep in 1/<i>n</i>
</pre>
<p>Did you know <b>sort</b> has its own, private random-number generator built into it?  It does, and <b>sort</b> uses its random-number generator to determine the order of tied observations.  In the manuals we at StataCorp are fond of writing, &#8220;the ties will be ordered randomly&#8221; and a few sophisticated users probably took that to mean, &#8220;the ties will be ordered in a way that we at StataCorp do not know and even though they might be ordered in a way that will cause a bias in the subsequent analysis, because we don&#8217;t know, we&#8217;ll ignore the possibility.&#8221; But we meant it when wrote that the ties will be ordered randomly; we know that because we put a random number generator into <b>sort</b> to ensure the result.  And that is why I can now write that repeated values of the <b>runiform()</b> function cause a reproducibility issue, but not a statistical issue.</p>
<p>The solution to the reproducibility issue is to draw two random numbers and use the random-number pair to order the observations:</p>
<pre> 
. use <i>dataset</i>, clear

. sort <i>varnames</i>

. set seed <i>#</i>

. generate double u1 = runiform()

. generate double u2 = runiform()

. sort u1 u2

. keep in 1/<i>n</i>
</pre>
<p>You might wonder if we would ever need three random numbers.  It is very unlikely.  <i>p</i>, the probability of no problem, equals 1 to at least 5 digits for <i>N</i>=500,000.  Of course, the chances of duplication are always nonzero.  If you are concerned about this problem, you could add an <b>assert</b> to the code to verify that the two random numbers together do uniquely identify the observations:</p>
<pre>
. use <i>dataset</i>, clear

. sort <i>varnames</i>

. set seed <i>#</i>

. generate double u1 = runiform()

. generate double u2 = runiform()

. sort u1 u2

. by u1 u2: assert _N==1            <i>// added line</i>

. keep in 1/<i>n</i>
</pre>
<p>I do not believe that doing that is necessary.</p>
<p>&nbsp;<br />
<b>Is using doubles necessary?</b></p>
<p>In the generation of random numbers in all of the above, note that I am storing them as doubles.  For the reproducibility issue, that is important.  As I mentioned in <a href="http://blog.stata.com/2012/07/18/using-statas-random-number-generators-part-1/">part 1</a>, the 32-bit random numbers that <b>runiform()</b> produces will be rounded if forced into 23-bit <b>float</b>s.</p>
<p>Above I gave you a table of probabilities <i>p</i> that, in creating</p>
<pre>
. generate double u = runiform()
</pre>
<p>the values of <b>u</b> would be distinct.  Here is what would happen if you instead stored <b>u</b> as a <b>float</b>:</p>
<pre>
                               u stored as
          ---------------------------------------------------------
          -------- double ----------     ----------float ----------
      N         p            1-p               p           1-p
-------------------------------------------------------------------
     50   0.999999715    0.000000285     0.999853979    0.000146021
    500   0.999970955    0.000029045     0.985238383    0.014761617
  1,000   0.999883707    0.000116293     0.942190868    0.057809132
  5,000   0.997094436    0.002905564     0.225346930    0.774653070
 10,000   0.988427154    0.011572846     0.002574145    0.997425855
 50,000   0.747490440    0.252509560     0.000000000    1.000000000
100,000   0.312187988    0.687812012     0.000000000    1.000000000
200,000   0.009498117    0.990501883     0.000000000    1.000000000
300,000   0.000028161    0.999971839     0.000000000    1.000000000
400,000   0.000000008    0.999999992     0.000000000    1.000000000
500,000   0.000000000    1.000000000     0.000000000    1.000000000
-------------------------------------------------------------------
</pre>
<p>&nbsp;<br />
<b>Drawing without replacement P-percent random samples</b></p>
<p>We have discussed drawing without replacement <i>n</i> observations from <i>N</i> observations.  The number of observations selected has been fixed.  Say instead we wanted to draw a 10% random sample, meaning that we independently allow each observation to have a 10% chance of appearing in our sample.  In that case, the final number of observations is <i>expected</i> to be 0.1*<i>N</i>, but it may (and probably will) vary from that.  The basic solution for drawing a 10% random sample is</p>
<pre>
. keep if runiform() &lt;= 0.10
</pre>
<p>and the basic solution for drawing a <i>P</i>% random sample is</p>
<pre>
. keep if runiform() &lt;= <i>P</i>/100
</pre>
<p>It is unlikely to matter whether you code &lt;= or &lt; in the comparison.  As you now know, <b>runiform()</b> produces values drawn from 2<sup>32</sup> possible values, and thus the chance of equality is 2<sup>-32</sup> or roughly 0.000000000232830644.  If you want a <i>P</i>% sample, however, theory says you should code &lt;=.</p>
<p>If you care about reproducibility, you should expand the basic solution to read,</p>
<pre>. set seed <i>#</i>

. use <i>data</i>, clear 

. sort <i>variables_that_put_data_in_unique_order</i>

. keep if runiform() &lt;= <i>P</i>/100
</pre>
<p>Below I draw a 10% sample from the card.dta:</p>
<pre>. set seed 838   

. use cards, clear

. sort rank suit

. keep if runiform() &lt;= 10/100
(46 observations deleted)

. list

     +-----------------+
     |  rank      suit |
     |-----------------|
  1. |     2   Diamond |
  2. |     2     Heart |
  3. |     3      Club |
  4. |     5     Heart |
  5. |  Jack   Diamond |
     |-----------------|
  6. | Queen     Spade |
     +-----------------+
</pre>
<p>&nbsp;<br />
<b>We&#8217;re not done, but we&#8217;re done for today</b></p>
<p>In part 3 of this series I will discuss drawing random samples <i>with</i> replacement.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stata.com/2012/08/03/using-statas-random-number-generators-part-2-drawing-without-replacement/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Using Stata&#8217;s random-number generators, part 1</title>
		<link>http://blog.stata.com/2012/07/18/using-statas-random-number-generators-part-1/</link>
		<comments>http://blog.stata.com/2012/07/18/using-statas-random-number-generators-part-1/#comments</comments>
		<pubDate>Wed, 18 Jul 2012 18:56:59 +0000</pubDate>
		<dc:creator>William Gould, President</dc:creator>
				<category><![CDATA[Numerical Analysis]]></category>
		<category><![CDATA[binary]]></category>
		<category><![CDATA[numerical analysis]]></category>
		<category><![CDATA[random numbers]]></category>
		<category><![CDATA[runiform()]]></category>

		<guid isPermaLink="false">http://blog.stata.com/?p=1181</guid>
		<description><![CDATA[I want to start a series on using Stata&#8217;s random-number function. Stata in fact has ten random-number functions: runiform() generates rectangularly (uniformly) distributed random number over [0,1). rbeta(a, b) generates beta-distribution beta(a, b) random numbers. rbinomial(n, p) generates binomial(n, p) random numbers, where n is the number of trials and p the probability of a [...]]]></description>
				<content:encoded><![CDATA[<p>I want to start a series on using Stata&#8217;s random-number function.  Stata in fact has ten random-number functions:</p>
<ol>
<li><b>runiform()</b> generates rectangularly (uniformly) distributed random number over [0,1).
<li><b>rbeta(</b><i>a</i><b>,</b> <i>b</i><b>)</b> generates beta-distribution beta(<i>a</i>, <i>b</i>) random numbers.
<li><b>rbinomial(</b><i>n</i><b>,</b> <i>p</i><b>)</b> generates binomial(<i>n</i>, <i>p</i>) random numbers, where <i>n</i> is the number of trials and <i>p</i> the probability of a success.
<li><b>rchi2(</b><i>df</i><b>)</b> generates <i>&chi;</i><sup>2</sup> with <i>df</i> degrees of freedom random numbers.
<li><b>rgamma(</b><i>a</i><b>,</b> <i>b</i><b>)</b> generates &Gamma;(<i>a</i>,  <i>b</i>) random numbers, where <i>a</i> is the shape parameter and <i>b</i>, the scale parameter.
<li><b>rhypergeometric(</b><i>N</i><b>,</b> <i>K</i><b>,</b> <i>n</i><b>)</b> generates hypergeometric random numbers, where <i>N</i> is the population size, <i>K</i> is the number of in the population having the attribute of interest, and <i>n</i> is the sample size.
<li><b>rnbinomial(</b><i>n</i><b>,</b> <i>p</i><b>)</b> generates negative binomial -- the number of failures before the <i>n</i>th success -- random numbers, where <i>p</i> is the probability of a success.  (<i>n</i> can also be noninteger.)
<li><b>rnormal(</b><i>&mu;</i><b>,</b> <i>&sigma;</i><b>)</b> generates Gaussian normal random numbers.
<li><b>rpoisson(</b><i>m</i><b>)</b> generates Poisson(<i>m</i>) random numbers.
<li><b>rt(</b><i>df</i><b>)</b> generates Student's t(<i>df</i>) random numbers.
</ol>
<p>You already know that these random-number generators do not really produce random numbers; they produce pseudo-random numbers.  This series is not about that, so we'll be relaxed about calling them random-number generators. </p>
<p>You should already know that you can set the random-number seed before using the generators.  That is not required but it is recommended.  You set the seed not to obtain better random numbers, but to obtain reproducible random numbers.  In fact, setting the seed too often can actually <i>reduce</i> the quality of the random numbers!  If you don't know that, then read <b>help set seed</b> in Stata.  I should probably pull out the part about setting the seed too often, expand it, and turn it into a blog entry.  Anyway, this series is not about that either.</p>
<p>This series is about the use of random-number generators to solve problems, just as most users usually use them.  The series will provide <i>practical</i> advice.  I'll stay away from describing how they work internally, although long-time readers know that I won't keep the promise.  At least I'll try to make sure that any technical details are things you really need to know.  As a result, I probably won't even get to write once that if this is the kind of thing that interests you, StataCorp would be delighted to have you join our development staff. </p>
<p>&nbsp;</p>
<p><b>runiform(), generating uniformly distributed random numbers</b></p>
<p>Mostly I'm going to write about <b>runiform()</b> because <b>runiform()</b> can solve such a variety of problems.  <b>runiform()</b> can be used to solve, </p>
<ul>
<li>shuffling data (putting observations in random order),
<li>drawing random samples without replacement (there's a minor detail we'll have to discuss because <b>runiform()</b> itself produces values drawn <i>with</i> replacement),
<li>drawing random samples with replacement (which is easier to do than most people realize),
<li>drawing stratified random samples (with or without replacement),
<li>manufacturing fictional data (something teachers, textbook authors, manual writers, and blog writers often need to do).
</ul>
<p><b>runiform()</b> generates uniformly, a.k.a. rectangularly distributed, random numbers over the interval, I quote from the manual, "0 to nearly 1".</p>
<p>Nearly 1?  "Why not all the way to 1?" you should be asking.  "And what exactly do you mean by nearly 1?"</p>
<p>The answer is that the generator is more useful if it omits 1 from the interval, and so we shaved just a little off.  <b>runiform()</b> produces random numbers over [0, 0.999999999767169356]. </p>
<p>Here are two useful formulas you should commit to memory.</p>
<ol>
<li>If you want to generate <i>continuous</i> random numbers between <i>a</i> and <i>b</i>, use
<p style="padding-left:30px;">
<b>generate double u = </b><b>(</b><i>b</i><b>-</b><i>a</i><b>)*runiform() +</b> <i>a</i></p>
<p>The random numbers will not actually be between <i>a</i> and <i>b</i>, they will be between <i>a</i> and nearly <i>b</i>, but the top will be so close to <i>b</i>, namely 0.999999999767169356*<i>b</i>, that it will not matter. </p>
<p>Remember to store continuous random values as <b>double</b>s.</p>
<li>If you want to generate <i>integer</i> random numbers between <i>a</i> and <i>b</i>, use
<p style="padding-left:30px;"><b>generate ui = floor((</b><i>b</i><b>-</b><i>a</i><b>+1)*runiform() +</b> <i>a</i><b>)</b></p>
<p>In particular, do not even consider using the formula for continuous values but rounded to integers, which is to say, <b>round(u)</b> = <b>round((</b><i>b</i><b>-</b><i>a</i><b>)*runiform() +</b> <i>a</i><b>)</b>. If you use that formula, and if <i>b</i>-<i>a</i>&gt;1, then <i>a</i> and <i>b</i> will be under represented by 50% each in the samples you generate!</p>
<p>I stored <b>ui</b> as a default <b>float</b>, so I am assuming that -16,777,216 &le; <i>a</i> &lt; <i>b</i> &le; 16,777,216. If you have integers outside of that range, however, store as a <b>long</b> or <b>double</b>.
</ol>
<p>I&#8217;m going to spend the rest of this blog entry explaining the above. </p>
<p>First, I want to show you how I got the two formulas and why you must use the second formula for generating integer uniform deviates. </p>
<p>Then I want explain why we shaved a little from the top of <b>runiform()</b>, namely (1) while it wouldn&#8217;t matter for formula 1, it made formula 2 a little easier, (2) the code would run more quickly, (3) we could more easily prove that we had implemented the random-number generator correctly, and (4) anyone digging deeper into our random numbers would not be misled into thinking they had more than 32 bits of resolution.  That last point will be important in a future blog entry.</p>
<p>&nbsp;</p>
<p><b>Continuous uniforms over [a, b)</b></p>
<p><b>runiform()</b> produces random numbers over [0, 1).  It therefore obviously follows that <b>(</b><i>b</i><b>-</b><i>a</i><b>)*runiform()+</b><i>a</i> produces number over [<i>a</i>, <i>b</i>).  Substitute 0 for <b>runiform()</b> and the lower limit is obtained.  Substitute 1 for <b>runiform()</b> and the upper limit is obtained. </p>
<p>I can tell you that in fact, <b>runiform()</b> produces random numbers over [0, (2<sup>32</sup>-1)/2<sup>32</sup>].  </p>
<p>Thus <b>(</b><i>b</i><b>-</b><i>a</i><b>)*runiform()+</b><i>a</i> produces random numbers over [<i>a</i>, ((2<sup>32</sup>-1)/2<sup>32</sup>)*<i>b</i>].  </p>
<p>(2<sup>32</sup>-1)/2<sup>32</sup>) approximately equals 0.999999999767169356 and exactly equals 1.fffffffeX-01 if you will allow me to use <b>%21x</b> format, which Stata understands and which you can understand if you see <a href="http://blog.stata.com/?s=%2521x#section9">my previous blog posting on precision</a>.</p>
<p>Thus, if you are concerned about results being in the interval [<i>a</i>, <i>b</i>) rather than [<i>a</i>, <i>b</i>], you can use the formula </p>
<p style="padding-left:30px;"><b>generate double u = ((</b><i>b</i><b>-</b><i>a</i><b>)*runiform() +</b> <i>a</i><b>) / 1.fffffffeX-01</b></ul>
<p>There are seven f&#8217;s followed by e in the hexadecimal constant.  Alternatively, you could type</p>
<p style="padding-left:30px;"><b>generate double u = ((</b><i>b</i><b>-</b><i>a</i><b>)*runiform() +</b> <i>a</i><b>) * ((2^32-1)/2^32)</b></p>
<p>but multiplying by 1.fffffffeX-01 is less typing so I&#8217;d type that.  Actually I wouldn&#8217;t type either one; the small difference between values lying in [<i>a</i>, <i>b</i>) or [<i>a</i>, <i>b</i>] is unimportant. </p>
<p>&nbsp;</p>
<p><b>Integer uniforms over [a, b]</b></p>
<p>Whether we produce real, continuous random numbers over [<i>a</i>, <i>b</i>) or [<i>a</i>, <i>b</i>] may be unimportant, but if we want to draw random integers, the distinction is important.</p>
<p><b>runiform()</b> produces continuous results over [0, 1).  </p>
<p><b>(</b><i>b</i><b>-</b><i>a</i><b>)*runiform()+</b><i>a</i> produces continuous results over [<i>a</i>, <i>b</i>).</p>
<p>To produce integer results, we might round continuous results over segments of the number line:</p>
<pre>
           a    a+.5  a+1  a+1.5  a+2  a+2.5       b-1.5  b-1  b-.5    b
real line  +-----+-----+-----+-----+-----+-----------+-----+-----+-----+
int  line  |<-a->|<---a+1--->|<---a+2--->|           |<---b-1--->|<-b->|
</pre>
<p>In the diagram above, think of the numbers being produced by the continuous formula <i>u</i><b>=(</b><i>b</i><b>-</b><i>a</i><b>)*runiform()+</b><i>a</i> as being arrayed along the real line.  Then imagine rounding those values, say by using Stata's <b>round(</b><i>u</i><b>)</b> function.  If you rounded in that way, then </p>
<ul>
<li>Values of <i>u</i> between <i>a</i> and <i>a</i>+0.5 will be rounded to <i>a</i>.
<li>Values of <i>u</i> between <i>a</i>+0.5 and <i>a</i>+1.5 will be rounded to <i>a+1</i>.
<li>Values of <i>u</i> between <i>a</i>+1.5 and <i>a</i>+2.5 will be rounded to <i>a+2</i>.
<li>...
<li>Values of <i>u</i> between <i>b</i>-1.5 and <i>b</i>-0.5 will be rounded to <i>b-1</i>.
<li>Values of <i>u</i> between <i>b</i>-0.5 and <i>b</i>-1 will be rounded to <i>b</i>.
</ul>
<p>Note that the width of the first and last intervals is half that of the other intervals.  Given that <i>u</i> follows the rectangular distribution, we thus expect half as many values rounded to <i>a</i> and to <i>b</i> as to <i>a</i>+1 or <i>a</i>+2 or ... or <i>b</i>-1.</p>
<p>And indeed, that is exactly what we would see:</p>
<pre>
. set obs 100000
obs was 0, now 100000

. gen double u = (5-1)*runiform() + 1

. gen i = round(u)

. summarize u i 

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
           u |    100000    3.005933    1.156486   1.000012   4.999983
           i |    100000     3.00489    1.225757          1          5

. tabulate i

          i |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |     12,525       12.53       12.53
          2 |     24,785       24.79       37.31
          3 |     24,886       24.89       62.20
          4 |     25,284       25.28       87.48
          5 |     12,520       12.52      100.00
------------+-----------------------------------
      Total |    100,000      100.00
</pre>
<p>To avoid the problem we need to make the widths of all the intervals equal, and that is what the formula <b>floor((</b><i>b</i><b>-</b><i>a</i><b>+1)*runiform() +</b> <i>a</i><b>)</b> does.</p>
<pre>
           a          a+1         a+2                     b-1          b          b+1
real line  +-----+-----+-----+-----+-----------------------+-----+-----+-----+-----+
int  line  |<--- a --->|<-- a+1 -->|                       |<-- b-1 -->|<--- b --->)
</pre>
<p>Our intervals are of equal width and thus we expect to see roughly the same number of observations in each:</p>
<pre>
. gen better = floor((5-1+1)*runiform() + 1)

. tabulate better

     better |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |     19,808       19.81       19.81
          2 |     20,025       20.02       39.83
          3 |     19,963       19.96       59.80
          4 |     20,051       20.05       79.85
          5 |     20,153       20.15      100.00
------------+-----------------------------------
      Total |    100,000      100.00
</pre>
<p>So now you know why we shaved a little off the top when we implemented <b>runiform()</b>; it made the formula </p>
<p style="padding-left:30px;"><b>floor((</b><i>b</i><b>-</b><i>a</i><b>+1)*runiform() +</b> <i>a</i><b>)</b>:</p>
<p>easier.  Our integer [<i>a</i>, <i>b</i>] formula did not have to concern itself that <b>runiform()</b> would sometimes &#8212; rarely &#8212; return 1.  If <b>runiform()</b> did return the occasional 1, the simple formula above would produce the (correspondingly occasional) <i>b</i>+1. </p>
<p>&nbsp;</p>
<p><b>How Stata calculates continuous random numbers</b></p>
<p>I&#8217;ve said that we shaved a little off the top, but the fact was that it was easier for us to do the shaving than not. </p>
<p><b>runiform()</b> is based on the KISS random number generator.  KISS produces 32-bit integers, meaning integers the range [0, 2<sup>32</sup>-1], or [0, 4,294,967,295].  You might wonder how we converted that range to being continuous over [0, 1).  </p>
<p>Start by thinking of the number KISS produces in its binary form:</p>
<p style="padding-left:30px;">b<sub>31</sub>b<sub>30</sub>b<sub>29</sub>b<sub>28</sub>b<sub>27</sub>b<sub>26</sub>b<sub>25</sub>b<sub>24</sub>b<sub>23</sub>b<sub>22</sub>b<sub>21</sub>b<sub>20</sub>b<sub>19</sub>b<sub>18</sub>b<sub>17</sub>b<sub>16</sub>b<sub>15</sub>b<sub>14</sub>b<sub>13</sub>b<sub>12</sub>b<sub>11</sub>b<sub>10</sub>b<sub>9</sub>b<sub>8</sub>b<sub>7</sub>b<sub>6</sub>b<sub>5</sub>b<sub>4</sub>b<sub>3</sub>b<sub>2</sub>b<sub>1</sub>b<sub>0</sub></p>
<p>The corresponding integer is b<sub>31</sub>*2<sup>31</sup> + b<sub>31</sub>*2<sup>30</sup> + ... + b<sub>0</sub>*2<sup>0</sup>.  All we did was insert a binary point out front:</p>
<p>0&nbsp;<b>.</b>&nbsp;b<sub>31</sub>b<sub>30</sub>b<sub>29</sub>b<sub>28</sub>b<sub>27</sub>b<sub>26</sub>b<sub>25</sub>b<sub>24</sub>b<sub>23</sub>b<sub>22</sub>b<sub>21</sub>b<sub>20</sub>b<sub>19</sub>b<sub>18</sub>b<sub>17</sub>b<sub>16</sub>b<sub>15</sub>b<sub>14</sub>b<sub>13</sub>b<sub>12</sub>b<sub>11</sub>b<sub>10</sub>b<sub>9</sub>b<sub>8</sub>b<sub>7</sub>b<sub>6</sub>b<sub>5</sub>b<sub>4</sub>b<sub>3</sub>b<sub>2</sub>b<sub>1</sub>b<sub>0</sub></p>
<p>making the real value b<sub>31</sub>*2<sup>-1</sup> + b<sub>30</sub>*2<sup>-2</sup> + ... + b<sub>0</sub>*2<sup>-32</sup>.  Doing that is equivalent to dividing by 2<sup>-32</sup>, except insertion of the binary point is faster.  Nonetheless, if we had wanted <b>runiform()</b> to produce numbers over [0, 1], we could have divided by 2<sup>32</sup>-1.</p>
<p>Anyway, if the KISS random number generator produced 3190625931, which in binary is </p>
<p style="padding-left:30px;">10111110001011010001011010001011</p>
<p>we converted that to </p>
<p style="padding-left:30px;">0.10111110001011010001011010001011</p>
<p>which equals 0.74287549 in base 10. </p>
<p>The largest number the KISS random number generator can produce is, of course, </p>
<p style="padding-left:30px;">11111111111111111111111111111111</p>
<p>and 0.11111111111111111111111111111111 equals 0.999999999767169356 in base 10.  Thus, the <b>runiform()</b> implementation of KISS generates random numbers in the range [0, 0.999999999767169356].  </p>
<p>I could have presented all of this mathematically in base 10:  KISS produces integers in the range [0, 2<sup>32</sup>-1], and in <b>runiform()</b> we divide by 2<sup>32</sup> to thus produce continuous numbers over the range [0, (2<sup>32</sup>-1)/2<sup>32</sup>].  I could have said that, but it loses the flavor and intuition of my longer explanation, and it would gloss over the fact that we just inserted the binary point.  If I asked you, a base-10 user, to divide 232 by 10, you wouldn&#8217;t actually divide in the same way that they would divide by, say 9.  Dividing by 9 is work.  Dividing by 10 merely requires shifting the decimal point.  232 divided by 10 is obviously 23.2.  You may not have realized that modern digital computers, when programmed by &#8220;advanced&#8221; programmers, follow similar procedures.</p>
<p>Oh gosh, I do get to say it!  If this sort of thing interests you, consider a career at StataCorp.  We&#8217;d love to have you.</p>
<p>&nbsp;</p>
<p><b>Is it important that runiform() values be stored as doubles?</b></p>
<p>Sometimes it is important.  It&#8217;s obviously not important when you are generating random integers using <b>floor((</b><i>b</i><b>-</b><i>a</i><b>+1)*runiform() +</b> <i>a</i><b>)</b> and -16,777,216 &le; <i>a</i> &lt; <i>b</i> &le; 16,777,216.  Integers in that range fit into a <b>float</b> without rounding. </p>
<p>When creating continuous values, remember that <b>runiform()</b> produces 32 bits. <b>float</b>s store 23 bits and <b>double</b>s store 52, so if you store the result of <b>runiform()</b> as a <b>float</b>, it will be rounded.  Sometimes the rounding matters, and sometimes it does not. Next time, we will discuss drawing random samples without replacement. In that case, the rounding matters.  In most other cases, including drawing random samples with replacement &#8212; something else for later &#8212; the rounding<br />
does not matter. Rather than thinking hard about the issue, I store all my non-integer<br />
random values as <b>double</b>s.</p>
<p>&nbsp;</p>
<p><b>Tune in for the next episode</b></p>
<p>Yes, please do tune in for the next episode of everything you need to know about using random-number generators.  As I already mentioned, we&#8217;ll discuss drawing random samples without replacement.  In the third installment, I&#8217;m pretty sure we&#8217;ll discuss random samples with replacement.  After that, I&#8217;m a little unsure about the ordering, but I want to discuss oversampling of some groups relative to others and, separately, discuss the manufacturing of fictional data.</p>
<p>Am I forgetting something?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stata.com/2012/07/18/using-statas-random-number-generators-part-1/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
	</channel>
</rss>
