# STATA Tutorial

STATA is a statistical software package that is widely used by researchers in economics. The interface may look overwhelming at first, but analyzing large datasets is relatively easy.

After opening **Intercooled Stata 9**, four separate windows will appear on your screen:

*Stata Command:*is the window in which you type all STATA commands. STATA is almost exclusively a command based package, so we will do a lot of typing.*Review:*after typing in a command and hitting enter, the command will appear in the review window. This allows you to easily keep track of what you have done.*Variables:*gives a list of variables that are available for analysis.*Result*is the window that prints the output of your command.

## Loading and Inspecting the Data

Go to **File > Open**, and browse to the location where the file is saved. Several variables should now appear in the variables window.

*des:*short for "describe." Type it in command window and a brief description of the data is given in the result window;*list:*This command shows the actual numerical values of your data. For small datasets such as this one (15 observations), the command may be useful. To inspect a specific variable, type list and then the variable name. For example,`list invest`shows you all the 15 observations on the investment variable.

### Recording Output

When analyzing large datasets, it is sometimes handy to have a so-called *log* file. This is a file that keeps track of everything that happens in the result window. In STATA log files have the extension *.smcl* or *.log*. Here is what you need to know about the log files:

- Go to
**File > Log > Begin**to open a log file. A screen will pop up, asking for where you want the log file to be saved. - If you want to temporarily turn off the log session, type
`log off`in the command window. The session will resume if you type in`log on` - At the end of your STATA session, type
`log close`to close the log file; - To view the log file: go to
**File > Log > View**. Since you will not be able to directly print from this window, the output can be copied and pasted into Microsoft Word and printed out. To directly print from STATA, type`print`and then the path and filename in the command window. - If the session is very short, the content of the result window can be printed directly via
**File > Print Result**. Use a log file for longer sessions.

### Manipulating the Data

To manipulate data, you often need to generate new variables by transforming or combining existing variables. The command for this is **generate** or **g** for short.

Example: Suppose that in addition to the gnp variable, we need half of gnp. Typing in

g y=gnp/2

will generate a new variable y, which is half the original gnp variable. With this command, standard mathematical operators can be used: +, -, *, /. Powers can be written using the ^ sign. The logarithmic transformation of a variable is

log(x) or ln(x).

You can also use the other option of going to **File > Data > Create or Change Variables > create new variable**

## Generating Plots (Scatter Plots and Line Graphs)

### Scatter Plots

Often times a graphical representation of the data is necessary to better visualize and understand the set of data. The command **twoway (scatter variable1 variable2)** command is used. The twoway indicates that the graph is 2-dimensional. The scatter part indicates the type of plot, variable1 is the variable that id graphed along the vertical axis, the second along horizontal axis. For example, typing in

twoway (scatter cpi year)

plots *year* against *cpi*.

You can also go to **Graphics > twoway graphs** and it will produce the same results.

### Line Graphs

In order to generate a line graph ("to connect the dots") the command is similar to generating a scatter plot. The command **twoway (connected cpi year)** will instruct STATA to "connect" the dots using lines.

If you do not like to see dots on the graph, you may use the **line** command, instead of **connected**. To obtain specific customized labels on either axes, use the **xlabel( )** and/or **ylabel( )** command, where the parentheses contain the points to be plotted on the axes, separated by spaces.

For example, typing in

twoway (connected cpi year), xlabel(1973 1976 1979)

plots the time series of *cpi*, where the years are explicitly marked on the x-axis.

### Frequency Distribution and Histograms

Suppose we are interested in examining the education level within the sample: are most people highly educated or is there a lot of variation? A frequency distribution is generated in STATA by using the tab command (short for tabular) **tab educ**.

The first column lists the values the **educ** variable takes on. The second column counts the number of individuals corresponding to each education level. These numbers are called *absolute frequencies* and the column as a whole presents the frequency distribution. The numbers in the third column are called relative frequencies. The fourth column lists the so-called cumulative frequencies.

The **tab** command can be extended to several variables. For example, typing in

tab educ male

produces a table with 14 cells, where the data is grouped according to education (7 subgroups in the dataset) and gender (0=female , 1=male). Dividing each of the absolute frequencies by the sample size yields relative frequencies. Fortunately, STATA will do this for you when you complement the tab command with the cell option **tab educ male, cell**.

*Now the table contains both absolute and relative frequencies. Note the similarity with a joint probability distribution. The tab command can be used to compute the joint sample distribution of two variables.*

### Histogram

A visual representation of a frequency table is called a *histogram*. In a histogram, values of the variable,

*educ* in this case, are plotted on the x-axis; relative frequencies, percentages or absolute frequencies are plotted on the y-axis. The basic command to produce histograms is histogram, followed by the variable name. If the variable is discrete, you also need to type in the option discrete. For example, typing in

histogram educ, discrete

will show a graph with the discrete categories of the education level on the x-axis and the density on the y-axis.Percent displays the percentage, which is the relative frequency multiplied by 100. The **xlabel** option can be used to mark the x-axis according to your preferences.

Finally, the **width( )** option tells STATA how wide the different bins should be. For example, typing in

histogram educ, discrete percent width(1)

produces a histogram with bins of 1 year of education and percentages displayed on the vertical axis. The histogram represents the distribution within a sample. You can produce histograms of continuous variables in exactly the same way by using the histogram command. Note that in this case, you do not use the discrete option! Also, you can specify the number of bins the data will be grouped in. The syntax is **bin( )** and within the parentheses you type a number.

## Summarizing the Data

### Sample Mean

Let *X* denote the return variable and *x* a particular realization. If the PDF of *X* were known, we could compute the expected value as

where the summation is over all possible realizations of *X,* indexed by *i.* Strictly speaking, this is not correct if *X* is a continuous variable with a potential continuum of outcomes. Therefore, the summation above is not well defined. To compute the expectation, we would have to know the functional form of *f(xi)* and use an integral:

A weakness of the sample mean is its sensitivity to extreme values (outliers). Suppose that returns less than --5% are considered extreme and we would like to compute the sample mean return, without these extremes. In STATA, type

sum ret if(ret>-5)

This command tells STATA to summarize the return variable, only using returns data larger than --5%.

### Sample Median

In STATA, the sample median can be found by using the detail option for the summarize command **sum ret, d**.

When extreme values smaller than --5% have to be omitted type

sum ret if(ret>-5), d

You can see that the sample mean changes a lot, whereas the sample median remains relatively constant.

### The Normal Distribution

Suppose *X* has a , distribution. This implies that the standardized variable *Z,* where

has a *N*(0, 1) distribution. STATA contains functions that use this *standard normal* distribution, so if and/or you have to standardize first and then use STATA. In the following, assume that *Z~N(0,1).*

The STATA command **norm( )** computes the CDF (cumulative distribution function) of *Z* at any number you specify within parentheses. Thus, you can find *P(Z <* 7) for any number 7. Note that this probability is the area under the PDF (probability distribution function) to the left of 7. For example, if you are interested in *P(Z* < 1), you would type in

display norm(l)

Note that the norm( ) command only computes 'left tails'. Right tails are computed as 1 minus the CDF at a certain point. For example, *P(Z >* -0.28) can be found by typing

display l-norm(-0.28)

In calculating probabilities of the normal distribution, we will very often make use of the symmetry property. In the previous example, we could have used the fact that *P(Z >* -0.28) = *P(Z <* 0.28), which can be computed by typing

display norm(0.28)

Instead of computing probabilities, let us ask a different question: if *P(Z <* ?) = a_,_ for a given probability a, what is the value of ?? Note that we are now working in the opposite direction: from a given probability to finding the value of ?. To solve this problem in STATA, we use the **invnorm( )** command. For example, suppose we would like to find the value ?, such that 99% of the probability lies to the left of ?. Simply type in

display invnorm(0.99)

which gives the answer 2.3263479. Thus, we know that *P(Z <* 2.3263479) = 0.99. The command **norm( )** and **invnorm( )** are exactly each others inverse.

### Computing Probabilities of the t Distribution

The *t* distribution is, next to the normal, one of the most widely used distributions in econometrics. Suppose a sample *(Xi,* . . . , *Xn)* is taken from a normal distribution with mean *fix* and variance *a.* The statistic

then has a standard normal distribution. However, if we do not know the value of and instead estimate it by the sample variance, the statistic

follows a *t* distribution (with n-1 degrees of freedom).

The command **ttail** allows you to compute probabilities in the right tail of the distribution (note that this is different from the normal distribution: the commands all dealt with left tails of the standard normal distribution). The arguments are the degrees of freedom and the value of interest. For example, to calculate *P(t18>* 1), type display

ttail(18,l)

Suppose you are interested in the value *t* such that *P(t18 > t) =* 5%. This can be solved using the **invttail** by typing display

invttail(18,0.05)

The answer 1.73 has 5% probability to the right and hence, 95% probability to the left. Thus, 1.73 is the 95%-quantile of the *t18* distribution.

## Hypothesis Testing

### Tests For a Single Mean

The wagel.dta file contains a variable wage, which represents the average hourly wage (in $) for each individual in the sample. The sample mean hourly wage in this case is approximately $5.90. Suppose that we are interested in testing whether the average hourly earnings are significantly less than $6. That is,

The output from the summarized command can be used to calculate t:

the value of which is then compared to the critical value of the *t525* distribution. There is a direct way to test this hypothesis in STATA. Type in the following command:

ttest wage==6

The output is extensive and shows 3 different alternative hypotheses, the computed *t* statistics and the associated p-values. STATA also computes by default a 95% confidence interval for µx. A different confidence level is used if you extend the ttest command with the **level ( )** option, where you specify a number between 10 and 99 in parentheses:

ttest wage==6, level(99)

which gives you a 99% confidence interval. As an aside, if you simply need a confidence interval without any testing, you can use **ci***,* followed by the variable of interest and possibly the confidence level. The default confidence level with this command is 95%. Thus,

{{}}ci wage

ci wage, level(90)

produce a 95% and 90% confidence interval, respectively. Both the **ttest** and **ci** command may be combined with the **if** command to do the calculations for a specific group. For example, the wages of individuals living in the West are taken when you type `if west==l`. If you need to test a hypothesis about people with at least 10 years of tenure, use `if tenure>10`, etc.

### Regression Analysis

Economists are often interested in quantifying the relationship between two or more variables. In terms of random variables we look at the relation between *X* and *Y.* Often, intuition suggests that *X* will have an effect on *Y.* In that case, we are trying to explain the variation in *Y* by looking at the variation in *X.* If the causality runs from *X* to *Y,* we call *X* the *explanatory* or *independent* variable and *Y* the *explained* or *dependent* variable. We usually assume the relationship between *X* and *Y* is linear in the population:

Yi=ßl+ß2Xi+uit |

i = l,...,n |

The *u** _{i}* represent a random error, since we do not expect the linear relation to hold with equality. The error term can be thought of as capturing all the factors other than

*X*

*that affect*

_{i}*Y*

_{i}*.*

*The model above is called a* linear regression model. *In this context,* ßl *is the* intercept *or* constant *and* ß2 *the* slope coefficient. *Both* ßl *and* ß2 *are generally unknown parameters and will have to be estimated.*

In our dataset we are interested in the relation between the # of hours studied and midterm grade. A graph of the two variables is suggestive of a linear relation, so that the linear regression model may be a good approximation. If *Y* is the midterm grade and *X* # of hours studied, then the parameters can be estimated in STATA with the regress command: **regress grade hours_studied**

Note that the dependent variable always comes before the independent variables in this command. The least squares estimates are given in the column labeled 'Coef.'. After the regress command, some output can be obtained through the **predict** command. If you are interested in the sample regression function *Y*_{i}*,* type

predict gr_pred

The fitted values of the grade are now stored under the name gr_pred. The residuals can be obtained by using the **resid** option after

predict: predict error, resid

Instead of the last command, we could have also used **g error = grade - gr_pred***.* You can graph the actual and fitted crime rates against the highschool education variable all in one graph: **twoway (scatter grade hours_studied) (connected gr_pred hours_studied)**

The predicted values of the grades obviously lie on a straight line, whereas the actual grades is scattered around the sample regression line.

## Importing & Exporting Data

In many cases the data available to you will not be in STATA format. Here we consider the case of an Excel spreadsheet, which cannot be directly opened into STATA. It turns out there is an easy way to read this type of data into STATA. Also, after finishing a STATA session, the variables in the workspace can be saved in a convenient format, so that it can be opened by other programs. The standard STATA file format, *.dta*, does not carry this feature.

- Open the file in Excel. Then go to
**File > Save**as and save the file again, but now in the text, tab delimited format. When you arc prompted, click yes. Make sure you save it in an easily accessible location, i.e. a short path name without spaces. - Close Excel and open STATA. Try to access the data that you just saved: it will not work. The traditional way of loading data is reserved for
*.dta*files only. There are two ways to import data: a) using command line and b) using the menu options on top.- Go to
**File > Import > ASCII data created by a spreadsheet.**Then browse in your directory for the .txt file you have just created. - Change the STATA working directory to the one where you saved the datafile by using the DOS command
**cd**. For example,

- Go to

cd N:

The data can be loaded with the **insheet using** command, following the exact path name of the tile plus extension (the path cannot have spaces, so a directory my documents cannot be accessed by STATA):

insheet using stocks2.txt

In addition to the list command, you can take a quick look at the data by typing `browse`.{{ }}This provides you with a check to assure that the importing has been successful. In some cases, you may want to have the original Excel file open at the same time and compare the STATA browser window with the Excel spreadsheet.

## Finishing a STATA session

If you started a log file, make sure to type `log close` to end the recording of output. Typing `clear` in the command window will erase all content in STATA's memory space. Now you can close STATA as you would any other program.

## Acknowledgments

PASS would like to thank Martijn van Hasselt for letting us use some of his handouts as reference as well as using his STATA data files to teach this class.