This statistical package provides a depth far beyond the needs of the casual user
Stata is a statistical package that integrates statistics with graphics and data management, as is common to most software these days. Despite the new point-and-click interface however, Stata is more efficiently used by those with more than a cursory knowledge of the field, as due to the depth and breadth of the techniques offered, this package is geared to the professional statistician or scientific researchers with in-depth knowledge in statistics. It has long had a following among many statisticians for the quality of its algorithms and, as such, is more in the class of SAS, rather than a beginners program.
Stata 10 is available for Windows (2000, XP, Vista), Macintosh, and Unix. The professional versions for multiple core/multiple processor units are capable of
working with up to 32,766 variables, and observations limited only by memory. Vista presents the usual security concerns so installation and use must be run as an administrator. After getting past this little “feature,” there were no problems in quickly installing and running the software. If the user can remember to always run as an Administrator, there are no problems with updating either. The developers were obviously aware of this Vista shortcoming as the instructions in the Getting Started Manual prominently repeat the warning, as does the support Web site.
For the more experienced users, the list of just some of the new or improved features in release 10 is given in Table 1.
Paper manual aficionados will be in heaven, as the documentation consists of 13 volumes of various sizes. In reading through the ‘Getting Started with Stata’ and ‘Users’ Guides, I was struck with how far we have come from the old IBM manuals and how much we have come to expect clarity and brevity in explanation of even the more complex procedures. Little was found to fault in these two volumes, but I’m still slugging my way through many areas of the statistical manuals. One thing that is always bothersome with the sample sets used is the propensity to avoid anything scientific in the introductory examples. It might whet a few more appetites if we could see some engineering, physics or life science examples up front, rather than the automobile and widget manufacturing examples that are usually employed. Also, the indexing of the specialized statistics and reference manuals may seem a bit odd at first, being based upon command alphabetical codes rather than page numbers, but these are minor quibbles.
The one major problem that the new user may encounter is the lack of clarity in demonstrating the ease of data importation from Excel. As this spreadsheet is probably the most common in real-world settings, and as data sets get ever larger, it seems almost comical that the authors advocate cutting and pasting. Granted, for small data sets, this is a very easy procedure, but for the vast majority of modern sets it would have been far easier to just come out and give the path within the specialized Stata commands, i.e., one example. Instead, importation procedures are scattered throughout several manuals.
On the positive side, spread throughout the Getting Started guide are Syntax notes that are extremely helpful in that they explain why particular codes and code sequences are used and how they would be modified should slightly differing output be required. Also, the text explains why information in the resultant tables is important and how it may be used.
The help contents are more ‘programming-like’ in that a Windows-style viewer comes up that seems to be filled with type right out of a programming manual. Still, the information is easy to get at and well-indexed.
With the push of a button on the main screen. an Excel-like data browser (Figure 1) appears and can be closed and recalled any time with simple commands. It is at first annoying that, while
this is up, the rest of the main screens are locked. It would be nice if this weren’t so, as many users sometimes like to tile or cascade the windows to keep track of multiple tasks. There are several other ways to visualize data in Stata, however. The main work areas consist of the menu bar, and the command, review and variables areas, as well as the colorful results section (Figure 2).
The software likes to see the variables in columns and the observations in the rows, and most analysts are used to this. As was mentioned, the data editor is not the only way to view the data. It can be sliced, diced and summarized by a variety of simple commands such as ‘list,’ ‘tabulate,’ ‘describe’ and ‘codebook.’ The last mentioned is a bit too limiting for general overviews but offers nice summaries of specific groups of data. An example from the ‘describe’ function is displayed in Figure 3.
A very nice feature is the automatic generation of code in the review area when the menu-driven elements are used. If the user is unsure of a particular code but can easily use the menu-driven sequence, the code corresponding to that action then appears in the review section automatically. This is a very helpful teaching tool as well as a nice editing feature.
Manipulation and Analysis
As highlighted in the title of this review, Stata may be run by either command line programming or menu-driven point-and-click. The novice may chuckle at the advice given in the Getting Started manual that it is easier and faster to type in the commands than to use the menu-driven elements. However, I found this to be true and used the command lines to follow all of the pre-cooked examples. Stata is case sensitive but not too fussy about white space and, although the novice must suffer through the Data Management manuals’ programming commands for management, there are the usual gems available for wrestling the data into shape, e.g., transpose, cut, subset, join and connect are all there in one form or another.
As to statistical analysis, Stata has long (and deservedly) enjoyed a reputation for excellence in the depth of its specialized routines, including algorithms and exact computation. However, from a perusal of the manuals, it becomes apparent that Stata has especial depth in the following areas: multivariate statistics, longitudinal/panel data, survey data, time series and survival analysis/epidemiological tables. This would seem to bias the software to greatest use in the biological sciences, but it is also a staple in statistics and business applications. Due to the breadth of the analytic routines, the interested reader is referred to the Web site for a fuller appreciation of what Stata can do.
In terms of simple, descriptive statistics, they are all available under the summary commands listed in the data visualization section above. Even at this level, there are refreshing “touches” that greatly assist the analyst at any level. One excellent example is the automatic generation of not only the standard deviation, but the standard error, and 95 percent confidence intervals for the mean with the t-test procedure.
As a genomics researcher, I was quickly drawn to the Multivariate Statistics Module and was delighted at the speed by which Stata withdrew data from the Internet and did a multivariate analysis of variance (MANOVA) on simple teaching data. It was also quite pleasant to find a biochemical example used for the multivariate analysis of covariance (MANCOVA). The manual is especially helpful here, detailing the code, options, mathematical basis for the tests and diagnostics, additional remarks and references.
The really nice surprise is that Stata, like SAS, is especially user friendly to those who, while avoiding any semblance of programming, can code the brief few lines that will perform sophisticated statistical analysis and diagnostics. This, of course, has a none-to-gentle learning curve; but for those who will spend the time, ample rewards accrue with experience. For those users familiar with statistical analysis, the actual generation of results is astonishingly simple. However, it is left to the user to understand what is being done and how well. This is not to say that there is no help available, and several avenues are mentioned in the summary below.
Stata is not a graphics program and lacks the breadth of niche products such as SigmaPlot and Origin, as well as the interactivity of JMP. The new user will miss access to 3-D, surface graphics and the ability to rotate same,
facilities sorely missed in many areas of life and physical science research. That said, the software can quickly produce many 2-D graphics and is capable of many layers of edits and annotations. The code for most of these is simple and straightforward, although the graphs also can be easily generated by menu-driven commands. A simple depiction of non-linear regression is displayed in Figure 5.
This program has much to recommend it to the statistician and provides a depth far beyond the needs of the casual user. Long held in high regard for its excellence in many of the more advanced methods and specialized applications for its routines, it is now more user-friendly and contains a wider menu of tests.
Help is available to the novice in the form of manuals, on-line help, the FAQ Web sites and formal courses, as well as a dynamic users group and an excellent tech support department. Working through a few examples will start new users off on the learning curve, and much can be done in a short time. The interested reader is encouraged to peruse the Web site.
• $1,550 corporate and government
• $740 academic (special pricing available at participating colleges and universities)
4905 Lakeway Drive, College Station, Texas 77845
1-800-782-8272; Fax: 1-979-696-4601
John Wass is a statistician based in Chicago, IL. He may be reached at editor@ScientificComputing.com.
Table 1: Some New Features in Stata 10
• logistic binary-response regression
• Poisson count-data regression
• exact tests and joint tests
• predictions with exact CIs
• logistic binary-response regression
• Poisson count-data regression
• random coefficients and intercepts
• hierarchical and multilevel models
• predicted effects
Survey and correlated data
• Cox proportional hazards model
• parametric survival models
• 27 more estimators now supported
• 48 total estimators
• solve for sample size, power, or effect size
•log-rank test of survival curves
• cox proportional hazards model
• exponential regression
• discriminant analysis: LDA, QDA, and more
• multiple and joint correspondence analysis
• modern nonmetric multidimensional scaling
Dynamic panel data
• Arellano–Bover/Blundell–Bond system estimator
• smaller bias with persistent AR processes
• serially correlated disturbances
• test overidentifying restrictions
• random-utility nested logit
• McFadden’s choice model
• rank-ordered probit
• unbalanced groups
• LIML estimation
• GMM estimation
• tests of overidentifying restrictions
• millisecond resolution
• optional leap second adjustment
• user-defined display formats
• easy conversion
• save estimation results to disk
• restore and Compare models
• restore and perform predictions
• restore and perform tests
• fastest version of Stata
• dual core through 32 processors
• more commands parallelized
• all survey estimators now parallelized
• local polynomial regression
• nonlinear seemingly unrelated regression
• at-risk tables underneath survival graphs
• three methods for handling singletons in survey data
• more estimation commands supporting bootstrap and jackknife standard errors
• clustering with both binary and continuous variables by using the Gower measure
• easy comparison tests for nested models
• restricted cubic splines
• Windows automation (OLE)
• tabbed graph windows in Stata for Windows