Statistical Analysis with Electronic Health Records


Vahed Maroufy, PhD

Department of Biostatistics and Data Science, SPH, University of Texas HSC


Abstract: Electronic health records (EHR), although originally designed for billing and documentation purposes, are valuable resources to provide real-world evidences to evaluate treatment effects and disease risks. They are big data sources with full history record, including diagnosis, medication, procedure, demographic, Lab and clinical event table, which makes them strong choices for longitudinal analysis and prediction. They are, however, not precisely recorded, very noisy and heterogeneous, hence quite challenging to process, clean and prepare for statistical analyses. Common problems with preparing EHRs are: massive noise, intense missing, different variable types (cross-sectional, longitudinal, functional, …), etc. In this presentation, I give a review on the different directions and research projects under going in our department. We particularly use Cerner Health Factor database and Blue Cross Blue Shield (BCBSTX) insurance database.  Cerner includes about 50M patients within 15 years period at around 700 hospitals around USA, and BCBSTX covers over 10M insured patients under 65 years of age in Texas. Specifically, I will discuss the issues and subtlety in data cleaning and processing and share a strategy we are developing for handling these issues. I will also briefly talk about three specific undergoing research projects based on these databases.


HACASA meetings are open to the public!  Please pass on copies of this announcement to colleagues and friends and post on appropriate bulletin boards.                                                     


WHEN and WHERE:        *Tuesday, September 11, 2018* at RAS Building, UT School of Public Health

5:30pm – 6:30pm, Social Hours, Room 102A

6:30pm – 7:30pm, Seminar, Room 102A




RAS Building is the primary building of UT School of Public Health, located at 1200 Pressler Street. For parking information and directions please refer to the UTSPH parking websites, the following have been found helpful:  



Thank you to UTHealth and the Department of Biostatistics and Data Science at UTSPH for their generous support of HACASA by allowing us the use of their facilities for our gatherings, and sponsoring FREE PARKING. See an officer for a voucher at the meeting.


To help us cover cost of our weekly snacks, we are asking for small donations ($5 or more). Also, please RSVP so we can bring enough snacks and drinks.  Please RSVP to Kristofer Jennings at


January 21, 2014

Heidi Spratt, PhD
Associate Professor    UTMB Bioinformatics Program  The University of Texas Medical Branch

Development of a Metabolic Biomarker Panel for the Early Diagnosis of Hepatocellular Carcinoma in Hepatitis C Infected Populations

Hepatocellular carcinoma (HCC) is the fifth most common cause of cancer and also one of the deadliest. It has recently shown an upward trend in the number of diagnoses per year. One of the precursors to HCC infection is the Hepatitis C virus (HCV). According to the NIH, HCV is one of the main causes of chronic liver disease in the United States. About one-third of HCV patients will develop cirrhosis of the liver. Of those, about 1-2% annually will develop HCC.  Ultimately, 80% of HCC cases are developed from liver cirrhosis. Hepatitis C causes an estimated ten to twelve thousand deaths each year in the U.S. alone; the virus varies greatly in both its course and disease outcome. Many patients infected with HCV are asymptomatic and have some degree of chronic hepatitis, often associated with some degree of fibrosis of the liver. The prognosis for patients with early-stage fibrosis is frequently good.  At the other end of the spectrum are patients with severe HCV who experience all the classic symptoms of the disease, and who ultimately develop liver cirrhosis. The prognosis for such patients is slim and mortality is usually the result. Little is known about how HCV infection progresses to HCC in patients with advanced fibrosis, so the main goal of this project is to discover biomarkers for the detection of early stage liver cancer. Patients with HCV and HCC co-infection will be compared to HCV only infected patients to develop a biomarker panel for the detection of early stage liver cancer. For the development of such biomarkers, nuclear magnetic resonance metabolomics experiments will be conducted on urine samples from infected patients. Machine learning techniques such as multivariate adaptive regression splines will be used to create a biomarker panel that has the ability to predict HCC infection.


October 12, 2013

Rice University hosts  StatFest 2013

StatFest 2013 was a one day conference aimed at encouraging undergraduate students from historically under-represented groups to consider careers and graduate studies in the statistical sciences.  The conference is an ongoing initiative of the American Statistical Association through its Committee on Minorities in Statistics.  The conference includes presentations from established professionals, academic leaders, and current graduate students that will help attendees understand the opportunities and routes for success in the field.  Panel forums include information and tips for a rewarding graduate student experience, achieving success as an academic statistician, opportunities in the private and government arenas, among other topics. 

Attendance at the conference was free, though registration was required. Look for next year's StatFest in the Fall.
 Sponsorship was provided by: Rice University, the National Science Foundation, Abbott Laboratories, and the American Statistical Association.

Congratulations 2013 Science Fair Winners!

Junior: Karina Bertelsmann, "Bingo!...In a Round or Two"; Westbrook Intermediate

Ninth: Christine Castagna, "Efficiencies of Sun Rays Capture"; Academy of Science and Technology

Senior: Grant Zhao, "Waist-to-Height Ratio - A Better Index to Diagnose Obesity & Estimate Body Fat: Part II"; Clements HS


April 9, 2013

 Dr. José-Miguel Yamal

Division of Biostatistics, The University of Texas

Health Sciences Center School of Public Health


Title: Technologies for the detection of cancer: algorithm development and novel statistical methodology

Abstract: We consider here two technologies proposed as accurate and low-cost alternatives for detection of cervical intraepithelial neoplasia (quantitative cytology and optical spectroscopy) and one technology for detection of breast cancer lesions (elastography). We discuss the development of classification algorithms using these three technologies. (1) Classification using quantitative cytology involves classifying a patient based on measurements on a collection of a random number of their cells, e.g., classifying a macro-level object based on measurements of embedded (micro-level) observations within each object. Classification problems with this hierarchical, nested structure have not received the same statistical understanding as the general classification problem. We present some model-based methodologies that address this structure. (2) Classification using optical spectroscopy involves classifying a patient based on functional data. We trained and tested an algorithm for optical spectroscopy as an adjunct to colposcopy as well as optical spectroscopy not including the colposcopic diagnosis, examining the importance of the probe placement expertise. (3) Classification using axial-shear strain elastograms involves acquiring ultrasound signals before and after a small compression of the breast. We developed a simple algorithm that has the potential to improve the accuracy over using ultrasound images alone. Quantitative cytology, optical spectroscopy, and axial-shear strain elastograms show promise for the detection of pre-cancer and cancer.


March 12, 2013

Dr. Stanley N. Deming

Professor Emeritus, Department of Chemistry, University of Houston


President, Statistical Designs
Topic: Masquerading as a Statistician

I'm an analytical chemist. I am *not* a formally trained statistician.
But I have found that the statistical analysis of laboratory data, the design of experiments, and the sequential simplex method of optimization are very useful tools that can be applied productively in my field. And I've had a lot of fun over the years showing others how they can use these tools in their fields. In keeping with full disclosure, I'm very careful to inform my potential clients early in our discussions that I'm not a formally trained statistician -- that my formal training is in the field of analytical chemistry. Much to my continuing surprise, this disclosure rarely ends the discussion. Thus, over the years, I've been able to appreciate the usefulness of statistical methods properly applied ... and to wonder and speculate about areas in which either the statistical methods don't seem to be fully mature yet, or the application of existing statistical methods doesn't make much sense. As an example of the first area, a question I encounter frequently is, "What do I do with non-detects?" There still seems to be room for work in that area. As an example of the second area, a question I constantly ask myself is, "Although two one-sided t-tests might work for a question involving 'different from a single specification', is it appropriate to use two one-sided t-tests for a question involving 'between two specifications'?" The United States Pharmacopeial Convention recommends this for at least one type of "equivalence testing", but I don't think it's the correct thing to do.

And, by extension, I ask myself, "Is there a p-value approach for between-specification testing that would be more appropriate for making better business decisions?" Questions like this keep me feeling young and active, and let me enjoy my charade -- masquerading as a statistician.

February 12, 2013

Dr. Sanjay S. Shete 

Department of Biostatistics, Division of Quantitative Sciences, 

The University of Texas

MD Anderson Cancer Center  

Topic: Identifying SNPs associated with Mediators in Genomewide Association Studies:

 Application to Smoking Behavior and Lung Cancer

A mediation model explores the direct and indirect effects between an independent variable and a dependent variable by including other variables (or mediators). Mediation analysis has recently been used to dissect the direct and indirect effects of genetic variants on complex diseases using case-control studies. However, bias could arise in the estimations of the genetic variant-mediator association because the presence or absence of the mediator in the study samples is not sampled following the principles of case-control study design. In this case, the mediation analysis using data from case-control studies might lead to biased estimates of coefficients and indirect effects. In this article, we investigated a multiple-mediation model involving a three-path mediating effect through two mediators using case-control study data. We propose an approach to correct bias in coefficients and provide accurate estimates of the specific indirect effects. Our approach can also be used when the original case-control study is frequency matched on one of the mediators. We conducted simulation studies to investigate the performance of the proposed approach, and showed that it provides more accurate estimates of the indirect effects as well as the percent mediated than standard regressions. We then applied this approach to the multiple-mediation study of the mediating effects of both smoking and chronic obstructive pulmonary disease (COPD) on the association between the CHRNA5-A3 gene locus and lung cancer risk using data from a lung cancer case-control study. The results showed that the genetic variant influences lung cancer risk indirectly through all three different pathways. The percent of genetic association mediated was 18.3% through smoking alone, 30.2% through COPD alone, and 20.6% through the path including both smoking and COPD, and the total genetic variant-lung cancer association explained by the two mediators was 69.1%. 

January 15, 2013 

Dr. Diane Schaub

Senior Statistical Applications Analyst  

Office of Performance ImprovementMD Anderson Cancer Center

Title: Quality Improvement at MD Anderson Cancer Center

Abstract:  Performance improvement methods are being utilized at The University of Texas MD Anderson Cancer Center’s (MDACC) clinical and facilities operations to improve patient flow and streamline support departments.  Examples of completed projects by these various groups will be shared, along with the operating and financial gains that have been realized.  MDACC has three distinct live instruction training classes with a common goal to identify, measure and implement changes that result in performance improvement.  The three classes are the widely known Lean and Six Sigma programs, as well as a healthcare-specific Clinical Safety and Effectiveness (CS&E) program; the primary difference between these three methodologies is the specific approach to the instruction and which statistical and management tools are used.  The MDACC Lean training is taught to departmental units, who then undertake their own improvement efforts.  The MDACC Six Sigma Green Belt training is taught to individuals, who are then mentored through the problem solving process for a specific project.  The CS&E training is typically taught to teams consisting of both clinical and administrative members who work on a project and are facilitated by members who have already completed the course. 

November 14, 2012

Dr. Alan Feiveson from NASA 

Cardiac study - convex hull growth 

October 17, 2012

 Dr. Don Berry – MD Anderson Cancer Center

 The Mammography Wars


Abstract: In 1997 I served as co-chair of an NIH Consensus Development Panel on  Breast Cancer Screening for Women Ages 40-49. The panel actually developed a consensus, but not everyone outside of the panel agreed ... to say the least! In the ensuing 15 years there has been no other medical intervention or procedure quite as controversial as screening mammography. During the political debates regarding healthcare reform the U.S. Preventive Services Task Force was actually charged with being an Obama death squad because of its recommendations on screening mammography. And speaking of death, along the way some have suggested that my own death would a positive development for women's health. I'll summarize some highlights of the controversies from my perspective in the trenches. I'll describe the screening trials, statistical modeling, epidemiology studies to assess the extent of overdiagnosis for example, and I've delve superficially into the politics of breast cancer.

Interface of Computing Sciences and Statistics
May 16-18, 2012
Rice University Statistics Department hosted the Interface of Computing Sciences and Statistics at Rice University.
  Trevor Hastie gave the keynote address. For more information see:

2012 Short Course

Applied Mixed Models

Presented by Dr. Linda Young, PhD

Dept. Of Statistics, University of Florida


Sponsored by

Houston Area Chapter of the American Statistical Association

Rice University, Department of Statistics

University of Texas School of Public Health, Division of Biostatistics

Saturday April 28, 2012

 Location: UT School of Public Health Auditorium (1st floor, E101)

                                                                   1200 Herman Pressler    RAS Building  Houston, TX 77030


Data sets from designed experiments, sample surveys, and observational studies often contain correlated observations due to random effects and repeated measures.  Mixed models can be used to accommodate the correlation structure, produce efficient estimates of means and differences between means, and provide valid estimates of standard errors.  Repeated measures and longitudinal data require special attention because they involve correlated data that arise when the primary sampling units are measured repeatedly over time or under different conditions.  Normal theory models for random effects and repeated measures ANOVA will be used to introduce the concept of correlated data.  These models are then extended to generalized linear mixed models for the analysis of non-normal data, including binomial responses, Poisson counts, and over-dispersed count data.  Methods of assessing the fit and deciding among competing models will be discussed.  Accounting for spatial correlation and radial smoothing splines within mixed models will be presented and their application illustrated. The use of SAS System’s PROC GLIMMIX will be introduced as an extension of PROC MIXED and used to analyze data from pharmaceutical trials, environmental studies, educational research, and laboratory experiments.

This workshop is for those who want to learn about the theory and application of linear and generalized linear mixed models. The material is presented at an applied level, accessible to participants with training in linear statistical models and previous exposure to linear mixed models.  Some experience with SAS’s PROC MIXED would be helpful.

See details in flyer attached at bottom of this page.

April 2012 Presentation
A Better Alternative to Long-Horizon Regressions

Natalia Sizova, Ph.D.    Assistant Professor     

Department of Economics    Rice University 

        In long-horizon regressions the dependent variable is aggregated over several future periods, i.e., yt+1 + yt+2+ yt+3 + ::: + yt+H. Long-horizon regressions are often used to uncover long-run relations that are not detectable at higher frequencies. Such regressions are finding their applications in Macroeconomics and Finance. One example is the test of monetary-neutrality of the GDP growth, i.e., the independence of the long-run GDP dynamics from the current monetary policy. Such tests are performed using long-horizon regressions. The example in this presentation is the long-run predictability of the stock market that is also tested with long-horizon regressions. However, it is often found that long-horizon regressions are not sufficiently accurate to conclusively reject or confirm long-run predictability. We re-examine this evidence and suggest a new method of estimating the long-run relations that out-performs long-horizon regressions. Our suggestion is to consider the problem in the frequency domain which allows us to naturally separate long-run and short-run dynamics. The suggested method preserves the simplicity and intuitiveness of the simple linear regressions.

2012 Science Fair

Congratulations to the 2012 Science Fair HACASA award winners!

HACASA participated in the 2012 Science and Engineering Fair of Houston as a Special Awarding Agency.  This year is the Fair's 54th year as a non-profit organization devoted to the enhancement of math and science education in junior and senior high schools. Congratulations to all the participants! 

This year's HACASA award winners are: 
Kavita Selva in the Junior Division, Physics and Astronomy Category, for the entry 
 "Heat Rejection, Light & Color Transmission Of Low-E Glasses"

Gijs Landwehr in the Ninth Division, Computer Science Category, for the entry "The Statistics of Winning"

Matthew Lovelace in the Senior Division, Mathematics Category, for the entry "Sample Size Does Matter"

Best wishes to these and all of Houston's young scientists.

Want to a judge at next year's Science Fair? Contact Erin Hodgess, HACASA Science Fair Coordinator, at 

More information on the science fair can be found at: