Discussion: View Thread

Advice on software for data matrix 2.5M x 10

  • 1.  Advice on software for data matrix 2.5M x 10

    Posted 03-26-2011 14:08
    Can anyone offer an opinion regarding which of these applications for a PC--SAS, Minitab, Stata--can best handle 2.5 million rows and 10 columns of data for statistical analysis?  My consulting project involves limited data management (sorting and selecting subsets of data) and basic statistical computations (univariate analysis, histograms, and maybe some Chi-squares).  There is funding to purchase software, and I'm most familiar with the three above.

    Also, can a typical PC (laptop) handle such a file and analysis or do I need access to a mainframe/server? 

    You may respond to me, and I'll compile and repost a summary or you can post to the group if you wish.

    Thank you,
    MJ-

    -------------------------------------------
    Monica Johnston
    Statistical Consultant & Instructor
    Mostly Math
    -------------------------------------------


  • 2.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-26-2011 14:27
    Hello Monica,

    I would use SPSS.  It is much more user friendly than the applications that you mentioned.

    -------------------------------------------
    Christopher Oldcorn
    -------------------------------------------








  • 3.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-26-2011 14:59
    I think SAS can probably handle it, although your typical desktop/laptop computer would need to be fairly fast, unless you don't mind waiting for 15-20 minutes. The issue is going to become available RAM, to be able to process a file that size, although that will be an issue regardless of the software you select.


    -------------------------------------------
    Gabriel Farkas
    -------------------------------------------








  • 4.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-28-2011 12:42


    -------------------------------------------
    Wendy Rotz
    -------------------------------------------

    It sounds like you are doing fairly simple calcs - SAS has sooooo many more capabilities if you will eventually want to do more analyses.  We use SAS on laptops with this size data sets all the time - no problem.

    One consideration though.  SAS is the most expensive and you only get licensed for a year.  If you have the funding for it, this may be one of the few opportunities you may have to obtain the software.  But it would be a larger part of your budget.  This could be justified if you have other projects coming up - especially if you are going to be doing more work for this source (who is paying for the software).






  • 5.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-26-2011 15:01
    It does depend on the kind of system you have.

    Pc-SAS will work fine (I'm assuming the file will be about 250Meg in size), but since SAS doesn't keep the
    data in memory, it can handle very large files - much larger if you are using the 64 bit windows 7 version.

    It should probably work in Stata (minimum system is 3.5G windows XP) also.  Although you will be somewhat closer to it's limits with the 32 bit version and windows XP - that tops out at about 1G files since it stores it in memory.  On the other hand if you are running windows 7 -64bit in an 8G to 12G memory system - it will handle up to 7G of data in memory in an 8G system (just checked).  Also you will have plenty of room for added variables etc.  I've used it with much larger files than yours, but of course with the bigger memory and 64 bit operating system.

    I stopped using SPSS for large files, because the last time I tried it, it processed files substantially slower than SAS (which only matters when you are looking at really large files).  But that was several years ago, I haven't tried it on the most recent version - which mainly I keep around to answer questions for consulting reasons. 

    Minitab, by the way, not even close. 

    I haven't tried R on the really big files - perhaps someone else out there will have. 

    Ray
    -------------------------------------------
    Raymond Hoffmann
    Associate Professor
    Medical College of Wisconsin
    -------------------------------------------








  • 6.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-26-2011 14:55
    I'm a SAS user and I'm confident that SAS could do the trick, but SAS is expensive.  For simple statistics I would suggest finding someone with the software that could run if for you, probably would cost a lot less than buying something. If you want to purchase some software, you can always call the vendors and see if their software can handle your needs.  I bet they all can.

    -------------------------------------------
    Rocco Brunelle
    Senior Statistician
    Bowsher Brunelle Smith LLC
    -------------------------------------------








  • 7.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-26-2011 15:37


    -------------------------------------------
    Milton Goldsamt
    Survey Statistician
    -------------------------------------------
    As already mentioned, it would be best to contact the vendors, and ask them about storage limits and processing speed, also the possibility of "overflow errors" due to the large file size. However, another issue is what's the format in which the data file will be stored? Can the software recognize the format in which you have the data?  And yes, as already mentioned, the speed of your processor will affect things.  I was fortunate to use a mainframe for large files of over 225,000 records in handling a series of file management and statistical calculations, I hope most PCs will handle your needs.  But going a little further, if you don't mind the suggestion, had you thought of drawing a fairly small yet statistically representative sample from the full data file for the statistical calculations?  Performing chi-squares with 2.5 million records seems to be overkill, and likely almost every result will be statistically significant.  I had a file of over a million records years ago for a client, and drawing a 1.5% or so representative sample (in SPSS) was enough to produce fast results and accurate ones.

    Best of luck!  I too like SPSS, perhaps Version 19, their latest, can do the trick, but just check first. 




  • 8.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-26-2011 15:22
    Many years ago, we found that SAS took about 15% more staff time to do identical tasks. We attributed this to human factors in the design of the s/w, i.e., the ease-of-doing all the nitty-gritty tasks involved in data cleaning and preparation that constitute almost all of the time in doing  a project.

    SPSS has a great menu system that allows you to create a first draft of syntax.  This facilitates going back to revise your approach, quality assurance review, and being explicit when you go on the listserv for help.


    2.5 million row should be no problem. SPSS is limited on the number of lines solely by your disk storage.
    I just ran a quick simulation on my desktop.  It took 10.576 second to generate 2.5 million lines with 10 normally distributed 2 digit integers (mean 50, sd 10).
    It took 2.577 seconds to do descriptive statistics.
    My machine is a home brew about 4 years old with 4 64 bit processors and 8G Ram.  The OS is Windows 7. YMMV.

    If you want to very esoteric procedures that are not in SPSS, you can save the cleaned and prepped data in the formats for many packages.  R procedures are callable from within SPSS so that any R syntax is very concise, SPSS handles the data.

    I have been doing statistical consulting since 1972.  Over the years many clients have found it effective to use SPSS for most work and to have some access to special purpose software for esoteric procedures.


    Art Kendall
    Social Research Consultants


    -------------------------------------------
    Arthur Kendall
    Social Research Consultants
    -------------------------------------------








  • 9.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-27-2011 08:06
    I would add that I put a lot of weight on ease of use; lower frustration; clarity of syntax that facilitates quality assurance review, being explicit about what is done to help revision and teamwork; and _saving staff time_.  In my experience, the cost of staff is a very large part of the total budget.  Total cost of ownership is a more important consideration than minimizing a single budget line.  SPSS and almost all packages are available as free trial versions. There are also grad student, government, and academic pricings.

    Clarity of syntax has made it easier for me to work and share with people on other continents over the net.

    Computer time to execute syntax is fairly trivial compared what it used to take to do the same task and compared to the staff time to draft, test, revise, refine, and review the syntax.

    -------------------------------------------
    Arthur Kendall
    Social Research Consultants
    -------------------------------------------








  • 10.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-27-2011 09:03
    Stata is powerful and has a user-friendly GUI and, depending on your machine and the Stata version you have, very fast.
     
    Kevin Gray
    Cannon Gray LLC
    Statistical and Analytics Consulting



  • 11.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-27-2011 09:33
    Though it was not one of the initial three software packages you asked about, I would encourage you to take a look at JMP.  JMP is an extremely powerful tool and is produced by the SAS group.  Another key feature is that it is very intuitive but at the same time has a extensive scripting language which allows for user created functions.  The Statistics staff here at UNH use JMP in most of our courses (undergrad through PhD level) as well as in research/consulting.



    -------------------------------------------
    Philip Loud
    University of New Hampshire
    -------------------------------------------








  • 12.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-29-2011 10:27
    JMP will work well with a data set of this size; and, if the data come in sequence, there will be changes in distribution over time. I recommend using CUSUM plots (in JMP's Graphics/quality control section) to visually check on shifts and/or other changes in distribution over time.  CUSUM has been proved to be an optimum dector of a change in distribution and many of my recent consulting problems have used CUSUM plots.

    -------------------------------------------
    James Lucas
    J M Lucas & Associates
    -------------------------------------------








  • 13.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-29-2011 11:36


    -------------------------------------------
    J. Dobbins
    Delmarva Foudation
    -------------------------------------------
    Are you the James Lucas of Quality Control fame and the Cumulative Sum Method?  I was a quality engineer at NCR back in the 80s and 90s and remember the name.

    Greg Dobbins







  • 14.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-27-2011 23:44
    I' ve used both SAS and R for file with about 1.5 M rowsx and 20cols, with no problem.  For R, remember to allocate enough memory (and clean unused objects), especially if you want to transpose the matrix.  Another way to go is to use perl scripting with R to make it less cumbersome.

    -------------------------------------------
    Julio Molineros
    Biostatistician
    -------------------------------------------








  • 15.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-26-2011 15:30
    JMP 9 can handle tens of millions of datapoints on a 64 bit machine. It's a very easy to use, visual, and interactive software with excellent features for sorting and selecting subsets of data.  It's dynamic link features across model platforms is excellent for data mining. Check out the program at www.jmp.com. You can download a free 30-day trial.

    Dave

    -------------------------------------------
    David Trindade
    Fellow
    Bloom Energy
    -------------------------------------------








  • 16.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-26-2011 15:43
    All comments (and Art's simulation!) have been very helpful.  And, I do know how to use SPSS so that is another option.  I'll consider other software mentioned as well.

    btw, I have a Toshiba laptop, 32-bit OS, 2GB RAM and 65GB free hard disk space, Pentium-dual core, Windows Vista. 

    Thank you!
    -------------------------------------------
    Monica Johnston
    Statistical Consultant & Instructor
    Mostly Math
    -------------------------------------------








  • 17.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-26-2011 16:05

    All good advice.  I've used both SAS and SPLUS side by side for a large (about 1 million) records.
    for SPLUS, one must have the "bigdata" library.  Both are slow to read in the data, and
    for an iterative procedure, like logistic regression, may take 10 or 15 minutes.
    Although I have not used it, there is a library (bit and ff)  for handling large data sets in R.
    http://cran.r-project.org/web/packages/ff/index.html

    A link to a nice description about R

    http://tinyurl.com/4ujmhhf

    http://search.yahoo.com/r/_ylt=A0oGdWFcRY5NLlYAgPZXNyoA;_ylu=X3oDMTEyajBwbzQ4BHNlYwNzcgRwb3MDMwRjb2xvA3NrMQR2dGlkA0g0NjVfNzk-/SIG=12jmh1q5e/EXP=1301191100/**http%3a//ff.r-forge.r-project.org/bit%26ff2.1-2_WU_Vienna2010.pdf


    Best option, is to take a sample of your data, to test/develop your software, and afterwards
     run it on the full dataset. You'll likely prefer to run in batch rather than interactively.

    -------------------------------------------
    Christopher Barker

    Statistical Planning and Analysis Services, Inc.

    President Elect - San Francisco Bay Area Chapter of the ASA
    -------------------------------------------








  • 18.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-26-2011 16:16
    I process many different kinds of multi-million record files using PC SAS. 2.5M records should not be a problem on a PC for SAS, especially since you only have 10 variables. For example, a DATA step to read in the American Community Survey from a network at about that many records and do some recoding takes about 5 minutes or less. SAS though is expensive for a single-person shop to purchase.

    -------------------------------------------
    Michele Burlew
    Episystems, Inc.
    SAS Press Author
    -------------------------------------------








  • 19.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-26-2011 18:04
    SPSS, SAS and JMP are all good choices.  I'm sure there even more packages out there that you can use.  R has two constructs for large datasets: bigmatrix and ff (flat file). The problem with these is that the survey functions cannot work with them. The alternative of storing the data in an external database is very inefficient, as R's database interface is very slow and disk-intensive.  R's only benefit in this context is that it is free, but, still, I do not recommend it.

    -------------------------------------------
    Chuck Coleman
    -------------------------------------------








  • 20.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-26-2011 19:04
    Minitab does a good job also. It's great for the less sophisticated user so it's good for consulting work with non-statisticians.

    -------------------------------------------
    Patrick Spagon
    -------------------------------------------








  • 21.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-27-2011 10:32
    But minitab will never handle that large a file in a reasonable amount of time.

    -------------------------------------------
    Raymond Hoffmann
    Associate Professor
    Medical College of Wisconsin
    -------------------------------------------








  • 22.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-26-2011 16:21


    -------------------------------------------
    Paul Black
    Neptune & Company
    -------------------------------------------
    Another option that we use is R combined with an SQL database.  In open source we use R with PostGre.







  • 23.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-26-2011 16:36
    Monica,

    Using R, I tried creating and manipulating a 2.5M x 10 matrix on my PC. It was no problem. Here are the times:

    - Create the matrix, initialized with zeros: 0.43 sec
    - Create the matrix, initialized with random normal variates: 4.6 sec
    - Plot histogram of matrix contents: 3.1 sec
    - Compute mean of each column: 0.17 sec
    - Compute SD of each column: 0.34 sec

    Admittedly, my PC is less than a year old and likely faster than your laptop. Nonetheless, you shouldn't need a mainframe or server to crunch your data. My clients regularly crunch datasets larger than this on their desktops.

    You can't beat the price of R: it's free. Why enrich SAS, SPSS, or anyone else? Spend a little grant money on the fine documentation available for R. Save the rest for a faster computer... or a nice team dinner, for that matter.

    Paul

    -------------------------------------------
    Paul Teetor
    Quantitative Developer
    ElginILUnited States
    -------------------------------------------








  • 24.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-27-2011 12:51

    I really agree with Paul's sentiments re the software.  We have had transitioning from SAS to R on the "important list" for quite a few months, and am hoping to move in that direction this year.  Out lastest "annual fee" from SAS (noting special - base, stats, graph, intrnet...) was $85,000 - because we consult!  I used to be one of their greatest supporters, but this move in the last few years to milk the consultants has really turned my attitude.  The SAS logic - our consulting means our client doesn't buy a SAS license - like the software runs itself without a trained statistician - great respect for the users there, so we should provide SAS with more income AND - pass it along to our clients; since they are unaware of "competition" maybe that seems OK for them. 

    Sorry to vent - it was just the comment on the software.
    -------------------------------------------
    Janet McDougall
    President
    McDougall Scientific Ltd
    -------------------------------------------








  • 25.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-28-2011 00:21
    Unless your doing some pretty exotic statistical routines, 2.5 million records is not a problem for any statistical software package mentioned here. Don't use Excel, of course!

    Exotic means something that requires a substantial number of iterations, or cross-validation, or other resource intensive procedures. Simple regression models (even logistic regression) should be no problem, as long as you are not expecting instantaneous turnaround.

    As others have mentioned, R is less adept at very large data files than the other programs, but very large here probably means a number of records that is one or two more orders of magnitude for your data set.

    So choose the statistical package based on what you are most comfortable with, rather than trying to figure out which package is optimal. They'll all work just fine. If you file size was closer to a billion observations rather than a million, you might need to think more carefully about which package to choose.

    Keep in mind that you need a fair amount of free hard disk space for intermediate files. I'd try to make sure that you have at least 10 times as much free space as the size of your data file. And add as much RAM to your laptop as it can hold.

    -------------------------------------------
    Stephen Simon
    Independent Statistical Consultant
    P. Mean Consulting
    -------------------------------------------




  • 26.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-28-2011 00:49

    Some have asked whether R will support large data sets.
    If you don't know whether your computer will handle
    a data set this size, it's easy to test; just run the
    following program.  I can run it on my several-year-old
    laptop.  The longest part is generating the random data, and
    that takes about 30 seconds on my computer.

    (For R addicts:  Note that I made the data frame in
    steps, rather than at once.  Evidently this handles
    memory better, because when I do it at once, there
    is not enough memory on my computer.  Putting the
    numbers in an array first, then in a data frame, works
    for me.)

    # test R with large data set;
    # 25,000,000 numbers in a
    # 2,500,000 x 10 array

    nums <- rnorm(25000000)
    data <- matrix(nums,ncol=10)
    rm(nums)     # remove objects when done with them
    data.fr <- data.frame(data)
    rm(data)

    mean(data.fr)  #  mean for each var
    var(data.fr)   #  var/covar matrix
    sd(data.fr)    #  standard deviations

    rm(data.fr)   

    -------------------------------------------
    David Rindskopf
    CUNY Graduate Center
    -------------------------------------------








  • 27.  RE:Advice on software for data matrix 2.5M x 10

    Posted 03-28-2011 02:07

    Part A: Why 2.5M

    Here are a few more considerations if working with large quantities of data (and I don't consider 2.5M records with a few columns particularly large):

    As soon as one goes beyond 100 000 records, it is good to think first: What is it that makes 2.5M records significantly richer in information than 100 000? Rarely, there will be a real difference unless we are dealing with insufficiently stratified surveys with low frequency pockets (rare combinations of criteria for which we would nevertheless like to assure sufficient coverage).

    If 100 000 is practically equal to 2.5M, then it may be more interesting to analyze 100 000 record chunks drawn by random partitioning from the raw data. This will not only get the estimates we are looking for but also an idea of variability. In the end we can pool to use all data.

    If 2.5M >> 100 000 then we should give some thoughts about re-designing the study (if it has to be done again).

    Part B: New trends in software/computing

    You may consider the following environment for high performance computing on large data sets:
    - check out www.elastic-r.org, it provides:
    a) a portal to Amazon EC2 computing services (as soon as you enable EC2 for your amazon book account, you can use Amazon EC2 computing services)
    b) a set of virtual machines suitable for statistical work (going from a basic single core like a Ubuntu 32bit system with 1-2GB of RAM and a 160GB virtual disk for 0.08USD per hour to high performance settings with 64bit Ubuntu, 8 cores, 8GB RAM and 1.6TB disk for a 0.68USD per hour. These virtual machines come pre-configured with R 2.12.0 (currently) and a set of connection tools (see point c)
    c) the connection tools allow you to transfer data from your local system to the cloud computer via scp (using a winscp client), to collaboratively access the cloud session, to build Java based interfaces to the cloud session, etc.
    d) persistent virtual disks (for a small rental fee)

    What do you get from this:
    -A very inexpensive scalable computing system enabling you to develop on a small instance and to run an almost
    super-computer when you need it.
    -A very flexible way to communicate with a running virtual machine.

    Check it out and have a nice day,

    Chris.


    -------------------------------------------
    Christian Ritter
    University Catholique De Louvain
    -------------------------------------------