ASA Connect

 View Only
  • 1.  Big Data Software

    Posted 03-05-2016 00:49

    Hi,

    I am going to collect data about use of simulations in physics education. This a big data. How can I handle this data using a software? Please inform me if you are familiar with a software that could use to handle big data?

    ------------------------------
    Muhammad Riaz
    Doctoral Student
    Dowling College
    ------------------------------


  • 2.  RE: Big Data Software

    Posted 03-05-2016 11:36

    If you are going to process a lot of data, you need a program that uses parallel processing and updated/upgraded BLAS and LAPACK. If you need to collect and store this data, you will need to use some type of Hadoop based system.  

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)



  • 3.  RE: Big Data Software

    Posted 03-07-2016 12:39
    In general, you probably don't need Big Data unless you are looking for tiny niches of importance (<1%).

    For any analytics of the group being analyzed you can typically take a reasonable random sample and analyze it with any common S/W package. For most purposes, a sample of 10K-100K is more than adequate and certainly easily handled by most packages.

    In fact, Big Data can often result in deceiving results. Showing significant relationships when there are none due to the large sample size. This is also true with smaller samples, say 100K. The key when looking at results from large samples is to look at not only the statistical significance of the results, but whether the results are meaningful.

    For example, a simple t-test may show a significant difference of <.001 but the actual difference may be very small. Consultation with SME's (Subject Matter Experts) are helpful in this regard.

    Michael L. Mout, MS, Cstat, Csci
    MIKS & Assoc. - Senior Consultant/Owner
    4957 Gray Goose Ln, Ladson, SC 29456
    804-314-5147(Mbl), 843-871-3039 (Home)





  • 4.  RE: Big Data Software

    Posted 03-08-2016 02:29

    Generally what happens with a simulation, you run a simple program and it generates thousands of data points. You keep those data points until you have time to process them.

    Depending upon how they run a simulation, taking a sample of the data will yield false results.     

    If you look at what BOINC does, they have large sets of data or simulation models that take hours to run. Part of the reason BOINC does what it does, is it needs help processing all the data to get it to a manageable size.

    I was at a talk a few days ago discussing "Big Data" and biological simulations. They have dozens of servers working 24/7 trying to process all the data the simulations generate. Only 1% of the data might be valuable. But, do you want to pass up on the cure for (enter disease name here)?

    The final data set might be able to fit on a flash drive. It will take peta-flops of data to get there.

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)