Part A: Why 2.5M
Here are a few more considerations if working with large quantities of data (and I don't consider 2.5M records with a few columns particularly large):
As soon as one goes beyond 100 000 records, it is good to think first: What is it that makes 2.5M records significantly richer in information than 100 000? Rarely, there will be a real difference unless we are dealing with insufficiently stratified surveys with low frequency pockets (rare combinations of criteria for which we would nevertheless like to assure sufficient coverage).
If 100 000 is practically equal to 2.5M, then it may be more interesting to analyze 100 000 record chunks drawn by random partitioning from the raw data. This will not only get the estimates we are looking for but also an idea of variability. In the end we can pool to use all data.
If 2.5M >> 100 000 then we should give some thoughts about re-designing the study (if it has to be done again).
Part B: New trends in software/computing
You may consider the following environment for high performance computing on large data sets:
- check out
www.elastic-r.org, it provides:
a) a portal to Amazon EC2 computing services (as soon as you enable EC2 for your amazon book account, you can use Amazon EC2 computing services)
b) a set of virtual machines suitable for statistical work (going from a basic single core like a Ubuntu 32bit system with 1-2GB of RAM and a 160GB virtual disk for 0.08USD per hour to high performance settings with 64bit Ubuntu, 8 cores, 8GB RAM and 1.6TB disk for a 0.68USD per hour. These virtual machines come pre-configured with R 2.12.0 (currently) and a set of connection tools (see point c)
c) the connection tools allow you to transfer data from your local system to the cloud computer via scp (using a winscp client), to collaboratively access the cloud session, to build Java based interfaces to the cloud session, etc.
d) persistent virtual disks (for a small rental fee)
What do you get from this:
-A very inexpensive scalable computing system enabling you to develop on a small instance and to run an almost
super-computer when you need it.
-A very flexible way to communicate with a running virtual machine.
Check it out and have a nice day,
Chris.
-------------------------------------------
Christian Ritter
University Catholique De Louvain
-------------------------------------------