Discussion: View Thread

Back to discussions

Expand all | Collapse all

Survey Design for payroll data- citations and suggestions appreciated

Chris Barker07-10-2012 10:49

This message has been cross posted to the following eGroups: Statistical Consulting Section and Survey ...

Stephen Simon07-10-2012 15:31

My apologies is most or all of this is already obvious to you. The first thing you need to define ...

1. Survey Design for payroll data- citations and suggestions appreciated

Recommend
Chris Barker
Posted 07-10-2012 10:49
This message has been cross posted to the following eGroups: Statistical Consulting Section and Survey Research Methods Section .
-------------------------------------------
I would appreciate some suggestions and especially appreciate citations to published articles. (Cross posted to Survey Research and Statistical Consulting)

I'm working for Chief Financial officer (CFO) of a relatively, Small Business.

The CFO needs an estimate of the number of employees who have worked more than 6 hours per day during each year of the period 1985 to the present - and- an estimate the number of days worked more than 6 hours during each year (1985-present). The "hours worked" by day are on paper time (punch) cards. Computerizing the entire time card records is prohibitively expensive.

The CFO has 11 (eleven offices of the company) and payroll records (for a 2 week period) for each employee , on paper records (not computerized). Payroll is run every 2 weeks at each company. Employees "punch cards" in a machine to record their time-in/timeout. Determining number of days worked more than 6 hours requires someone to visually review the paper time card. The eleven offices of the company were open the entire period under consideration.

By "relatively small business" I mean each of the eleven offices has about 10 employees - and that varies by about 1 or 2 employees over the course of a year. Employees can either be managers or line workers.

The CFO has a record for each employee (which I will see in a few days) of the hire date, and discharge date (e.g. employee leaves for another job),

For some survey designs I considered - Its possible that some employees, who work for most or entire 1985 to present period may appear twice in a sample.

I had hoped to use a survey design to get the answer. Design Tips, suggestions and -citations- appreciated.

Thanks in advance

-------------------------------------------
Chris Barker, Ph.D.
Adjunct Associate Professor of Biostatistics
University of Illinois Chicago School of Public Health

and

Director of Communications
Statistics Without Borders

2010 Past President - San Francisco Bay Area Chapter of the American Statistical Association

----------------------------------------------
"In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
-Steve Lacy
-------------------------------------------
2. RE:Survey Design for payroll data- citations and suggestions appreciated

Recommend
Stephen Simon
Posted 07-10-2012 15:31
My apologies is most or all of this is already obvious to you.

The first thing you need to define is whether you want to look at all the records for a randomly selected employee or all employees on randomly selected days in a time period or randomly selected employees on randomly selected days or something else entirely.

Also, do you have the same sampling probability for an employee who works for two months and an employee who works for two decades? Would you need to reweight the data to get a meaningful estimate?

These two are intertwined. You might get a simpler sampling system that leads to a more complex weighted estimate versus a more complex sampling system that leads to a simpler unweighted estimate. Which is better overall is hard to say, though I suspect that the latter would be preferable.

One possibility that you've probably already thought about is randomly selecting employees and picking a random time for each employee in your sample. Depending on what you are trying to estimate, this may lead to overrepresentation of short-term employees, which would require reweighting. Another choice is randomly selecting a certain number of time periods and then randomly selecting one employee from each time period. Again, depending on what you are trying to estimate that may lead to overrepresentation of long-term employees.

In the latter case, an employee who stays for a decade has a ten times larger probability of being selected than an employee who stays for a year. If that's what you want, great, but if not, you have to downweight the first employee.

It's tricky to visualize, but one way of thinking of it is to figure how you would estimate things if you had all the data. Would you compute an estimate for each employee and then calculate an unweighted average of those results across all employees? In that case, you'd want to make sure either that the short term and long term employees have an equal probability of getting into your sample or you'd want to adjust the weights, giving greater weight to those employees who, because of their shorter tenure, have a smaller probability of getting into your sample.

Or would you just get an estimate for each payroll record and average across all records. In that case, you'd either want to make sure that longer term employees had a larger sampling probability or you'd want to adjust the weights. This time the weights would go in the opposite direction. Employees with longer tenures are underrepresented and need to be upweighted.

You might consider changing the sampling probabilities, so that your chance of choosing an employee is some function of how many payroll records that employee has.

You also need to define the cost of sampling. This may vary, of course, by the sampling plan. So ten time cards from the same employee may or may not cost as much as ten time cards from ten different employees. These costs will define whether one sampling plan is preferable to another sampling plan.

You also need to define the desired width of your confidence interval. Even better would be a function that translates the width of the interval into a dollar figure, so you can find the optimal tradeoff between sample size and precision. But in many applications, there is no obvious way to quantify the economic benefit of a narrower confidence interval. But you do have to get at least a commitment to the desired width of the confidence interval. Otherwise, you have no rational basis for choosing a sample size.

The formula for the width of the confidence interval should take into account both the stratified nature of your sample and the clustering effect of multiple measurements on a single employee. It might be a weighted mean or proportion, which adds another layer of complexity. But any good book on sampling (e.g., Levy and Lemeshow) should have all the formulas you need.

Do you want a single confidence interval across all years or a separate interval for each year? If you get yearly intervals, do you care if the intervals are independent of one another or not. If not, then getting the same employee in 1986 and 1988 is irrelevant.

You might avoid some of the complexities of cluster sampling if you deliberately only choose one record for any employee in your sample, but it is unclear from your description if you can do this.

You should definitely consider a sample of employees stratified by office (e.g., get exactly two employees from each office). A stratified sample by work period is another possibility (e.g., make sure that each month is equally represented in your sample). Maybe you should consider a doubly stratified sample. A stratified sample may or may not be better than a purely random sample--it depends on whether there is a lot of homogeneity within strata. But in your setting it seems like such an easy thing to implement that it's almost a no-brainer.

A cluster sample only makes economic sense if the cost of sampling multiple records per individual is a lot cheaper than sampling the same number of records from different individuals. I doubt that is the case here. So avoid cluster effects in your estimate unless it is impossible to avoid them.

Finally, if the sample is needed for some regulatory reason (e.g., to satisfy an IRS audit) then you should consider what the regulatory agency wants to see. They may have some guidelines on how to sample in a way to meet their goals. No point in optimizing the sampling plan for the benefit of the CFO if the regulatory agency is going to shoot down your sampling plan if they consider it inadequate.

I hope this helps.

-------------------------------------------
Stephen Simon
Independent Statistical Consultant
P. Mean Consulting
-------------------------------------------

Discussion: View Thread

Survey Design for payroll data- citations and suggestions appreciated

Chris Barker07-10-2012 10:49

Stephen Simon07-10-2012 15:31

1. Survey Design for payroll data- citations and suggestions appreciated

2. RE:Survey Design for payroll data- citations and suggestions appreciated