PMean: What should go into a data codebook

Steve Simon


Before you start your data entry, you should create a data codebook. If you don’t have a data codebook when you hand your data over to someone else, take the time to create one for their benefit and yours. The data codebook contains a description of your data set. There’s no standard form for a data codebook, and what you describe may depend on a variety of factors, such as the complexity of your data set, the number of people involved in data collection and data entry, and the number of people that you are likely to share your data with. Here are some of the elements that you should think about putting in a data codebook.

The data codebook should have some general information about the data set as a whole. This might include:

In a laboratory experiment, you might want to document the conditions under which the experiment was run.

The data codebook, more importantly, documents information about each variable in your data set (each column). This might include:

If there are several reasons why a measurement might be missing, you should create SEVERAL missing value codes and document what each code represents. This is important because the statistical handling of missing values is strongly dependent on the reason the value is missing.

If you enter your data into a text file, the data codebook can document the structure of your file. This would be the location of each variable (which column the variable starts in and which column it ends in) for a fixed width text file. It would be the delimiter that separates individual data values in a delimited data file. If strings are surrounded by quote marks, the codebook would note this, as well as whether they were single (') or double (") quotes.

The data codebook is also a record of any choices that you make during data entry that you need to remember throughout the entire data entry process. If, for example, you are typing in the date of a visit to the emergency room, what day do you assign to a patient that arrives exactly at midnight? There’s nothing wrong with using the same date that a 11:59pm visitor would have, and there’s nothing wrong with putting a midnight visitor on the following day (see commentary on this at the National Institute for Standards and Technology).? What does matter is that once you decide how to resolve this ambiguous data value, you need to document your choice so that every future value that is ambiguous in the same way is handled in the same way.

There are some web pages that describe what should go into a codebook, though the recommendations vary widely.