Tag Archives: psychology

Open data for online psychology experiments

The research activities in cognitive science are usually limited to a single lab, or to a small group of collaborators. But they need not be. A key problem in encouraging wider collaboration is finding ways of sharing data from human subjects in ways that do not compromise the privacy and confidentiality of the participants (DeAngelis, 2004), or the legal and ethical norms designed to protect that privacy.

Bullock, qtd. in DeAngelis (2004), notes other difficulties: Psychology traditionally has not built large data systems for storing or sharing large data sets, and, has not developed a culture of sharing data as seen in other disciplines.

The challenge in brief: what would it take to build a general purpose internet-based experiment presentation and data collection system where the resulting data is automatically and anonymously shared after a suitable embargo period? This could be done by individual researchers self-archiving data, or be sent to a repository. One of the advantages of such an an automated data publishing system would be a reduction in the costs in publishing properly formatted raw psychological data.

This is a practical project, the contemplation and building of which is entirely feasible with today’s technology. The charm of this project is that it requires three issues to be worked out in order to succeed.

One of these issues has to do with the internet: how do we offer reasonable guarantees that data collected remains anonymous, and is not compromised in transmission back to the experimental server or on the server itself? Although this varies by institution, in order to make human subjects data publicly available, the main requirement is that the data be presented so that it is anonymous. The name, and any personally identifying information, must not be stored with the data.

This simple practice of separating data from identifying information can be complicated by the fact that the data is being sent over the internet, where data can be intercepted, and servers can be compromised. Some solutions to this problem could include the use of strong encryption or IP anonymization. They could also include discarding potentially identifying data on the client machine before it is even transmitted, or discarding some of the received information after summary calculations are made, but before it is stored.

A second issue is societal: under what circumstances do we have the right to share human subjects data? This question particularly includes challenges that can vary by institution. It also involves aligning the policies and interests of different stakeholders. University oversight on these matters can include ethics committees, privacy officers, and legal departments. There may also be assertions of intellectual property by either the institution or the body funding the research. It may be that one or all of these groups would need to be consulted, depending on the location of the researcher and/or repository. This is an issue that would need to be explored carefully and sensitively with the relevant stakeholders at the repository institution.

A third issue is experimental. There are no shortage of online experiments on the web. Psychological Research on the Web lists hundreds. As the Top Ten Online Psychology Experiments points out, it’s a little hard to assess the validity of these results because of variations in the speed of the hardware. (This post also notes that we also don’t know who is taking these tests, or whether they have understood the instructions properly). How can we offer reasonable guarantees that data collected on different hardware will be valid? This can include timing accuracy for both input and output (Plant, RR & Turner, G., 2009), and adjusting stimuli to ensure similarity of size, colour, or volume.

An embargo period is important for three reasons: (a) to protect participant privacy, (b) to protect the integrity of the experiment, and (c) to protect, to the extent they desire, the researcher’s work. In particular,

(a) it is important not to release data immediately upon collection because anyone with a knowledge of when that person had conducted the experiment might be able to trace the data back to them. A standard (known) period of embargo would have the same property. A better approach may be a randomized period of embargo, or a set period of embargo for all collected data.

(b) usually when online experiments are conducted, the data will not be made available until after the experiment is run so that there is no way a potential participant could look at the results and be influenced by them.

(c) some researchers may not wish to release their data until they have published, but would be happy to release the data afterwards.

The online-experiment-runner could, of course, be made open source, as could the experiments that run on it, but these are separate issues.

Does my account include problems that don’t exist in practice? Are there places where it is actually more complicated than I’ve sketched out here? Are there examples of open data collected on the internet? Do you know of other references on the ethics and practice of making human subjects data available in various context (for psychology or otherwise)?

Acknowledgements: Terry Stewart for many illuminating conversations on open models, and open modelling, and to the folks at ISPOC for their model repository. Thanks also for stimulating questions and feedback to Michael Nielsen, Greg Wilson, Jon Udell, Andre and Carlene Paquette, Jim McGinley, and James Redekop.


DeAngelis, T. (2004). ‘Data sharing: a different animal‘. APA Monitor on Psychology. February 2004. 35: 2.

Plant, RR & Turner, G. (2009). Millisecond precision psychological research in a world of commodity computers: New hardware, new problems? Behavior Research Methods, 41, 598-614. doi: 10.3758/BRM.41.3.598. [Thanks to Mike Lawrence for pointing me at this]


Tools for Psychology and Neuroscience

Open source tools make new options available for designing experiments, doing analysis, and writing papers. Already, we can see hardware becoming available for low-cost experimentation. There is an OpenEEG project. There are open source eye tracking tools for webcams. Stimulus packages like VisionEgg can be used to collect reaction times or to send precise timing signals to fMRI scanners. Neurolens is a free functional neuroimage analysis tool.

Cheaper hardware and software make it easier for students to practice techniques in undergraduate labs, and easier for graduate students to try new ideas that might otherwise be cost-prohibitive.

Results can be collected and annotated using personal wiki lab notebook programs like Garrett Lisi’s deferentialgeometry.org. Although some people, like Lisi, share their notebooks on the web (a practice known as open notebook science), it is not necessary to share wiki notebooks with anyone to receive substantial benefit from them. Wiki notebooks are an aid to the working researcher because they can be used to record methods, references and stimuli in much more detail than the published paper can afford. Lab notebooks, significantly, can include pointers to all of the raw data, together with each transformation along the chain of data provenance. This inspires trust in the analysis, and makes replication easier. Lab notebooks can also be a place to make a record of the commands that were used to generate tables and graphs in languages like R.

R is an open source statistics package. It scriptable, and can be used in place of SPSS (Revelle (2008), Baron & Li (2007)). It is multi-platform, can be freely shared with collaborators, and can import and export data in a CSV form that is readable by other statistics packages, spreadsheets, and graphing packages.

R code can be embedded directly into a LaTeX or OpenOffice document using a utility called Sweave. Sweave can be used with LaTeX to automatically format documents in APA style (Zahn, 2008). With Sweave, when you see a graph or table in a paper, it’s always up to date, generated on the fly from the original R code when the PDF is generated. Including the LaTeX along with the PDF becomes a form of reproducible research, rooted in Donald Knuth’s idea of literate programming. When you want to know in detail how the analysis was done, you need look no further than the source text of the paper itself.


Baron, J. & Li, Y. (9 Nov 2007). ‘Notes on the use of R for psychology experiments and questionnaires.’

Revelle, W. (25 May 2008). ‘Using R for Psychological Research. A simple guide to an elegant package.’ http://www.personality-project.org/R/

Zahn, Ista. (2008). ‘Learning to Sweave in APA Style.’ The PracTeX Journal. http://www.tug.org/pracjourn/2008-1/zahn/