Our corpus is your corpus.

[ Here is our P3P corpus, be it your corpus as well. ]

Giving away your corpus (in empirical language analysis or elsewhere) is perhaps nothing extremely established, but it's nothing original or strange either. It does make sense a lot! Stop limiting the impact of your research efforts! Stop wasting the time of your community members! Sharing corpora is one of these many good ideas of Research 2.0: see SSE'10 (and friend events), eScience @ Microsoft, R2oSE, ...

Computer Science vs. Science

When you do academic CS research in programming- or software development-related contexts, then the culture of validation is these days such that you are often expected to provide online access to your program, application, library, tool, what have you as an implementation or illustration. There are various open-source repositories that are used to this end--as a backend (a storage facility), but any sort of author-hosted download locations are also used widely. In basic terms, if you write a paper, you include a URL. (There is one exception: if your work leverages Haskell, you can usually include the complete source code right into your paper so that one gets convenient access through copy and paste. Sorry for the silly joke.) Metadating-wise, common practice is nowhere perfect, but it's perfect compared to what follows.

When you do empirical analysis in CS, which results in some statements or data about software/IT artifacts, then the culture of validation is essentially the one of science. In particular, reproducibility is the crucial requirement. You describe the methodology of your analysis in a detailed manner. So you define your hypotheses, your input, your techniques for measurement, your results (which you also interpret), your threats to validity, what have you. Downloads aren't integral with science. What would you want to download anyway?

Message of this post?!

I suggest that various artifacts of an empirical analysis in CS, in general; in empirical language analysis, in particular, qualify for a valuable download. In this post, I want to call out the corpora (as in corpora of source projects, buildable projects, built projects, runnable projects, ran projects, demos, etc.).

Beyond reproducibility in CS

What's indeed not yet commonplace (if done ever) is that the corpora underlying empirical analyses are given convenient access to. Consider for example Baxter et. al's paper on structural properties of Java software, or Cranor et al.'s paper on P3P deployment. These are two seminal papers in their fields. I would loose a night of sleep over each of the two corpora.
Wouldn't it be helpful for researchers if such corpora were made available for one-click download incl. useful metadata, potentially even tooling? Let's suppose such convenient access became a best practice. First, reproducibility would be improved. Second, derived research would be simplified. Third, incentives for collaboration would be added.

I contend that convenient access adds little pain for the original author, but adds huge value for the scientific community. Why should we need to execute the description of some corpus from some paper, if it requires substantial work for us, but the corpus would be easily shared by the primary author. Why should we work hard to "reproduce the corpus" if some little help by the original authors would make reproducibility (of the corpus and most of the research work, perhaps) a charm.

Naysayers -- get lost

I can think of many reasons why 'convenient access' is not getting off the ground. Here are few obvious options:
  • "It's extra work, even if it is little extra work." This problem can be solved if incentives are created. For instance, publications on empirical analysis with 'convenient access' to the corpus could be rated higher than those w/o. Also, just like tool papers in many venues, there could be corpus papers.
  • "There is sufficient, inconvenient access available already." At least, for one of the two examples above, I fully understand how I could go about gathering the corpus myself, but I have not executed this plan, even though I could really use this corpus in some ongoing research activity. It's just too much work for me. I am effectively hampered in benefitting from the authors' research beyond their immediate results.
  • "Provision of convenient access is too difficult." Think of a corpus of Java programs. Suddenly, an access provider gets into the business of configuration management. After all, convenience would imply that the corpus builds and runs out of the box. I think the short-term answer is that access to the corpus w/o extra "out-of-the box" magic is still more convenient than no access. The long-term-answer is that we may need a notion of remote access to corpora, where I can give you access to my corpus in my environment, through appropriate, web-based or service-oriented interfaces.
  • "Convenient access gives a head start to the competition." I refuse to believe that this is really too relevant in academic practice. For instance, I am sure that the research groups behind the above-mentioned papers have no "corpus monopoly" in mind. I have not done much work on empirical analysis, but I have experience with papers that "give away details", and I must say that those papers which give away the most typically coincide with those which have the highest impact in all possible ways.
  • "There is copyright & Co. in the way." Yes, it is. This is a serious problem, and we better focus on solving the problem shortly, if we want to get anywhere with science and (IT) society in this age. This post will just explode if I tried to comment on that issue over here. There are many good ideas around on this issue, and we all understand that some amount of sharing works even now in this very imperfect world as we have it. If you are pro-Research 2.0, don't get bogged down by this red herring.
Well, I can think of quite a number of other reasons, but I reckon that all the usual suspects have been named, and everything else can be delegated perhaps to some discussion on this blog or elsewhere.

Ralf Lämmel

PS: CS is of an age that empirical research is becoming viral and vital. I am grateful for talking to Jean-Marie occasionally with his lucid vision of Research 2.0 and linguistics for software languages---two topics that are strongly connected. Empirical analysis of software languages has got to be an integral part of software language linguistics. Specialized software-engineering conferences like SLE, ICPC and MSR or even big ones like ICSE or OOPSLA include empirical research for a while now.


An ambitious course on programming paradigms, semantics, and types

We have completed this course.
I hope others find the design or some of the material helpful.
See more about reuse below.

- Parsing and interpretation in Prolog
- Basics of small-step and big-step semantics
- Basics of untyped and typed lambda calculi
- Introduction to Haskell
- Basics of denotational semantics
- Denotational semantics in Haskell
- Basics of static program analysis
- Static program analysis in Haskell
- OO programming in Haskell
- The Expression Problem
- Basics of Constraint-Logic Programming
- Basics of Process Algebra (CCS)
- ... a few more specialized lectures

- English as the course language
- Slides, videos, exercises available online publicly
- 42 hours (45mins each) of lectures over 4 months
- 12 programming-oriented, interactive labs
- Transparent scheme for midterm and final exam
- Heavy reuse of material from other courses
- Use of Twitter for notification and aggregation

This course is my only chance to tell many students at my department about semantics, type systems, bits of interesting functional programming, declarative language processors, and sophisticated declarative methods such as process algebra or constraint-logic programming. Luckily there was a similar module in our curriculum, but I admit that I had to hijack this module a bit. (That is, the module is designed for the last semester of Bachelors, whereas the actual course is probably meant for Masters.) The students have complained in a friendly manner that the course is very heavy, but in return, I have designed the exam to be feasible for Bachelors. So everyone is happy: me because I could do loads of material and topics; the students because they receive unusual help in preparation for the exam. As to the content, students are relatively happy with the techniques that we studied---in particular because we are running the course in such a pragmatic mode, where we basically leverage Prolog and Haskell throughout---for every bit of type systems and semantics and programming concepts. In this manner, the CS students can improve the programming skills in paradigms and domains to which they used before. I also admit guilty by saying that there was no way of selling the formal details of many topics. (For instance, an attempt to do soundness proofs almost triggered a riot.) As to the use of twitter, it seems that not enough students were keen to use Twitter; so we had to continue using a newsgroup in parallel.

But again, I am confident that the design of such a course with broad coverage, low formal depth, but operational depth based on declarative programming and a transparent exam is an effective way of exposing Bachelor students (possibly some of them future Master students) to critically important foundations of computer science and modern programming techniques.

The future
I will mature the course next year. I will be happy if others use the design of the course to some extent, or any specific material. Please note that more than half of the material is directly based on other people's work. I probably used 3 resources heavily, and about another 7 for some bits. Each non-original slide has credits on it in the header area. If you want to reuse anything, please let me know.

Vadim Zaytsev did an excellent job in running the lab and helping me to prepare the exam at the level of quality that we had in mind. It is good to have such a committed (emerging) declarative programmer next to you for a course like this! As to the reused material, detailed credits are online and on the slides, but I want to emphasize a few sources because of the degree of reuse and my admiration of their work: Hanne Riis Nielson and Flemming Nielson (for their book "Semantics with Applications" with slides); Graham Hutton (for his wonderful introduction to Haskell); Jaakko Järvi (for the slides of his mainly TAPL-based programming course).



Empirical Language Analysis

We have been trying to understand the language P3P. Ok, its pretty simple to say what it is. It's a small (domain-specific, non-executable) language for privacy policies and it's used by web sites and checked by user agents (potentially part of browsers). Arguably, understanding is limited if you look at online samples and syntax definition alone.

So we figured we had to
understand usage of the language
in order to understand the language.

Here is the link to the paper:
Joint work with Ekaterina Pek

  • Software Language Engineering
  • Domain-Specific Languages
  • Empirical Analysis
  • Policy Languages
  • P3P
Abstract: P3P is the policy language with which web sites declare the intended use of data that is collected about users of the site. We have systematically collected P3P-based privacy policies from web sites listed in the Google directory, and analysed the resulting corpus with regard to metrics and cloning of policies, adherence to constraints, coverage of the P3P language, and a number of additional properties of language usage. This effort helps understanding the de-facto usage of the non-executable, domain-specific language P3P. Some elements of our methodology may be similarly useful for other software languages.