Counting Users 2012

History

for various purposes it would be very helpful to get a solid estimate how many users the facilities have in total; how many users are using just a single or multiple different facilities; how many users perform experiments both at Neutron and at Photon sources. Last years user survey turned out to be amazingly succesful. Though the scope was extremly limited, it provided quite a bit of information about common users shared between photon and neutron facilities. The results confirmed the general impression, that the complementary use of instruments at photon and neutron facilties is gaining more and more popularity. The survey has shown, that there are so many scientists using both photons and neutrons, that a common data and user management infrastructure would be greatly beneficial both from users as well as facilities point of view. Details on the first user survey can be found on the PaNdata wiki.

Scope of the 2012 user survey

Since the last survey turned out to be real useful, we thought to slightly extend the scope to include some more information, but still keep everything strictly anonymous. This time we would like to include - in addition to an anonymized ID - a binned age, the country of the users home institution, the gender and the number of visits. None of these is mandatory to participate in the survey, but naturally the more facilities provide the information the more useful/conclusive the results. The survey is "designed" not to release any personal information at all.

How to count

To keep confidentiality of email addresses of the users, we will collect irreversible hashes (sha256) of the email addresses of all active users of the time 01/06/2010-31/05/2012, where an active user is anyone (incl. facility staff) who was:
  • on a proposal submitted during that time, including rejected or currently reviewed proposals ... or
  • on a proposal submitted prior to 06/2010, but which entitles for beamtime during the time span ... or
  • on a beamtime application ... or
  • visiting the facility as a user
The definition of an active user does not have to be absolutely identical for each facility, in case a user management system can not exactly match the criteria. For example, some facilities might not distinguish between proposal and beamtime application. In the end it should be somewhat better than just an email-address in the database. The procedure is than simply:
  • create a list of email addresses, age-bin, gender, country of the users home institution, number of visits - of the active users for your facility, valid examples :
    • my.user@gmail.com 31-40 m France 0
    • my.user@gmail.com 1970 France female 7
    • the email has to be the first entry, the rest is free.
    • Age is binned as 0-20, 21-30, 31-40 etc. Birthyear is also fine.
    • Entries you can't or don't want to supply leave empty.
  • convert the email address to lower case strings (if they are not already)
  • calculate the sha256 hash for each email address and compile the hashes into a single file

How to hash

Identical email addresses will result in identical sha256 hashes regardless where and how it's being calculated, which allows to match the hashes and extract the basic information mentioned above - without ever knowing other facilities users email addresses. The only potential pitfall is the (undesired) inclusion of a newline character at the end of an email address. Please make sure NOT to include the newline. A simple command can be used to convert a file with email addresses into hashes. Lets assume your list contained in a file named facility.users looks like this:
my.user@gmail.com 1960 UK m 0
m2.user@yahoo.de   1970 France f 7

Anonymize the file with a command like (Note: has been corrected to work with different bash/awk/shell combinations!):

#!/bin/bash
while read line; do echo -n $line | tr '[A-Z]' '[a-z]' | echo -n `awk '{printf "%s",$1}' | \
        sha256sum` && echo $line | cut -d ' ' -f2- ; done < facility.users

results in

65b0f00fdc5588617be703e0affff12915182e65e588faba390608136cbe68a4 -1960 UK m 0
1456e09de84f149b40505265a613c13907a5983d8b9a00b0007f8cc0673bdffb -1970 France f 7

Encodings and white-space can cause troubles. If the hashes were created on a Mac or Windows machine, please convert the files to a plain ascii text-file before actually converting emails to hashes, otherwise unicode encoded spaces for example might become part of the email-address spoiling the hashes:

dos2unix file                                                                           # for file created on windows
mac2unix file                                                                           # for file created on mac
iconv -f utf-8 -t iso-8859-1 input.file | perl -pe 's/[^[:ascii:]]/ /g' > output.file   # remove all unicode encodings from input.file and write to output.file
There are similar commands/tools available for all kind of script and database query languages. To verify your procedure you could check the hash of dummy.user@dummy.org:
[~]$ echo -n dummy.user@dummy.org | sha256sum
5e6562e6f962a26039d713f2c362aa5d228f094110441a2a6996f06f37ddb11d  -

What to do with the hashes

Once you've created a file with the hashes of email addresses please send it to frank.schluenzen@desy.de preferably until 15.09.2012. We will collect all the hashes and produce some basic statistics. The statistics will of course be openly available, and all hashes will be available to all those who provided their (anonymous) list of users, so that you can extract whatever (additional) information with your own tools of choice, or e.g. combine figures for differently assembled consortia. Hashes will be scrambled to make it absolutely impossible to retrieve any other information and in particular not the email-addresses from the files. This way all facilities will have exactly the same information at hand, which just seems fair enough. All stats and the (scrambled) hashes for the 2011 survey are available from the PaNdata wiki.

Naturally, the statistics will be the more meaningful the more facilities participate, so your participation in this survey will be highly appreciated.

Note: If you collect users for different facilities (e.g. you manage users of both a Neutron and a Photon Source), please keep the list of users separate for different facilities and make sure that users using both facilities appear in both lists! If you manage users of both a synchrotron and a free electron laser it would be interesting to separate these as well, but not essential if it's too much of an effort.