Detect cheating in Microsoft Excel submissions
Figure out a way to generate data sets and results programatically so you can uniquely verify each student's assignment separately. This would imply lots of effort initially but may payoff for later courses.
Use a single dataset and a single result set. Ask them for their reasoning behind for achieving the most complex tasks of the assignment. Use this to check for cheating. No one person thinks exactly the same or describes something complex in the same manner.
My recommendation is for the latter. Most jobs that can be done with tools like Excel will be automated in the near future. The best that your students can learn is to reason (and articulate this effectively) and choose the appropriate tool/technique for the job.
Oh, hey, I've written scripts to deal with this problem, on a similar scale (about 800 students each semester). We create new "starting point" workbooks each semester.
The first step is to check the metadata. This can be automatically extracted by the command-line tool exiftool, the Python package openpyxl, or by just parsing the XML.
Then, if the creation date is older than your provided file, you know they downloaded a file from a previous semester. On the other hand, if the creation date is newer and the creator is the student rather than you, that means they created a new workbook from scratch and copied & pasted the contents in. There is little reason to do this unless they are trying to cover up cheating (but of course the student should get a chance to explain themselves).
Then there is the modified date—you can check that this is in a reasonable range (the days before the due date)—and lastModifiedBy. In our case, students turn in about a dozen workbooks over the semester, so it's worth it to keep track of their username and flag differences. If you're just getting one workbook, that might not make sense. I also check if the username seems to match the student's name; I have some automated checks which approve most of them, and prompt me for the remainder. If the name in the metadata is clearly a different person's name, we ask the student why that is.
Another thing which can be found is links. It's possible in Excel to refer to values in a different file, which creates a "link". When the workbook is uploaded without the corresponding linked file, the values can't be updated. Students often inadvertently create links when copying and pasting. The name of the file they copied from is stored in the workbook, which might be enough to make it clear it's from an illicit source. You can detect the presence of links by listing the files in the zip (every .xlsx file is really a zip file): if 'xl/externalLinks/externalLink1.xml' is present, then there are links.
I find the target of the link using openpyxl; in the workbook object, all links are in workbook._external_links
. Given a link object link
, the filename is in link.file_link.Target
. If you want to do it on a lower level, the name of the linked file is in 'xl/externalLinks/_rels/externalLink1.xml.rels'.
I also compute a hash of each file in order to find exact duplicates (generally only students who re-upload the starting file).
A more time-consuming step is to extract all the cell contents (including formulas). A hash of those will find duplicates, but ignoring metadata, formatting and colors, any graphs, filters, and pivot tables; it only takes into account the actual cell contents. (So if the main thing your students are doing is creating a pivot table, this won't be helpful).
I also feed these cell contents to simhash to find "near-duplicates". I've used a command-line tool by Bart Massey as well as a Python package. Frankly, this doesn't seem very helpful most of the time. It has found suspicious similarities on several occasions, but it takes a lot of my time to check through the potential matches that come up.
I also find cell values that occur in a handful of workbooks (in, say, 2–12 of the 800 files.) The assumption is that cell values which occur in more files than that are "natural" for many students to include, and that a copying-cheating-ring would not be larger than that. I look at these and decide if they're suspicious or not (two students making a typo "Parrt 1" seems normal; a longer phrase with identical mispellings is not.)
It does take a couple hours of my time each time a batch of workbooks comes in, and it's far from foolproof, but it tends to turn up several clear cases of cheating each time.
I assume you need clues which can be obtained automatically.
It might be an idea to check all the positions and sizes of graphs / diagrams / drawings etc. Those are usually placed by mouse movement and so the positions should differ.
If you are very subtile, you might want to give them individual sheets to start from (maybe containing the initital data set). Those files can be secretly marked in the underlying xml files, e.g. by adding an extra tag per student (I did not test this, but I'm sure no one will detect it) or by adding a personalized "document creator" (but this is visible to the students). You just have zu unzip the .xlsx-file and modify docPros\core.xml
A pure file copy can easily be detected by e.g. calculating a PGP checksum for the files. If two checksums are the same, they are exact copies.