Gathering decent-sized economic history datasets requires learning by doing and making good choices about the tools of the trade. It’s a craft.
Over time, as your skills improve, and as the technology itself becomes better and smoother, new challenges are much more easily handled, and what’s learned can be rolled out, and adapted, and improved, as you tackle more ‘difficult’ datasets.
I’ve converged on a process of data capture which is simple in outline, applied to various datasets to date, and which I continually adapt. I illustrate it in the image above, for the 18th century Irish fiscal dataset previewed here, but it holds good for much else I’ve tackled.
The starting point of course is the source material, and what you can learn about it substantively, from the literatures, and from the people who write them. In this case, we’re dealing with the Journals of the Irish House of Commons, printed in various editions, and which are more or less available in digital versions, but without clean access to tabular numeric data. To get at that data in a structured way I used a mix of direct keying in, and OCR (Optical Character Recognition) technology which can work pretty well in specific domains.
In this case, the domain for OCR was just the numeric data— the original text of categories in tables was just as efficiently typed in directly: it often remained reasonably structurally similar from year to year, and so often required only basic rearrangement as I worked through the century.
Other options can include outsourcing the tedious task of core data-entry, but my focus here is on DIY efforts: much early-stage work doesn’t have even the (now) modest funds needed to buy in transcription services. Also, there’s something very right about really working through every data point yourself, at least for a while.
The two other core elements of the process are Excel and MySQL.
Excel, for all its limitations, is a superb work-horse for data gathering, and I find that its advanced features can be used—up to a point—to add a lot of value to the work. Some things as basic but useful as graphing data or cross-checking totals, can be done quickly all in the one package.
For historical quantitative data, often with a variety of denominations e.g., of currencies, quantities, and prices, and concerns over provenance, quality, and maintaining fidelity to the source, Excel can up to certain limits, be adapted to deal with complications, running on a standard machine.
This image shows the main ‘checking’ Excel worksheet of the 18th century Irish Treasury Account from the Journals, rather than the underlying data. View even gorier details by clicking on the image.
This Excel worksheet uses the ‘names’ feature, whereby you can ‘name’ areas, in this case, such as the columns for pounds, shillings, pence, and fractions in the underlying main data worksheet—and refer to those areas by name in formulas, rather than using the default cell references. This can make formulas easier to interpret, and thus less error-prone—it’s not great if the formulas you write to check errors are themselves flaky.
The Excel worksheet in the image checks that the totals of the underlying individual revenue and expenditure items in each year, usually a few hundred, add up to the grand totals printed in the original account, and also, that the total money ‘in’ matches the total ‘out’ are equal to each other, which is what should be the case in this sort of essentially cash account. The checking is to the fraction. The red cells use the ‘conditional formatting’ feature in Excel, where, for example, the appearance of a cell can be made to depend on its value—in this case, any non-zero discrepancy gets bold white text on a dramatic red background.
The resulting blotches of red look a bit problematic at first, but most of the errors in this version were under £1, and some cases, a few pennies. I hope that few, if any, of those errors which remain are data entry errors on my part. The resolution of some digitised images was not great, so that may be a factor, but at least some remaining errors are probably printing errors in the originals. But at least at lot of data-entry errors have been caught. I hope.
None of this is rocket science, but it can provide real quality assurance for challenging data.
However, I’ve found that using some of the more advanced features of Excel, such as its database capabilities, can run into memory/speed problems: even some of the basic formulas in the sheet pictured above can take an annoyingly long time to recalculate, after one data item has been edited, even for a modestly-sized sheet (e.g., a few thousand rows) on a half-decent machine.
The great leap forward in structured data capture and analysis comes when you move to a proper relational database environment. I’ve run with MySQL as it’s powerful, free for non-commercial use, and eminently adaptable. Most usefully, for a project with no budget other than your time, MySQL can generally ‘talk’ seamlessly to many other free software tools, thanks to packages developed by the open source world.
In the process I use, MySQL is really the foundational technology upon which lots of other nice things can be built, like interactive online charts— in fact, any dynamic online version of the data, which can be much more engaging—and illuminating—than posting up flat files alone.
With specialist programming, which is for another day, you can do neat things like roll out your data from MySQL into any file format you care to mention, e.g., text, pdf, R, TeX, and most usefully for many end-users, properly formatted Excel tables—without error-prone and redundant copying and pasting.
MySQL, like any powerful technology, can have a steep learning curve, but some of the online resources for this are just superb, such as the excellent MySQL Tutorial.
Other sites allow you to play with, and thus learn, MySQL code online i.e., via ‘fiddle’ sites, without the need to actually install the software on your machine, at least at first.
I hope to blog here from time to time on some of the more advanced stuff, particularly on MySQL, but for now, one last screen shot, to illustrate the next level by showing a version of the 18th century public finances dataset, in MySQL form but viewed through Navicat, one of a number of GUIs (Graphical User Interfaces). MySQL can be directly accessed via a command-line tool, but for most humanoids, GUI packages like Navicat are a nice combination of usability and power. They sit on top of MySQL and allow you to fully interact with your databases both at the level of code, like a command line interface, but also with more intuitive and equally powerful menus, windows and wizards, for ease of use and speed. Click to view the detail.
Navicat is a commercial product, but not that expensive for educational etc. use, and I think worth it—once you’ve seen the underlying power of MySQL.