Our Chief Technology Officer Frank Wickström is in charge of planning our technology strategy. He wrote a blog about how to keep personal data personal with database sanitization.
TL;DR: We made a sanitization tool for your database dumps so that you can worry less about dumping data and more about fixing things. Find it at: https://github.com/andersinno/python-database-sanitizer.
At Anders, we hold security and privacy at the utmost regard. We have strict rules of how data can be managed and shared while trying our best to not step on the developers' toes too much.
One scenario has been a pain point for some time; however, how to debug database content with as little risk of data exposure as possible. While our developers are under NDA to not disclose any customer data, we have always found it troubling that to debug production issues one many times need to replicate the production environment as close as possible locally which often requires a database dump to be taken and examined. While this is a very straightforward task, it puts sensitive customer data at unnecessary risk of exposure by keeping data locally.
Managed test databases solve this to some extent but might slow down the work compared to local debugging. However, keeping data locally on developers’ machines heightens the risk of the data getting forgotten on disk, which in turn adds one more vector of attack. This is why database dumps have always been kept to a minimum and only used as the last resort in debugging.
With GDPR in place, more discussion regarding the topic started to appear online, and many wondered how they could handle cases where they had personal data on their local machine and end users wanted a report of what data is stored on them, or a full erase of all personal data from the system. As we have always strived to keep personal data personal, we knew that we could do better than the current system that was in place. GBDR
Tools for handling database dumps started to appear close to the enforcement of GDPR, but many of them were either highly specific to a programming language, to a framework or database type. We wanted to have something more agnostic, something that would work for applications written in any language, and we wanted to support at least PostgreSQL and MySQL. So we created it, https://github.com/andersinno/python-database-sanitizer. Don't let the name fool you, "database-sanitizer" was just taken, the tool needs Python to run but your application can be in your language of choice.
At Anders, we like it when things work out-of-the-box and developers do not need to spend endless hours figuring things out. So, we tried our best to take care of figuring out which commands to run while you figure out what you need to sanitize. We try to keep things simple while still making as the tool flexible enough for other developers to extend the way sanitization works when needed. We do this by adding plugin support for adding additional sanitization functions.
HOW DOES IT WORK?
The tool works by taking a database dump with either "pgdump" or "mysqldump" and then going through the output, checking for table and column name and if the field is found in the config file, the content of the field runs through a sanitization function which changes the value accordingly and outputs a sanitized database dump. For convenience sake, we have included a few sanitization functions for the most common use-cases such as user fields and dates. However, developers can write their sanitization functions, and either places them in a "/sanitizers" folder at the root of the project, by pointing to a python package such as "sanitizer_package.sanitizers.uuid".
Want to give it a spin? It is as easy as "pip install database-sanitizer"
WORKING WITH DJANGO?
While the sanitizer is agnostic to the technologies used in the project that reads the database, we are relying heavily on Django at Anders and have made a convenience wrapper for it, https://github.com/andersinno/django-sanitized-dump. Since Django uses an ORM for specifying its database structure, we have made use of Django's "models" and output a ready-made sanitization configuration which includes all of the fields. Then the developers can specify which sanitizer should to use for each field.
Frank Wickström, CTO of Anders
Writer:
Frank Wickström
Chief Technology Officer at Anders
Motto:
Scientia potentiaest. (Knowledge is power.)