DetaBord Private SQL

Introduction

DetaBord uses a proprietary solution for private SQL queries.

Find more information about the general approach here: “Differentially Private SQL with Bounded User Contribution”

The DetaBord Private SQL plugin can (in contrast to smartnoise for example) handle Joins – through a Trino (formally Presto) connector even over distributed private datasets.

Flow

SQL statements are first parsed and validated and then rewritten for bounded aggregations with differential privacy. The schema below illustrated the flow of the private SQL DetaBord plugin:

dp sql schema

Privacy Properties

Privacy properties can be set through the metadata definition of the used datasets. An example metadata definition in DetaBord looks like this:

name: 'Census'
description: 'Description'
type: 'CSV'
connection: '/path/to/data/census.csv'
privacy_budget: 100
privacy_budget_interval_days: 30
synth_allowed: true
privacy_level: 2
Census:
  table:
    censor_dims: true
    clamp_columns: false
    clamp_counts: false
    max_ids: 10
    row_privacy: false
    rows: 150
    sample_max_ids: true
    tau: 0
    age:
      type: int
      bounded: true
      lower: 0
      upper: 100
      use_auto_bounds: false
      auto_bounds_prob: 0.9
    id:
      private_id: true
      type: int
    workclass:
      cardinality: 9
      allowed_values: 'Private,Self-emp-not-inc,...'
      type: string
      selectable: false
    email:
      type: string
      mask: '(.*)@(.*).{3}$'

The metadata definition borrows some of the privacy properties from open dp (or more precicely, the metadata is a superset of open dp’s defintion): smartnoise metadata

  • name: The name of the data set.
  • description: Data set description
  • connection: Connection URI, file path for CSVs, DB connection string for SQL
  • type: Data set type
  • privacy_budget: Privacy budget property. The privacy budget limits the maximum allowed information to be published about this data set.
  • privacy_budget_interval_days: Reset the privacy budget after this amount of days. Default is 0 (no reset).
  • synth_allowed: true to allow synthesized data for exploration. The DQ0 data synthesizer can be a powerful tool to learn more about data sets without consuming (more) privacy budget.
  • privacy_level: 0, 1, 2 in ascending order of privacy protection. Use 0 for public data sets, 1 more non-private data sets, and 2 for private data sets.
  • schema (Census): Name of the database

Table level properties:

  • row_privacy: Tells the system to treat each row as being a single individual. This is common with social science datasets. Default is false.
  • rows: Number of rows
  • max_ids: Specifies how many rows each unique user can appear in. If any user appears in more rows than specified, the system will randomly sample to enforce this limit (see sample_max_ids). Default is 1.
  • sample_max_ids: If the data curator can be certain that each user appears at most max_ids times in the table, this setting can be enabled to skip the reservoir sampling step. Default is true.
  • censor_dims: Drops GROUP BY output rows that might reveal the presence of individuals in the database. For example, a query doing GROUP BY over last names would reveal the existence of an individual with a rare last name. Data owners may override this setting if the dimensions are public or non-sensitive. Default is true.
  • clamp_counts: Differentially private counts can sometimes be negative. Setting this option to True will clamp negative counts to be 0. Does not affect privacy, but may impact utility. Default is false.
  • clamp_columns: By default, the system clamps all input data to ensure that it falls within the lower and upper bounds specified for that column. If the data curator can be certain that the data never fall outside the specified ranges, this step can be disabled. Default is true.
  • use_dpsu: Tells the system to use Differential Private Set Union for censoring of rare dimensions. Does not impact privacy. Default is false.
  • tau: Privacy thresholding value. Group sizes below this value are considered private and won’t answer. Default is 0 (disabled).

Column level properties:

  • type: This type attribute indicates the simple type for all values in the column. Type may be one of “int”, “float”, “string”, “boolean”, or “date”. The “date” type includes date or time types. This property is required.
  • private_key: Indicates that this column is the private identifier (e.g. “UserID”, “Household”). Only columns which have private_id set to ‘true’ are treated as individuals subject to privacy protection. Default is false.
  • selectable: Set to true to allow this column to be selectable outside private aggregations. Default is false.
  • lower: Valid on numeric columns. Specifies the lower bound for values in this column.
  • upper: Valid on numeric columns. Specifies the upper bound for values in this column.
  • use_auto_bounds: DQ0 provides a mechanism to calculate reasonable bounds automaticaly. Set this to true to use the calculated values (stored in the additional properties auto_lower and auto_upper) instead of the manual ones. Default is false.
  • auto_bounds_prob: For auto bound calculation: the probability of not selecting false positives.
  • cardinality: This is an optional hint, valid on columns intended to be used as categories or keys in a GROUP BY. Specifies the approximate number of distinct keys in this column.
  • allowed_values: An optional propertiy for string type columns. List of strings (comma-seperated) indicating the allowed values this column can have.
  • mask: Valid on string columns. Can be used to mask returned values, e.g. to hide parts of e-mail addresses etc.

Join Waitlist

Join the AI for Life Sciences Challenge

Let's talk