CLI

The Command Line Interface is the program to communicate with the DetaBord instance. Please refer to the CLI Installation Manual to find out how to install the CLI.

Introduction

The DetaBord Data Science CLI program is executed in a terminal session (i.e. bash) as “dq0” (assuming that the path to the CLI installation is added to your PATH). All commands of the DetaBord Data Owner CLI follow the form dq0 [context] [command] [arguments].

Example:

  • dq0 user login to login to the DetaBord instance
  • dq0 data list to list available datasets

First Steps

When the CLI is used for the first time, the DetaBord data quarantine instance to be used must first be registered. This is done with the following command:

dq0 proxy add --scheme https --hostname [URL] --port [PORT]

Example for https://dq0.io:8000:

dq0 proxy add --hostname dq0.io --port 8000

Ask your DQ0 administrator about the URL and port of your instance.

Help

You can always run the DetaBord CLI with the -h or --help argument to find out about the individual commands:

Examples:

  • dq0 -h
  • dq0 data -h

Login

In order to communicate with the DetaBord instance, you have to log in, i.e. authorize and authenticate.

If you are not yet registered, you can do this directly via the CLI using the following command:

dq0 user register

You will then be asked for a user name (email address) and password.

In the course of this registration, the CLI also creates an SSH key pair (private and public key), which is used to encrypt the communication with DetaBord.

Note: All communication with the DetaBord instance is encrypted end-to-end. You can therefore only communicate with the instance from the computer with which you did the registration.

The registration request must first be confirmed by your DetaBord administrator. Only then can you log in with your chosen credentials using the following command:

dq0 user login

After successful login, the session is valid for 30 days.

Projects

Everything you do with DetaBord is organized by projects. To create a new project call

dq0 project create [PROJECT-NAME]

Example:

dq0 project create My-Project

This will create a new folder in your local directory called “My-Project”. This new folder contains a meta file for project manangement and some templates to help you get started with DetaBord development.

To get a list of your available projects use

dq0 project list

Example response:

+--------------------------------------+--------------+-------------+---------+------+----------------+----------------+---------------------------+----------------+
| PROJECTUUID                          | PROJECTNAME  | EXPERIMENTS | COMMITS | RUNS | DATASETS       | MODELS         | UPDATEDAT                 | LOCALAVAILABLE |
|                                      |              |             |         |      | (USED/CREATED) | (USED/CREATED) |                           |                |
+--------------------------------------+--------------+-------------+---------+------+----------------+----------------+---------------------------+----------------+
| afff0f3f-6299-450c-9ac8-69ebcf49d23d |      DemoXYZ |           1 |       1 |    1 |          1 / 1 |          0 / 1 | 2021-01-08T11:02:19+01:00 |           true |
+--------------------------------------+--------------+-------------+---------+------+----------------+----------------+---------------------------+----------------+
# Total Items: 1, page: 1, pageSize: 100

Info about one project

dq0 project info --project-path=[PATH-TO-PROJECT-FOLDER]

You can omit the project-path argument if you change to your project directory. This is true for all commands where you need a project-uuid or project-path argument.

As projects are created locally on your machine and the project’s code is managed by your external versioning control system (e.g. your company’s git repositories) you need to sync the project’s content with DetaBord before you are able to start runs (e.g. training jobs) on the DetaBord instance.

To sync a project with the DetaBord instance use

dq0 project deploy --project-path=[PATH-TO-PROJECT-FOLDER]

or inside the project’s directory:

dq0 project deploy

Data

This section describes how you can manage data sets available the DetaBord instance.

List data sets

Use the following command to display a list of all available records:

dq0 data list

A response can look like this:

+----+--------------------------------------+--------------+------------+----------------------+-------------+---------------------------+
| ID | DATAUUID                             | DATANAME     | TYPE       | DESCRIPTION     | PERMISSIONS | UPDATEDAT                 |
+----+--------------------------------------+--------------+------------+----------------------+-------------+---------------------------+
|  1 | 81e497ef-c37d-41f3-8381-dcd6c268a7fd | Test dataset | PostgreSQL | Some test data  |             | 2021-01-08T10:59:43+01:00 |
|  2 | 802bf101-9087-4856-8399-506d7728ab70 |       Census |        CSV | Description     |             | 2021-01-07T14:41:40+01:00 |
+----+--------------------------------------+--------------+------------+----------------------+-------------+---------------------------+

The data set with the name “Census” is of type “CSV” (comma separated values file); it has the ID “2” and the UUID (universally unique identifier) “802bf101-9087-4856-8399-506d7728ab70”.

Data set info

You can use the following command to display detailed information, including access statistics, about a data record:

dq0 data info --data-uuid UUID

or

dq0 data info --data-id ID

Example:

dq0 data info --data-uuid 802bf101-9087-4856-8399-506d7728ab70

Example response (in JSON format):

{
  "commit_uuid": "602d2329-c7ab-44b6-a6ff-6bea996ce41b",
  "data_uuid": "802bf101-9087-4856-8399-506d7728ab70",
  "data_name": "Census",
  "data_type": "CSV",
  "data_description": "Description",
  "privacy_budget": {
    "initial": 100,
    "current": 79.69,
  },
  "data_usage": 89,
  "data_size": 1000,
  "data_meta": "base64encoded-metadata",
  "created_at": 1610026900,
  "updated_at": 1610026900
}

Attach Data Sets to Projects

If you want to train a model on a sensitive data set inside the DetaBord quarantine there are two important prerequisites:

  1. Your model code needs to use the DetaBord SDK methods to read the selected data sets at runtime (i.e. use the dq0.sdk.data data source classes and the read() function).
  2. The DetaBord platform needs to know which data source shall be connected to the runtime. Therefore, an available data set needs to be attached to your project.

To attach a data set copy the data sets UUID (from the data list or data info command) and use it in the following command:

dq0 project attach --project-path=[PATH-TO-PROJECT-FOLDER] --data-uuid=[DATA-UUID]

or inside the project directory:

dq0 project attach --data-uuid=[DATA-UUID]

Use the detach command to remove a data set from a project:

dq0 project detach --data-uuid=[DATA-UUID]

Experiments & Commits

Experiments are there to organize your attempts to create good models. Create a new experiment whenever you want to go a different route. You can create as many experiments as you like. Experiments belong to one project, can have different parameters and entry points and contain many runs, i.e. parametrized experiment executions.

Create a new experiment with:

dq0 experiment create [NAME] [--project-path=[PATH-TO-PROJECT-FOLDER]]

Delete an existing experiment with:

dq0 experiment delete --experiment-uuid=[UUID]

Rename an experiment:

dq0 experiment update --experiment-uuid=[UUID] --experiment-name=[NEW-NAME]

Get all available experiments of the project:

dq0 experiment list [--project-path=[PATH-TO-PROJECT-FOLDER]]

Example response:

+--------------------------------------+---------+----------+-------+---------------------------+
| UUID                                 | NAME    | #COMMITS | #RUNS | UPDATEDAT                 |
+--------------------------------------+---------+----------+-------+---------------------------+
| d7a3d540-15cc-48f7-92da-caca9dfe20aa | Default |        1 |     1 | 2021-01-07T14:42:08+01:00 |
+--------------------------------------+---------+----------+-------+---------------------------+

Info for one specific experiment:

dq0 experiment info --experiment-uuid=[UUID]

Running a training job

Before running a training job, make sure your code is in sync with the DetaBord platform instance. To sync your code run:

dq0 project deploy [--project-path=[PATH-TO-PROJECT-FOLDER]]

This command will return a commit ID that you can use to start the run. Example project deploy response:

{
  "message": "project successfully deployed with new commit uuid: c99eb85d-f39d-4362-8640-9f981ede687d"
}

The latest commit is stored in your local project metadata automatically.

To start the train job use:

dq0 commit run [--project-path=[PATH-TO-PROJECT-FOLDER]]

With arguments:

dq0 commit run [ARG1]=[VAL1] [ARG2]=[VAL2] --mlproject-entry-point=[ENTRY_POINT]

Track your runs with

dq0 run list [--project-path=[PATH-TO-PROJECT-FOLDER]]

and

dq0 job info --job-uuid=[JOB-UUID]

To inspect the job’s results, use the artifact commands:

dq0 artifact tree-structure --run-uuid=[JOB-UUID] --level=5

dq0 artifact download --run-uuid=[RUN-UUID] --path=[ARTIFACT-PATH] --download-path=[LOCAL-DOWNLOAD-PATH]

Example:

dq0 artifact download --run-uuid=ae38b2aa-4976-4155-a51b-897bbbb93a1c --path=path/to/artifact --download-path=/path/to/your/local/file.txt

Queries

To send a query you must specify the query, the used datasets and additional parameters:

dq0 query create --datasets=[DATASET1-NAME] --query='[QUERY-STRING]' [--project-path=[PATH-TO-PROJECT-FOLDER]]

Example:

dq0 query create --datasets=dataset1 --query='SELECT COUNT(*) FROM db;'

You can also point to a yaml file containing the query string. Example:

dq0 query create --datasets=dataset1,dataset2 --query-path=/path/to/query.yaml

Get information about a running query job with

dq0 query info --query-uuid=[JOB-UUID]

Example output:

{
  "user_id": 2,
  "user_name": "12@gradient0.com",
  "job_uuid": "4298634d-727d-48dd-96a9-29f8ce2f563b",
  "job_name": "Query Run",
  "job_type": "query.run",
  "job_logs": "2021-01-11T14:58:42Z | dq0.sql.runner | INFO | [__KEYWORD_STARTED__] Started with args: ...",
  "job_progress": 1,
  "job_state": "finished",
  "created_at": 1610377119,
  "updated_at": 1610377125
}

Get the query results (once released) with:

dq0 query result --query-uuid=[JOB-UUID]

Edit Data (DATA OWNER ROLE)

Data Metadata

The data_meta field contains a (base64 encoded) string of the data set’s metadata definition. Data metadata is defined in Yaml format and looks like this:

name: 'Census'
description: 'Description'
type: 'CSV'
connection: '/path/to/data/census.csv'
privacy_budget: 100
privacy_budget_interval_days: 30
synth_allowed: true
privacy_level: 2
Census:
  table:
    censor_dims: true
    clamp_columns: false
    clamp_counts: false
    max_ids: 10
    row_privacy: false
    rows: 150
    sample_max_ids: true
    tau: 0
    age:
      type: int
      bounded: true
      lower: 0
      upper: 100
      use_auto_bounds: false
      auto_bounds_prob: 0.9
    id:
      private_id: true
      type: int
    workclass:
      cardinality: 9
      allowed_values: 'Private,Self-emp-not-inc,...'
      type: string
      selectable: false
    email:
      type: string
      mask: '(.*)@(.*).{3}$'

The metadata definition borrows some of the privacy properties from open dp (or more precisely, the metadata is a superset of open dp’s defintion): smartnoise metadata

  • name: The name of the data set.
  • description: Data set description
  • connection: Connection URI, file path for CSVs, DB connection string for SQL
  • type: Data set type
  • privacy_budget: Privacy budget property. The privacy budget limits the maximum allowed information to be published about this data set.
  • privacy_budget_interval_days: Reset the privacy budget after this amount of days. Default is 0 (no reset).
  • synth_allowed: true to allow synthesized data for exploration. The DetaBord data synthesizer can be a powerful tool to learn more about data sets without consuming (more) privacy budget.
  • privacy_level: 0, 1, 2 in ascending order of privacy protection. Use 0 for public data sets, 1 more non-private data sets, and 2 for private data sets.
  • schema (Census): Name of the database

Table level properties:

  • row_privacy: Tells the system to treat each row as being a single individual. This is common with social science datasets. Default is false.
  • rows: Number of rows
  • max_ids: Specifies how many rows each unique user can appear in. If any user appears in more rows than specified, the system will randomly sample to enforce this limit (see sample_max_ids). Default is 1.
  • sample_max_ids: If the data curator can be certain that each user appears at most max_ids times in the table, this setting can be enabled to skip the reservoir sampling step. Default is true.
  • censor_dims: Drops GROUP BY output rows that might reveal the presence of individuals in the database. For example, a query doing GROUP BY over last names would reveal the existence of an individual with a rare last name. Data owners may override this setting if the dimensions are public or non-sensitive. Default is true.
  • clamp_counts: Differentially private counts can sometimes be negative. Setting this option to True will clamp negative counts to be 0. Does not affect privacy, but may impact utility. Default is false.
  • clamp_columns: By default, the system clamps all input data to ensure that it falls within the lower and upper bounds specified for that column. If the data curator can be certain that the data never fall outside the specified ranges, this step can be disabled. Default is true.
  • use_dpsu: Tells the system to use Differential Private Set Union for censoring of rare dimensions. Does not impact privacy. Default is false.
  • tau: Privacy thresholding value. Group sizes below this value are considered private and won’t answer. Default is 0 (disabled).

Column level properties:

  • type: This type attribute indicates the simple type for all values in the column. Type may be one of “int”, “float”, “string”, “boolean”, or “date”. The “date” type includes date or time types. This property is required.
  • private_key: Indicates that this column is the private identifier (e.g. “UserID”, “Household”). Only columns which have private_id set to ‘true’ are treated as individuals subject to privacy protection. Default is false.
  • selectable: Set to true to allow this column to be selectable outside private aggregations. Default is false.
  • lower: Valid on numeric columns. Specifies the lower bound for values in this column.
  • upper: Valid on numeric columns. Specifies the upper bound for values in this column.
  • use_auto_bounds: DetaBord provides a mechanism to calculate reasonable bounds automaticaly. Set this to true to use the calculated values (stored in the additional properties auto_lower and auto_upper) instead of the manual ones. Default is false.
  • auto_bounds_prob: For auto bound calculation: the probability of not selecting false positives.
  • cardinality: This is an optional hint, valid on columns intended to be used as categories or keys in a GROUP BY. Specifies the approximate number of distinct keys in this column.
  • allowed_values: An optional propertiy for string type columns. List of strings (comma-seperated) indicating the allowed values this column can have.
  • mask: Valid on string columns. Can be used to mask returned values, e.g. to hide parts of e-mail addresses etc.

Add or Update a data set

To add a new data set to DetaBord use the:

dq0 data add --meta-path=/path/to/my_config.yaml

where my_config.yaml contains a data set definition in the above metadata format.

Update an existing data set with:

dq0 data update --data-uuid=[UUID] --meta-path=/path/to/my_config.yaml

Remove a data set

To remove an existing data set from DetaBord use:

dq0 data remove --data-uuid=[UUID]

Audits (DATA OWNER ROLE)

One of the more important aspects of privacy and data protection is to keep track of what is going on. DetaBord offers an exstensive auditing system that can be used by Data Owners and Administrators to inspect what happened on the platform.

Use the following command to get a list of all recent audited events:

dq0 audit list

A response to this command can look like this:

+---------------------------+--------------------+------------------+----------------------------------------+
| TIMESTAMP                 | ACTOR              | ACTION           | DESCRIPTION                            |
+---------------------------+--------------------+------------------+----------------------------------------+
| 2021-01-11T15:58:39+01:00 | user@gradient0.com | query.added      | query with uuid                        |
|                           |                    |                  | '4298634d-727d-48dd-96a9-29f8ce2f563b' |
|                           |                    |                  | was added by user 'jb@gradient0.com'   |
| 2021-01-11T12:08:28+01:00 | user@gradient0.com | query.added      | query with uuid                        |
|                           |                    |                  | 'e62cd857-ded7-4f8d-83e6-ec8820301278' |
|                           |                    |                  | was added by user 'jb@gradient0.com'   |
| 2021-01-08T16:31:51+01:00 | user@gradient0.com | user.loggedIn    | user with email                        |
|                           |                    |                  | 'jb@gradient0.com' and device          |
|                           |                    |                  | '597dce26-5219-43e2-add3-a9322ac40210' |
|                           |                    |                  | logged in successfully                 |
+---------------------------+--------------------+------------------+----------------------------------------+
# Total Items: 3, page: 1, pageSize: 100

You can control the output with the optional flags json (output in json format) page and page-size.

dq0 audit list --json --page=1 --page-size=10

{
  "total": 36,
  "page": 1,
  "page_size": 10,
  "items": [
    {
      "timestamp": "1610363349",
      "actor": "jb@gradient0.com",
      "action": "query.added",
      "description": "query with uuid '80918f4f-5530-42b4-91cb-7625c6687ad2' was added by user 'jb@gradient0.com'"
    },
    {
      "timestamp": "1610363308",
      "actor": "jb@gradient0.com",
      "action": "query.added",
      "description": "query with uuid 'e62cd857-ded7-4f8d-83e6-ec8820301278' was added by user 'jb@gradient0.com'"
    },
    {
      "timestamp": "1610119911",
      "actor": "jb@gradient0.com",
      "action": "user.loggedIn",
      "description": "user with email 'jb@gradient0.com' and device '597dce26-5219-43e2-add3-a9322ac40210' logged in successfully"
    }
  ]
}

Join Waitlist

Join the AI for Life Sciences Challenge

Let's talk