# Bulk processing

Bulk import aims at **giving you the ability to bulk-import data into the mediarithmics platform**.

You can import:

* [Offline activities](/data-streams/data-ingestion/bulk-processing/imports/offline-activities.md) such as offline purchases and store visits
* User segments such as email lists, cookies list, user accounts list, etc.
* User profiles such as CRM data and scoring
* User association such as CRM Onboarding
* User dissociation
* User suppression requests such as GDPR Suppression requests, and Opt-Out Management

## How it works

You upload files associated with a document import definition:

* Files represent the data.
* Document imports represent what mediarithmics should do with the data.

If you need to track users in real-time, you should read [the real-time tracking guide.](/data-streams/data-ingestion/real-time-user-tracking.md)

The two steps for bulk import are:

1. Create the document import definition to tell mediarithmics what you are importing
2. Upload files associated with the document import definition. Each uploaded file creates a new document import execution.

{% hint style="info" %}
For maximum performance:

* Ensure a maximum size for each file of 100M.&#x20;
* Use the document import for multiple records when there will be more than 1,000 per file.&#x20;
  {% endhint %}

*How to choose between creating a new document import or adding a new file to an existing document import? Our recommendation is to create a new document import each time you have a new set of files to upload. For example, if you upload CRM profiles every night, you should create a new "User profiles from CRM - " document import every night instead of just uploading new files to a unique "User profiles from CRM" document import.*

{% hint style="success" %}
Each line in the uploaded file is a command to execute. Depending on the document import type, you have different commands available.
{% endhint %}

## User identifiers in imports

When importing data, you need to properly add [user identifiers](/user-points.md#user-identifiers). This will ensure your data is associated with the proper [UserPoint](/user-points.md).

{% hint style="warning" %}
Only one identifier is allowed per line. For example, you shouldn't specify the user agent ID if the Email Hash is already used in a line.

However, you don't have to always use the same type of identifier in your document. For example, one line could use the user account ID while another uses the email hash.
{% endhint %}

## Document import

Document imports define what you are about to upload in one or multiple files.

A document import object has the following properties:

| field                     | type    | description                                                                                                                                                                                                                                                                                  |
| ------------------------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| document\_type            | Enum    | <p>The type of data you want to import. Should be <code>USER\_ACTIVITY</code>, <code>USER\_SEGMENT</code>, <code>USER\_PROFILE</code>, </p><p><code>USER\_CHOICE</code>, </p><p><code>USER\_IDENTIFIERS\_DELETION</code> ,  or <code>USER\_IDENTIFIERS\_ASSOCIATION\_DECLARATIONS</code></p> |
| mime\_type                | Enum    | The format of the imported data. `APPLICATION_X_NDJSON`or `TEXT_CSV`It should match the file format of the upload file, e.g. `.csv` or `.ndjson`.  The csv format can be chosen only for `USER_SEGMENT` imports.                                                                             |
| encoding                  | String  | Encoding of the data that will be imported. Usually`utf-8`                                                                                                                                                                                                                                   |
| name                      | String  | The name of your import.                                                                                                                                                                                                                                                                     |
| priority                  | Enum    | `LOW`, `MEDIUM` or `HIGH`                                                                                                                                                                                                                                                                    |
| use\_processing\_pipeline | Boolean | Use this parameter if the import should go through activity analyzers or session aggregation for instance. Values are `true` or `false`. Default is `false`                                                                                                                                  |
| shuffle\_lines            | Boolean | Will shuffle the lines of the file for better performance. Values are : `true` or `false`. Default is `true`                                                                                                                                                                                 |

```javascript
// Sample document import object
{
    "document_type": "USER_ACTIVITY",
    "mime_type": "APPLICATION_X_NDJSON",
    "encoding": "utf-8",
    "name": "<YOUR_DOCUMENT_IMPORT_NAME>"
}
```

## Create a document import

<mark style="color:green;">`POST`</mark> `https://api.mediarithmics.com/v1/datamarts/:datamartId/document_imports`

#### Path Parameters

| Name       | Type    | Description                                                |
| ---------- | ------- | ---------------------------------------------------------- |
| datamartId | integer | The ID of the datamart in which your data will be imported |

#### Request Body

| Name | Type   | Description                                   |
| ---- | ------ | --------------------------------------------- |
| data | object | The document import object you wish to create |

Response:

{% tabs %}
{% tab title="200 " %}

```javascript
{
  "status": "ok",
  "data": {
    "id": "36271",
    "datafarm_key": "DF_KEY",
    "datamart_id": "DATAMART_ID",
    "document_type": "USER_PROFILE",
    "mime_type": "APPLICATION_X_NDJSON",
    "encoding": "utf-8",
    "name": "YOUR_DOCUMENT_IMPORT_NAME",
    "priority": "MEDIUM",
    "shuffle_lines" : true, 
    "use_processing_pipeline" : false
  }
}
```

{% endtab %}
{% endtabs %}

Here is a sample request using **curl**:

```bash
curl -X POST \
  "https://api.mediarithmics.com/v1/datamarts/<DATAMART_ID>/document_imports"
  -H 'Authorization: <YOUR_API_TOKEN>'
  -H 'Content-Type: application/json'
  -d '{
          "document_type": "USER_ACTIVITY",
          "mime_type": "APPLICATION_X_NDJSON",
          "encoding": "utf-8",
          "name": "<YOUR_DOCUMENT_IMPORT_NAME>"
      }'
```

## List document imports

<mark style="color:blue;">`GET`</mark> `https://api.mediarithmics.com/v1/datamarts/:datamartId/document_imports`

You can list all document imports for a datamart or search them with filters.

#### Path Parameters

| Name       | Type    | Description            |
| ---------- | ------- | ---------------------- |
| datamartId | integer | The ID of the datamart |

#### Query Parameters

| Name            | Type   | Description                                                                                                                                                                                                                                           |
| --------------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| keywords        | string | The keywords to match with document import names. It is case sensitive.Examples:                                                                                                                                                                      |
| mime\_type      | string | Filter on a specific mime type. Supported values are `APPLICATION_X_NDJSON` or `TEXT_CSV` .                                                                                                                                                           |
| document\_types | string | Filter on specific document types. Supported values are`USER_PROFILE`, `USER_ACTIVITY` or `USER_SEGMENT` .Multiple filters can be separated with commas.*Examples :* `&document_types=USER_PROFILE` *or* `&document_types=USER_PROFILE,USER_ACTIVITY` |
| order\_by       | string | ID sorts result by default, you can specify `&order_by=name` to sort them by name                                                                                                                                                                     |

The query is paginated as described in [using our API guide](/resources/api-overview.md).

{% tabs %}
{% tab title="200 " %}

```javascript
{
  "status": "ok",
  "data": [
    {
      "id": "19538",
      "datafarm_key": "DF_KEY",
      "datamart_id": "DATAMART_ID",
      "document_type": "USER_PROFILE",
      "mime_type": "APPLICATION_X_NDJSON",
      "encoding": "utf-8",
      "name": "December 2020 user profiles",
      "priority": "MEDIUM",
      "shuffle_lines" : true, 
      "use_processing_pipeline" : false
    },
    {
      "id": "19552",
      "datafarm_key": "DF_KEY",
      "datamart_id": "DATAMART_ID",
      "document_type": "USER_PROFILE",
      "mime_type": "APPLICATION_X_NDJSON",
      "encoding": "utf-8",
      "name": "January 2021 user profiles",
      "priority": "MEDIUM",
      "shuffle_lines" : true, 
      "use_processing_pipeline" : false
    },
    {
      "id": "19553",
      "datafarm_key": "DF_EU_2020_02",
      "datamart_id": "1509",
      "document_type": "USER_PROFILE",
      "mime_type": "APPLICATION_X_NDJSON",
      "encoding": "utf-8",
      "name": "February 2021 user profiles",
      "priority": "MEDIUM",
      "shuffle_lines" : true, 
      "use_processing_pipeline" : false
    }
  ],
  "count": 3,
  "total": 3,
  "first_result": 0,
  "max_result": 50,
  "max_results": 50
}
```

{% endtab %}
{% endtabs %}

## Get a document import

<mark style="color:blue;">`GET`</mark> `https://api.mediarithmics.com/v1/datamarts/:datamartId/document_imports/:importId`

#### Path Parameters

| Name       | Type    | Description                   |
| ---------- | ------- | ----------------------------- |
| datamartId | integer | The ID of the datamart        |
| importId   | integer | The ID of the document import |

{% tabs %}
{% tab title="200 " %}

```json
{
  "status": "ok",
  "data": {
    "id": "36271",
    "datafarm_key": "DF_KEY",
    "datamart_id": "DATAMART_ID",
    "document_type": "USER_PROFILE",
    "mime_type": "APPLICATION_X_NDJSON",
    "encoding": "utf-8",
    "name": "December 2020 user profiles",
    "priority": "MEDIUM",
    "shuffle_lines" : true, 
    "use_processing_pipeline" : false
  }
}
```

{% endtab %}
{% endtabs %}

## Update a document import

<mark style="color:orange;">`PUT`</mark> `https://api.mediarithmics.com/v1/datamarts/:datamartId/document_imports/:importId`

#### Path Parameters

| Name       | Type    | Description                   |
| ---------- | ------- | ----------------------------- |
| datamartId | integer | The ID of the datamart        |
| importId   | integer | The ID of the document import |

#### Request Body

| Name | Type   | Description                       |
| ---- | ------ | --------------------------------- |
| data | object | The document import object to put |

{% tabs %}
{% tab title="200 " %}

```json
{
  "status": "ok",
  "data": {
    "id": "36271",
    "datafarm_key": "DF_KEY",
    "datamart_id": "DATAMART_ID",
    "document_type": "USER_PROFILE",
    "mime_type": "APPLICATION_X_NDJSON",
    "encoding": "utf-8",
    "name": "YOUR_DOCUMENT_IMPORT_NAME",
    "priority": "MEDIUM",
    "shuffle_lines" : true, 
    "use_processing_pipeline" : false
  }
}
```

{% endtab %}
{% endtabs %}

## Remove a document import

<mark style="color:red;">`DELETE`</mark> `https://api.mediarithmics.com/v1/datamarts/:datamartId/document_imports/:importId`

Removes a document import you don't want to see anymore in the system.

#### Path Parameters

| Name       | Type    | Description                   |
| ---------- | ------- | ----------------------------- |
| datamartId | integer | The ID of the datamart        |
| importId   | integer | The ID of the document import |

## File upload

A file upload creates an execution.

After creation, the execution is at the `PENDING` status. It goes into the `RUNNING` status when the import starts and `SUCCEEDED` status once the platform has correctly imported the file.

## Create an execution

<mark style="color:green;">`POST`</mark> `https://api.mediarithmics.com/v1/datamarts/:datamartId/document_imports/:importId/executions`

You create an execution and upload a file with this endpoint.

#### Path Parameters

| Name                                         | Type   | Description                   |
| -------------------------------------------- | ------ | ----------------------------- |
| datamartId<mark style="color:red;">\*</mark> | string | The ID of the datamart        |
| importId<mark style="color:red;">\*</mark>   | string | The ID of the document import |

#### Headers

| Name                                           | Type   | Description                |
| ---------------------------------------------- | ------ | -------------------------- |
| Content-Type<mark style="color:red;">\*</mark> | string | Your upload configuration. |

{% tabs %}
{% tab title="200 " %}

```javascript
{
    "status": "ok",
    "data": {
        "parameters": null,
        "result": null,
        "error": null,
        "id": "11597785",
        "status": "PENDING",
        "creation_date": 1609410143659,
        "start_date": null,
        "duration": null,
        "organisation_id": "1426",
        "user_id": null,
        "cancel_status": null,
        "debug": null,
        "is_retryable": false,
        "permalink_uri": "MTowOjA6NDI1MzAxMg==",
        "num_tasks": null,
        "completed_tasks": null,
        "erroneous_tasks": null,
        "retry_count": 0,
        "job_type": "DOCUMENT_IMPORT",
        "import_mode": "MANUAL_FILE",
        "import_type": null
    }
}
```

{% endtab %}
{% endtabs %}

See an example:

```
curl --location --request POST 'https://api.mediarithmics.com/v1/datamarts/:datamartId/document_imports/:executionId/executions/' \
--header 'Content-Type: application/x-ndjson; \
--header 'Authorization: api:TOKEN' \
--data-binary '@/Users/username/path/to/the/file.ndjson'
```

You retrieve metadata about the created execution, notably and id property you can use to track the execution.

## List executions

<mark style="color:blue;">`GET`</mark> `https://api.mediarithmics.com/v1/datamarts/:datamartId/document_imports/:importId/executions`

You can list all executions for a document, import and retrieve useful data like their status, execution time and error messages.

#### Path Parameters

| Name                                         | Type    | Description               |
| -------------------------------------------- | ------- | ------------------------- |
| datamartId<mark style="color:red;">\*</mark> | integer | The ID of the datamart    |
| importId<mark style="color:red;">\*</mark>   | integer | The ID of document import |

{% tabs %}
{% tab title="200 " %}

```javascript
{
    "status": "ok",
    "data": [
        {
            "parameters": {
                "datamart_id": 1609,
                "document_import_id": 19718,
                "mime_type": "APPLICATION_X_NDJSON",
                "document_type": "USER_PROFILE",
                "input_file_name": "requestBody9664967795462448677asRaw",
                "file_uri": "mics://data_file/tenants/1426/datamarts/1509/document_imports/19518/requestBody9664967795462448677asRaw-2020-12-31_10.22.23-KzgivDim3y.json",
                "number_of_lines": 4,
                "segment_id": null
            },
            "result": {
                "total_success": 4,
                "total_failure": 0,
                "input_file_name": "requestBody9664967795462448677asRaw",
                "input_file_uri": "mics://data_file/tenants/1426/datamarts/1509/document_imports/19518/requestBody9664967795462448677asRaw-2020-12-31_10.22.23-KzgivDim3y.json",
                "error_file_uri": "mics://data_file/tenants/1426/datamarts/1509/document_imports/19518/requestBody9664967795462448677asRaw-2020-12-31_10.22.23-KzgivDim3y_errors.csv",
                "possible_issue_on_identifiers": false,
                "top_identifiers": {}
            },
            "error": null,
            "id": "11597785",
            "status": "SUCCEEDED",
            "creation_date": 1609410143659,
            "start_date": 1609410150976,
            "duration": 3059,
            "organisation_id": "1426",
            "user_id": null,
            "cancel_status": null,
            "debug": null,
            "is_retryable": false,
            "permalink_uri": "MTowOjA6NDI1MzAxMg==",
            "num_tasks": 4,
            "completed_tasks": 4,
            "erroneous_tasks": 0,
            "retry_count": 0,
            "job_type": "DOCUMENT_IMPORT",
            "import_mode": "MANUAL_FILE",
            "import_type": null,
            "end_date": 1609410154035
        },
        {
            "parameters": {
                "datamart_id": 1609,
                "document_import_id": 19718,
                "mime_type": "APPLICATION_X_NDJSON",
                "document_type": "USER_PROFILE",
                "input_file_name": "requestBody17471990940413569967asRaw",
                "file_uri": "mics://data_file/tenants/1426/datamarts/1509/document_imports/19518/requestBody17471990940413569967asRaw-2020-10-19_09.54.45-JvP1ssxKSu.json",
                "number_of_lines": 4,
                "segment_id": null
            },
            "result": {
                "total_success": 0,
                "total_failure": 4,
                "input_file_name": "requestBody17471990940413569967asRaw",
                "input_file_uri": "mics://data_file/tenants/1426/datamarts/1509/document_imports/19518/requestBody17471990940413569967asRaw-2020-10-19_09.54.45-JvP1ssxKSu.json",
                "error_file_uri": "mics://data_file/tenants/1426/datamarts/1509/document_imports/19518/requestBody17471990940413569967asRaw-2020-10-19_09.54.45-JvP1ssxKSu_errors.csv",
                "possible_issue_on_identifiers": false,
                "top_identifiers": {}
            },
            "error": {
                "message": "0 success, 4 failures\nSaved errors:\nNo profile id found while upserting a user profile Error id = 9d5016ea-6b7b-4c64-bc74-60ba207e3bed.\nNo profile id found while upserting a user profile Error id = 99f8d9bb-4c94-49ea-8bb2-934bc6056cac.\nNo profile id found while upserting a user profile Error id = d1216b0e-619c-4d92-9098-cc5ae4ac8e16.\nNo profile id found while upserting a user profile Error id = a92d3258-163c-4b9d-949e-94f9006cd77d.\n"
            },
            "id": "11170897",
            "status": "SUCCEEDED",
            "creation_date": 1603101286198,
            "start_date": 1603101317674,
            "duration": 1062,
            "organisation_id": "1426",
            "user_id": null,
            "cancel_status": null,
            "debug": null,
            "is_retryable": false,
            "permalink_uri": "MTowOjA6MzgyNjEyNA==",
            "num_tasks": 4,
            "completed_tasks": 0,
            "erroneous_tasks": 4,
            "retry_count": 0,
            "job_type": "DOCUMENT_IMPORT",
            "import_mode": "MANUAL_FILE",
            "import_type": null,
            "end_date": 1603101318736
        }
    ],
    "count": 2,
    "total": 2,
    "first_result": 0,
    "max_result": 50,
    "max_results": 50
}
```

{% endtab %}
{% endtabs %}

## Get an execution

<mark style="color:blue;">`GET`</mark> `https://api.mediarithmics.com/v1/datamarts/:datamartId/document_imports/:importId/executions/:executionId`

Get a specific execution and retrieves useful data like its status, execution time and error messages.

#### Path Parameters

| Name                                          | Type    | Description                                                                                       |
| --------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------- |
| datamartId<mark style="color:red;">\*</mark>  | integer | The ID of the datamart                                                                            |
| importId<mark style="color:red;">\*</mark>    | integer | The ID of the document import                                                                     |
| executionId<mark style="color:red;">\*</mark> | integer | The ID of the execution (usually retrieved from "create execution" or "list executions" requests) |

{% tabs %}
{% tab title="200 " %}

```javascript
{
    "status": "ok",
    "data": {
        "parameters": {
            "datamart_id": 1609,
            "document_import_id": 19718,
            "mime_type": "APPLICATION_X_NDJSON",
            "document_type": "USER_PROFILE",
            "input_file_name": "requestBody9664967795462448677asRaw",
            "file_uri": "mics://data_file/tenants/1426/datamarts/1509/document_imports/19518/requestBody9664967795462448677asRaw-2020-12-31_10.22.23-KzgivDim3y.json",
            "number_of_lines": 4,
            "segment_id": null
        },
        "result": {
            "total_success": 4,
            "total_failure": 0,
            "input_file_name": "requestBody9664967795462448677asRaw",
            "input_file_uri": "mics://data_file/tenants/1426/datamarts/1509/document_imports/19518/requestBody9664967795462448677asRaw-2020-12-31_10.22.23-KzgivDim3y.json",
            "error_file_uri": "mics://data_file/tenants/1426/datamarts/1509/document_imports/19518/requestBody9664967795462448677asRaw-2020-12-31_10.22.23-KzgivDim3y_errors.csv",
            "possible_issue_on_identifiers": false,
            "top_identifiers": {}
        },
        "error": null,
        "id": "11597785",
        "status": "SUCCEEDED",
        "creation_date": 1609410143659,
        "start_date": 1609410150976,
        "duration": 3059,
        "organisation_id": "1426",
        "user_id": null,
        "cancel_status": null,
        "debug": null,
        "is_retryable": false,
        "permalink_uri": "MTowOjA6NDI1MzAxMg==",
        "num_tasks": 4,
        "completed_tasks": 4,
        "erroneous_tasks": 0,
        "retry_count": 0,
        "job_type": "DOCUMENT_IMPORT",
        "import_mode": "MANUAL_FILE",
        "import_type": null,
        "end_date": 1609410154035
    }
}
```

{% endtab %}
{% endtabs %}

## Cancel an execution

<mark style="color:green;">`POST`</mark> `https://api.mediarithmics.com/v1/datamarts/:datamartId/document_imports/:importId/executions/:executionId/action`

Cancel a specific execution

#### Path Parameters

| Name                                          | Type   | Description                                                                                       |
| --------------------------------------------- | ------ | ------------------------------------------------------------------------------------------------- |
| datamartId<mark style="color:red;">\*</mark>  | string | The ID of the datamart                                                                            |
| importId<mark style="color:red;">\*</mark>    | string | The ID of the document import                                                                     |
| executionId<mark style="color:red;">\*</mark> | string | The ID of the execution (usually retrieved from "create execution" or "list executions" requests) |

#### Request Body

| Name                                   | Type | Description                    |
| -------------------------------------- | ---- | ------------------------------ |
| body<mark style="color:red;">\*</mark> | json | Must be: `{"action":"CANCEL"}` |

{% tabs %}
{% tab title="200: OK " %}

```javascript
{
  "status": "ok",
  "data": {
    "parameters": null,
    "result": null,
    "error": null,
    "id": "22747195",
    "status": "CANCELED",
    "creation_date": 1646060596034,
    "start_date": null,
    "duration": null,
    "organisation_id": "1581",
    "user_id": null,
    "cancel_status": "REQUESTED",
    "debug": null,
    "is_retryable": false,
    "community_id": "1581",
    "num_tasks": null,
    "completed_tasks": null,
    "erroneous_tasks": null,
    "retry_count": 0,
    "permalink_uri": null,
    "job_type": "DOCUMENT_IMPORT",
    "import_mode": "MANUAL_FILE",
    "import_type": null
  }
}
```

{% endtab %}
{% endtabs %}

The cancellation of an execution will only work if the status of this executions is "PENDING"

## Splitting large files

If you need to import larger files than 100Mbytes, you can split them before using the upload API and call it multiple times.

You can split massive files using the shell command.

```bash
split -l <LINE_NUMBER> ./your/file/path
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://developer.mediarithmics.io/data-streams/data-ingestion/bulk-processing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.