Follow

Bulk Split Data with the API

When you upload a data file that has more complex data than a CSV can naturally represent, the split operation can be useful.

Note: In this article, rows are referred to as "units".

Any of the columns of your uploaded CSV may be internally delimited by a special character (by default, a blank space). Calling split on this column will let Appen know that the contents of this column should be treated by Appen internally as a collection of discrete items rather than a block.

Note: You are strongly advised to post complex data formats to Appen using JSON rather than CSV. JSON is better suited for complex data — the split operation is provided as a convenience operation for existing datasets, and it is not recommended for new users.

 

Split

Method

Endpoint

Parameters



GET



/jobs/{job_id}/units/split

  • on A comma-delimited list of columns to be split.
  • with The internal delimiter for the column. Default is the space character (" ").
  • "Unit" refers to Row. The former term is eliminated from the UI.

Suppose your existing dataset is an arbitrary collection of major authors.

author,major_works,countries_active
  Homer,The Iliad|The Odyssey,Greece
  Dickens,David Copperfield|Bleak House,England
  Nabokov,Camera Obscura|Lolita,Russia|United States
  Rabelais,Gargantua and Pantagruel,France
  Cervantes,Don Quixote,Spain

When this data is posted as a CSV to Appen, one row is created for each of the five rows of data. The rows each have data associated with the three CSV columns provided. When initially posted, Appen treats all of the values transferred as free text values with no depth or structure. After the initial data post, Dickens' major works field is set to David Copperfield|Bleak House

To let Appenknow that the major_works and countries_active columns are each actually collections of delimited values, you can use the split operation.

curl -X PUT --data-urlencode "key={api_key}" https://api.figure-eight.com/v1/jobs/{job_id}/units/split?on=major_works,countries_active&with=|

Note: Be careful to URL-encode the parameters.

After the PUT, Appen will consider Dickens' major_works field to be set to the collection [ "David Copperfield", "Bleak House" ]. Similarly, Nabokov's countries_active field will be set to [ "Russia", "United States" ]. The brackets indicate a data structure that is analogous to a List or Vector in Java, a list in Python, an Array in Ruby, etc. If you were to request Homer's major_works from Appen, it would be returned as a JSON array:

{major_works: [ "The Iliad","The Odyssey" ]}

Because the author field was not split, it will not be treated as a collection:

{author: "Homer"}

Was this article helpful?
0 out of 0 found this helpful


Have more questions? Submit a request
Powered by Zendesk