Database: Tasks history

This document discusses the database, particularly the part of the database that works with the tasks history.

The main problem HOT TM solves is that it helps mappers to manage the mapping of a large area. From the mapper’s point of view, the solution of this problem is straightforward. Split the large area into smaller parts and let the mappers lock and unlock them to communicate what part of the area is being worked on.

HOT TM uses the Project and Task terminology to denote a large area with additional information like the description of the purpose of mapping, and its smaller parts. Each task has its state, e.g. locked or unlocked. (In the HOT TM code base the “state” is often called “status”.) However, a task can also be understand as a thing to do instead of a part of the area. To avoid the misunderstanding, we introduce the Action to denote a transition from one task state to another.

Using the terminology of Project, Task and Action introduced above, we can describe the common mapping workflow. First, Celestine creates new project with all tasks unlocked. Then, Monica, Michal, Marcel and Miriam each locks random task and map that part of the project. Miriam and Marcel are finished with their tasks, unlock them and lock another random tasks. We can see the figure representing the task states (rounded boxes) with corresponding actions (arrows):

digraph {
    rankdir="LR"
    node [shape="box" style="rounded"]

    unlocked
    locked

    unlocked -> locked [label="map"]
    locked -> unlocked [label="finish"]
}

In addition, Radek and Ramona comes into the mapping workflow. They find tasks that have been recently mapped and check that these tasks have been mapped properly. To improve the representation of the workflow, we give the meaning to the locked and unlocked states and we name the actions, as shown in the following figure:

digraph {
     rankdir="LR"
     node [shape="box" style="rounded"]

     lm [label="locked\nfor mapping"]
     lc [label="locked\nfor checking"]

     tm [label="unlocked\nto map"]
     tc [label="unlocked\nto check"]
     d [label="unlocked\ndone"]

     tm -> lm [label="map"]
     lm -> tc [label="finish"]
     tc -> lc [label="check"]
     lc -> d [label="good"]
     lc -> tm [label="bad"]

     {rank="source" ; tm}
     {rank="sink" ; d}
 }

Mapping workflow

The names of the actions is the first half of the database interface. The other half, the DTOs, identify the project, the task, the transition and the mapper. The DTOs should not be confused with the states.

Note that the terminology and the naming of task states only reassembles HOT TM naming – there are more states with slightly different names in HOT TM code.

The purpose of the tasks history

Keeping the task history helps Radek and Ramona understand the evolution of the task. When they do the bad action to move the task state back to unlocked to map, there is a new round of mapping and checking of the task with possibly new names of mappers. When there are new mappers, Radek and Ramona may be in the need of using bad action on the same task multiple times.

On the other hand, when task history is kept, Monica, Michal, Marcel and Miriam can see if it was them who failed to map the task properly.

Keeping the task history is interesting also from the statistics point of view – tasks having multiple mappers and a lot of bad actions are probably hard to map, or it is possible to track the time between lock and unlock actions.

When we are aware of the common mapping workflow, familiar with the Project, Task and Action terminology, and we understand the need for the tasks history, we can finally discuss the database interface and the database schema.

The database interface

The database interface is the best described by the hot_tm_proposal.database package:

HOT TM database abstraction.

This package abstracts the database. It contains the modules with the queries to store and retrieve the information from the PostGIS database (based on the PostgreSQL).

The database interface (procedure names and DTOs) that modules must implement is:

create_project(pid)

Create new project in the database identified by pid . pid is a number greater than zero.

map_random(pid, by)

Select random task of project pid and set its status to locked by. by is a string representing the mapper’s name. Return pid, tid, and geometry.

finish(pid, tid, by)

Set the status of task tid of the project pid to unlocked to check by. tid is a number greater than zero. Return pid and tid.

check_recent(pid, by)

Select the newest mapped task of pproject pid and set its status to locked by. Return pid, tid and geometry.

good(pid, tid, by)

Set the status of task tid of the project pid to unlocked done by. Return pid and tid.

bad(pid, tid, by)

Set the status of task tid of the project pid to unlocked to map by. Return pid and tid.

all_tasks_states

Return the list of dictionaries representing the state of the project’s tasks. The list contains dictionaries with project identifier pid, task identifier tid, and the current state of each particular task.

There are following modules in the database package:

like_tm_admin

Try to be as close as possible to the TM Admin. Not yet implemented.

almost_tm_admin

Use the database schema of the TM Admin, but different database queries.

actions_history

Keep read-only information about tasks separated from what happenned to the tasks.

The database package also contains db module, but the db module is used solely to keep the database pool and to provide helper procedures for the database access.

The database schema

There are tables common to all tested database schemas and particular tables that are implementation specific. There is always projects table with the following columns:

  • pid – project identifier, integer, primary key, automatically

  • geometry – area geometry (GeoJSON when no PostGIS extension)

  • description – text, the project description

The approach of Tasks history in HOT TM is to use tasks and task_history tables, but the implementation leads to the duplicated information and therefore increased complexity to keep the information synchronized.

Tasks history in TM Admin describes the approach where only tasks table is used and task history is concatenated as JSON list in the history column.

Both approaches understand the task and the task history as an object that can be considered to Create, Read, Update, Delete (CRUD). Task state is a property of the task being updated and the action is a property of the record in the task history. The approaches only differ in how the objects, task and task history, are stored in the database.

CRUD is not always the best way to work with objects, especially when the history is needed. This is because the update on the object ignores its history, so the history must be kept elsewhere. Instead of updating the task status and keeping the separated history synchronized, it is more convenient to keep the history and compute the status when needed.

In Actions history, we introduce the database schema with tasks and actions tables, where the database records are only created and read, but never updated. Action is first class database record and task state is a read-only database query.

Non-goals

This document does not cover the whole database schema or the whole database schema for the tasks. This document covers part of the database, limited to tasks and the transitions between the task states. It is expected there are similar documents covering other parts of the database as groups or campaigns, and that there is another document that puts all these parts together and introduces overall database schema.

This document does not cover task issues or annotations. The document aims solely on the main function of tasks within the HOT TM and keeping the tasks history.

This document does not introduce production-ready database interface or schema.

Tasks history in HOT TM

We do not provide test implementation for the tasks history that is comparable to the HOT TM current implementation. In general, our approach is to read the code, try to understand the idea behind the code and implement a restricted version to test that idea. We do not see the idea hidden in the current implementation of HOT TM.

tasks table has the following columns:

  • pid – project identifier, reference to projects table

  • tid – task identifier, tid and pid are together primary key in tasks table

  • geometry – task geometry (GeoJSON when no PostGIS extension)

  • status – string for simplicity, task’s current state

  • locked_by – string for simplicity, who did the last “lock” modification

  • mapped_by – string for simplicity, who did finish action

  • validated_by – string for simplicity, who did good action

  • task_history – relation to the task_history table

and task_history has the columns:

  • pid – project identifier, reference to projects table

  • tid – task identifier, reference to tasks table

  • hid – task history identifier

  • action_purpose – string, what action happened

  • action_text – string, note about the action

  • action_date – timestamp, when the action happened

  • action_by – string, the author of the action

Tasks history in TM Admin

TM Admin is a new approach for the complex solution of Tasking Manager-like projects. The database schema in this section follows the idea of storing the task history in the JSON column.

We provide two implementations: like_tm_admin and almost_tm_admin. Former tries to be as close as possible to the TM Admin, latter keeps the database schema but uses different database queries.

tasks table has the following columns:

  • pid – project identifier, reference to projects table

  • tid – task identifier, tid and pid are together primary key in tasks table

  • geometry – task geometry (GeoJSON when no PostGIS extension)

  • status – string for simplicity, task’s current state

  • locked_by – string for simplicity, who did the last “lock” modification

  • mapped_by – string for simplicity, who did finish action

  • validated_by – string for simplicity, who did good action

  • history – JSON (better or binary) list of action records where each action record contain purpose, text, date, by.

like_tm_admin

almost_tm_admin

Actions history

We introduce the database schema where the information about the tasks is only added to the database. The tasks stores the read-only information about tasks that do not change, like project identifier pid, task identifier tid and task border geometry. On the other hand, the actions table stores what happenned to the tasks.

tasks table has the following columns:

  • pid – project identifier, reference to projects table

  • tid – task identifier, tid and pid are together primary key in tasks table

  • geometry – task geometry (GeoJSON when no PostGIS extension)

and actions table has the columns:

  • pid – project identifier, reference to projects table

  • tid – task identifier, reference to tasks table

  • aid – action identifier

  • purpose – what action happened?

  • text – note about the action?

  • date – when the action happened?

  • by – the author of the action?

actions_history

Experiments

We use hot_tm_proposal.database.test module to perform the tests of the database module:


In this section, we describe what experiments we performed. We provide the source code of the experiments in the test module of the hot_tm_proposal.database Python3 package.

We compare Tasks history in TM Admin and Actions history in the following ways:

  1. We create the database with two projects, each having 1000 tasks and we measure the overall time to create the projects.

    Setting LONGER_TEST = True will create 10 projects.

  2. We measure the overall time for 100 task state changes of the single project: find random task to map and lock it for mapping, then immediately unlock it to check.

    Then, we measure the overall time for 1000 task state changes of the single project: find random task to map or check, lock it and then immediately unlock it to check.

    Setting LONGER_TEST = True will measure 5000 actions over all 10 projects randomly.

  3. We measure the overall time for retrieving the last status of each task of the area.

_images/db-test-1000.svg

Test two projects with 1000 requests to map or check

_images/db-test-5000.svg

Test ten projects with 5000 requests to map or check

Conclusion

From the results in Test two projects with 1000 requests to map or check and Test ten projects with 5000 requests to map or check, we can see that for the single mapper, the Tasks history in TM Admin that stores task history in JSON column of the tasks table, yields better results than Actions history. However, we need to be aware of that:

  • We did not try to retrieve all actions of a single project chronologically ordered, i.e., from the newest to the oldest.

    Doing this using Actions history is trivial read-only query. Doing this using Tasks history in TM Admin leads to read-only query to retrieve data (JSON lists) from all (jsonb) history columns, followed by merging the data (JSON lists) by the date contained in the dictionaries (actions) in the JSON lists.

  • We implemented the tested functionality, like map random task, as a single (serializable) query, but we have no idea what approach will be used within TM Admin.

  • We did not tested parallel accesses to the database, i.e. we do not know how these two approaches compare in the environment of many mappers; yet this needs to be tested.

  • We did not implemented the database design from Tasks history in HOT TM, because the relations between the tables were not clear enough after reading the source code of HOT TM. This would probably not influence the results.