summaryrefslogtreecommitdiffstats
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to 'docs')
-rw-r--r--docs/building.md12
-rw-r--r--docs/clientapi.md129
-rw-r--r--docs/design.md104
-rw-r--r--docs/designgoals.md39
-rw-r--r--docs/index.md25
-rw-r--r--docs/logging.md27
-rw-r--r--docs/queries.md104
-rw-r--r--docs/resource.md59
-rw-r--r--docs/sinksh.md (renamed from docs/akonadish.md)0
-rw-r--r--docs/storage.md23
-rw-r--r--docs/terminology.md2
-rw-r--r--docs/tradeoffs.md36
12 files changed, 298 insertions, 262 deletions
diff --git a/docs/building.md b/docs/building.md
index 907827d..17ef54b 100644
--- a/docs/building.md
+++ b/docs/building.md
@@ -85,3 +85,15 @@ mkdir build && cd build
85cmake .. 85cmake ..
86make install 86make install
87``` 87```
88
89# Dependencies
90
91* ExtraCmakeModules >= 0.0.10
92* Qt >= 5.2
93* KF5::Async >= 0.1
94* flatbuffers >= 1.0
95* libgit2
96* readline
97
98## Maildir Resource
99* KF5::Mime
diff --git a/docs/clientapi.md b/docs/clientapi.md
index 219f972..be8ff19 100644
--- a/docs/clientapi.md
+++ b/docs/clientapi.md
@@ -13,16 +13,6 @@ The client API consists of:
13* property-level on-demand loading of data 13* property-level on-demand loading of data
14* streaming support for large properties (attachments) 14* streaming support for large properties (attachments)
15 15
16## Domain Types
17A set of standardized domain types is defined. This is necessary to decouple applications from resources (so a calendar can access events from all resources), and to have a "language" for queries.
18
19The definition of the domain model directly affects:
20
21* granularity for data retrieval (email property, or individual subject, date, ...)
22* queriable properties for filtering and sorting (sender, id, ...)
23
24The purpose of these domain types is strictly to be the interface and the types are not necessarily meant to be used by applications directly, or to be restricted by any other specifications (such as ical). By nature these types will be part of the evolving interface, and will need to be adjusted for every new property that an application must understand.
25
26## Store Facade 16## Store Facade
27The store is always accessed through a store specific facade, which hides: 17The store is always accessed through a store specific facade, which hides:
28 18
@@ -52,118 +42,12 @@ Each modification is associated with a specific revision, which allows the synch
52### Conflict Resolution 42### Conflict Resolution
53Conflicts can occur at two points: 43Conflicts can occur at two points:
54 44
55* While i.e. an editor is open and we receive an update for the same entity 45* In the client: While i.e. an editor is open and we receive an update for the same entity
56* After a modification is sent to the synchronizer but before it's processed 46* In the synchronizer: After a modification is sent to the synchronizer but before it's processed
57 47
58In the first case the client is repsonsible to resolve the conflict, in the latter case it's the synchronizer's responsibility. 48In the first case the client is repsonsible to resolve the conflict, in the latter case it's the synchronizer's responsibility.
59A small window exists where the client has already started the modification (i.e. command is in socket), and a notification has not yet arrived that the same entity has been changed. In such a case the synchronizer may reject the modification because it has the revision the modification refers to no longer available. 49A small window exists where the client has already started the modification (i.e. command is in socket), and a notification has not yet arrived that the same entity has been changed. In such a case the synchronizer may reject the modification because it has the revision the modification refers to no longer available.
60 50
61This design allows the synchronizer to be in control of the revisions, and keeps it from having to wait for all clients to update until it can drop revisions.
62
63## Query System
64The query system should allow for efficient retrieval for just the amount of data required by the client. Efficient querying is supported by the indexes provided by the resources.
65
66The query always retrieves a set of entities matching the query, while not necessarily all properties of the entity need to be populated.
67
68Queries should are declarative to keep the specification simple and to allow the implementation to choose the most efficient execution.
69
70Queries can be kept open (live) to receive updates as the store changes.
71
72### Query
73The query consists of:
74
75* a set of filters to match the wanted entities
76* the set of properties to retrieve for each entity
77
78Queryable properties are defined by the [[Domain Types]] above.
79
80### Query Result
81The result is returned directly after running the query in form of a QAbstractItemModel. Each row in the model represents a matching entity.
82
83The model allows to access the domain object directly, or to access individual properties directly via the rows columns.
84
85The model is always populated asynchronously. It is therefore initially empty and will then populate itself gradually, through the regular update mechanisms (rowsInserted).
86
87Tree Queries allow the application to query for i.e. a folder hierarchy in a single query. This is necessary for performance reasons to avoid recursive querying in large hierarchies. To avoid on the other hand loading large hierchies directly into memory, the model only populates the toplevel rows automatically, all other rows need to be populated by calling `QAbstractItemModel::fetchMore(QModelIndex);`. This way the resource can deal efficiently with the query (i.e. by avoiding the many roundtrips that would be necessary with recursive queries), while keeping the amount of data in memory to a minimum (i.e. if the majority of the folder tree is collapsed in the view anyways). A tree result set can therefore be seen as a set of sets, where every subset corresponds to the children of one parent.
88
89If the query is live, the model updates itself if the update applies to one of the already loaded subsets (otherwise it's currently irrelevant and will load once the subset is loaded).
90
91#### Enhancements
92* Asynchronous loading of entities/properties can be achieved by returning an invalid QVariant initially, and emitting dataChanged once the value is loaded.
93* To avoid loading a large list when not all data is necessary, a batch size could be defined to guarantee for instance that there is sufficient data to fill the screen, and the fetchMore mechanism can be used to gradually load more data as required when scrolling in the application.
94
95#### Filter
96A filter consists of:
97
98* a property to filter on as defined by the [[Domain Types]]
99* a comparator to use
100* a value
101
102The available comparators are:
103
104* equal
105* greater than
106* less than
107* inclusive range
108
109Value types include:
110
111* Null
112* Bool
113* Regular Expression
114* Substring
115* A type-specific literal value (e.g. string, number, date, ..)
116
117Filters can be combined using AND, OR, NOT.
118
119#### Example
120```
121query = {
122 offset: int
123 limit: int
124 filter = {
125 and {
126 collection = foo
127 or {
128 resource = res1
129 resource = res2
130 }
131 }
132 }
133}
134```
135
136possible API:
137
138```
139query.filter().and().property("collection") = "foo"
140query.filter().and().or().property("resource") = "res1"
141query.filter().and().or().property("resource") = "res2"
142query.filter().and().property("start-date") = InclusiveRange(QDateTime, QDateTime)
143```
144
145The problem is that it is difficult to adjust an individual resource property like that.
146
147### Usecases ###
148Mail:
149
150* All mails in folder X within date-range Y that are unread.
151* All mails (in all folders) that contain the string X in property Y.
152
153Todos:
154
155* Give me all the todos in that collection where their RELATED-TO field maps to no other todo UID field in the collection
156* Give me all the todos in that collection where their RELATED-TO field has a given value
157* Give me all the collections which have a given collection as parent and which have a descendant matching a criteria on its attributes;
158
159Events:
160
161* All events of calendar X within date-range Y.
162
163Generic:
164* entity with identifier X
165* all entities of resource X
166
167### Lazy Loading ### 51### Lazy Loading ###
168The system provides property-level lazy loading. This allows i.e. to defer downloading of attachments until the attachments is accessed, at the expense of having to have access to the source (which could be connected via internet). 52The system provides property-level lazy loading. This allows i.e. to defer downloading of attachments until the attachments is accessed, at the expense of having to have access to the source (which could be connected via internet).
169 53
@@ -173,12 +57,3 @@ Note: We should perhaps define a minimum set of properties that *must* be availa
173 57
174### Data streaming ### 58### Data streaming ###
175Large properties such as attachments should be streamable. An API that allows to retrieve a single property of a defined entity in a streamable fashion is probably enough. 59Large properties such as attachments should be streamable. An API that allows to retrieve a single property of a defined entity in a streamable fashion is probably enough.
176
177### Indexes ###
178Since only properties of the domain types can be queried, default implementations for commonly used indexes can be provided. These indexes are populated by generic preprocessors that use the domain-type interface to extract properties from individual entites.
179
180## Notifications ##
181A notification mechanism is required to inform clients about changes. Running queries will automatically update the result-set if a notification is received.
182
183Note: A notification could supply a hint on what changed, allowing clients to ignore revisions with irrelevant changes.
184A running query can do all of that transparently behind the scenes. Note that the hints should indeed only hint what has changed, and not supply the actual changeset. These hints should be tailored to what we see as useful, and must therefore be easy to modify.
diff --git a/docs/design.md b/docs/design.md
index 4451b49..2890450 100644
--- a/docs/design.md
+++ b/docs/design.md
@@ -1,101 +1,45 @@
1# Design Goals
2## Axioms
31. Personal information is stored in multiple sources (address books, email stores, calendar files, ...)
42. These sources may local, remote or a mix of local and remote
5
6## Requirements
71. Local mirrors of these sources must be available to 1..N local clients simultaneously
82. Local clients must be able to make (or at least request) changes to the data in the local mirrors
93. Local mirrors must be usable without network, even if the source is remote
104. Local mirrors must be able to syncronoize local changes to their sources (local or remote)
115. Local mirrors must be able to syncronize remote changes and propagate those to local clients
126. Content must be searchable by a number of terms (dates, identities, body text ...)
137. This must all run with acceptable performance on a moderate consumer-grade desktop system
14
15Nice to haves:
16
171. As-close-to-zero-copy-as-possible for data
182. Simple change notification semantics
193. Resource-specific syncronization techniques
204. Data agnostic storage
21
22Immediate goals:
23
241. Ease development of new features in existing resources
252. Ease maintenance of existing resources
263. Make adding new resources easy
274. Make adding new types of data or data relations easy
285. Improve performance relative to existing Akonadi implementation
29
30Long-term goals:
31
321. Project view: given a query, show all items in all stores that match that query easily and quickly
33
34Implications of the above:
35
36* Local mirrors must support multi-reader, but are probably best served with single-writer semantics as this simplifies both local change recording as well as remote synchronization by keeping it in one process which can process write requests (local or remote) in sequential fashion.
37* There is no requirement for a central server if the readers can concurrently access the local mirror directly
38* A storage system which requires a schema (e.g. relational databases) are a poor fit given the desire for data agnosticism and low memory copying
39
40# Overview 1# Overview
41 2
42## Client API 3Sink is a data access layer that additionally handles synchronization with external sources and indexing of data for efficient queries.
43The client facing API hides all Sink internals from the applications and emulates a unified store that provides data through a standardized interface. 4
5## Store
6The client facing Store API hides all Sink internals from the applications and emulates a unified store that provides data through a standardized interface.
44This allows applications to transparently use various data sources with various data source formats. 7This allows applications to transparently use various data sources with various data source formats.
45 8
46## Resource 9## Resource
47A resource is a plugin that provides access to an additional source. It consists of a store, a synchronizer process that executes synchronization & change replay to the source and maintains the store, as well as a facade plugin for the client api. 10A resource is a plugin that provides access to an additional source. It consists of a store, a synchronizer process that executes synchronization & change replay to the source and maintains the store, as well as a facade plugin for the client api.
48 11
49## Store 12## Storage / Indexes
50Each resource maintains a store that can either store the full dataset for offline access or only metadata for quick lookups. Resources can define how data is stored. 13Each resource maintains a store that can either store the full dataset for offline access or only metadata for quick lookups. Resources can define how data is stored.
14The store consists of revisions with every revision containing one entity.
15
16The store additionally contains various secondary indexes for efficient lookups.
51 17
52## Types 18## Types
53### Domain Type 19### Domain Type
54The domain types exposed in the public interface. 20The domain types exposed in the public interface provide standardized access to the store. The domain types and their properties directly define the granularity of data retrieval and thus also what queries can be executed.
55 21
56### Buffer Type 22### Buffer Type
57The individual buffer types as specified by the resource. The are internal types that don't necessarily have a 1:1 mapping to the domain types, although that is the default case that the default implementations expect. 23The buffers used by the resources in the store may be different from resource to resource, and don't necessarily have a 1:1 mapping to the domain types.
24This allows resources to store data in a way that is convenient/efficient for synchronization, altough it may require a bit more effort when accessing the data.
25The individual buffer types are specified by the resource and internal to it. Default buffer types exist of all domain types.
26
27### Commands
28Commands are used to modify the store. The resource processes commands that are generated by clients and the synchronizer.
29
30### Notifications
31The resource emits notifications to inform clients of new revisions and other changes.
58 32
59## Mechanisms 33## Mechanisms
60### Change Replay 34### Change Replay
61The change replay is based on the revisions in the store. Clients (as well as also the write-back mechanism that replays changes to the source), are informed that a new revision is available. Each client can then go through all new revisions (starting from the last seen revision), and thus update its state to the latest revision. 35The change replay is based on the revisions in the store. Clients (as well as also the write-back mechanism that replays changes to the source), are informed that a new revision is available. Each client can then go through all new revisions (starting from the last seen revision), and thus update its state to the latest revision.
62 36
63### Preprocessor pipeline 37### Synchronization
64Each resource has an internal pipeline of preprocessors that can be used for tasks such as indexing or filtering. The pipeline guarantees that the preprocessor steps are executed before the entity is persisted. 38The synchronizer executes a periodic synchronization that results in change commands to synchronize the store with the source.
65 39The change-replay mechanism is used to write back changes to the source that happened locally.
66# Tradeoffs/Design Decisions
67* Key-Value store instead of relational
68 * `+` Schemaless, easier to evolve
69 * `-` No need to fully normalize the data in order to make it queriable. And without full normalization SQL is not really useful and bad performance wise.
70 * `-` We need to maintain our own indexes
71
72* Individual store per resource
73 * Storage format defined by resource individually
74 * `-` Each resource needs to define it's own schema
75 * `+` Resources can adjust storage format to map well on what it has to synchronize
76 * `+` Synchronization state can directly be embedded into messages
77 * `+` Individual resources could switch to another store technology
78 * `+` Easier maintenance
79 * `+` Resource is only responsible for it's own store and doesn't accidentaly break another resources store
80 * `-` Inter`-`resource moves are both more complicated and more expensive from a client perspective
81 * `+` Inter`-`resource moves become simple additions and removals from a resource perspective
82 * `-` No system`-`wide unique id per message (only resource/id tuple identifies a message uniquely)
83 * `+` Stores can work fully concurrently (also for writing)
84 40
85* Indexes defined and maintained by resources 41### Command processing
86 * `-` Relational queries accross resources are expensive (depending on the query perhaps not even feasible) 42The resources have an internal persitant command queue hat is populated by the synchronizer and clients continuously processed.
87 * `-` Each resource needs to define it's own set of indexes
88 * `+` Flexible design as it allows to change indexes on a per resource level
89 * `+` Indexes can be optimized towards resources main usecases
90 * `+` Indexes can be shared with the source (IMAP serverside threading)
91 43
92* Shared domain types as common interface for client applications 44Each resource has an internal pipeline of preprocessors that can be used for tasks such as indexing or filtering, and through which every command goes before it enters the store. The pipeline guarantees that the preprocessor steps are executed on any command before the entity is persisted.
93 * `-` yet another abstraction layer that requires translation to other layers and maintenance
94 * `+` decoupling of domain logic from data access
95 * `+` allows to evolve types according to needs (not coupled to specific application domain types)
96 45
97# Risks
98* key-value store does not perform with large amounts of data
99* query performance is not sufficient
100* turnaround time for modifications is too high to feel responsive
101* design turns out similarly complex as Akonadi
diff --git a/docs/designgoals.md b/docs/designgoals.md
new file mode 100644
index 0000000..4ffeeac
--- /dev/null
+++ b/docs/designgoals.md
@@ -0,0 +1,39 @@
1# Design Goals
2## Axioms
31. Personal information is stored in multiple sources (address books, email stores, calendar files, ...)
42. These sources may local, remote or a mix of local and remote
5
6## Requirements
71. Local mirrors of these sources must be available to 1..N local clients simultaneously
82. Local clients must be able to make (or at least request) changes to the data in the local mirrors
93. Local mirrors must be usable without network, even if the source is remote
104. Local mirrors must be able to syncronoize local changes to their sources (local or remote)
115. Local mirrors must be able to syncronize remote changes and propagate those to local clients
126. Content must be searchable by a number of terms (dates, identities, body text ...)
137. This must all run with acceptable performance on a moderate consumer-grade desktop system
14
15Nice to haves:
16
171. As-close-to-zero-copy-as-possible for data
182. Simple change notification semantics
193. Resource-specific syncronization techniques
204. Data agnostic storage
21
22Immediate goals:
23
241. Ease development of new features in existing resources
252. Ease maintenance of existing resources
263. Make adding new resources easy
274. Make adding new types of data or data relations easy
285. Improve performance relative to existing Akonadi implementation
29
30Long-term goals:
31
321. Project view: given a query, show all items in all stores that match that query easily and quickly
33
34Implications of the above:
35
36* Local mirrors must support multi-reader, but are probably best served with single-writer semantics as this simplifies both local change recording as well as remote synchronization by keeping it in one process which can process write requests (local or remote) in sequential fashion.
37* There is no requirement for a central server if the readers can concurrently access the local mirror directly
38* A storage system which requires a schema (e.g. relational databases) are a poor fit given the desire for data agnosticism and low memory copying
39
diff --git a/docs/index.md b/docs/index.md
index 3019cfd..90d04b6 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,17 +1,18 @@
1# Index 1Sink is a data access layer handling synchronization, caching and indexing.
2* Design 2
3 * Design Goals 3Discussion of the code should be done on the kde-pim at kde.org mailing list
4 * Overview 4or in #kontact on IRC.
5 * Client API 5
6 * Storage 6Note that all feature development should happen in feature branches, and that
7 * Resource 7the mainline development branch is "develop". Master is for releases. It is
8 * Facade 8recommended (though not required) to use the ["git flow" tools](https://github.com/nvie/gitflow) to make branched
9 * Logging 9development easy (and easy for others to coordinate with).
10* Extending Akoandi Next 10
11 * Steps to add support for new types 11For further information on the project see the [KDE Phabricator instance](https://phabricator.kde.org/project/view/5/).
12 * Steps for adding support for a type to a resource
13 12
14# Documentation 13# Documentation
15This documentation is built using [mkdocs.org](http://mkdocs.org). 14This documentation is built using [mkdocs.org](http://mkdocs.org).
16 15
17Use `mkdocs serve` to run a local webserver to view the docs. 16Use `mkdocs serve` to run a local webserver to view the docs.
17
18The documentation is also published at [http://api.kde.org/doc/sink/](http://api.kde.org/doc/sink/) and rebuilt nightly.
diff --git a/docs/logging.md b/docs/logging.md
index a495a7a..3d5ea61 100644
--- a/docs/logging.md
+++ b/docs/logging.md
@@ -10,13 +10,34 @@ For debugging purposes a logging framework is required. Simple qDebugs() proved
10 * logfiles 10 * logfiles
11 * a commandline monitor tool 11 * a commandline monitor tool
12 * some other developer tool 12 * some other developer tool
13This way we get complete logs also if some resource was not started from the console (i.e. because it was already running).
13 14
14## Debug levels 15## Debug levels
15* trace: trace individual codepaths. Likely outputs way to much information for all normal cases and likely is only ever temporarily enabled. Trace points are likely only inserted into code fragments that are known to be problematic. 16* trace: trace individual codepaths. Likely outputs way to much information for all normal cases and likely is only ever temporarily enabled for certain areas.
16* log: Comprehensive debug output. Enabled on demand 17* log: Comprehensive debug output. Enabled on demand
17* warning: Only warnings, should always be logged. 18* warning: Only warnings, should always be logged.
18* error: Critical messages that should never appear. Should always be logged. 19* error: Critical messages that should never appear. Should always be logged.
19 20
21## Debug areas
22Debug areas split the code into sections that can be enabled/disabled as one.
23This is supposed to give finer grained control over what is logged or displayed.
24
25Debug areas may align with classes, but don't have to, the should be made so that they are useful.
26
27Areas could be:
28
29* resource.sync.performance
30* resource.sync
31* resource.listener
32* resource.pipeline
33* resource.store
34* resource.communication
35* client.communication
36* client.communication.org.sink.resource.maildir.identifier1
37* client.queryrunner
38* client.queryrunner.performance
39* common.typeindex
40
20## Collected information 41## Collected information
21Additionally to the regular message we want: 42Additionally to the regular message we want:
22 43
@@ -24,5 +45,5 @@ Additionally to the regular message we want:
24* threadid? 45* threadid?
25* timestamp 46* timestamp
26* sourcefile + position + function name 47* sourcefile + position + function name
27* application name / resource identfier 48* application name / resource identifier
28* component identifier (i.e. resource access) 49* area (i.e. resource access)
diff --git a/docs/queries.md b/docs/queries.md
new file mode 100644
index 0000000..8676392
--- /dev/null
+++ b/docs/queries.md
@@ -0,0 +1,104 @@
1## Query System
2The query system should allow for efficient retrieval for just the amount of data required by the client. Efficient querying is supported by the indexes provided by the resources.
3
4The query always retrieves a set of entities matching the query, while not necessarily all properties of the entity need to be populated.
5
6Queries are declarative to keep the specification simple and to allow the implementation to choose the most efficient execution.
7
8Queries can be kept open (live) to receive updates as the store changes.
9
10### Query
11The query consists of:
12
13* a set of filters to match the wanted entities
14* the set of properties to retrieve for each entity
15
16Queryable properties are defined by the [[Domain Types]] above.
17
18### Query Result
19The result is returned directly after running the query in form of a QAbstractItemModel. Each row in the model represents a matching entity.
20
21The model allows to access the domain object directly, or to access individual properties directly via the rows columns.
22
23The model is always populated asynchronously. It is therefore initially empty and will then populate itself gradually, through the regular update mechanisms (rowsInserted).
24
25Tree Queries allow the application to query for i.e. a folder hierarchy in a single query. This is necessary for performance reasons to avoid recursive querying in large hierarchies. To avoid on the other hand loading large hierchies directly into memory, the model only populates the toplevel rows automatically, all other rows need to be populated by calling `QAbstractItemModel::fetchMore(QModelIndex);`. This way the resource can deal efficiently with the query (i.e. by avoiding the many roundtrips that would be necessary with recursive queries), while keeping the amount of data in memory to a minimum (i.e. if the majority of the folder tree is collapsed in the view anyways). A tree result set can therefore be seen as a set of sets, where every subset corresponds to the children of one parent.
26
27If the query is live, the model updates itself if the update applies to one of the already loaded subsets (otherwise it's currently irrelevant and will load once the subset is loaded).
28
29#### Enhancements
30* Asynchronous loading of entities/properties can be achieved by returning an invalid QVariant initially, and emitting dataChanged once the value is loaded.
31* To avoid loading a large list when not all data is necessary, a batch size could be defined to guarantee for instance that there is sufficient data to fill the screen, and the fetchMore mechanism can be used to gradually load more data as required when scrolling in the application.
32
33#### Filter
34A filter consists of:
35
36* a property to filter on as defined by the [[Domain Types]]
37* a comparator to use
38* a value
39
40The available comparators are:
41
42* equal
43* greater than
44* less than
45* inclusive range
46
47Value types include:
48
49* Null
50* Bool
51* Regular Expression
52* Substring
53* A type-specific literal value (e.g. string, number, date, ..)
54
55Filters can be combined using AND, OR, NOT.
56
57#### Example
58```
59query = {
60 offset: int
61 limit: int
62 filter = {
63 and {
64 collection = foo
65 or {
66 resource = res1
67 resource = res2
68 }
69 }
70 }
71}
72```
73
74possible API:
75
76```
77query.filter().and().property("collection") = "foo"
78query.filter().and().or().property("resource") = "res1"
79query.filter().and().or().property("resource") = "res2"
80query.filter().and().property("start-date") = InclusiveRange(QDateTime, QDateTime)
81```
82
83The problem is that it is difficult to adjust an individual resource property like that.
84
85### Usecases ###
86Mail:
87
88* All mails in folder X within date-range Y that are unread.
89* All mails (in all folders) that contain the string X in property Y.
90
91Todos:
92
93* Give me all the todos in that collection where their RELATED-TO field maps to no other todo UID field in the collection
94* Give me all the todos in that collection where their RELATED-TO field has a given value
95* Give me all the collections which have a given collection as parent and which have a descendant matching a criteria on its attributes;
96
97Events:
98
99* All events of calendar X within date-range Y.
100
101Generic:
102* entity with identifier X
103* all entities of resource X
104
diff --git a/docs/resource.md b/docs/resource.md
index defbf9a..8c87522 100644
--- a/docs/resource.md
+++ b/docs/resource.md
@@ -4,7 +4,7 @@ The resource consists of:
4* a plugin providing the client-api facade 4* a plugin providing the client-api facade
5* a configuration setting of the filters 5* a configuration setting of the filters
6 6
7# Synchronizer 7## Synchronizer
8The synchronizer process is responsible for processing all commands, executing synchronizations with the source, and replaying changes to the source. 8The synchronizer process is responsible for processing all commands, executing synchronizations with the source, and replaying changes to the source.
9 9
10Processing of commands happens in the pipeline which executes all preprocessors ebfore the entity is persisted. 10Processing of commands happens in the pipeline which executes all preprocessors ebfore the entity is persisted.
@@ -16,7 +16,15 @@ The synchronizer process has the following primary components:
16* Listener: Opens a socket and listens for incoming connections. On connection all incoming commands are read and entered into command queues. Control commands (i.e. a sync) don't require persistency and are therefore processed directly. 16* Listener: Opens a socket and listens for incoming connections. On connection all incoming commands are read and entered into command queues. Control commands (i.e. a sync) don't require persistency and are therefore processed directly.
17* Synchronization: Handles synchronization to the source, as well as change-replay to the source. The modification commands generated by the synchronization enter the command queue as well. 17* Synchronization: Handles synchronization to the source, as well as change-replay to the source. The modification commands generated by the synchronization enter the command queue as well.
18 18
19# Preprocessors 19A resource can:
20
21* provide a full mirror of the source.
22* provide metadata for efficient access to the source.
23
24In the former case the local mirror is fully functional locally and changes can be replayed to the source once a connection is established again.
25It the latter case the resource is only functional if a connection to the source is available (which is i.e. not a problem if the source is a local maildir on disk).
26
27## Preprocessors
20Preprocessors are small processors that are guaranteed to be processed before an new/modified/deleted entity reaches storage. They can therefore be used for various tasks that need to be executed on every entity. 28Preprocessors are small processors that are guaranteed to be processed before an new/modified/deleted entity reaches storage. They can therefore be used for various tasks that need to be executed on every entity.
21 29
22Usecases: 30Usecases:
@@ -33,16 +41,29 @@ The following kinds of preprocessors exist:
33 41
34Preprocessors are typically read-only, to i.e. not break signatures of emails. Extra flags that are accessible through the sink domain model, can therefore be stored in the local buffer of each resource. 42Preprocessors are typically read-only, to i.e. not break signatures of emails. Extra flags that are accessible through the sink domain model, can therefore be stored in the local buffer of each resource.
35 43
36## Requirements 44### Requirements
37* A preprocessor must work with batch processing. Because batch-processing is vital for efficient writing to the database, all preprocessors have to be included in the batch processing. 45* A preprocessor must work with batch processing. Because batch-processing is vital for efficient writing to the database, all preprocessors have to be included in the batch processing.
38* Preprocessors need to be fast, since they directly affect how fast a message is processed by the system. 46* Preprocessors need to be fast, since they directly affect how fast a message is processed by the system.
39 47
40## Design 48### Design
41Commands are processed in batches. Each preprocessor thus has the following workflow: 49Commands are processed in batches. Each preprocessor thus has the following workflow:
42* startBatch is called: The preprocessor can do necessary preparation steps to prepare for the batch (like starting a transaction on an external database) 50* startBatch is called: The preprocessor can do necessary preparation steps to prepare for the batch (like starting a transaction on an external database)
43* add/modify/remove is called for every command in the batch: The preprocessor executes the desired actions. 51* add/modify/remove is called for every command in the batch: The preprocessor executes the desired actions.
44* endBatch is called: If the preprocessor wrote to an external database it can now commit the transaction. 52* endBatch is called: If the preprocessor wrote to an external database it can now commit the transaction.
45 53
54### Generic Preprocessors
55Most preprocessors will likely be used by several resources, and are either completely generic, or domain specific (such as only for mail).
56It is therefore desirable to have default implementations for common preprocessors that are ready to be plugged in.
57
58The domain type adaptors provide a generic interface to access most properties of the entities, on top of which generic preprocessors can be implemented.
59It is that way trivial to i.e. implement a preprocessor that populates a hierarchy index of collections.
60
61### Preprocessors generating additional entities
62A preprocessor, such as an email threading preprocessors, might generate additional entities (A thread entity is a regular entity, just like the mail that spawned the thread).
63
64In such a case the preprocessor must invoke the complete pipeline for the new entity.
65
66
46## Indexes 67## Indexes
47Most indexes are implemented as preprocessors to guarantee that they are always updated together with the data. 68Most indexes are implemented as preprocessors to guarantee that they are always updated together with the data.
48 69
@@ -65,6 +86,9 @@ Index types:
65 * sort indexes (i.e. sorted by date) 86 * sort indexes (i.e. sorted by date)
66 * Could also be a lookup in the range index (increase date range until sufficient matches are available) 87 * Could also be a lookup in the range index (increase date range until sufficient matches are available)
67 88
89### Default implementations
90Since only properties of the domain types can be queried, default implementations for commonly used indexes can be provided. These indexes are populated by generic preprocessors that use the domain-type interface to extract properties from individual entites.
91
68### Example index implementations 92### Example index implementations
69* uid lookup 93* uid lookup
70 * add: 94 * add:
@@ -106,25 +130,14 @@ Building the index on-demand is a matter of replaying the relevant dataset and u
106 130
107The indexes status information can be recorded using the latest revision the index has been updated with. 131The indexes status information can be recorded using the latest revision the index has been updated with.
108 132
109## Generic Preprocessors
110Most preprocessors will likely be used by several resources, and are either completely generic, or domain specific (such as only for mail).
111It is therefore desirable to have default implementations for common preprocessors that are ready to be plugged in.
112
113The domain type adaptors provide a generic interface to access most properties of the entities, on top of which generic preprocessors can be implemented.
114It is that way trivial to i.e. implement a preprocessor that populates a hierarchy index of collections.
115
116## Preprocessors generating additional entities
117A preprocessor, such as an email threading preprocessors, might generate additional entities (A thread entity is a regular entity, just like the mail that spawned the thread).
118
119In such a case the preprocessor must invoke the complete pipeline for the new entity.
120
121# Pipeline 133# Pipeline
122A pipeline is an assembly of a set of preprocessors with a defined order. A modification is always persisted at the end of the pipeline once all preprocessors have been processed. 134A pipeline is an assembly of a set of preprocessors with a defined order. A modification is always persisted at the end of the pipeline once all preprocessors have been processed.
123 135
124# Synchronization / Change Replay 136# Synchronization
125* The synchronization can either: 137The synchronization can either:
126 * Generate a full diff directly on top of the db. The diffing process can work against a single revision/snapshot (using transactions). It then generates a necessary changeset for the store. 138
127 * If the source supports incremental changes the changeset can directly be generated from that information. 139* Generate a full diff directly on top of the db. The diffing process can work against a single revision/snapshot (using transactions). It then generates a necessary changeset for the store.
140* If the source supports incremental changes the changeset can directly be generated from that information.
128 141
129The changeset is then simply inserted in the regular modification queue and processed like all other modifications. The synchronizer has to ensure only changes are replayed to the source that didn't come from it already. This is done by marking changes that don't require changereplay to the source. 142The changeset is then simply inserted in the regular modification queue and processed like all other modifications. The synchronizer has to ensure only changes are replayed to the source that didn't come from it already. This is done by marking changes that don't require changereplay to the source.
130 143
@@ -142,8 +155,12 @@ The remoteid mapping has to be updated in two places:
142* New entities that are synchronized immediately get a localid assinged, that is then recorded together with the remoteid. This is required to be able to reference other entities directly in the command queue (i.e. for parent folders). 155* New entities that are synchronized immediately get a localid assinged, that is then recorded together with the remoteid. This is required to be able to reference other entities directly in the command queue (i.e. for parent folders).
143* Entities created by clients get a remoteid assigned during change replay, so the entity can be recognized during the next sync. 156* Entities created by clients get a remoteid assigned during change replay, so the entity can be recognized during the next sync.
144 157
158## Change Replay
159To replay local changes to the source the synchronizer replays all revisions of the store and maintains the current replay state in the synchronization store.
160Changes that already come from the source via synchronizer are not replayed to the source again.
161
145# Testing / Inspection 162# Testing / Inspection
146Resources new to be tested, which often requires inspections into the current state of the resource. This is difficult in an asynchronous system where the whole backend logic is encapsulated in a separate process without running tests in a vastly different setup from how it will be run in production. 163Resources have to be tested, which often requires inspections into the current state of the resource. This is difficult in an asynchronous system where the whole backend logic is encapsulated in a separate process without running tests in a vastly different setup from how it will be run in production.
147 164
148To alleviate this inspection commands are introduced. Inspection commands are special commands that the resource processes just like all other commands, and that have the sole purpose of inspecting the current resource state. Because the command is processed with the same mechanism as other commands we can rely on ordering of commands in a way that a prior command is guaranteed to be executed once the inspection command is processed. 165To alleviate this inspection commands are introduced. Inspection commands are special commands that the resource processes just like all other commands, and that have the sole purpose of inspecting the current resource state. Because the command is processed with the same mechanism as other commands we can rely on ordering of commands in a way that a prior command is guaranteed to be executed once the inspection command is processed.
149 166
diff --git a/docs/akonadish.md b/docs/sinksh.md
index 9884169..9884169 100644
--- a/docs/akonadish.md
+++ b/docs/sinksh.md
diff --git a/docs/storage.md b/docs/storage.md
index 4852131..afd55d8 100644
--- a/docs/storage.md
+++ b/docs/storage.md
@@ -1,17 +1,3 @@
1## Store access
2Access to the entities happens through a well defined interface that defines a property-map for each supported domain type. A property map could look like:
3```
4Event {
5 startDate: QDateTime
6 subject: QString
7 ...
8}
9```
10
11This property map can be freely extended with new properties for various features. It shouldn't adhere to any external specification and exists solely to define how to access the data.
12
13Clients will map these properties to the values of their domain object implementations, and resources will map the properties to the values in their buffers.
14
15## Storage Model 1## Storage Model
16The storage model is simple: 2The storage model is simple:
17``` 3```
@@ -42,8 +28,7 @@ Each entity can be as normalized/denormalized as useful. It is not necessary to
42 28
43Denormalized: 29Denormalized:
44 30
45* priority is that mime message stays intact (signatures/encryption) 31* priority is that the mime message stays intact (signatures/encryption)
46* could we still provide a streaming api for attachments?
47 32
48``` 33```
49Mail { 34Mail {
@@ -55,7 +40,7 @@ Mail {
55Normalized: 40Normalized:
56 41
57* priority is that we can access individual members efficiently. 42* priority is that we can access individual members efficiently.
58* we don't care about exact reproducability of e.g. ical file 43* we don't care about exact reproducability of e.g. an ical file
59``` 44```
60Event { 45Event {
61 id 46 id
@@ -101,7 +86,7 @@ The resource can be effectively removed from disk (besides configuration),
101by deleting the directories matching `$RESOURCE_IDENTIFIER*` and everything they contain. 86by deleting the directories matching `$RESOURCE_IDENTIFIER*` and everything they contain.
102 87
103#### Design Considerations 88#### Design Considerations
104* The stores are split by buffertype, so a full scan (which is done by type), doesn't require filtering by type first. The downside is that an additional lookup is required to get from revision to the data. 89The stores are split by buffertype, so a full scan (which is done by type), doesn't require filtering by type first. The downside is that an additional lookup is required to get from revision to the data.
105 90
106### Revisions 91### Revisions
107Every operation (create/delete/modify), leads to a new revision. The revision is an ever increasing number for the complete store. 92Every operation (create/delete/modify), leads to a new revision. The revision is an ever increasing number for the complete store.
@@ -167,6 +152,8 @@ Using regular files as the interface has the advantages:
167The copy is necessary to guarantee that the file remains for the client/resource even if the resource removes the file on it's side as part of a sync. 152The copy is necessary to guarantee that the file remains for the client/resource even if the resource removes the file on it's side as part of a sync.
168The copy could be optimized by using hardlinks, which is not a portable solution though. For some next-gen copy-on-write filesystems copying is a very cheap operation. 153The copy could be optimized by using hardlinks, which is not a portable solution though. For some next-gen copy-on-write filesystems copying is a very cheap operation.
169 154
155A downside of having a file based design is that it's not possible to directly stream from a remote resource i.e. into the application memory, it always has to go via a file.
156
170## Database choice 157## Database choice
171By design we're interested in key-value stores or perhaps document databases. This is because a fixed schema is not useful for this design, which makes 158By design we're interested in key-value stores or perhaps document databases. This is because a fixed schema is not useful for this design, which makes
172SQL not very useful (it would just be a very slow key-value store). While document databases would allow for indexes on certain properties (which is something we need), we did not yet find any contenders that looked like they would be useful for this system. 159SQL not very useful (it would just be a very slow key-value store). While document databases would allow for indexes on certain properties (which is something we need), we did not yet find any contenders that looked like they would be useful for this system.
diff --git a/docs/terminology.md b/docs/terminology.md
index 1826bec..5238c79 100644
--- a/docs/terminology.md
+++ b/docs/terminology.md
@@ -13,7 +13,7 @@ It is recommended to familiarize yourself with the terms before going further in
13* resource: A plugin which provides client command processing, a store facade and synchronization for a given type of store. The resource also manages the configuration for a given source including server settings, local paths, etc. 13* resource: A plugin which provides client command processing, a store facade and synchronization for a given type of store. The resource also manages the configuration for a given source including server settings, local paths, etc.
14* store facade: An object provided by resources which provides transformations between domain objects and the store. 14* store facade: An object provided by resources which provides transformations between domain objects and the store.
15* synchronizer: The operating system process responsible for overseeing the process of modifying and synchronizing a store. To accomplish this, a synchronizer loads the correct resource plugin, manages pipelines and handles client communication. One synchronizer is created for each source that is accessed by clients; these processes are shared by all clients. 15* synchronizer: The operating system process responsible for overseeing the process of modifying and synchronizing a store. To accomplish this, a synchronizer loads the correct resource plugin, manages pipelines and handles client communication. One synchronizer is created for each source that is accessed by clients; these processes are shared by all clients.
16* Preprocessor: A component that takes an entity and performs some modification of it (e.g. changes the folder an email is in) or processes it in some way (e.g. indexes it) 16* preprocessor: A component that takes an entity and performs some modification of it (e.g. changes the folder an email is in) or processes it in some way (e.g. indexes it)
17* pipeline: A run-time definable set of filters which are applied to an entity after a resource has performed a specific kind of function on it (create, modify, delete) 17* pipeline: A run-time definable set of filters which are applied to an entity after a resource has performed a specific kind of function on it (create, modify, delete)
18* query: A declarative method for requesting entities from one or more sources that match a given set of constraints 18* query: A declarative method for requesting entities from one or more sources that match a given set of constraints
19* command: Clients request modifications, additions and deletions to the store by sending commands to a synchronizer for processing 19* command: Clients request modifications, additions and deletions to the store by sending commands to a synchronizer for processing
diff --git a/docs/tradeoffs.md b/docs/tradeoffs.md
new file mode 100644
index 0000000..d0e32c1
--- /dev/null
+++ b/docs/tradeoffs.md
@@ -0,0 +1,36 @@
1# Tradeoffs/Design Decisions
2* Key-Value store instead of relational
3 * `+` Schemaless, easier to evolve
4 * `-` No need to fully normalize the data in order to make it queriable. And without full normalization SQL is not really useful and bad performance wise.
5 * `-` We need to maintain our own indexes
6
7* Individual store per resource
8 * Storage format defined by resource individually
9 * `-` Each resource needs to define it's own schema
10 * `+` Resources can adjust storage format to map well on what it has to synchronize
11 * `+` Synchronization state can directly be embedded into messages
12 * `+` Individual resources could switch to another store technology
13 * `+` Easier maintenance
14 * `+` Resource is only responsible for it's own store and doesn't accidentaly break another resources store
15 * `-` Inter`-`resource moves are both more complicated and more expensive from a client perspective
16 * `+` Inter`-`resource moves become simple additions and removals from a resource perspective
17 * `-` No system`-`wide unique id per message (only resource/id tuple identifies a message uniquely)
18 * `+` Stores can work fully concurrently (also for writing)
19
20* Indexes defined and maintained by resources
21 * `-` Relational queries accross resources are expensive (depending on the query perhaps not even feasible)
22 * `-` Each resource needs to define it's own set of indexes
23 * `+` Flexible design as it allows to change indexes on a per resource level
24 * `+` Indexes can be optimized towards resources main usecases
25 * `+` Indexes can be shared with the source (IMAP serverside threading)
26
27* Shared domain types as common interface for client applications
28 * `-` yet another abstraction layer that requires translation to other layers and maintenance
29 * `+` decoupling of domain logic from data access
30 * `+` allows to evolve types according to needs (not coupled to specific application domain types)
31
32# Risks
33* key-value store does not perform with large amounts of data
34* query performance is not sufficient
35* turnaround time for modifications is too high to feel responsive
36* design turns out similarly complex as Akonadi