diff options
-rw-r--r-- | .gitignore | 1 | ||||
-rw-r--r-- | docs/clientapi.md | 213 | ||||
-rw-r--r-- | docs/design.md | 98 | ||||
-rw-r--r-- | docs/index.md | 17 | ||||
-rw-r--r-- | docs/logging.md | 26 | ||||
-rw-r--r-- | docs/resource.md | 42 | ||||
-rw-r--r-- | docs/storage.md | 208 | ||||
-rw-r--r-- | mkdocs.yml | 1 |
8 files changed, 606 insertions, 0 deletions
@@ -1,2 +1,3 @@ | |||
1 | *.swp | 1 | *.swp |
2 | *.kdev4 | 2 | *.kdev4 |
3 | site | ||
diff --git a/docs/clientapi.md b/docs/clientapi.md new file mode 100644 index 0000000..a2ac18e --- /dev/null +++ b/docs/clientapi.md | |||
@@ -0,0 +1,213 @@ | |||
1 | The client API consists of: | ||
2 | |||
3 | * a modification API for messages (Create/Modify/Delete) | ||
4 | * a query API to retrieve messages | ||
5 | * a resource facade to abstract the resource implementation details | ||
6 | * a set of standardized domain types | ||
7 | * a notification mechanism to be notified about changes from individual stores | ||
8 | |||
9 | ## Requirements/Design goals | ||
10 | * zero-copy should be possible (mmap support) | ||
11 | * Likely only possible until application domain until we rewrite portions of the applications | ||
12 | * Most importantly we should hide how the data is stored (in parts, or one mmapped buffer) | ||
13 | * Support for mmapped buffers implies that we keep track of the lifetime of the loaded values. | ||
14 | * property-level on-demand loading | ||
15 | * streaming support for certain properties (attachments) | ||
16 | |||
17 | ## Domain Types | ||
18 | A set of standardized domain types is defined. This is necessary to decouple applications from resources (so a calendar can access events from all resources), and to have a "language" for queries. | ||
19 | |||
20 | The definition of the domain model directly affects: | ||
21 | * granularity for data retrievel (email property, or individual subject, date, ...) | ||
22 | * queriable properties (sender, id, ...) | ||
23 | * properties used for sorting (10 latest email) | ||
24 | |||
25 | The purpose of these domain types is strictly to be the interface and the types are not meant to be used by applications directly, or to be restricted by any other specifications (such as ical). By nature these types will be part of the evolving interface, and will need to be adjusted for every new property that an application must understand. | ||
26 | |||
27 | ### Akonadi Domain Types | ||
28 | This is a proposed set of types that we will need to evolve into what we actually require. Hierarchical types are required to be able to query for a result set of mixed types. | ||
29 | |||
30 | Items: | ||
31 | |||
32 | * Item | ||
33 | * incidence | ||
34 | * Event | ||
35 | * Todo | ||
36 | * Journal | ||
37 | * Freebusy | ||
38 | * Note | ||
39 | |||
40 | * Contact | ||
41 | |||
42 | Collections: | ||
43 | |||
44 | * Collection | ||
45 | * Mail Folder | ||
46 | * Calendar | ||
47 | * Tasklist | ||
48 | * Journal | ||
49 | * Contact Group | ||
50 | * Address Book | ||
51 | |||
52 | Relations: | ||
53 | |||
54 | * Relation | ||
55 | * Tag | ||
56 | |||
57 | ## Store Facade | ||
58 | The store is always accessed through a store specific facade, which hides: | ||
59 | * store access (one store could use a database, and another one plain files) | ||
60 | * message type (flatbuffers, ...) | ||
61 | * indexes | ||
62 | * syncronizer communication | ||
63 | * notifications | ||
64 | |||
65 | This abstraction layer allows each resource to separately define how data is stored and retrieved. Therefore tradeoffs can be defined to suit the expected access patters or structure of source data. Further it allows individual resources to choose different technologies as suitable. Logic can still be shared among resources, while keeping the maintenance effort reasonable, by providing default implementations that are suitable for most workloads. | ||
66 | |||
67 | Because the facade also implements querying of indexes, a resource my use server-side searching to fullfill the query, and fallback to local searches when the server is not available. | ||
68 | |||
69 | ## Modifications | ||
70 | Modifications are stored by the client sending modification commands to the syncronizer. The syncronizer is responsible for ensuring that modification are not lost and eventually persistet. A small window exists therefore where a modification is transferred to the syncronizer where a modifications can get lost. | ||
71 | |||
72 | The API consists of the following calls: | ||
73 | |||
74 | * create(domainObject, resource) | ||
75 | * modify(domainObject, resource) | ||
76 | * remove(domainObject, resource) | ||
77 | |||
78 | The changeset can be recorded by the domain object adapter while the properties are set, and are then sent to the syncronizer once modify is called. | ||
79 | |||
80 | Each modification is associated with a specific revision, which allows the syncronizer to do automatic conflict resolution. | ||
81 | |||
82 | ### Conflict Resolution | ||
83 | Conflicts can occur at two points in the client: | ||
84 | |||
85 | * While i.e. an editor is open and we receive an update for the same entity | ||
86 | * After a modification is sent to the syncronizer but before it's processed | ||
87 | |||
88 | In the first case the client is repsonsible to resolve the conflict, in the latter case it's the syncronizer's responsibility. | ||
89 | A small window exists where the client has already started the modification (i.e. command is in socket), and a notification has not yet arrived that the same entity has been changed. In such a case the syncronizer may reject the modification because it has the revision the modification refers to no longer available. | ||
90 | |||
91 | This design allows the syncronizer to be in control of the revisions, and keeps it from having to wait for all clients to update until it can drop revisions. | ||
92 | |||
93 | ## Query System | ||
94 | The query system should allow for efficient retrieval for just the amount of data required by the client. Efficient querying will be supported by the indexes povided by the resources. | ||
95 | |||
96 | The query always retrieves a set of entities matching the query, while not necessarily all properties of the entity need to be populated. | ||
97 | |||
98 | Queries should be declarative to keep the specification simple and to allow the implementation to choose the most efficient execution. | ||
99 | |||
100 | Queries can be kept open to receive updates as the store changes, and modified to adjust the result set. | ||
101 | |||
102 | ### Query | ||
103 | The query consists of: | ||
104 | * a declarative set of filters to match the wanted entities | ||
105 | * the set of properties to retrieve for each entity | ||
106 | * a limit for the amount of entities to retrieve | ||
107 | * an offset to retrieve more entities | ||
108 | |||
109 | Queryable properties are defined by the [[Domain Types]] above. | ||
110 | |||
111 | Other Requirements: | ||
112 | * modifiable: to facilitate adjustments, such as a date-range while scrolling in the mail view. | ||
113 | * serializable: to persist queries, i.e. to store a "smart folder" query to a config file. | ||
114 | |||
115 | #### Filter | ||
116 | A filter consists of: | ||
117 | |||
118 | * a property to filter on as defined by the [[Domain Types]] | ||
119 | * a comparator to use | ||
120 | * a value | ||
121 | |||
122 | The available comparators are: | ||
123 | |||
124 | * equal | ||
125 | * greater than | ||
126 | * less than | ||
127 | * inclusive range | ||
128 | |||
129 | Value types include: | ||
130 | |||
131 | * Null | ||
132 | * Bool | ||
133 | * Regular Expression | ||
134 | * Substring | ||
135 | * A type-specific literal value (e.g. string, number, date, ..) | ||
136 | |||
137 | Filters can be combined using AND, OR, NOT. | ||
138 | |||
139 | #### Example | ||
140 | ``` | ||
141 | query = { | ||
142 | offset: int | ||
143 | limit: int | ||
144 | filter = { | ||
145 | and { | ||
146 | collection = foo | ||
147 | or { | ||
148 | resource = res1 | ||
149 | resource = res2 | ||
150 | } | ||
151 | } | ||
152 | } | ||
153 | } | ||
154 | ``` | ||
155 | |||
156 | possible API: | ||
157 | |||
158 | ``` | ||
159 | query.filter().and().property("collection") = "foo" | ||
160 | query.filter().and().or().property("resource") = "res1" | ||
161 | query.filter().and().or().property("resource") = "res2" | ||
162 | query.filter().and().property("start-date") = InclusiveRange(QDateTime, QDateTime) | ||
163 | ``` | ||
164 | |||
165 | The problem is that it is difficult to adjust an individual resource property like that. | ||
166 | |||
167 | ### Usecases ### | ||
168 | Mail: | ||
169 | |||
170 | * All mails in folder X within date-range Y that are unread. | ||
171 | * All mails (in all folders) that contain the string X in property Y. | ||
172 | |||
173 | Todos: | ||
174 | |||
175 | * Give me all the todos in that collection where their RELATED-TO field maps to no other todo UID field in the collection | ||
176 | * Give me all the todos in that collection where their RELATED-TO field has a given value | ||
177 | * Give me all the collections which have a given collection as parent and which have a descendant matching a criteria on its attributes; | ||
178 | |||
179 | Events: | ||
180 | |||
181 | * All events of calendar X within date-range Y. | ||
182 | |||
183 | Generic: | ||
184 | * entity with identifier X | ||
185 | * all entities of resource X | ||
186 | |||
187 | ### Lazy Loading ### | ||
188 | The system provides property-level lazy loading. This allows i.e. to defer downloading of attachments until the attachments is accessed, at the expense of having to have access to the source (which could be connected via internet). | ||
189 | |||
190 | To achieve this, the query system must check for the availability of all requested properties on all matched entities. If a property is not available the a command should be sent to the synchronizer to retrieve said properties. Once all properties are available the query can complete. | ||
191 | |||
192 | Note: We should perhaps define a minimum set of properties that *must* be available. Otherwise local search will not work. On the other hand, if a resource implements server-side search, it may not care if local search doesn't work. | ||
193 | |||
194 | ### Data streaming ### | ||
195 | Large objects such as attachments should be streamable. An API that allows to retrieve a single property of a defined entity in a streamable fashion is probably enough. | ||
196 | |||
197 | ### Indexes ### | ||
198 | Since only properties of the domain types can be queried, default implementations for commonly used indexes can be provided. These indexes are populated by generic preprocessors that use the domain-type interface to extract properties from individual entites. | ||
199 | |||
200 | ## Notifications ## | ||
201 | A notification mechanism is required to inform clients about changes. Running queries will automatically update the result-set if a notification is received. | ||
202 | |||
203 | A notification constist of: | ||
204 | |||
205 | * The latest revision of the store | ||
206 | * A hint what properties changed | ||
207 | |||
208 | The revision allows the client to only fetch the data that changed. | ||
209 | The hint allows the client to avoid fetching that it's not interested in. | ||
210 | A running query can do all of that transparently behind the scenes. | ||
211 | |||
212 | Note that the hints should indeed only hint what has changed, and not supply the actual changeset. These hints should be tailored to what we see as useful, and must therefore be easy to modify. | ||
213 | |||
diff --git a/docs/design.md b/docs/design.md new file mode 100644 index 0000000..9b074b1 --- /dev/null +++ b/docs/design.md | |||
@@ -0,0 +1,98 @@ | |||
1 | # Goals | ||
2 | ## Axioms | ||
3 | 1. Personal information is stored in multiple sources (address books, email stores, calendar files, ...) | ||
4 | 2. These sources may local, remote or a mix of local and remote | ||
5 | |||
6 | ## Requirements | ||
7 | 1. Local mirrors of these sources must be available to 1..N local clients simultaneously | ||
8 | 2. Local clients must be able to make (or at least request) changes to the data in the local mirrors | ||
9 | 3. Local mirrors must be usable without network, even if the source is remote | ||
10 | 4. Local mirrors must be able to syncronoize local changes to their sources (local or remote) | ||
11 | 5. Local mirrors must be able to syncronize remote changes and propagate those to local clients | ||
12 | 6. Content must be searchable by a number of terms (dates, identities, body text ...) | ||
13 | 7. This must all run with acceptable performance on a moderate consumer-grade desktop system | ||
14 | |||
15 | Nice to haves: | ||
16 | |||
17 | 1. As-close-to-zero-copy-as-possible for data | ||
18 | 2. Simple change notification semantics | ||
19 | 3. Resource-specific syncronization techniques | ||
20 | 4. Data agnostic storage | ||
21 | |||
22 | Immediate goals: | ||
23 | |||
24 | 1. Ease development of new features in existing resources | ||
25 | 2. Ease maintenance of existing resources | ||
26 | 3. Make adding new resources easy | ||
27 | 4. Make adding new types of data or data relations easy | ||
28 | 5. Improve performance relative to existing Akonadi implementation | ||
29 | |||
30 | Long-term goals: | ||
31 | |||
32 | 1. Project view: given a query, show all items in all stores that match that query easily and quickly | ||
33 | |||
34 | Implications of the above: | ||
35 | |||
36 | * Local mirrors must support multi-reader, but are probably best served with single-writer semantics as this simplifies both local change recording as well as remote synchronization by keeping it in one process which can process write requests (local or remote) in sequential fashion. | ||
37 | * There is no requirement for a central server if the readers can concurrently access the local mirror directly | ||
38 | * A storage system which requires a schema (e.g. relational databases) are a poor fit given the desire for data agnosticism and low memory copying | ||
39 | |||
40 | # Overview | ||
41 | |||
42 | # Types | ||
43 | ## Domain Type | ||
44 | The domain types exposed in the public interface. | ||
45 | |||
46 | ## Buffer Type | ||
47 | The individual buffer types as specified by the resource. The are internal types that don't necessarily have a 1:1 mapping to the domain types, although that is the default case that the default implementations expect. | ||
48 | |||
49 | ## Steps to add support for new types | ||
50 | * Add new type to applicationdomaintypes.h and implement `getTypenName()` | ||
51 | * Implement `TypeImplementation<>` for updating indexes etc. | ||
52 | * Add a type.fbs default schema for the type | ||
53 | |||
54 | ## Steps for adding support for a type to a resource | ||
55 | * Add a TypeAdaptorFactory, which can either register resource specific mappers, or stick to what the default implementation in TypeImplementation provides | ||
56 | * Add a TypeFacade that injects the TypeAdaptorFactory in the GenericFacade | ||
57 | * Register the facade in the resource | ||
58 | * Add synchronization code that creates the relevant objects | ||
59 | |||
60 | # Change Replay | ||
61 | The change replay is based on the revisions in the store. Clients (and also the write-back mechanism), are informed that a new revision is available. Each client can then go through all new revisions (starting from the last seen revision), and thus update it's state to the latest revision. | ||
62 | |||
63 | # Tradeoffs/Design Decisions | ||
64 | * Key-Value store instead of relational | ||
65 | * `+` Schemaless, easier to evolve | ||
66 | * `-` No need to fully normalize the data in order to make it queriable. And without full normalization SQL is not really useful and bad performance wise. | ||
67 | * `-` We need to maintain our own indexes | ||
68 | |||
69 | * Individual store per resource | ||
70 | * Storage format defined by resource individually | ||
71 | * `-` Each resource needs to define it's own schema | ||
72 | * `+` Resources can adjust storage format to map well on what it has to synchronize | ||
73 | * `+` Synchronization state can directly be embedded into messages | ||
74 | * `+` Individual resources could switch to another store technology | ||
75 | * `+` Easier maintenance | ||
76 | * `+` Resource is only responsible for it's own store and doesn't accidentaly break another resources store | ||
77 | * `-` Inter`-`resource moves are both more complicated and more expensive from a client perspective | ||
78 | * `+` Inter`-`resource moves become simple additions and removals from a resource perspective | ||
79 | * `-` No system`-`wide unique id per message (only resource/id tuple identifies a message uniquely) | ||
80 | * `+` Stores can work fully concurrently (also for writing) | ||
81 | |||
82 | * Indexes defined and maintained by resources | ||
83 | * `-` Relational queries accross resources are expensive (depending on the query perhaps not even feasible) | ||
84 | * `-` Each resource needs to define it's own set of indexes | ||
85 | * `+` Flexible design as it allows to change indexes on a per resource level | ||
86 | * `+` Indexes can be optimized towards resources main usecases | ||
87 | * `+` Indexes can be shared with the source (IMAP serverside threading) | ||
88 | |||
89 | * Shared domain types as common interface for client applications | ||
90 | * `-` yet another abstraction layer that requires translation to other layers and maintenance | ||
91 | * `+` decoupling of domain logic from data access | ||
92 | * `+` allows to evolve types according to needs (not coupled to specific applicatoins domain types) | ||
93 | |||
94 | # Risks | ||
95 | * key-value store does not perform with large amounts of data | ||
96 | * query performance is not sufficient | ||
97 | * turnaround time for modifications is too high to feel responsive | ||
98 | * design turns out similarly complex as akonadi1 | ||
diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..3019cfd --- /dev/null +++ b/docs/index.md | |||
@@ -0,0 +1,17 @@ | |||
1 | # Index | ||
2 | * Design | ||
3 | * Design Goals | ||
4 | * Overview | ||
5 | * Client API | ||
6 | * Storage | ||
7 | * Resource | ||
8 | * Facade | ||
9 | * Logging | ||
10 | * Extending Akoandi Next | ||
11 | * Steps to add support for new types | ||
12 | * Steps for adding support for a type to a resource | ||
13 | |||
14 | # Documentation | ||
15 | This documentation is built using [mkdocs.org](http://mkdocs.org). | ||
16 | |||
17 | Use `mkdocs serve` to run a local webserver to view the docs. | ||
diff --git a/docs/logging.md b/docs/logging.md new file mode 100644 index 0000000..a88bc0f --- /dev/null +++ b/docs/logging.md | |||
@@ -0,0 +1,26 @@ | |||
1 | For debugging purposes a logging framework is required. Simple qDebugs() proved to be insufficient for any non-trivial software. Developers should be enabled to add detailed debug information that allows to analize problems, and users should be enabled to record that information at runtime to debug a problem. The aim is to get away from the situation where developers remove messages because "it's to noisy", and then have to ship a special version with additional debug output to a user to debug a problem, just to then remove the debug output again. | ||
2 | |||
3 | ## Requirements | ||
4 | * runtime configurability of debug level for specific components | ||
5 | * queriable debug logs. If everything is turned on a *lot* of information will be generated. | ||
6 | * integration with the system log. It likely makes sense to integrate with the system-log instead of rolling our own solution or use .xsession-errors as dumping ground. In any case, simply logging to the console is not enough. | ||
7 | * debug information *must* be available in release builds | ||
8 | * It may make sense to be able to disable certain debug output (configurable per debug level) for certain components at compile time, for performance reasons. | ||
9 | * Ideally little interaction with stdout (i.e. only warnings). Proper monitoring should happen through: | ||
10 | * logfiles | ||
11 | * a commandline monitor tool | ||
12 | * akonadiconsole | ||
13 | |||
14 | ## Debug levels | ||
15 | * trace: trace individual codepaths. Likely outputs way to many information for all normal cases and likely is only ever temporarily enabled. Trace points are likely only inserted into code fragments that are known to be problematic. | ||
16 | * debug: Comprehensive debug output. Enabled on demand | ||
17 | * warning: Only warnings, should always be logged. | ||
18 | * error: Critical messages that should never appear. Should always be logged. | ||
19 | |||
20 | ## Collected information | ||
21 | Additionally to the regular message we want: | ||
22 | * pid | ||
23 | * threadid? | ||
24 | * timestamp | ||
25 | * sourcefile + position + function name | ||
26 | * application name | ||
diff --git a/docs/resource.md b/docs/resource.md new file mode 100644 index 0000000..847831d --- /dev/null +++ b/docs/resource.md | |||
@@ -0,0 +1,42 @@ | |||
1 | The resource consists of: | ||
2 | |||
3 | * the syncronizer process | ||
4 | * a plugin providing the client-api facade | ||
5 | * a configuration setting up the filters | ||
6 | |||
7 | # Syncronizer | ||
8 | * The synchronization can either: | ||
9 | * Generate a full diff directly on top of the db. The diffing process can work against a single revision, and could even stop writing other changes to disk while the process is ongoing (but doesn't have to due to the revision). It then generates a necessary changeset for the store. | ||
10 | * If the source supports incremental changes the changeset can directly be generated from that information. | ||
11 | |||
12 | The changeset is then simply inserted in the regular modification queue and processed like all other modifications. | ||
13 | The synchronizer already knows that it doesn't have to replay this changeset to the source, since replay no longer goes via the store. | ||
14 | |||
15 | # Preprocessors | ||
16 | Preprocessors are small processors that are guaranteed to be processed before an new/modified/deleted entity reaches storage. The can therefore be used for various tasks that need to be executed on every entity. | ||
17 | |||
18 | Usecases: | ||
19 | |||
20 | * Updating various indexes | ||
21 | * detecting spam/scam mail and setting appropriate flags | ||
22 | * email filtering | ||
23 | |||
24 | Preprocessors need to be fast, since they directly affect how fast a message is processed by the system. | ||
25 | |||
26 | The following kinds of preprocessors exist: | ||
27 | |||
28 | * filtering preprocessors that can potentially move an entity to another resource | ||
29 | * passive filter, that extract data that is stored externally (i.e. indexers) | ||
30 | * flag extractors, that produce data stored with the entity (spam detection) | ||
31 | |||
32 | Filter typically be read-only, to i.e. not break signatures of emails. Extra flags that are accessible through the akonadi domain model, can therefore be stored in the local buffer of each resource. | ||
33 | |||
34 | # Generic Preprocessors | ||
35 | Most preprocessors will likely be used by several resources, and are either completely generic, or domain specific (such as only for mail). | ||
36 | It is therefore desirable to have default implementations for common preprocessors that are ready to be plugged in. | ||
37 | |||
38 | The domain types provide a generic interface to access most properties of the entities, on top of which generic preprocessors can be implemented. | ||
39 | It is that way trivial to i.e. implement a preprocessor that populates a hierarchy index of collections. | ||
40 | |||
41 | # Pipeline | ||
42 | A pipeline is an assembly of a set of preprocessors with a defined order. A modification is always persisted at the end of the pipeline once all preprocessors have been processed. | ||
diff --git a/docs/storage.md b/docs/storage.md new file mode 100644 index 0000000..3a8d74a --- /dev/null +++ b/docs/storage.md | |||
@@ -0,0 +1,208 @@ | |||
1 | ## Store access | ||
2 | Access to the entities happens through a well defined interface that defines a property-map for each supported domain type. A property map could look like: | ||
3 | ``` | ||
4 | Event { | ||
5 | startDate: QDateTime | ||
6 | subject: QString | ||
7 | ... | ||
8 | } | ||
9 | ``` | ||
10 | |||
11 | This property map can be freely extended with new properties for various features. It shouldn't adhere to any external specification and exists solely to define how to access the data. | ||
12 | |||
13 | Clients will map these properties to the values of their domain object implementations, and resources will map the properties to the values in their buffers. | ||
14 | |||
15 | ## Storage Model | ||
16 | The storage model is simple: | ||
17 | ``` | ||
18 | Entity { | ||
19 | Id | ||
20 | Revision { | ||
21 | Revision-Id, | ||
22 | Property* | ||
23 | }+ | ||
24 | }* | ||
25 | ``` | ||
26 | |||
27 | The store consists of entities that have each an id and a set of properties. Each entity can have multiple revisions. | ||
28 | |||
29 | A entity is uniquely identified by: | ||
30 | * Resource + Id | ||
31 | The additional revision identifies a specific instance/version of the entity. | ||
32 | |||
33 | Uri Scheme: | ||
34 | akonadi://resource/id:revision | ||
35 | |||
36 | ## Store Entities | ||
37 | Each entity can be as normalized/denormalized as useful. It is not necessary to have a solution that fits everything. | ||
38 | |||
39 | Denormalized: | ||
40 | * priority is that mime message stays intact (signatures/encryption) | ||
41 | * could we still provide a streaming api for attachments? | ||
42 | ``` | ||
43 | Mail { | ||
44 | id | ||
45 | mimeMessage | ||
46 | } | ||
47 | ``` | ||
48 | |||
49 | Normalized: | ||
50 | * priority is that we can access individual members efficiently. | ||
51 | * we don't care about exact reproducability of e.g. ical file | ||
52 | ``` | ||
53 | Event { | ||
54 | id | ||
55 | subject | ||
56 | startDate | ||
57 | attendees | ||
58 | ... | ||
59 | } | ||
60 | ``` | ||
61 | |||
62 | Of course any combination of the two can be used, including duplicating data into individual properties while keeping the complete struct intact. The question then becomes though which copy is used for conflict resolution (perhaps this would result in more problems than it solves). | ||
63 | |||
64 | #### Optional Properties | ||
65 | For each domain type, we want to define a set of required and a set of optional properties. The required properties are the minimum bar for each resource, and are required in order for applications to work as expected. Optional properties may only be shown by the UI if actually supported by the backend. | ||
66 | |||
67 | However, we'd like to be able to support local-only storage for resources that don't support an optional property. Each entity thus has a "local" buffer that provides default local only storage. This local-only buffer provides storage for all properties of the respective domain type. | ||
68 | |||
69 | Each resource can freely define how the properties are split, while it wants to push as many as possible into the left side so they can be synchronized. Note that the resource is free to add more properties to it's synchronized buffer even though they may not be required by the specification. | ||
70 | |||
71 | The advantage of this is that a resource only needs to specify a minimal set of properties, while everything else is taken care of by the local-only buffer. This is supposed to make it easier for resource implementors to get something working. | ||
72 | |||
73 | ### Value Format | ||
74 | Each entity-value in the key-value store consists of the following individual buffers: | ||
75 | * Metadata: metadata that is required for every entity (revision, ....) | ||
76 | * Resource: the buffer defined by the resource (synchronized properties, values that help for synchronization such as remoteId's) | ||
77 | * Local-only: default storage buffer that is domain-type specific. | ||
78 | |||
79 | ## Database | ||
80 | ### Storage Layout | ||
81 | Storage is split up in multiple named databases that reside in the same database environment. | ||
82 | |||
83 | ``` | ||
84 | $DATADIR/akonadi2/storage/$RESOURCE_IDENTIFIER/$BUFFERTYPE.main | ||
85 | $BUFFERTYPE.index.$INDEXTYPE | ||
86 | ``` | ||
87 | |||
88 | The resource can be effectively removed from disk (besides configuration), | ||
89 | by deleting the `$RESOURCE_IDENTIFIER` directory and everything it contains. | ||
90 | |||
91 | #### Design Considerations | ||
92 | * The stores are split by buffertype, so a full scan (which is done by type), doesn't require filtering by type first. The downside is that an additional lookup is required to get from revision to the data. | ||
93 | |||
94 | ### Revisions | ||
95 | Every operation (create/delete/modify), leads to a new revision. The revision is an ever increasing number for the complete store. | ||
96 | |||
97 | #### Design Considerations | ||
98 | By having one revision for the complete store instead of one per type, the change replay always works across all types. This is especially important in the write-back | ||
99 | mechanism that replays the changes to the source. | ||
100 | |||
101 | |||
102 | ### Database choice | ||
103 | By design we're interested in key-value stores or perhaps document databases. This is because a fixed schema is not useful for this design, which makes | ||
104 | SQL not very useful (it would just be a very slow key-value store). While document databases would allow for indexes on certain properties (which is something we need), we did not yet find any contenders that looked like they would be useful for this system. | ||
105 | |||
106 | ### Requirements | ||
107 | * portable; minimally: Linux, Windows, MacOS X | ||
108 | * multi-thread and multi-process concurrency with single writer and multiple readers. | ||
109 | * This is required so we don't block clients while a resource is writing and deemed essential for performance and to reduce complexity. | ||
110 | * Reasonably fast so we can implement all necessary queries with sufficient performance | ||
111 | * Can deal with large amounts of data | ||
112 | * On disk storage with ACID properties. | ||
113 | * Memory consumption is suitable for desktop-system (no in-memory stores). | ||
114 | |||
115 | Other useful properties: | ||
116 | * Is suitable to implement some indexes (the fewer tools we need the better) | ||
117 | * Support for transactions | ||
118 | * Small overhead in on-disk size | ||
119 | |||
120 | ### Contenders | ||
121 | * LMDB | ||
122 | * support for mmapped values | ||
123 | * good read performance, ok write performance | ||
124 | * fairly complex API | ||
125 | * Up to double storage size due to paging (with 4k pagesize 4001 bytes provide the worst case) | ||
126 | * size limit of 4GB on 32bit systems, virtually no limit on 64bit. (leads to 2GB of actual payload with write amplification) | ||
127 | * limited key-search capabilities | ||
128 | * ACID transactions | ||
129 | * MVCC concurrency | ||
130 | * no compaction, database always grows (pages get reused but are never freed) | ||
131 | * berkeley db (bdb) | ||
132 | * performance is supposedly worse than lmdb (lmdb was written as successor to bdb for openldap) | ||
133 | * oracle sits behind it (it has an AGPL licence though) | ||
134 | * rocksdb | ||
135 | * => no multiprocess | ||
136 | * kyotocabinet http://fallabs.com/kyotocabinet/ | ||
137 | * fast, low on-disk overhead, simple API | ||
138 | * => no multiprocess | ||
139 | * GPL | ||
140 | * hamsterdb | ||
141 | * => no multiprocess | ||
142 | * sqlite4 | ||
143 | * not yet released | ||
144 | * bangdb | ||
145 | * not yet released opensource, looks promising on paper | ||
146 | * redis | ||
147 | * => loads everything into memory | ||
148 | * => not embeddable | ||
149 | * couchdb | ||
150 | * MVCC concurrency | ||
151 | * document store | ||
152 | * not embeddable (unless we write akonadi in erlang ;) | ||
153 | * https://github.com/simonhf/sharedhashfile | ||
154 | * not portable (i.e. Windows; it's a mostly-Linux thing) | ||
155 | * http://sphia.org/architecture.html | ||
156 | * => no multiprocess | ||
157 | * leveldb | ||
158 | * => no multiprocess | ||
159 | * ejdb http://ejdb.org/#ejdb-c-library | ||
160 | * modified version of kyoto cabinet | ||
161 | * => multiprocess requires locking, no multiprocess | ||
162 | * Is more of a document store | ||
163 | * No updates since September 2013 | ||
164 | * http://unqlite.org | ||
165 | * bad performance with large database (looks like O(n)) | ||
166 | * like lmdb roughly 2*datasize | ||
167 | * includes a document store | ||
168 | * mmapped ready access | ||
169 | * reading about 30% the speed of lmdb | ||
170 | * slow writes with transactions | ||
171 | |||
172 | ## Indexes | ||
173 | |||
174 | ### Index Choice | ||
175 | Additionally to the primary store, indexes are required for efficient lookups. | ||
176 | |||
177 | Since indexes always need to be updated they directly affect how fast we can write data. While reading only a subset of the available indexes is typically used, so a slow index doesn't affect all quries. | ||
178 | |||
179 | #### Contenders | ||
180 | * xapian: | ||
181 | * fast fulltext searching | ||
182 | * No MVCC concurrency | ||
183 | * Only supports one writer at a time | ||
184 | * If a reader is reading blocks that have now been changed by a writer, it throws a DatabaseModifiedException. This means most of the Xapian code needs to be in wihle (1) { try { .. } catch () } blocks and needs to be able to start from scratch. | ||
185 | * Wildcard searching (as of 2015-01) isn't ideal. It works by expanding the word into all other words in the query and that typically makes the query size huge. This huge query is then sent to the database. Baloo has had to configure this expanding of terms so that it consumes less memory. | ||
186 | * Non existent UTF support - It does not support text normalization and splitting the terms at custom characters such as '_'. | ||
187 | * lmdb: | ||
188 | * sorted keys | ||
189 | * sorted duplicate keys | ||
190 | * No FTS | ||
191 | * MVCC concurrency | ||
192 | * sqlite: | ||
193 | * SQL | ||
194 | * updates lock the database for readers | ||
195 | * concurrent reading is possible | ||
196 | * Requires duplicating the data. Once in a column as data and then in the FTS. | ||
197 | * lucenePlusPlus | ||
198 | * fast full text searching | ||
199 | * MVCC concurrency | ||
200 | |||
201 | ## Useful Resources | ||
202 | * LMDB | ||
203 | * Wikipedia for a good overview: https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database | ||
204 | * Benchmarks: http://symas.com/mdb/microbench/ | ||
205 | * Tradeoffs: http://symas.com/is-lmdb-a-leveldb-killer/ | ||
206 | * Disk space benchmark: http://symas.com/mdb/ondisk/ | ||
207 | * LMDB instead of Kyoto Cabinet as redis backend: http://www.anchor.com.au/blog/2013/05/second-strike-with-lightning/ | ||
208 | |||
diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..50abdae --- /dev/null +++ b/mkdocs.yml | |||
@@ -0,0 +1 @@ | |||
site_name: Akonadi Next | |||