Overview
Metadata management systems are software that allows organizations to store, monitor, and manage data across their enterprise. As more information is stored on a server or in the cloud, there become more and more metadata available in order to analyze this data and make better decisions. They also enable users to take advantage of new features and to provide more useful reports to management.
There are two common types of metadata management software (MMS): hosted MMS and self-hosted MMS. Self-hosted MMS has many benefits such as security, reliability, and scalability and requires no additional infrastructure beyond what an organization intends to use. On the other hand, hosted MMS need large resources from several vendors and are best suited for smaller and larger enterprises with complex needs.
Data replication
There are two ways to replicate data. Data replication can be quick (e.g. SQL Server replication) or it can be detailed (e.g. Oracle replication). Both methods of replication may include manual replication and automated replication. There can also be multi-site replication. For instance, one database must be replicated for multiple sites. Another scenario might involve a single table for different databases located at different sites. In the past, multiple data sources would be used to create tables in the same database. This approach was called cross-docking. It is not recommended nowadays because then the number of tables will grow. To avoid extra effort, you should keep tables and/or other rows of data together, instead of creating them all separately. For example, by using multiple copies of data, the original tables can be moved and replaced at later dates. For instance, if there is some data loss or any error is detected, you can easily identify which tables were moved, so you can replace them with a newer version later on. The data duplication is done in the following way:
On replication: You select tables from your repository but do not duplicate them — in fact, they will still be the same tables. You repeat this step each time; but sometimes, tables and rows can be duplicated but not duplicated. In this case, you can delete rows to prevent errors. Then you create other tables by putting duplicates together with the same columns, without replicating. Or you can just replicate tables until there is enough space on the disk to fit both tables; then you delete older ones. Depending on whether you repeat these steps over time, you get different amounts of data duplication if you can maintain replication for long periods. To prevent replication failure: A backup strategy such as a full recovery strategy or incremental restore strategy should be used. If that is not possible, and you want to reduce duplication, you can start migrating tables from non-replication into replication. To avoid unnecessary redundancy: Look into replication methods and storage solutions where required by your business. Some replication strategies include using partitions, replication groups, deduplication, mirroring, copy-on-write, and using snapshots.
Table Partitioning
It allows partitioning your tables into multiple logical partitions depending on how much data your application requires. Each part contains another column. The size of each partition depends upon the number of rows in the table. However, some types of data are more suitable than others for this type. One example may be, say, a sales department, which usually has hundreds of data rows, but only five thousand records. So instead of breaking up physical products into large numbers in the form of individual items, partitioning your tables may be good enough to process that information in small chunks. The reason why it would be bad is that all of those physical products are similar in terms of brand name, price, and quality. For instance, sales managers cannot differentiate between three kinds of cars, even though they have very different each of them. When partitions are created, they do not match objects. Therefore, they can increase data duplication and cause performance problems. Also, it could become costly and resource-intensive because each partition is given a unique value or index. Therefore, it is better to implement a balanced partition scheme for your dataset. An efficient solution is to use balance partitions and hash partitions as follows:
First, create a master and replica set in MySQL 5.5 and above. Create a script that creates a temporary table with 2 partitions each having 1 column. Hash columns of each partition as a separate column of the main table. Add a few columns from each partition and assign indexes to each column of this hash partition column. Finally, add the next row to the master table. At last, when all the partitions have been created successfully, execute the query by adding the rows of the next partition in the main table. After it is performed, save the whole table. Once the partitioning has been done, put all your partitions back on the master table. Now you can run queries against different partitions and find relevant ones. As shown below, we have created 20 partitions on one table, a total of 120 partitions on the entire table:
The most important thing about these partition plans is that partitions are distributed in blocks, where every block is assigned a unique identifier. Basically, we created 3 blocks with different partitions on one table in MySQL. Next time a new partition plan needs to be created, this would not change the partition set. We can modify the previous partition plan, so we can do the things mentioned in this section, but it is not recommended because then time will be wasted. Instead, we should choose a simple partition-level partition plan that does not depend upon the complexity of your business.
Table Deletion
It supports deleting or altering the data within a table. Unlike other forms of manipulation for a table, deletion only deletes data in tables, but it does not delete columns. Suppose a customer’s order table is deleted or modified, and then one can get a new order table without losing any orders from previous months. Thus, it is easier to predict the behavior for future changes and ensure continuity. Similar to partition partitioning, hash partitioning creates partitions, while balance partitioning creates blocks with hash partitions. The difference is that hash partitions delete parts of data, while balance partitions delete blocks. Hence, hash partitions delete blocks and balance partitions delete partitions. Here is the pseudocode for a basic deletion plan:
create three empty blocks (1/3 blocks = 8 partitions)
hash partitions into four blocks (2/4 blocks = 12 partitions)
delete block 1 into block 2 (delete two partitions)
deleting partitions from the main table
delete the main table
delete two partitions
This is the method I recommend to optimize memory usage:
Create 2 more blocks
hash partitions into four blocks
delete block 1 into block 2 (delete two partitions)
deleting partitions from the main table
delete the main table
delete two partitions
This technique was implemented in our system and it is quite stable and reliable.
Deletion Optimization With C++ code, https://gitlab-ci.com/zcxz/project-deletion-optimization.git
Deletion optimizers:
You can improve the efficiency of Delete Optimizer to achieve more efficiency and stability. This project provides a complete package with support for the development of optimized deletion plans, including parallelized access to files and files. Its primary focus is implementing high throughput deletion capabilities and reducing bandwidth consumption. And here the link to the source code can be found: https://github.com/zcxz/project-deletion-optimization.
Deletion Optimizer is open sourced under an MIT license. Any contributions can be made on GitHub.
Manual Deletion Control with PHP code, https://gitlab-ci.com/zcxz/project-deletion-control.git
Deletion Control
And here the link to the source code can be found: https://github.com/zcxz/project-deletion-control.
Deletion control is an extension of the DnDBP control protocol that supports bulk deletion, such as unloading entire tables. It works without changing anything except the code. Manual control can be achieved in Perl programming language and it is easy to develop. Unfortunately, it is not supported by MySQL 5 so it is not worth considering. No alternative solution is available yet. Please see some reviews of the DnDBP control protocol for further discussion.