- 1. Outline of Major Topics
- 2. Data Modeling
- 3. Process Modeling
- 4. Data-Models Architecture DBMS-Architecture
- 5. Examples of Data Models
- 6. Relational Models | Data Structures (i.e. Tables)
- 7. Relational Models | Constraints
- 8. Data Model | Operations
- 9. Keys and Identifiers
- 10. Integrity and Consistency
- 11. Null Values
- 12. Surrogates Things and Names
- 13. ANSI/SPARC 3-Level DB Architecture: Separating Concerns
- 13.1 Conceptual Schema
- đź‘€13.2 External Schema - Window Into the Database
- 13.3 Knowledge Check Desaster
- 13.4 Internal Schema
- 14. Physical Data Independence
- 15. Logical Data Independance
- 16. ANSI/SPARC DBMS Framework Part 1
- Schema Compiler
- 17. ANSI/SPARC DBMS Framework Part 2
- Query Transformer
- 18. Metadata Chart
- 19. Metadata - What Is It?
1. Outline of Major Topics
Major topics of this course:
- data modeling
- process modeling
- database efficiency
2. Data Modeling
The data modeling process has two steps:
1) Fix a perception of structures of reality
2) Represent that perception
Databse people use two different languages for this. One lanugage is the EER (Extended Entity Relationship) model, which is good for fixing a perception of structures of reality. Second, the Relational Model is good for representing this perception inside a database.
In the data modeling process we select aspects of reality that are important to us while ignoring others. And we use abstraction to orgaize these aspects.
3. Process Modeling
In Process Modeling we aim at fixing representing a perception of processes of reality
Rather, they are reflected in the way we use the database through the database management system.
There are two ways in which we can use the database through the DBMS (Database Management System).
1) One ist to embed data manipulation language code inside a program or
2) to execute by means of the data manipulation language directly to issue queries on the database.
If we have a program, that builds for example a user interface. If you have a program that builds an interface for an application, then inside of that program, it is possible to have data manipulation language statements that accesses a database either to update the data or to retrieve the data and to display back through the interface.
Alternatively, it is possible to use the data manipulation language directly to do adhoc queries on the database or adhoc updates and maintenance of the database.
The data manipulation language, we are going to use in this course, is going to be SQL. You are going to spend major amounts of time in the course learning, how to build programs like that, and how to embed SQL statements to manipulate and maintain the database.
4. Data-Models Architecture DBMS-Architecture
In order to do data modeling, we need to talk about a tool called “Data Model”.
Data Models contain formalisms to allow us to express
- data structures
- constraints
- operations
- keys and identifiers
- integrity & consistency
- NULL values
- surrogates
There are several additional important aspects, I want to talk about, before in the course we get into details we the two major data models we want to talk about, namely the EER (Extended Entitiy Relationship) Model and the Relational Model.
Another important aspect is Database Architecture.
It turns out that a database model structure of reality has several layers to it. We want to look at what are those layers. We also want to take a look at what the architecture of a DBMS to allow us to create such a system would be. Specifically, we will look at the
- ANSI/SPARC 3-level DB architecture and the corresponding DBMS architecture
- Data Independence DBMS
Finally, we look at Metadata, what it is, and why it is so important.
5. Examples of Data Models
We defined a database to be a model of structures of reality.
A data model is the tool or the formalism that we use to create such a model. We should always be talking about the following three elements, when talking about a data model:
- data structures
- integrity constraints
- operations
Three data models:
1) Entity Relationship Model: We will use this model to fix a perception of reality in this course
2) Relational Model: We will use to implement that model in a DBMS
3) Hierarchical Model: Was implemented in the first database system, the IBM IMS database system back in 1967. Interestingly, the hierarchical model is also the fundamental model on the EXCEL databases today.
6. Relational Models | Data Structures (i.e. Tables)
Relational Model Example
Tables have names, in this case RegularUser, each column has a name, each one of the columns has a data type.
A table has columns, the number of colums is also calles the degree of the table. A relation also has a number of rows, in this case it is 7 rows.
The table name, column names and data types together constitute the schema of this table. The schema represents aspects of the data that are stable over time. In other words: The schema is not expected to change.
The state of the database represented by the rows in the table reflect aspects that are dynamic and change over time. The idea is that whereas the schema in general represents the structure of the data, the rows will represent what is the current state of reality that is modelled by this table.
7. Relational Models | Constraints
Constraints represent rules that cannot be expressed by the data structures, we just looked at.
- We might want Emails to be unique. That would allow us to think about Emails as representing unique RegularUser. So, for each user out in the real world, there is a corresponding row in this table.
- It would not make sense Emails to be NULL in this case
- We also might have a constraint saying that Birthday must be after 1900-01-01
- We might have a constraint that Hometown must be a city in US
Notice, that none of these constraints are expressed by the table structure or the data types.
8. Data Model | Operations
Third aspect of a data model, namely operations.
Example of database maintenance: Operations support change and retrieval of data
9. Keys and Identifiers
Keys are uniqueness constraints
10. Integrity and Consistency
Integrity and consistency are two highly desirable properties of databases. To give you a really great example of integrity, I have a confession to make: This is, how you know me now, Leo Mark, my confession is, I am really working under cover. Leo Mark Christensen, that is, too, I really am. Why? Born in Denmark, lot of people are named Christensen with firts name Leo. From that day on, I became Leo Mark.
Integrity has to do with - does the database refelct reality well?
Consistency has to do with is the database without internal conflicts?
We have our RegularUser table. The RegularUser table has BirthDate, Name and CurrCity. The User table has Email and Address.
Inconsistency
Integrity
The CurrCity of User1 is Atlanta, but the Adress of User1 is Roswell. There seems to be inconsistency. User4 CurrCity is Atlanta, but in the User table, the Address is Roswell. Obviously there is some redundancy between CurrCity and Address.
You now know that my full name is Leo Mark Christensen. User4 actually happens to be my daughter, Louise Mark Christensen. Funny name, she is born in Denmark. Denmark children inherit both the last and the middle name of the father. So, she has a man’s name in the middle. On her 18th birthday, the extended family is gathered in the kitchen for breakfast, and the phone rings. A man’s voice asks at the other end, if Louis Mark is present. The military was calling and assumed that a name Mark must belong to a male person and therefore should have signed up for military services.
11. Null Values
SSRegist = Selective Services. In the US a male older than 18 has to register for Selective Services.
Hä?
12. Surrogates Things and Names
Everything is a thing.
And some things are just that: Things.
Examples: Tablet is a thing, not a name for anything, a camera is a thing, not a name for anything.
But then, there are some things, that also are names of things. When I say names, I mean it in a very general sense.
Examples: The text string “Leo” is a thing, but it is also my first name. The text string “GTO1” is a text string which is the name on my license tag, and the text string “49” is the thing which is the name of my age.
Let us say, we have a set of RegularUsers in reality, and we would like to record Email, Name and Addr. So, we create a table in the database with the columns Email, Name, and Addr.
The Name may change 8due to marriage) and the address may change due to Umzug. Is ist still the same person? It probably is.
This type of representation is called a name-based representation. In this you are, what is known about you, no more, no less.
Now, let us consider an alternative kind of doing this.
So, we have RegularUsers, we would like to record Email, Name and Addr. However, this time, instead of having just having the three columns Email, Name and Addr in the RegularUser table, we add an extra column User for system-generated, unique identifiers. These are also called surrogates.
They represent inside the system the User outside in the real world. So for each instance of a RegularUser out in the real world, you have a surrogate or substance inside the database to represent that User in the real world.
Now, let us reconsider Lisa. This allows Email to be changed and enables to recognize, whether it is the same User.
“Das Ding an sich.” Kant
13. ANSI/SPARC 3-Level DB Architecture: Separating Concerns
We will be talking about an architecture of a database and a DBMS that is used to create and maintain the database. The architecture we are going to look at is called ANSI/SPARC 3-Level DB Architecture.
A database is a model of structures of reality. it is divided into
- Schema
- Data
The schema describes the intention or the type of the data, and the data describes the extension. The separation between schema and data is done, because it makes data access more effective and efficient.
For example, when we want to search for some data, we write a query against the schema, and that query will bring back a result of all the data that fits the structures of the schema, that were used in the query.
The separation into a single schema and data, however is not the end of it. If the database only consisted of schema and data, then the schema would have to describe aspects of what the meaning of data is, how it is used, and how it is internally organized. This would mean, that queries against the DB through that schema would be able to reference how the data is physically organized (hä?) 👀
The implication of that is, if we decide to change, how the data is physically organized, then the application of the DB would be affected. Therefore the ANSI/SPARC Level 3 architecture separates out aspects of how data is physically organized into what is known as an internal schema.
And data now sits under the internal schema so that the DB access, it will appear that it goes through the schema to access the data.
But instead, what happens, is that it goes through the schema and the request or change is translated to one at the internal schema level, and data is accessed and the answer is sent back.
The benefit of this is, that now it is actually possible to change the way data is stored without effecting the applications that run against the schema.
However, a DB is used by many different applications with different needs to display data. Therefore for each need for each application of the DB, we could create an external schema. Therefore the ANSI/SPARC Level 3 architecture prescribes a 3rd level, namely the External Schema, that separates out aspects of how data is used by individual applications.
So, now you have applications running against External Schemata that represents data in the preferred format for those applications requests through the External Schema are then translated to through the Conceptual Schema, translated through to request the Internal Schema. Data is accessed and the response is translated up through the schemata and then presented to the application in the format that it needs.
Turn this illustration by 90° and you get 3-Level ANSI/SPARC DB architecture that people normally see depicted.
You have a Conceptual Schema in the middle, you have an Internal Schema implementing that, and you have a number of External Schemata, one for each major application.
The Conceptual Schema concentrates on the meaning of data, the Internal Schema concentrates on the storage of data, and the External Schemata on the use of data.
13.1 Conceptual Schema
The Conceptual Schema describes all conceptually relevant, general, time-invariant structural aspects of reality. It describes nothing related to data representation, physical organisation, access, or use.
Example: RegularUser table. At the Conceptual Level, the only thing that is visible to the query language is this table and what comes in it.
You cannot include in this query anything about how the results are displayed other than in this order or how the data is accessed.
đź‘€13.2 External Schema - Window Into the Database
External Schema describes parts of the information in the Conceptual Schema in a form convenient to a particular user group’s view. It is logically derived from the Conceptual Schema.
Here is an example of a virtual table, that exists in the External Schema. The Name of the virtual table - or the view - is HighPayFemales. It is defined as the result of the following query:
The query selects some of the columns of the Conceptual Schema table, so it selects Email, MaidenName, CurrentCity from RegularUser, where the Sex equals “female” and the Salary equals “>10,000”.
The “Create View” statement creates the virtual table HighPayFemales in the External Schema. So, in other words, the virtual table named HighPayFemales in the top has attributes Email, MaidenName, CurrentCity. It is a virtual table, it is a view into the database. It never really exists, it is just a view or a window into the DB.
Unfortunately, I cannot add any Achievement, Badge or LineDance. I need to switch to the relevant collection. But this is not how I meant it.
When I define the colums (i.e. relations in DB jargon) of the collection “New View into The DB” not in the current collection/table, but in its source collection (i.e. in collections Achievements, Badges, and Line Dance), I have the same effect: I cannot choose any of these form collection “New View into The DB”, I must put the data into the source collections. It is exactly the same. I did not expect that.
👀 Hmmm…
13.3 Knowledge Check Desaster
The truth is, whenever it comes to a knowledge check, I fail. Always. I will never be master of DB, just master of desaster.
Or is it possible that this question was just too early? The chapters still to come are:
- Internal Schema
- Physical Data Independance
- Logical Data Independance
Never mind, I do not have to earn my living with this, anyway. I am learning this for fun.
13.4 Internal Schema
An Internal Schema describes, how an information in a conceptual schema is physically represented in order to get the overall best query and update performance of the DB.
Here, we have in the middle the table in our Conceptual Schema. That may be represented for example by a corresponding table underneath by a Physical Schema. It has - in this example here - the same attributes, and it may be sorted in some particular way. For example, it could be sorted on LastName.
In the case that there are many queries on Salary, it might be good and make queries more efficient, if there were an index defined on Salary. We could define an index in a B+-tree for example, and we will look at them later on in the course.
B+-trees create a logarithmic type access to data, so it would be possible to ask efficiently a query like the one we just saw before when we defined the External Schema.
Furthermore, it might be that we would like to have a fast access on CurrentCity, and therefore we could create an additional index on the DB on CurrentCity.
That index would basically be a table with two columns in it, (1) the list of CurrentCity and (2) pointers to the row in the DB that contain that particular value of CurrentCity.
Notice, that since both, the B+-tree and the index, cannot be seen byfrom the applications. Because of that, it is possible to replace that, to remove it or to add additional indices, all without affecting the applications that run on the database.
14. Physical Data Independence
The three levels in the ANSI/SPARC architecture provides for 2 types of independance:
(1) Physical data independance - a measure of how much you can change the Internal Schema representation of a DB without changing or affecting the applications that run on the External Schema. DB technology allows us to make that separation almost perfect. It is very similar to the idea of object oriented programming (*), that supports encapsulation, where the implementation of a class can be changed without affecting the application that access the class.
(*) đź‘€This reminds me of this course:
15. Logical Data Independance
Logical data independance is a measure of how much you can change the Conceptual Schema without changing the applications that run on the DB or the External Schema. It is more difficult to provide logical data independance than it is to provide physical data independance. The reason ist that the External Schemata against which the applications are written, those External Schemata are logically divided from the Conceptual Schema.
Needless to say, if you have an application that acceses a table in an External Schema, and you go and change that Conceptual Schema, then it is possible that the applications will be affected.
16. ANSI/SPARC DBMS Framework Part 1
Here is a description of the DBMS ANSI/SPARC architecture that is necessary in order to create and support a 3-Level DB as we have just looked at. This architecture, or this framework was proposed by American National Standards Institute (ANSI). And amazingly, it was done as early as in 1975, before there were any commercial implementations of software and databases.
It had a profound impact on the way relations and databases were build. The framework consists of 2 pieces:
(1) Schema Compiler
(2) Query Transformer
Schema Compiler
The hexagons represent people working in different roles with the DBMS, the boxes represent processes or pieces of software that transform text. The triangle represents the metadata, where schema definitions are stored.
The job or the role of Enterprise Administrator is to define conceptual DB schemata. Using the language or the interface (1) the Enterprise Admin will define a Conceptual Schema. The Conceptual Schema Processor will check that for syntax, and through a language II it will store that Conceptual Schema in the Metadatabase.
An Application Systems Admin is responsible for defining External Schemata. That human role can look at the Conceptual Schema, that is currently defined, through the language interface (3) and can express an External Schema definition through the interface (4). The External Schema Processor will read that External Schema definition, ti will check it for correct syntax, and it will check that it is correctly logically divided from the Conceptual Schema, that was previously defined. Finally it will store that External Schema definition inside the Meta DB. And - as we already talked about - multiple External Schemata can be defined from the same Conceptual Schema, and stored as Metadata.
Next, a person acting in the role of DB Admin, again, can look at the Conceptual Schema, that has been defined for the DB through the interface (3). Then that DB Admin can define an Internal Schema using the language in interface (13). The Internal Schema Processor will now pass that Internal Schema definition and make sure that it is syntactically correct and make sure that it actually physically implements and supports what was defined in the Conceptual Schema. When all of that is checked, the Internal Schema processor can now store that Internal Schema definition inside the Meta DB.
So, with those 3 levels completed, there is now a definition of what the 3 levels of the DB will look like stored inside the Meta DB.
17. ANSI/SPARC DBMS Framework Part 2
Query Transformer
A person acting in the role of User can now express queries on the DB using language interface (12). We previously talked about two different ways that could be done.
- The User can LH (?) query, and those LH(?) queries would then be translated down through the levels executed on data, and then would come back to the User. Or
- The User or a programmer implementing a User interface would include access to the DB inside some host language code, and when in the execution of that program the DB access statements are made then they are sent down to the translation, evaluated and the the answer comes back to the User.
So, let us talk about these translations here:
There is a DB query expressed in interface (12). The External to Conceptual Schema transformer will read the Metadata, describes what the External Schema looks like and what the Conceptual Schema looks like and with using that information from the Meta DB coming through interface (36) will translate the query that was issued with interface (12) to a query expressed in interface (31).
Likewise the Conceptual to Internal transformer will read the Conceptual Schema definition and the Internal Schema definition through interface (36) and translate the query from a query at this level (31) to a query at this level at interface (30).
Finally, Internal Schema to Storage transformer will read the Internal Schema definition through interface (34), translate the query from the Internal Schema (30) to one that is (21). At (21) is essentially the Operating System level language. Through Operating System calls or DB system calls the operation will be executed on the DB., the rulst will come back from the DB system, and again will be rooted b and transalted back through the levels and return as an answer to the query.
If this process took place every single time, there is a query that is executed on the DB, the DB would be incredibly inefficient. Therefore it is actually done slightly different. But the functionality, of what happens, is exactly, what I just described.
18. Metadata Chart
Here is a somewhat different approach of what I just described to you.
In this, DBA staff includes the three human roles, I just described before, the role of the the Conceptual Schema, Internal Schema and External Schemata. That is done through the DDL (Data Definition Language) with statements that basically define the, the DDL compiler will compile them and store them in the DB.
With that in place, a Casual User can now write an interactive query. That query is processed by a Query Compiler, it is optimized to find an efficient way to execute it, and it is given as a DBA command to the Runtime DB Processor, which will execute it together with all other queries on the DB.
Looking at Application Programmers, they can write Application Programs, which consist of host language code with a better DB access. That code is passed to a Precompiler, and out of the Precompiler comes two things:
- Host Language Code and
- the DB queries or DB Manipulation Language (DML) with statements.
Those DML statements are then compiled and optimized and combined with the compiled Host Language Code in this Compiled Transactions. the compiled transactions are then transfered to the DB and executed by the Runtime Database Processor (RDP).
Since there may be many concurrent Users on the DB, the Concurrency Control SW system makes sure that all these competing transactions and queries on the DB are executed in a proper order.
19. Metadata - What Is It?
Metadata is essential for making everything happen in the DB.
One must distinguish between two kinds of metadata. One is Systems Metadata, the other is Business Metadata.
Systems Metadata include
- Where the data came from
- How the data werde changed
- How the data are stored
- How data are mapped
- Who owns the data
- Who can access the data
- Data usage history
- Data usage statistics
Systems Metadata is abolutely critical to make DBMS work
Business Metadata include
- What data are available
- Where data re located
- What the data mean
- How to access the data
- Predefined reports
- Predefined queries
- How current the data are
Business Metadata are critical for businesses, for example in Data Warehouse applications.
Late in this course, we are going to talk about metadata and the use of it in data