Welcome

KU 5TH SEM ASSIGNMENT - BSIT (TA) - 53 (DATA WAREHOUSING & DATA MINING)

Assignment: TA (Compulsory)

1. With neat diagram explain the main parts of the computer
A Computer will have 3 basic main parts –
i). A central processing unit that does all the arithmetic and logical operations. This can be
thought of as the heart of any computer and computers are identified by the type of CPU
that they use.
ii). The memory is supposed to hold the programs and data. All the computers that we came
across these days are what are known as “stored program computers”. The programs are
to be stored before hand in the memory and the CPU accesses these programs line by line
and executes them.
iii). The Input/output devices: These devices facilitate the interaction of the users with the computer.
The input devices are used to send information to the computer, while the output devices
accept the processed information form the computer and make it available to the user.
Diagram:-


2. Briefly explain the types of memories.
There are two types of memories – Primary memory, which is embedded in the computer
and which is the main source of data to the computer and the secondary memory like floppy disks, CDs etc., which can be carried around and used in different computers. They cost much less than the primary memory, but the CPU can access data only from the primary memory. The main advantage of computer memories, both primary and secondary, is that they can store data indefinitely and accurately

3. Describe the basic concept of databases.
The Concept of Database :-
We have seen in the previous section how data can be stored in computer. Such stored data becomes
a “database” – a collection of data. For example, if all the marks scored by all the students of a class are
stored in the computer memory, it can be called a database. From such a database, we can answer
questions like – who has scored the highest marks? ; In which subject the maximum number of students
have failed?; Which students are weak in more than one subject? etc. Of course, appropriate programs
have to be written to do these computations. Also, as the database becomes too large and more and more
data keeps getting included at different periods of time, there are several other problems about “maintaining”
these data, which will not be dealt with here.
Since handling of such databases has become one of the primary jobs of the computer in recent years,
it becomes difficult for the average user to keep writing such programs. Hence, special languages –
called database query languages- have been deviced, which makes such programming easy, there languages
help in getting specific “queries” answered easily.

4. With example explain the different views of a data.
Data is normally stored in tabular form, unless storage in other formats becomes advantageous, we
store data in what are technically called “relations” or in simple terms as “tables”.
The views are Mainly 2 types .
i). Simple View
ii). Complex View
Simple view: 
    - It is created by selecting only one table.
    - It does not contains functions.
    - it can perform DML (SELECT,INSERT,UPDATE,DELETE,MERGE, CALL,LOCK TABLE) operations through simple view.
Complex view :
    -It is created by selecting more than one table.
    -It can performs  functions.
    -You can not perform always DML operations through

5. Briefly explain the concept of normalization.
Normalization is dealt with in several chapters of any books on database management systems. Here, we will take the simplest definition, which suffices our purpose namely any field should not have subfields.

Again consider the following student table.
Here under the field marks, we have 3 sub fields: marks for subject1, marks for subject2 and subject3.

However, it is preferable split these subfields to regular fields as shown below
Quite often, the original table which comes with subfields will have to be modified suitable, by the
process of “normalization”.

6. Explain the concept of data ware house delivery process in detail.
The concept of data ware house delivery process :-
This section deals with the dataware house from a different view point - how the different components that go into it enable the building of a data ware house. The study helps us in two ways:
   i) to have a clear view of the data ware house building process.
   ii) to understand the working of the data ware house in the context of the components.
Now we look at the concepts in details :-
   i). IT Strategy : The company must and should have an overall IT strategy and the data ware housing has to be a part of the overall strategy.
   ii). Business case analysis : This looks at an obvious thing, but is most often misunderstood. The overall understanding of the business and the importance of various components there in is a must. This will ensure that one can clearly justify the appropriate level of investment that goes into the data ware house design and also the amount of returns accruing.
   iii). Education : This has two roles to play - one to make people, specially top level policy makers, comfortable with the concept. The second role is to aid the prototyping activity.
   iv). Business Requirements : As has been discussed earlier, it is essential that the business requirements are fully understood by the data ware house planner. This would ensure that the ware house is incorporated adequately in the overall setup of the organization.
   v). Technical blue prints : This is the stage where the overall architecture that satisfies the requirements is delivered.
   vi). Building the vision : Here the first physical infrastructure becomes available. The major infrastructure components are set up, first stages of loading and generation of data start up.
   vii). History load : Here the system is made fully operational by loading the required history into the ware house - i.e. what ever data is available over the previous years is put into the dataware house to make is fully operational.
   viii). Adhoc Query : Now we configure a query tool to operate against the data ware house.
   ix). Automation : This phase automates the various operational processes like -
a) Extracting and loading of data from the sources.
b) Transforming the data into a suitable form for analysis.
c) Backing up, restoration and archiving.
d) Generate aggregations.
e) Monitoring query profiles.
   x). Extending Scope : There is not single mechanism by which this can be achieved. As and when needed, a new set of data may be added, new formats may be included or may be even involve major changes.
   xi). Requirement Evolution : Business requirements will constantly change during the life of the ware house. Hence, the process that supports the ware house also needs to be constantly monitored and modified.

7. What are three major activities of data ware house? Explain.

Three major activities of data ware house are :-
  i) Populating the ware house (i.e. inclusion of data)
  ii) day-to-day management of the ware house.
  iii) Ability to accommodate the changes.

   i). The processes to populate the ware house have to be able to extract the data, clean it up, and make it available to the analysis systems. This is done on a daily / weekly basis depending on the quantum of the data population to be incorporated.

   ii). The day to day management of data ware house is not to be confused with maintenance and management of hardware and software. When large amounts of data are stored and new data are being continually added at regular intervals, maintaince of the “quality” of data becomes an important element.
   iii). Ability to accommodate changes implies the system is structured in such a way as to be able to cope with future changes without the entire system being remodeled. Based on these, we can view the processes that a typical data ware house scheme should support as follows.

8. Explain the extract and load process of data ware house.
Extract and Load Process : This forms the first stage of data ware house. External physical systems like the sales counters which give the sales data, the inventory systems that give inventory levels etc. constantly feed data to the warehouse. Needless to say, the format of these external data is to be monitored and modified before loading it into the ware house. The data ware house must extract the data from the source systems, load them into their data bases, remove unwanted fields (either because they are not needed or because they are already there in the data base), adding new fields / reference data and finally reconciling with the other data. We shall see a few more details of theses broad actions in the subsequent paragraphs.
     i). A mechanism should be evolved to control the extraction of data, check their consistency
etc. For example, in some systems, the data is not authenticated until it is audited.
     ii). Having a set of consistent data is equally important. This especially happens when we are
having several online systems feeding the data. 
     iii). Once data is extracted from the source systems, it is loaded into a temporary data storage
before it is “Cleaned” and loaded into the warehouse.

9. In what ways data needs to be cleaned up and checked? Explain briefly.
Data needs to be cleaned up and checked in the following ways :-
     i) It should be consistent with itself.
     ii) It should be consistent with other data from the same source.
     iii) It should be consistent with other data from other sources.
     iv) It should be consistent with the information already available in the data ware house.

While it is easy to list act the needs of a “clean” data, it is more difficult to set up systems that
automatically cleanup the data. The normal course is to suspect the quality of data, if it does not meet the normally standards of commonsense or it contradicts with the data from other sources, data already available in the data ware house etc. Normal intution doubts the validity of the new data and effective measures like rechecking, retransmission etc., are undertaken. When none of these are possible, one may even resort to ignoring the entire set of data and get on with next set of incoming data.

10. Explain the architecture of data warehouse.
The architecture for a data ware is indicated below. Before we proceed further, we should be clear about the concept of architecture. It only gives the major items that make up a data ware house. The size and complexity of each of these items depend on the actual size of the ware house itself, the specific requirements of the ware house and the actual details of implementation.

11. Briefly  explain the functions of each manager of data warehouse.

The Warehouse Manager : The ware house manager is a component that performs all operations necessary to support the ware house management process. Unlike the load manager, the warehouse management process is driven by the extent to which the operational management of the data ware house has been automated.

The ware house manger can be easily termed to be the most complex of the ware house components, and performs a variety of tasks. A few of them can be listed below.
     i) Analyze the data to confirm data consistency and data integrity.
     ii) Transform and merge the source data from the temporary data storage into the ware house.
     iii) Create indexes, cross references, partition views etc.,.
     iv) Check for normalization’s.
     v) Generate new aggregations, if needed.
     vi) Update all existing aggregations
     vii) Create backups of data.
     viii) Archive the data that needs to be archived.

12. Explain the star schema to represent the sales analysis.
Star schemes are data base schemas that structure the data to exploit a typical decision support
enquiry. When the components of typical enquiry’s are examined, a few similarities stand out.
     i) The queries examine a set of factual transactions - sales for example.
     ii) The queries analyze the facts in different ways - by aggregating them on different bases /
graphing them in different ways.

The central concept of most such transactions is a “fact table”. The surrounding references are called dimension tables. The combination can be called a star schema.


13. What do you mean by partition of data? Explain briefly.
Partitioning of data :-
In most ware houses, the size of the fact data tables tends to become very large. This leads to several problems of management, backup, processing etc. These difficulties can be over come by partitioning each fact table into separate partitions.

Data ware houses tend to exploit these ideas by partitioning the large volume of data into data sets. For example, data can be partitioned on weekly / monthly basis, so as the minimize the amount of data scanned before answering a query. This technique allows data to be scanned to be minimized, without the overhead of using an index. This improves the overall efficiency of the system. However, having too many partitions can be counter productive and an optimal size of the partitions and the number of such partitions is of vital importance.

Participating generally helps in the following ways.
   i) Assists in better management of data
   ii) Ease of backup / recovery since the volumes are less.
   iii) The star schemes with partitions produce better performance.
   iv) Since several hardware architectures operate better in a partitioned environment, the overall
system performance improve.

14. Describe the terms data mart and Meta data.
Data mart :-

A data mart is a subset of information content of a data ware house, stored in it’s own data base. The data of a data ware house may have been collected through a ware house or in some cases, directly from the source. In a crude sense, if you consider a data ware house as a whole sale shop of data, a data mart can be thought of as a retailer.

Meta data :-

Meta data is simply data about data. Data normally describe the objects, their

quantity, size, how they are stored etc. Similarly meta data stores data about how data (of objects) is stored, etc.

Meta data is useful in a number of ways. It can map data sources to the common view of information within the warehouse. It is helpful in query management, to direct query to most appropriate source etc.,.

The structure of meta data is different for each process. It means for each volume of data, there are multiple sets of meta data describing the same volume. While this is a very convenient way of managing data, managing meta data itself is not a very easy task.



15. Enlist the differences between fact and dimension.

This ensures that key dimensions are no fact tables.

Consider the following example :-


Let us elaborate a little on the example. Consider a customer A. If there is a situation, where the
warehouse is building the profiles of customer, then A becomes a fact - against the name A, we can list his address, purchases, debts etc. One can ask questions like how many purchases has A made in the last 3 months etc. Then A is fact. On the other hand, if it is likely to be used to answer questions like “how many customers have made more than 10 purchases in the last 6 months”, and one uses the data of A, as well as of other customers to give the answer, then it becomes a fact table. The rule is, in such cases, avoid making A as a candidate key.


16. Explain the designing of star-flake schema in detail.

A star flake schema, as we have defined previously, is a schema that uses a combination of denormalised star and normalized snow flake schemas. They are most appropriate in decision support data ware houses. Generally, the detailed transactions are stored within a central fact table, which may be partitioned horizontally or vertically. A series of combinatory data base views are created to allow the user to access tools to treat the fact table partitions as a single, large table.

The key reference data is structured into a set of dimensions. Theses can be referenced from the fact table. Each dimension is stored in a series of normalized tables (snow flakes), with an additional denormalised star dimension table.



17. What is query redirection? Explain.

Query Redirection :-

One of the basic requirements for successful operation of star flake schema (or any schema, for that matter) is the ability to direct a query to the most appropriate source. Note that once the available data grows beyond a certain size, partitioning becomes essential. In such a scenario, it is essential that, in order to optimize the time spent on querying, the queries should be directed to the appropriate partitions that store the date required by the query.

The basic method is to design the access tool in such away that it automatically defines the locality to which the query is to be redirected.



18. In detail, explain the multidimensional schema.

Multidimensional schemas :-

Before we close, we see the interesting concept of multi dimensions. This is a very convenient

method of analyzing data, when it goes beyond the normal tabular relations.

For example, a store maintains a table of each item it sells over a month as a table, in each of it’s 10 outlets..




This is a 2 dimensional table. One the other hand, if the company wants a data of all items sold by it’s outlets, it can be done by simply by superimposing the 2 dimensional table for each of these items – one behind the other. Then it becomes a 3 dimensional view.

Then the query, instead of looking for a 2 dimensional rectangle of data, will look for a 3 dimensional cuboid of data.

There is no reason why the dimensioning should stop at 3 dimensions. In fact almost all queries can be thought of as approaching a multi-dimensioned unit of data from a multidimensioned volume of the schema.

19. Why partitioning is needed in large data warehouse?

Partitioning is needed in any large data ware house to ensure that the performance and manageability is improved. It can help the query redirection to send the queries to the appropriate partition, thereby reducing the overall time taken for query processing.

20. Explain the types of partitioning in detail.
i). Horizontal partitioning :-
This is essentially means that the table is partitioned after the first few thousand entries, and the next
few thousand entries etc. This is because in most cases, not all the information in the fact table needed all the time. Thus horizontal partitioning helps to reduce the query access time, by directly cutting down the amount of data to be scanned by the queries.
ii). Vertical partitioning :-
As the name suggests, a vertical partitioning scheme divides the table vertically – i.e. each row is
divided into 2 or more partitions.
iii). Hardware partitioning :-
Needless to say, the dataware design process should try to maximize the performance of the system. One of the ways to ensure this is to try to optimize by designing the data base with respect to specific hardware architecture.


21. Explain the mechanism of row splitting.

Row Splitting :-

The method involved identifying the not so frequently used fields and putting them into another table.
This would ensure that the frequently used fields can be accessed more often, at much lesser computation time.
It can be noted that row splitting may not reduce or increase the overall storage needed, but normalization may involve a change in the overall storage space needed. In row splitting, the mapping is 1 to 1 whereas normalization may produce one to many relationships.
22. Explain the guidelines used for hardware partitioning.
Guidelines used for hardware partitioning :-
Needless to say, the dataware design process should try to maximize the performance of the system. One of the ways to ensure this is to try to optimize by designing the data base with respect to specific hardware architecture. Obviously, the exact details of optimization depends on the hardware platforms. Normally the following guidelines are useful:-
i). maximize the processing, disk and I/O operations.
ii). Reduce bottlenecks at the CPU and I/O

23. What is aggregation? Explain the need of aggregation. Give example.
Aggregation : Data aggregation is an essential component of any decision support data ware house. It helps us to ensure a cost – effective query performance, which in other words means that costs incurred to get the answers to a query would be more than off set by the benefits of the query answer. The data aggregation attempts to do this by reducing the processing power needed to process the queries. However, too much of aggregations would only lead to unacceptable levels of operational costs.
Too little of aggregations may not improve the performance to the required levels. A file balancing of
the two is essential to maintain the requirements stated above. One thumbrule that is often suggested is that about three out of every four queries would be optimized by the aggregation process, whereas the fourth will take it’s own time to get processed. The second, though minor, advantage of aggregations is that they allow us to get the overall trends in the data. While looking at individual data such overall trends may not be obvious, whereas aggregated data will help us draw certain conclusions easily.

24. Explain the different aspects for designing the summary table.

Summary table are designed by following the steps given below :-
i). Decide the dimensions along which aggregation is to be done.
ii). Determine the aggregation of multiple facts.
iii). Aggregate multiple facts into the summary table.
iv). Determine the level of aggregation and the extent of embedding.
v). Design time into the table.
vi). Index the summary table.

25. Give the reasons for creating the data mart.

The following are the reasons for which data marts are created :-
   i). Since the volume of data scanned is small, they speed up the query processing.
   ii). Data can be structured in a form suitable for a user access too
   iii). Data can be segmented or partitioned so that they can be used on different platforms and
also different control strategies become applicable.


26. Explain the two stages in setting up data marts.
There are two stages in setting up data marts :-
i). To decide whether data marts are needed at all. The above listed facts may help you to
decide whether it is worth while to setup data marts or operate from the warehouse itself.
The problem is almost similar to that of a merchant deciding whether he wants to set up retail
shops or not.
ii). If you decide that setting up data marts is desirable, then the following steps have to be gone
through before you can freeze on the actual strategy of data marting.
a) Identify the natural functional splits of the organization.
b) Identify the natural splits of data.
c) Check whether the proposed access tools have any special data base structures.
d) Identify the infrastructure issues, if any, that can help in identifying the data marts.
e) Look for restrictions on access control. They can serve to demarcate the warehouse
details.

27. What are disadvantages of data mart?

There are certain disadvantages :-
i). The cost of setting up and operating data marts is quite high.
ii). Once a data strategy is put in place, the datamart formats become fixed. It may be fairly difficult to change the strategy later, because the data marts formats also have to be changes.


28. What is role of access control issue in data mart design?
Role of access control issue in data mart design :-
This is one of the major constraints in data mart designs. Any data warehouse, with it’s huge volume
of data is, more often than not, subject to various access controls as to who could access which part of data. The easiest case is where the data is partitioned so clearly that a user of each partition cannot access any other data. In such cases, each of these can be put in a data mart and the user of each can access only his data .
In the data ware house, the data pertaining to all these marts are stored, but the partitioning are retained. If a super user wants to get an overall view of the data, suitable aggregations can be generated.

29. Explain the purpose of using metadata in detail.

Metadata will be used for the following purposes :-
i). data transformation and loading.
ii). data management.
iii). query generation.


30. Explain the concept of metadata management.
Meta data should be able to describe data as it resides in the data warehouse. This will help the warehouse manager to control data movements. The purpose of the metadata is to describe the objects in the database. Some of the descriptions are listed here.
· Tables
- Columns
* Names
* Types
· Indexes
- Columns
* Name
* Type
· Views
- Columns
* Name
* Type
· Constraints
- Name
- Type
- Table
* Columns

31. How the query manager uses the Meta data? Explain in detail.

Meta data is also required to generate queries. The query manger uses the metadata to build a history of all queries run and generator a query profile for each user, or group of uses.
We simply list a few of the commonly used meta data for the query. The names are self explanatory.
o Query
o Table accessed
§ Column accessed
· Name
· Reference identifier
o Restrictions applied
o Column name
o Table name
o Reference identifier
o Restrictions
o Join criteria applied
o Column name
o Table name
o Reference identifier
o Column name
o Table name
o Reference identifier
o Aggregate function used
o Column name
o Reference identifier
o Aggregate function
o Group by criteria
o Column name
o Reference identifier
o Sort direction
o Syntax
o Resources
o Disk
o Read
o Write
o Temporary

32. Why we need different managers to a data ware house? Explain.

Need for managers to a data ware house :-
Data warehouses are not just large databases. They are complex environments that integrate many
technologies. They are not static, but will be continuously changing both contentwise and structurewise. Thus, there is a constant need for maintenance and management. Since huge amounts of time, money and efforts are involved in the development of data warehouses, sophisticated management tools are always justified in the case of data warehouses.
When the computer systems were in their initial stages of development, there used to be an army of
human managers, who went around doing all the administration and management. But such a scheme became both unvieldy and prone to errors as the systems grew in size and complexity. Further most of the management principles were adhoc in nature and were subject to human errors and fatigue.

33. With neat diagram explain the boundaries of process managers.

A schematic diagram that defines the boundaries of the three types of managers :-




34. Explain the responsibilities of each manager of data ware house.
Ware house Manager :-
The warehouse manager is responsible for maintaining data of the ware house. It should also create
and maintain a layer of meta data. Some of the responsibilities of the ware house manager are
o Data movement
o Meta data management
o Performance monitoring
o Archiving.
Data movement includes the transfer of data within the ware house, aggregation, creation and
maintenance of tables, indexes and other objects of importance. It should be able to create new aggregations as well as remove the old ones. Creation of additional rows / columns, keeping track of the aggregation processes and creating meta data are also it’s functions.

25. What are the different system management tools used for data warehouse?
The different system management tools used for data warehouse :-
i). Configuration managers
ii). schedule managers
iii). event managers
iv). database mangers
v). back up recovery managers
vi). resource and performance a monitors.

Don't forget to comment and like..


15 comments:

  1. Replies
    1. Thanks for Uploading The Remaining Part VJ.....


      VJ wat abt BSIT-54 Assignment

      Delete
  2. PLease upload the remaining 5th Sem assignments... Thanks a lot for uploading this..

    ReplyDelete
  3. Thanks a lot brother 4 uploading....

    ReplyDelete
  4. plz upload tb part of 5 sem and also upload 6sem

    ReplyDelete
  5. Vikash.. i feel proud about u brother

    ReplyDelete
  6. Thanks a lot brother....Please upload KU 5TH SEM ASSIGNMENT - BSIT (TA)& (TB) - 54 (Software Quality and Testing)

    ReplyDelete
  7. MANY MANY THANK YOU BROTHER.BUT THIS SITE ARE NOT AVAILABLE BSIT-54 ASSIGNMENT.SO PLZ UPLOAD THIS ASSIGNMENT VERY SOON.OK BROTHER.

    ReplyDelete
  8. awesome work bro.....thank u very much

    ReplyDelete
  9. Please provide us the ku 5th sem assignment Bsc.(it)- 54 [TA] & [TB] ...

    Thanks in advance... :)

    ReplyDelete
  10. Thnx for the assignments bro but plz upload the Assignment BSIT-54 Software Quality & Assurance soon :)

    ReplyDelete
  11. plz upload the Assignment BSIT-54 Software Quality & Assurance soon

    ReplyDelete
  12. please upload BSIT-54 TB part soooon.

    ReplyDelete

Leave Comments