Discussion Topic:
Inmon (2011) identified that there are architectural, economical and technical considerations as part of data warehousing. Define two of each and provide examples.
Discussion Post:
There are various considerations that need to be studied when designing a datawarehouse. These considerations can be architecture, technical and economical.
Architecture: The business requirements should be reflected in the datawarehouse design and architecture. Creating a datawarehouse based on a flawed architecture is very serious mistake because it is difficult and costly to rectify once it has been implemented (Khan, 2003). Top down approach is preferred when it is vital for an organization to analyze data from multiple departments. Bottom-up approach is favored when the priority is to serve the analytical needs of individual business functions. Moreover, datawarehouse should be scalable and flexible. As its usage increases with time coupled with rapid generation of data volume, datawarehouse should be able to scale adequately.
Technical: Datawarehouse implementation should not be governed by technical considerations. Technology, instead of being viewed as the solution, should be used only as an enabler (Khan, 2003). At the same time, a datawarehouse should be implemented using a well-established methodology .Two of the most common technical considerations are to decide the grain of fact table and update strategy of data. Grain of the fact table is usually governed by the measurements that need to be stored in the fact table (Kimball Group, 2016). Similarly, how frequently to refresh data into datawarehouse is based on the operational requirements and limitations of underlying hardware resources .Some elements of datawarehouse can be refreshed frequently such as nightly or at real time whereas the other elements such as aggregated and summarized data can be refreshed weekly or monthly.
Economic: Datawarehouse project should only be undertaken after it has been justified by a cost/benefit analysis. A well justified project improves the probability of success and support from business end users (Khan, 2003) .Based on the economic consideration, decision can be made whether to buy the hardware resources to build the datawarehouse in-house or to use cloud based datawarehouse solution. Conventional datawarehouses which are based on centralized proprietary databases are costly to store vast quantities of data. Therefore, the decision to use datawarehouse for unstructured, sensor generated streaming data can also be driven by cost.
Reference.
Khan, A. (2003). DataWarehousing 101 Concepts and Implementation. Khan consulting and Publishing, LLC. San Jose: CA.
Kimball Group.(2016).Grain. Retrieved from https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/grain/
Reply to Discussion Post (120-150 words with references):
Task 2: 130-150 words with reference
Discussion Topic:
Define the importance of ontologies for unstructured data warehouses. Provide an example of unstructured data and the use of an ontology used to manage this data.
Discussion Post:
Ontologies are groups of ideas inside a specified domain that explains the interrelationship between ideas. It is a structural framework that is usually used to organize information concepts. The use of an ontology is to study the existence of entities in a specific domain and also to identify domain itself.
When it comes with unstructured data, ontologies provides easy navigation when a user moves in the ontology from one concept to another. Ontologies also helps in improving data management. It can also be extended as relationships and concept matching. Taxonomies and Anthologies also plays an important part in unstructured data. They help process important data analytically on unstructured data.
References
Ontotext. (2018). What are Ontologies? Retrieved from Ontotext: https://www.ontotext.com/knowledgehub/fundamentals/what-are-ontologies/
Reply to Discussion Post (120-150 words with references):
Task 3: 130-150 words with reference
Discussion Topic:
Search the web for an instance involving the use of data mining for cluster or outlier analysis. (One good example is fraud detection). Describe the example and relate what the impact was. Provide the link. Use references and justification to support your point of view.
Discussion Post:
The observation of the world around us naturally gives us the tendency to organize, group, differentiate and catalog what we see in order to understand it better. Similarly, clustering helps marketers to improve their customer base, work on target areas and segment them based on historical purchases, interests or activity.
An instance involving the use of data mining for cluster analysis or outliers is the study of prepaid telecom customers segmentation using k-mean algorithm. The study was carried out on natural persons who are prepaid subscribers. These people do not have a contractual relation with the telecom carrier, and they buy credit in advance. The analysis excluded the people who failed to recharge within the past three months and did not spend anything on calls, SMS or Internet in three months. The analysis will identify the subscribers’ profiles in the overall population and determine the efficiency of the K-mean cluster analysis in the case of high data volumes. The classification of subscribers into several categories using the following variables: the sum of the amounts recharged in 6 months, the value of the SMS sent within the 6 months, the Internet traffic value in the 6 months and the value of calls made in the 6 months. To group subscribers into segments, the study used the K-Mean Cluster non-hierarchical method. This algorithm follows the segmentation of the populations so that the variation inside the clusters will be down to a minimum. The analysis pursues the grouping of subscribers into various segments based on their behavioral values (recharge values, call values, SMS expenditure and Internet expenditure). The ANOVA analysis revealed the following order in the case of the factors’ contribution to the splitting of the population into groups: recharge value, call value, Internet expenditure and sent SMS value. The value of Sig. was smaller than 0.05 thereby results are significant.
Reference:
Mihai-Florin, Băcilă & Al., (2012). Prepaid Telecom Customer Segmentation Using the K-Mean Algorithm. Retrieved from https://www.researchgate.net/publication/268445170_Prepaid_Telecom_Customer_Segmentation_Using_the_K-Mean_Algorithm
Reply to Discussion Post (120-150 words with references):
Task 4: 130-150 words with reference
Discussion Topic:
Search the web for an instance involving the use of data mining for cluster or outlier analysis. (One good example is fraud detection). Describe the example and relate what the impact was. Provide the link. Use references and justification to support your point of view.
Discussion Post:
Clustering is a data mining technique that makes a meaningful or useful cluster of objects which have similar characteristics using the automatic technique. Nowadays technology has become an integral part of the business processes, the process of transfer of information has become more complicated. So, data mining technique might play an important role. There are various industries that use clustering technique to solve some challenge organizing issues. Stock Market is one good example where clustering technique is applied. According to (Hajizadeh, Davari and Shahrabi, 2010), apply a pair wise clustering approach to the analysis of the Dow Jones index companies, in order to identify similar temporal behavior of the traded stock prices. The objective of this attention is to understand the underlying dynamics which rules the companies’ stock prices. In particular, it would be useful to find, inside a given stock market index, groups of companies sharing a similar temporal behavior. To this purpose, a clustering approach to the problem may represent a good strategy. To this end, the chaotic map clustering algorithm is used, where a map is associated to each company and the correlation coefficients of the financial time series to the coupling strengths between maps. The simulation of a chaotic map dynamics gives rise to a natural partition of the data, as companies be-longing to the same industrial branch are often grouped together. The identification of clusters of companies of a given stock market index can be exploited in the portfolio optimization strategies (Hajizadeh, Davari and Shahrabi, 2010).
Reference
Hajizadeh, E., Davari, H. and Shahrabi, J. (2010). Application of data mining techniques in stock markets: A survey. [online] academia. Available at: https://www.academia.edu/30094548/Application_of_data_mining_techniques_in_stock_markets_A_survey [Accessed 29 Mar. 2019].
Reply to Discussion Post (120-150 words with references):