The Subdue structural discovery system is being used as the Data Mining tool to study the "Orizaba Fault " located in Mexico, as part of a research project of the geologist Dr. Burke Burkart. We analyze the information of the Earthquake Database
hierarchical description of the input data where latersubstructures are defined in terms of substructuresdiscovered on previous iterations.
There are other components that make Subdue morepowerful. We can specify predefined substructures thatSubdue looks for in the data. This allows Subdue to useprevious knowledge as a starting point and guide thediscovery process. Subdue uses an inexact graph matchtechnique so that instances of substructures that are slightlydifferent can be matched. We can also iterate Subdue’sdiscovery process in order to find more substructures innew iterations that might contain substructures found inprevious iterations. Figure 1 shows a simple example ofSubdue’s operation. Subdue finds four instances of thetriangle-on-square substructure in the geometric figure.The graph representation used to describe the substructure,as well as the input graph, is shown in the middle.
Vertices: objects or attributesEdges: relationships
4 instances ofFigure 1: Subdue’s Example
The Earthquake Database
The earthquake database contains information collectedfrom several catalogs (gs.gov). Thesecatalogs were provided by sources like the NationalGeophysical Data Center of the National Oceanic andAtmospheric Administration (NOAA). The database hasrecords of earthquakes from 2000 B. C. through the currentweek. An earthquake record consists of 35 fields: sourcecatalog, date, time, latitude, longitude, magnitude, intensityand seismic related information such as cultural effects,isoseismal map, geographic region and stations used forthe computations. Earthquakes of magnitude below 1.0 arenot stored in the database; most of the magnitudes ofearthquakes range from 2.5 to 9.5.
There are some differences between catalogs, e.g. it ispossible to find the same earthquake with a slightlydifferent epicenter or magnitude in two catalogs. This isdue to the methods and instruments used to compute thedata. As an example we mention that currently epicentersand magnitudes are calculated with computer programsusing seismographic data. The problem is that thecomputer programs contain assumptions about the earth inthe formulae they use. If those assumptions are violatedthen the results can be different.
The size of the Earthquake database is extremely large(e.g. 2.2 MB only for 1995 data), so we could not use allthe information in our experiments; we just used subsets ofinformation corresponding to periods of time between 6months and 1 year. We created a relational databasecontaining the earthquake information (the 35 fields). Thiseased the extraction of information for the experiments,because we can use SQL queries to extract the desired
subset of the database. We use the Data Mining approachinstead of queries because we do not pre-set theinformation to be included in the result. This means thatwe prepare a query that can uncover novel structuralpatterns in the same way as the Subdue system.
Earthquake Database Knowledge Representation
Every record in the database represents an earthquakeevent. In this domain we used two kinds of edges toconnect the events (earthquakes). The first type of edge isthe “near_in_distance” edge, which is set between twoevents if the distance between them is equal or less than 75kilometers. The second type of edge is the “near_in_time”edge that is set between two events if they happened with adifference of time equal or less than 36 hours. We chosethose parameters because of two reasons. First, they were agood combination that generates enough edges so that thesystem may find them, and not too many to overload thegraph so that those were the only substructures found.Second, our geology specialist told us that 75 kilometerswas reasonable for the size of the area of study and that theeffects between one earthquake and another are usuallyshown within 36 hours. An earthquake event in graph formis shown in figure 2. All the fields of the Earthquakedatabase are included except for the empty fields, whichwould bias the system because of the large amount ofthem.
Figure 2: Earthquake Knowledge Representation
Earthquake Database Experimental Results
We chose only a subset of the database to run theexperiments. For example, we took 6 months ofinformation and ran Subdue on it, so the query to extractthe information from the database included the year andmonth of the earthquakes that we wanted. We started usingall the fields of the database, but the year field affected ourresults because the values were all the same, so we decidedto exclude that field.
We wanted to take a random sample from the database(from the 5 years of information and keeping the samegraph size) but that would affect the “near_in_time” edges,
百度搜索“77cn”或“免费范文网”即可找到本站免费阅读全部范文。收藏本站方便下次阅读,免费范文网,提供经典小说教育文库Structural Knowledge Discovery Used to Analyze Earthquake Ac(2)在线全文阅读。
相关推荐: