A&A 606, A39 (2017) DOI: 10.1051/0004-6361/201730968 Astronomy & cESO 2017 Astrophysics
Automated novelty detection in the WISE survey with one-class support vector machines?
A. Solarz1, M. Bilicki2, 1, 3, M. Gromadzki4, A. Pollo1, 5,A. Durkalec1, andM.Wypych5 1 National Centerfor Nuclear Research,ul.Ho˙za69, 00-681Warsaw, Poland e-mail: aleksandra.solarz@ncbj.gov.pl; bilicki@strw.leidenuniv.nl 2 Leiden Observatory, Leiden University, 2333 CA Leiden, The Netherlands 3 Janusz Gil Institute of Astronomy, University of Zielona Ga, 65-417 Zielona Ga, Poland 4 Warsaw University Astronomical Observatory, 00-001Warszawa, Poland 5 The Astronomical Observatory of the Jagiellonian University, 31-007 Krak, Poland Received 10 April 2017 / Accepted 15 June 2017 ABSTRACT Wide-angle photometric surveys of previously uncharted skyareas or wavelength regimes will always bring in unexpected sources – novelties or even anomalies – whose existence and properties cannot be easily predicted from earlier observations. Such objects can be eﬃciently located with novelty detection algorithms. Here we present an application of such a method, called one-class support vector machines (OCSVM), to search for anomalous patterns among sources preselected from the mid-infrared AllWISE catalogue coveringthewholesky.Tocreateamodelofexpecteddatawetrainthe algorithmonasetofobjectswith spectroscopic identifcations from the SDSS DR13 database, present also in AllWISE. The OCSVM method detects as anomalous those sources whose patterns – WISE photometric measurements in this case – are inconsistent with the model. Among the detected anomalies we fnd artefacts, such as objects with spurious photometry due to blending,but more importantly also real sources of genuine astrophysical interest. Among the latter, OCSVM has identifed a sampleof heavily reddenedAGN/quasar candidates distributed uniformly over the sky andinalargepart absentfrom other WISE-basedAGN catalogues.Italsoallowedustofnda specifc groupof sources ofmixed types, mostly stars and compactgalaxies.By combining the semi-supervised OCSVM algorithm with standard classifcation methods itwillbe possibletoimprovethe latterby accountingfor sources whicharenot presentinthe trainingsample,butare otherwise well-represented in the target set. Anomaly detection adds fexibility to automated source separation procedures and helps verify the reliability and representativeness of the training samples. It should be thus considered as an essential step in supervised classifcation schemes to ensure completeness and purity of produced catalogues. Key words. infrared:galaxies – infrared: stars –galaxies: statistics– stars: statistics – Galaxy: fundamental parameters 1. Introduction Catalogues of astronomical objects derived from sky surveys serve asa foundation for anysubsequent scientifc analysis. One of their primary uses is to provide information about statistical properties and spatial distribution of the observed sources, and to identify rare objects, especially those whose presence in the dataset is not expected. Regardless of the aim of the survey, it is important to identify what characteristic properties each class of objects exhibits. This information is crucial in order to separate the desired type of sources from the heap of collected data for further analysis. The nature of an astronomical object can be determined most reliably by analysing its electromagnetic spectrum. However, even the largest spectroscopic surveys undertaken today, designed to provide detailed information about each observed object, usually cover just a fraction of all the sources available for a given instrument. Photometric observations, on the other hand, are capable of delivering data for many more sources at a signifcantlyfaster rate andlower costs. For photometric data the traditional tool for object sep-aration are colour-colour (CC) and colour-magnitude (CM) ? The catalogues of outlier data are only available at the CDS via anonymous ftp to cdsarc.u-strasbg.fr
(130.79.128.5)or via http://cdsarc.u-strasbg.fr/viz-bin/qcat?J/A+A/606/A39
diagrams, wherevarious typesof objects (like stars andgalaxies) appear in separate areas due to diﬀerences in observed colours (e.g. Walker et al. 1989; Pollo et al. 2010; Jarrett et al. 2017).Today’s largest photometric datasets, such as SuperCOS
MOS(Hambly et al. 2001), WISE(Wright et al. 2010), orPanSTARRS(Chambersetal.2016),containoftheorderofabillion catalogued sources each, which means that even now the traditionalwaysof dealingwiththe resulting cataloguesbydirecthu-man inspection are not practicable. These numbers are expected to grow by orders of magnitude with future experiments such as the LSST1 or SKA2, so it is crucial to develop automated meth
odsfor source classifcationinthe associateddata products.With the advent of self-learning algorithms, the task of source separationcannowbedealtwithmuch moreeﬃciently and much more reliably, due to the ability of the algorithm to work in a multidimensional rather than two-dimensional parameter space (e.g. CC or CM diagrams) as is usually the case for human analysis. Machine learning schemes are now widely used to auto-matically classify astronomical sources. Owing to automated algorithms, selecting objects of signifcantly diﬀerent properties (compactness, colour, etc.) from sky surveys has be-come quite straightforward(Zhang&Zhao 2004; Solarz et al. 2012; Cavuoti et al. 2014; Shi et al. 2015; Heinis et al. 2016). 1 www.lsst.org
2 www.skatelescope.org
Article publishedby EDP Sciences A39, page1of 13 However, depending on the nature of the survey, we can expect diﬀerent types of objects to appear within the feld of view. In surveys covering large areas of the skyand reaching deep enough to encompass signifcant amounts of both Galactic and extra-galactic sources, the source separation is usually complicated. In such a case restricting the search to just a few basic classes (e.g. stars, galaxies, and quasars) is not suﬃcient any more, as the closer we get to the Galactic plane the more diverse objects we can expect to fnd, planetary nebulae, special types of stars (embedded in envelopes or undergoing catastrophic events), regions of interstellar matter, etc. Moreover, the wavelength regime in which observations are made determines what kind of objects canbeexpectedto appear.For instance,in optical surveyswe will fndfar fewer dust-rich objects than in infrared (IR) ones, while hot stars clearly visible in optical and ultraviolet bands will fade away at longer wavelengths. On the other hand, if anyunknown objects are present within the data, their properties should stand out from the crowd of the expected ones. However, detecting these outliers is not straightforward as it is not uncommon for rare objects to mimic the appearance of the well-known ones; for instance,a star anda compactgalaxy could bothbe classifed as the same source type based on their angular size only. InKurczetal. (2016)an attempt was madeto perform an automated, supervised source classifcation of IR sources from the all-sky surveyconducted by the WISE satellite. The training sample was based on the most secure identifcations of SDSS spectroscopic sources divided into three classes of expected (normal) objects: stars,galaxies, and quasars.However, sucha standard approach of supervised classifcation is not designed to correctly handle objects with patterns absent during training, or in other words, anomalous sources. Moreover, especially in low-resolution surveys like WISE, in areas of high observed source density such as low Galactic latitudes, measurements are plagued by eﬀects of overcrowding and therefore blending of objects. In such cases the measured properties of objects can display deviant characteristics and create further training biases. These issues were partlyavoidedin Kurczetal. (2016)by re-moving the data from the lowest Galactic latitudes(|b| < 10◦ and widerbythe Galacticbulge)in orderto improvethe classifcation.Nevertheless, both the catalogueofgalaxy candidates and especially of the putative quasars obtained there exhibited large-scale on-sky variations resulting from issues with the data themselves,but also – and maybe more importantly – from im-perfections of the classifcation approach. One of the goals of the present study is to examine whether those results could be ex-plained by the existence of unaccounted for anomalous sources in the WISE data, and what the prospects are for improving that classifcation. By defnition, characteristics of the unexpected sources are not known a priori, rendering the standard multi-class approach inapplicable.Novelty detection schemesoﬀera solution to these problems, and such methods are designed to recognise cases when a special population of data points diﬀer in some as-pects from the data which are used to train the machine learning algorithm. It is common to apply these methods to datasets which containalarge numberofexamples representingthe“nor-mal” populations, but for which data describing the “anomalous” populations are insuﬃcient. In this work objects inconsistent with the training data will be defned as anomalous, as in principle theyshould display novel/outlying properties in the parameter space. In other words, “unknown” patterns in the target set will manifest themselves in the form of points deviating from the “known” sources.Acomprehensive review of this type of methodologies developed for machine learning can be found in Hodge&Austin (2004), Agyemang et al. (2006); and Chandolaetal. (2009). Accordingto Hodge&Austin (2004),a user can approach the problem of novelty detection in three differentways.The frstis basedon unsupervised clustering, where outliers are detected without anyprevious knowledge about the data. The second approach uses supervised classifcation, where data have to be prelabelled as normal and unknown. The third method is a mixture of the frst two and is referred to as semisupervised recognition, where the algorithm models the normality of the data, and no knowledge about the true nature of test data is assumed. In this third approach the observer designs an algorithm to create a model of how normal data behave based on a large number of representative examples introduced during training. Next, the algorithm investigates previously unseen pat-ternsby comparingthemtothemodelof normalityand searches for a score of novelty. This decision threshold is then used to infer whether the data are behaving in a diﬀerent manner with respect to the training set or not. Awide selection of outlier detection algorithms is currently available. Some, based on unsupervised approaches like random forest(Ho 1998), were used by Baron&Poznanski (2017)to fnd SDSSgalaxies with abnormal spectra. Other methods, based on semi-supervised graph-based methods like label propagation and label spreading (described in detail in Chapelle et al. 2006), were usedby Škodaetal. (2017, 2016)to identify artefactsand interesting celestial objects in the LAMOST survey. In the presentwork we useaknowledge-based novelty detection method designed to create a boundary within the structure of the training dataset, supportvector machines (SVM, Vapnik 1995), to show the power of anomaly detection algorithms and discuss how theycould improve automatic selection schemes for present and forthcoming surveys covering large areas of the sky. As a case study, we will search for potential new source classes in the WISE dataset. The paper is organized as follows: in Sect. 2we present the data and the parameter space used by the SVM algorithm; a de-scription of the one-class SVM algorithm we use can be found in Sect. 3where we discuss the steps of anomaly detection and describe the training process; Sect. 4contains the results of the applicationofthose proceduresto AllWISEdata;asummaryand conclusions aregivenin Sect. 5. 2. Data In this paper we perform anomaly detection in the Wide-feld Infrared SurveyExplorer (WISE) data, aiming at improving the early all-sky classifcation results of Kurcz et al. (2016)and at searching for deviant objects with unexpected properties. Explorationofthe publiclyavailable AllWISE catalogue(Cutrietal. 2013), which contains over 747 million sources with photomet
ric information, allows us to test the power of basic artifcial intelligence algorithms for anomaly detection in order to obtain information about special objects contained within the dataset. Currently AllWISEisthe deepest all-skydatasetavailabletothe public whichatthe same timeprovidesvast amountsof data that can be used to test automatic schemes of classifcation and de-tection of novelty. The WISE telescope (Wright et al. 2010), launched by NASA in December 2009, has been scanning the whole sky, originally in four passbands(W1, W2, W3, and W4) covering near-and mid-IRwavelengths centredat3.4,4.6,12,and23 µm, respectively. The AllWISE Source Catalogue was produced by combining the WISE single-exposure images from the WISE 4-Band Cryo, 3-Band Cryo, and NEOWISE Post-Cryo survey A39, page2of 13 A. Solarz et al.: Automated novelty detection in WISE with OCSVM phases(Mainzer et al. 2014).The angular resolution of the flters is6.100,6.400,6.500, and 12.000, respectively, and the sensitivity to point sources at the5σ detection limit is estimated to be no less than 0.054, 0.071, 0.73, and5mJy, which is equivalent to 16.6, 15.6, 11.3, and 8.0Vega mag, respectively3. 2.1. Source preselection: WISE × SDSS cross-match The source preselection and parameter space for the purpose of thisstudy followsthatof Kurczetal.(2016).Namely,we focus on reliable measurements only in the W1 and W2 channels to maximise the completeness and uniformity of the sample. We thus use AllWISE sources which meet the following criteria: profle-ft measurement signal-to-noise ratios w1snr ≥ 5 and w2snr ≥ 2; saturated pixel fractions w1sat and w2sat ≤ 0.1. To ensure that we do not preserve any severe artefacts we also apply cc_flags[1, 2] , 0 DPHO0, which excludes sources with diﬀraction spikes, persistence, halos, or optical ghosts.We em-phasise that we do not use the W3andW4channels for the preselection nor for the SVM analysis; the former band will only be employed in the verifcation phase for CC plots. In the domain of knowledge-based machine learning it is necessaryto createa templateforthe classifcationofknownobjects. Therefore, the training data should representatively sample the underlying distribution of target objects within a given parameter space. In the case of this study, an ideal training set would be constructed from a subsample of securely mea-sured WISE sources with well-defned types. However, since at present such datasets are not available for WISE, to create the basis for the training process an external dataset containing sources of interest is needed.For this purpose we construct the training set by cross-matching the AllWISE dataset with the Sloan Digital Sky Survey (SDSS, York et al. 2000)DR13 (SDSS Collaboration et al. 2016), which provides spectroscopic measurements (Bolton et al. 2012). The SDSS spectroscopic sample includesover 4.4 million sources, wheregalaxies com-prise 59%, quasars 23%, and stars 18%. The cross-matching procedure was performed using a 100 matching radius and re-sulted in 3 million common sources, of which galaxies con-stitute 70%, quasars 12%, and stars 18%. This sample of All-WISE sources with a counterpart in SDSS DR13 spectroscopic will henceforth be referred to as AllWISE × SDSS, and below we provide details on cuts applied to it before the training procedure. 2.2.Parameter space Asin Kurczetal.(2016)whereSDSSDR10wasused,the cross-match between AllWISE and SDSS DR13 practically does not providegalaxiesfainter thanVega W1= 16 or W2= 16.For the sake of completeness we thus trim all our catalogues, including the target AllWISE, at these limits; the same applies to anyother cuts described below. At the bright end the matched catalogue contains practically only stars, we thus apply additional criteria of W1 > 9.5 and W2 > 9.5 as otherwise such a population of bright starswouldbe identifed as anomaliesby our scheme. In the earlier related studies by Kurcz et al. (2016) and Krakowski et al. (2016), where a more classical approach to supervised learning was applied, the SDSS-based training sets were purifed of sources with problematic redshift measurements according to SDSS parameters such as zWarning and zErr; http://wise2.ipac.caltech.edu/docs/release/allwise/
expsup/sec2_3a.html
this was done to avoid type misidentifcations which could have detrimental eﬀects on multi-class source identifcation. In the case of outlier detection schemes the aim is to search for sources which do not exhibit patterns learned during training. As such algorithms are not designed to provide distinctions between spe-cifc classesbut rather to show unexpected sources, the quality of the redshift measurementis of little interest here. Even though the SDSS spectroscopic data themselves may contain anomalous sources, the focus of our study is to fnd interesting ob-jects within the infrared WISE catalogue. Therefore, even if an objectsexhibitsdeviant propertiesat opticalwavelengthsbutis otherwise well-detected, it will still be included in our training sampleasaknown source.Forthat reasonwedonotapplyany data cleaning on the SDSS spectroscopic database. As mentionedabove, our parameter spacewas limitedtotwo outof fouravailable WISE passbands: W1andW2. This is to en-sure as manyobjects as possible in the fnal catalogue as the W3 and especially the W4flters have much lower sensitivities and a much shorter data acquisition period (limited to the cryogenic phase) which leads to their much lower detection rates than at the shorter wavelengths. The W3and W4 passbands are dominated by upper limits and non-detections in the WISE database and using them in our study would lead to losing the majority of objects and introducing severely non-uniform distribution of the sources, and signifcant biases in the photometry. To ensure the maximum coverage of the parameter space by known sources, instead of using the W1andW2measurements separately, we employ the W1 magnitude and the W1 − W2 colour. Even though using fux measurements from each flter separately is mathematically equivalent to employing the colours derived from them (e.g. Wolf et al. 2001), usage of colours can enhance the spread area within the parameter space for the con-sidered objects. Finally, to extend the parameter space, we also use a concentration parameter defned as the diﬀerence between fux measurements in two circular apertures in the W1passband in radii equal to5.500 and 11.000 centred on a source: w1mag13 ≡ w1mag_1 − w1mag_3, (1) used previously by Bilicki et al. (2014, 2016), Kurcz et al. (2016); and Krakowskietal. (2016).It serves asa proxy for morphological information:extended sources will typicallyhave larger w1mag13 values than point-like ones (see Fig. 1). We emphasise that currently the WISE database does not provide anyreliable extended source identifcations nor isophotal mag-nitudes,except fora small subset(∼500 000) of objects in common with the 2MASS Extended Source Catalogue and for those in some of the GAMA felds(Cluver et al. 2014; Jarrett et al. 2017). Tosummarize this part, the chosen parameter space has three dimensions: 1. W1magnitude measurement; 2. W1− W2colour; 3. concentration parameter w1mag13. All the WISE magnitudes willbegivenin theVega system. 2.3. Quality cuts Topurify the data further,we apply twocuts on the concentration parameter. First, we require that w1mag13 ≥ 0to remove objects with measured fux decreasing with increased aperture, which are most likely artefacts of source extraction in high-density ar-eas. This cut removes 3360 sources from the AllWISE × SDSS A39, page3of 13 training set. In our full WISE dataset, this cut eliminates about 600 000 sources, the vast majority of which are located within the Galacticplaneandbulge,in Magellanic Clouds,andinM31; these are regions of severe blending in WISE, which is a further confrmation of the spurious nature of these w1mag13<0 objects. Due to the much lower angular resolution of WISE compared to SDSS (6.100 in the W1 channel vs.1.300 in the r band), two objects appearing in close vicinity of one another can be wellseparated in SDSS, but may be blended in WISE. This could then leadtohighvaluesofthe w1mag13 parameter,suggesting an extended source whichinfactisa blend. Such objectswould in-troduce biases during the anomaly detection process. The distribution of thew1mag13 parameter in our training set is illustrated in Fig. 1. It peaks at ∼0.6for stars and quasars, and at∼0.7for galaxies. Only a small fraction (2%) of the training sources have w1mag13 > 1, and we examined a representative sample of such objectsbyeye, starting from those with the mostextremevalues. We have found that the vast majority of them are indeed blends, and this happens even if w1mag13 ∼ 1. An example is shown in Fig.2.Two well-separated sourcesinSDSS(a quasarandastar) are blended in WISE and the 1100 aperture centred on the object of interestgathersalarge amountoffuxfromthe second source. Such objects are not usable for the purposes of source separation and anomaly detection despite their usually excellent quality of SDSS measurements. Owing to these considerations, we will not be using objects with w1mag13 > 1 for training; we then also have to remove such sources from the target catalogue. Such a cut removes a considerable number of AllWISE objects. However, for the purpose of the present analysis, these cuts are nec-essary as the training sample has to refect the target sample in terms of parameter ranges. Otherwise target sources with param-etervalues diﬀering signifcantlyfromtheinput rangeswouldbe automatically marked as anomalous. After all the cuts discussed above, our training sample in-cludes almost 2.3 million sources, of which 81% aregalaxies, 13% are stars and6% are quasars (seeTable 1). The fnal AllWISE catalogue that will be used for the nov-elty detection is composed of 237 million objects; see map in Fig.3. The most prominent features are our Galaxy and the Mag-ellanic Clouds; however, there is lower surface density in the Bulge, consistent with blendingeﬀects in areas of high projected A39, page4of 13 Fig.
2.
Example of a quasar with a clean SDSS detection at α = 192.00,δ = 10.17(lower panel, colour image constructed form u, g and r fters) for which the WISE-derived concentration parameter (w1mag13 ≡ w1mag_1 − w1mag_3 )is1.13 because of blending with a nearby star(upper panel, single-band image with W1fux). Concentric circles mark the aperture in which the w1mag_1 (5.500)andw1mag_3 (1100)magnitudes were measured in WISE. Table 1. Summary of the training samples of SDSS objects used to train the OCSVM classifer. SDSS class Nobj before cuts Nobj after cuts Galaxy 1918 469 1827 211 Star 321 416 298 254 QSO 148 309 141 471 Fig.
3.
Sky distribution of the 237 million AllWISE sources used for anomaly detection in this study. See text for details of sample preselection. density4. Also visible are stripes related to WISE instrumental issues5. 3. Novelty detection After the introduction of kernel methods (Vapnik 1995; Shawe-Taylor&Cristianini 2004), pattern recognition schemes 4 http://wise2.ipac.caltech.edu/docs/release/allsky/
expsup/sec2_2.html
5 http://wise2.ipac.caltech.edu/docs/release/allwise/
expsup/sec2_2.html#w1sat
A. Solarz et al.: Automated novelty detection in WISE with OCSVM (ridge regression, e.g. Murphy 2012; Fisher discriminant, e.g. Mika et al. 1999;principle component analysis, e.g. Schkopf et al. 1999; spectral clustering, e.g. Langone et al. 2015; etc.) havegainedin popularityin manybranchesof science wherethe amount of data being collected is increasing to the point where human processing is no longer practicable, which is the current situationin astronomy.Kernel methodsexplore linear and non-linear pair-wise similarity measures. Using non-linearkernelsis equivalent to mapping data from the original input space onto a higher dimensional feature space where distinction between pat-terns can be easier. This conventional pattern recognition focuses on two or more classes.Inatwo-classproblemwearedealingwithasetof training examples X = (xi,ωi)|xi ∈ RD , i= 1...N, which contain D-dimensionalvectors of Dcharacteristic properties (features or observables) for each of the Nexamples (in astronomy: sources). Depending on the class the object belongs to, it is then given a certain label ω = {−1, 1}. Next, out of the training dataset a function h(x)is constructed to estimate which label should be assigned to a new input vector x0:ω = h(x0|X): h(x0|X):RD → [−1, 1]. (2) Inthe caseof classifcation schemesof more thantwo classesit is typical to decompose the problem into multiple binary problems. The fnal classifcation result combines partial outcomes of binary classifers by a ranking method. This conventional way, however,ignores anynew/outlying data that do not belong to the considered classes.Withoutanyfreedom,the algorithmis forced to classify a source as one of the predefned classes, even if it does not ft to anypresented category, for example objects that do not occurin an optical-based training samplebut are detected in the IR. To tackle the problem of novel data detection it is possi-ble to modify the standard supervised classifcation scheme to one-class classifcation. Here, the main class composed of nor-mal/expected data points will be detected separately from all the other data points. In the usual approach to novelty detection it is assumed that the normal class is well sampled, while the outlying classis undersampled.Amodelof normality N(θ)(not to be confused with normal distribution), where θ is a free parameter of the model, is deduced and used to assign the novelty scores n(x)to the previously unseen data x. In this sense, increasing scores can be understood as increasing deviation of the points from the normality model.We defne the normality threshold as z(x)= kin a way that an example x will be classifed as normal if z(x)≤ kor as deviant in the opposite case. Therefore z(x)= k defnesthe decision boundary.Inthiswaythe possibilityofmisclassifcationoftheobjectsmissinginthe trainingsampleisvery low, as theywill occur simply as deviations from normal. To search for anomalies within the AllWISE data we have chosen to use a semi-supervised method belonging to knowledge-based algorithms (e.g. Schkopf et al. 2000)– the support vector machines – as these approaches focus on creating the decision boundary to contain the normal datapoints and are sensitive to outliers in both the training and test set. On the other hand, they do not depend on the distribution of the data within the training set; however, knowledge-based approaches to novelty detection have one drawback: complexity associated with the computational time ofkernel functions (see Sect. 3.1). Nevertheless, present-day technology coupled with parallelised computational capabilities signifcantly shortens the properkernel choice for a given dataset and corresponding calculations. There are also several other algorithms for novelty detection, such as reconstruction-based techniques such as neural networks (e.g. Hawkins et al. 2002;Markou&Singh 2003)or subspacebased methods (e.g. Jolliﬀe 2002; Hoﬀmann 2007)that model the underlying data and reconstruct an error defned as a dis-tance between the test vector and output of the system, which is then translated to a novelty score. However, even though reconstruction methods oﬀer a fexible way to deal with high dimensionality of the data, theyrequire a predefnition of parameters to defne the structure of the model, which leads to two basic problems; the frst is the selection of the most eﬀective training method to enable the integration of new units into the existing structure and the second is the need to add a priori information about the saturation point (when no more new units can be added). Another large family of novelty detection techniques are distance-based approaches, which do not require any a priori knowledge about the data distribution, like nearest neighbourbased techniques (e.g. Hautamaki et al. 2004; Angiulli et al. 2009)or clustering-based techniques (e.g. Yang&Wang 2003; Basu et al. 2004). However, they require a defnition of dis-tance metrics to establish similarity between data points, which becomes an increasingly persisting problem especially when dealing with high dimensionality of the parameter space (e.g. Kriegel et al. 2009)as distance measures in many dimensions lose ability to diﬀerentiate between normal and outlying data points. Moreover, these methods lack the fexibility of param-eter tuning, making the methods unsuitable for full automation. Owing to the above reasons, we chose to use support vector machines for our study; a detailed description of our approach follows. 3.1. Support vector machines Supportvector machinesis oneof most commonly used conventional classifcation algorithmsin astronomy.TheideaofSVMis based on structural risk minimization(Vapnik&Chervonenkis 1974). For many applications SVM have shown better perfor
mance and accuracy than other learning machines and have been used in many branches of astrophysics to solve classifcation problems and build catalogues (e.g. Beaumont et al. 2011;Fadely et al. 2012; Ma ek et al. 2013; Solarz et al. 2015; Kovács&Szapudi 2015;Heinis et al. 2016;Marton et al. 2016; Kurcz et al. 2016; Krakowski et al. 2016). Support vector ma-chines maps input points onto a high-dimensional feature space and fndsahyperplane separating two or more classes with as large a margin as possible between points belonging to each category in this space. Then the solution of the besthyperplane is composed of input points laying on the boundary called support vectors (SVs). Here we outline the basis of the SVM theory in application to classifcation schemes.TrainingofanSVM algorithm startswith havingasetof observations with labels(y1, x1), ..., (yl, xl),where xi ∈ RN belongsto oneoftwo classesandhasalabel yi ∈ {−1, 1} for i = 1, ...l. Each point should contain a vector of features, characteristic values which describe it. Then the algorithm maps each vector from the input space Xonto a feature space Husing a non-linear function Φ:X → H. The desired separation plane w · z + b= 0is defnedby the pair(w, b)in such a way that each point xi is separated according to a decision function f(xi)= sgn(w · zi + b), (3) where w ∈ Hand b∈ R.In principle,itisnottheexplicitknowledge of the mapping function Φ thatis needed,but the dot product of the transformed points hΦ(xi), Φ(xj)i (Cortes&Vapnik 1995). Therefore, instead of working with Φ it is possible to A39, page5of 13 Fig.
4.
Schematic representationof OCSVMusinganexampleofthedefault radial basiskernel.The presented caseof classifcationshowsthe tightest decision boundary which envelopes the known data (red circle) which canbe treated as fndinga separatinghyperplanein the traditional SVM sense (green line). Unknown objectsfall outside the sphere and are marked as outliers. work with K:X× X → R, where Ktakes two points as input and returns a real value representing hΦ(xi), Φ(xj)i. The only condition is that Φ exists if and only if K (called kernel)is positive defnite (satisfes Mercer’s condition; Mercer 1909). Therefore, anyfunction which meets this criterion canbeakernel function. The most commonly featured kernel functions are linear, sigmoid, radial basis, and polynomial, which we describe in more detail in Sect. 3.3. 3.2. One-class SVM reformulation Schkopf et al. (2000) introduced an extension of the SVM methodology to pattern recognition as an open set problem. Unlike the traditional SVM algorithm, which is designed to diﬀerentiate between classes contained within a given set, one-class SVM (hereafter OCSVM) recognizes patterns in a much larger space of classes, unseen in trainingbut which occur in testing. For that purpose, in the absence of a second class in the training data, the algorithm defnes an “origin” by mapping feature vec-tors ontoa feature space through an appropriatekernel function and then separates thembyahyperplane witha maximum mar-gin with respect to the origin. The resulting discriminant function is trained to assign positive values in the region surrounding the majority of the training points and negative elsewhere. Hyperplane parameters are derivedbysolvinga quadratic program-ming problem ⎛⎞ l X 11 minimize ⎜ w · w + ξi − ρ ⎟ (4) ⎝⎠ 2νl i=1 subject to (w · Φ(xi))≥ ρ − ξi;i= 1, 2, ..., l;ξi ≥ 0, (5) where w and ρ are parameters of the separationhyperplane, Φ is the mapping function of the input parameter space to a feature space, ν is the asymptotic fraction of outliers (anomalies) allowed, lis the number of training points, and ξ is a slack variable which penalizes misclassifcations. The decision function f(x) = sgn(w · Φ(x)− ρ)determines point labels (e.g. +1 for known instances and −1for novel points).Aschematic idea be-hind OCSVMis shownin Fig. 4. In this approach the parameter ν is interpreted as the asymptotic fraction of data labelled as outliers. The choice of the outlier fraction ν implies that the knowledge about the frequency of appearance of novel points is known a priori (e.g. Manevitz&Yousef 2007). Otherwise,thevaluehastobe tuned as a free parameter together with other unknowns. It is worth noting that domain-based approaches, such as this one, regu-late the position of the novelty boundary using only those data with the closest proximity to it and that the properties of the distribution of data in the training set have no infuence on this process (e.g. Tax&Duin 1999; Le et al. 2010, 2011; Liu et al. 2011). The only drawback of the presented method is the com-plexity associatedwiththe choiceand computationofthekernel functions. Moreover, the parameters controlling the size of the boundary area should be properly adjusted, increasing the com-putational time (e.g. Tax&Duin 2004).In thiswork we use the R (RCoreTeam 2013)6 implementationofSVM includedinthe e1071 package(Meyer et al. 2015), which provides an interface to libsvm.We use doParallel7 and caret8 packages to par-allelize the computations 3.3. Classifcation scheme To make a selection of outlying data it is crucial to create the best-suited classifer for a given dataset. In our application, the classifer is trained on sources with spectroscopic measurements intheSDSS database treatedasasingleclass,andpresentalsoin the AllWISE catalogue.For this purpose we include all sources from the AllWISE × SDSS cross-match in the training sample. Unlike in the case of classical SVM, the imbalance of the training set has no infuence on the OCSVM training, as we create only one known class. The quantity and ratios of specifc classes are not an issue here.For that reason OCSVM can also be treated as an alternative approach to dealing with imbalanced datasets which oﬀers no information loss during training (e.g. Batuwita&Palade 2013).With the training sample selected, the algorithm has to be trained to recognise the normal patterns, which in the case of OCSVM means fnding the best-suited vol-ume encompassing the training points, which will later be used as a decision boundary between what the algorithm fnds as normal and deviating patterns. This procedure involves searching for the most appropriatekernel function, which governs the topologyof the surface enclosing the training sample.To ensure the best performance of the algorithm it is necessary to choose an appropriate kernel function for the given training set and to fnd its meta-parameters to train the novelty detector which 6 http://www.R-project.org
7 https://cran.r-project.org/web/packages/doParallel/
index.html
8 https://cran.r-project.org/web/packages/caret/
index.html
A39, page6of 13 A. Solarz et al.: Automated novelty detection in WISE with OCSVM will best suit the input data. As no two datasets are the same, it is natural that there is no universal kernel function optimal for each classifcation problem. This makes testing several functionsa vital stepin anykernel-based machine learning process (Sangeetha&Kalpana 2010). In this study we test four basic shapesofkernel functionsin the application of the novelty detection to the AllWISE dataset i) linearkernel: uTv, T ii) sigmoidkernel: tanh(γu v + C), iii) radial basiskernel:exp −γ||u − v||2, T iv) polynomialkernel:(γuv + C)d , where u and v are vectors in the input space, || · ||2 is the squared Euclidean distance betweenthetwofeaturevectors, γ isascaling parameter, dis the degree of the polynomial function, and Cis a constant. These meta-parametersneedtobe tunedforeachgiven dataset;anexceptionisthe linearkernelwhichdoesnothaveany free parameters. Taking into account all the above points, the following steps need to be applied for each consideredkernel function in order to determinethe best-ftting topologyofthe separationhypersur-face for anydataset: 1. Division of the training set: The full training set is divided into two subsets, where one is used for the actual training and the other is used as a validation subsettoverifythe accuracyofthe createdhypersurface against sources with known class which were not usedbythe algorithmtofndthe model.We createthe trainingsetoutof a random 99% of known objects; the validation set contains the remaining 1% of known sources. The percentage left for validationis smallbut suﬃcient to verify whether the classifer works well on previously unseen data. 2. Wide grid search: Training a classifer means fnding the best set of kernel function meta-parameters. They are determined by searching through a loosely spaced grid of meta-parameter values describing each kernel (e.g. d, γ, and C in the case of the polynomial kernel) and the ν parameter specifying the ex-pected outlier fraction. 3. Estimation of training accuracy: For each tested combination of meta-parameters we count how manytimes an object with known nature was correctly classifed by the OCSVM (true positive; TP) and how many timesa known objectwas classifed as an outlier(false negative; FN). Based on these counts accuracy is calculated as acctrain = TP/(TP + FN). Moreover, we count the number of SVs used to fnd the decision boundary. The fewer the points treated as SVs, the better: when the data is wellstructured only a small number of SVs are used; all the re-maining training points will not be used in the calculation of the fnal boundary. High numbers of SVs mean that the topology of the surface is complex, and that the data cannot be easily contained within the boundary. 4. Estimation of validation accuracy: The algorithm, trained in the previous step, with its bestsuited meta-parameters of its kernel, is applied to classify thevalidationset. Thankstotheknowledgeofthetrue labels of this set it is possible to verify how well the trained algo-rithm is working on previously unseen data by calculating accvalid = TP/(TP+ FN)and therefore to estimate how well itwillworkontrulyunknowndata.Asabove,the numberof SVs is also taken into account here. Table 2. Input parameter rangesforthegrid searchforthekernels tested in Sect. 4(see text for details). Kernel νγ Cd Linear 0.0001–0.69 − −− Radial 0.0001–0.69 0.001–10 000 −− Sigmoidal 0.0001–0.69 0.001–10 000 0–4 − Polynomial 0.0001–0.69 0.001–10 000 0-4 2–3 5. Fine-tuning of grid search: We tighten the grid search around the best values from the wide grid to fne-tune the meta-parameter choice for the best performance (by repetitive measurements of acctrain, accvalid). The search for the free parametersof eachkernelis done within reasonableexpected ranges.To fnd these ranges we follow the schemeof Chapelle&Zien (2005), whereitis frst necessaryto fx initial values for each set of parameters which will provide the most reliable orders of magnitude. In the case of the one-class problem we use the median of pairwise distances of all training points as the default for γ. The default for ν is taken as the inverse of the empirical variance s2 in the feature space cal 21 culated as s=ΣiKii − n12 Σi, jKi, jfrom an n× n kernel matrixK. n For the degree of the polynomial we consider only two possiblevalues,2and3, as higherdegreeswould createa boundary whose topology is too complex, which would result in overftting the model. Then we use multiples (10k for k∈ {−3, ..., 3})of the default values as the grid search range. After pinpointing the best set of parameters for each ker-nel function we perform fne-tuning of the grid search, where we search the grid around those best parameters with a much smaller step (multiplesof2k).However, we fnd that the maxima of the performance around the most optimal parameters found on the sparser grid are very broad and the fne-tuning of the grid does not improve the performance of the classifer. Ranges of searched parameter values for each kernel are summarized in Table2. Upon fnalizationofthe trainingandverifyingthatthe classifer performsona satisfactorylevel,itis appliedtothe AllWISE target data to search for true outlying objects. 4. Results of OCSVM application to AllWISE data In this section we present the results of applying the OCSVM algorithm to the AllWISE catalogue.Following the discussion presented in the previous section, we started by determining the most appropriatekernelforthistrainingset.Wefoundthatinthis case the preferredkernel function is radial-based with optimal parameters γ = 0.1andν = 0.001 as it provides the highest trainingandvalidation accuraciesandis characterizedbythe smallest number of SVs. The optimal parameters and training/validation performanceofallfourtestedkernelsare summarisedinTable 3. Having determined the proper kernel, we trained the OCSVM anomaly detector on the AllWISE × SDSS training set, and subsequently applied it to AllWISE data preselected as described in Sect. 2. As a result, we obtained a sample of 642 353 sources classifed as unknown by the algorithm. We show theirskydistributionin Fig. 5;asisobvious from the plot, the vast majority of the sources is located within the Galactic Plane and Bulge (90% are within |bGal| < 15◦)and in other con-fusion areas: Magellanic Clouds, Galactic dust clouds, and even M31and M33. Thisistobeexpectedasthe600 spatial resolution of the WISE satellite leads to severe blending in areas of A39, page7of 13 Table 3. Performance of thekernel functions tested in the training of the OCSVM algorithm. Kernel Best parameters NSV acctrain accvalid ν γ C d Linear 0.1 − − − 892 27.02% 31.33% Sigmoid 0.001 1 0 − 125 99.99% 98.80% Radial 0.001 0.1 − − 48 99.99% 99.98% Poly 0.005 100 0 3 53 97.87% 96.66% Fig.
5.
Sky positions of all the objects classifed as unknown in the application of the OCSVM anomaly detector to the AllWISE catalogue (642 353 sources), shown in Galactic coordinates. high projected density, which in turn results in anomalous (spurious) photometric properties of these blended objects. However, as discussed below, except for such artefacts, our anomaly de-tector also fagged a considerable number of genuine sources of astrophysical interest. Togain insight into the natureof these anomalies, we started by looking at their WISE colours. It is important to note that the OCSVM algorithm itself does not provide anymeans of dis-criminating various populations among the outliers. Relying on the colours to identify the groups in the resultant anomalies is the most basic approach. It is possible to refne this task by employing clustering algorithms (e.g. Han et al. 2011)which could fnd diﬀerent classes within the outlier group.For the bright sources the problem can also be approached by using passband images directly (e.g. Hoyle 2016), but the speed of data processing would signifcantly decrease. However, in this work we restrict ourselves to a frst look at the anomalies, and we mostly use a single WISE colour, W1−W2,for that.InFig. 6we present their W1− W2distribution and compare it to the training set, divided according to SDSS source classes (stars,galaxies, quasars).We observe multi-modal behaviour of this colour for the detected anomalies, with three peaks at W1− W2∼−1, ∼−0.5, and ∼1.7. The peaks are separated by minima at W1− W2 = −0.65 and at ∼0.8. It is interesting to note that the latter is the same as the WISE activegalactic nuclei(AGN) separation criterion frst proposedby Sternetal. (2012);we discuss thesered sourcesin more detail in Sect. 4.2. The total number of sources contained in each considered groupis showninTable 4. This division into roughly three groups is also confrmed in the CC diagram where the W2− W3 WISE colour is used as the second dimension (Fig. 7). To construct this diagram we used only those sources which had positive signal-to-noise ratio in the W3 band, which is 38% of the full anomaly sample. In addition, for objects with 0 < w3snr < 2, which have only W3 upper limits in the WISE database, we applied the correction of +0.75 mag as discussed in the Appendix of Krakowskietal. (2016). ThisCC diagramgives indicationsof Fig.
6.
Distributions of the W1 − W2 colour for AllWISE anomalies found in this work (solid black) compared with known sources from AllWISE × SDSS used to train the OCSVM algorithm. Galaxies (1827 241 objects) are marked by blue dashed lines; stars (298 269 objects) by orange dotted lines; and quasars (141 494 objects) by magenta dot-dashed lines. Vertical lines mark the colour cuts ap-plied to the OCSVM anomaly sample dividing it into four possible subgroups. Table 4. Number of sources identifed as anomalies by the OCSVM algorithm, further divided according to the W1− W2colour cuts. Colour cut Nobj W1− W2< 0 575 598 0≤ W1− W2< 0.8 26 990 W1− W2≥ 0.8 39 940 the nature of the detected anomalies: according to Fig. 26 in Jarrett et al. (2011)or Fig.5 in Cluver et al. (2014), stars con-centrate around W1− W2 ' W2− W3 ' 0, ellipticalgalaxies have W1− W2& 0, and0.5< W2− W3< 1.5while spirals span 0< W1− W2< 0.5and1< W2− W3< 4.5; quasars are much redder in W1− W2 than most of inactivegalaxies while their W2− W3is similar to that of some spirals. There are also some more specifc sources locatedon that diagram, such as (U)LIRGs or brown dwarfs,but here we will restrict ourselves to the basic three classes (stars,galaxies, and quasars) in our basic division of the anomalies. Compared with the theoretical W1− W2 vs. W2− W3diagram, we see that the upper cloud of our anomalies is located roughly at the (obscured)AGN locus, while the two lower ones do not seem to be consistent with anynormal sources in this plane. Below we discuss in more detail the possible nature of these three groupsof sources.We reiterate thatas mostofthe detected anomalies do not have measurements in the W3 band, the dis-tinction will be made only based on their W1− W2colour. 4.1. Anomalies with extremely low W1 – W2 colour: photometric artefacts We begin our detailed investigation into the nature of the de-tected anomalies by looking at those with extremely low W1− W2< −0.65. This is the majority (55%) of the outliers found by OCSVM. Their sky distribution (Fig. 8), and thefact that they have such an extremely blue and unphysical W1− W2colour, is a clear indication that these are sources with spurious photome-try due to blending. In what follows we will not deal with these A39, page8of 13 A. Solarz et al.: Automated novelty detection in WISE with OCSVM Fig.
7.
Left panel:WISE colour–colour(W1− W2vs. W2− W3) diagram for the sources identifed by OCSVM as outliers in AllWISE. The plot shows only sources detected in the W3band (188 496 objects comprising 38% of all the anomalies). The greyscale marks the density of displayed points in linear bins. Right panel:W1− W2vs. W2− W3diagram for AllWISE× SDSS sources used for OCSVM training.We note thediﬀerent ranges of the axes in the two panels. Even after removing the W1− W2 < −0.65 artefacts, there is still a signifcant fraction of sources with a very negative and most likely non-physical W1− W2 colour (cf. Fig. 6). Almost all such sources are confned to the Galactic plane and bulge; to inspect whether they have indeed spurious photometry, we checked their properties in GLIMPSE(Benjamin et al. 2003)9, a Spitzer survey covering the inner Galactic Plane and Bulge within |bGal| < 5◦ & |lGal| < 65◦. As flter coverage of Spitzer IRAC I1(centred at 3.6µm)and IRAC I2(centred at 4.5µm)are very comparable to WISEW1andW2, respectively(Jarrett et al. 2011), this is an adequate test to confront corresponding fux measurements from the GLIMPSE catalogue with those of the OCSVM AllWISE anomalies. Our sample of outliers with 0 > W1 − W2 > −0.65 has almost17000 counterpartsin GLIMPSE withina300 matching radius. By comparing the IRAC I1& I2 vs. WISE W1& W2 measurements, we found that for the shorter-wavelength chan-nels the IRAC and WISE magnitudes match very well, while in the case of I2 vs. W2 comparison there is a clear discrepancy: WISE measurements in this band signifcantly underesti-mate the fuxes with respect to IRAC. What is more, this bias increases with decreasing W1 − W2 colour (Fig. 9). Sources Weused GLIMPSEII 2.1 Data Release,http://www.astro.wisc.
edu/glimpse/glimpse2_dataprod_v2.1.pdf
Fig.
9.
Diﬀerence between Spitzer IRAC I2and WISEW2magnitudes for the anomalous sources with W1− W2 < −0.65 (354 301 sources) as a function of the WISE W1− W2colour. TheI2measurements were taken from the GLIMPSE surveyof the Galactic Plane. Contours mark the density of the displayed points in linear bins. with 0 > W1 − W2 > −0.65, however, could hide a fraction of real sources of astronomical interest as objects with moderately negative W1− W2colours have been reported (e.g. Banerji et al. 2013;Jarrett et al. 2017). Current parameter space does not allow for a proper distinction between the real and spurious sources within this anomaly group; only a more in-depth analysis with clustering algorithms could reveal more insight on that matter. As this task is beyond the scope of this work, in the present approach we restrict ourselves to treat all anomalies with W1− W2< 0either as having spuriousW2photometry or as problematic, and we also remove them from further examinations. This cut only aﬀects confusion areas (Galaxy, Magellanic Clouds) and signifcantly purifes the anomaly dataset. 4.2. Anomalies with W1 − W2> 0.8:AGN/quasar candidates We now turn our attention to the group of sources with W1− W2 > 0.8, which clearly stand out in the CC diagram (Fig. 7); there are almost 40 000 such anomalies in our sample. As already A39, page9of 13 beingAGN/QSO, i.e. high-redshift extragalactic sources. There are several linesofevidence supporting thishypothesis. First of all, these sources are very uniformly distributed over the entire sky(Fig. 10),and are preferentially located outside the Galactic Bulge, except for some at the Galactic equator (5200 within |bGal| < 3◦)which must be artefacts of WISE blending in a similarwayastheverylow W1−W2sources of Sect.4.1. Sec-ondly, these anomalies are mostlyfaint: their W1counts peak at the limit of the catalogue used here, W1= 16 (see Fig. 11). Fur-thermore, almost all of them (over 95%) have WISE detections in the W3 channel(w3snr > 0).We reiterate that this channel was not used in the source preselection procedure; thus, as its sensitivity in WISE is much lower than that of the two shorterwavelength bandpasses, this indicates that these sources are in-trinsically bright at (observed) 12 µm. As we can quite safely exclude the situation in which stellar light would be redshifted to this channel (one would need z > 2, which for W1 < 16 would mean intrinsic brightness of ∼−30 mag or brighter at rest-frame λ ∼ 1 µm), this points to emission from dust as 12 µm observations are sensitive to warm dust radiation (e.g. Sauvage et al. 2005)and polycyclic aromatichydrocarbon emis
sionlines(PAH,e.g. Brandletal.2006)at redshiftslowerthan2. On the other hand, a cross-match with the all-sky2MASS data (Skrutskie et al. 2006)gave only 1500 sources, most of which in its Point Source Catalogue (PSC) and just a handful (45) are extended (in 2MASS XSC, Jarrett et al. 2000). This shows that most of our QSO candidates are not in the local volume, as 2MASS provides a very complete census of the local Universe (Bilicki et al. 2014;Rahman et al. 2016). Further insight into the natureof these objectsisgainedby checking for their presence and properties in the SDSS photometric catalogue10. For training we used only sources with SDSS spectroscopy, while the general photometric dataset from Sloan is obviously much larger and more complete at a price of much more limited information on the real nature of the de-tected sources. By cross-matching with SDSS photometric, we found about 7000 of ourAGN candidates to have counterparts there withina matching radiusof100. About 3000 of them are also present in the SDSS DR12 photometric redshift catalogue ofBecketal. (2016),which contains SDSS-resolvedgalaxiesup to z = 1– and these matched objects have mean hzi∼ 0.5. By extrapolating to the full extragalactic sky, this exercise suggests that about 40% of these anomalies would have no optical counterpart in an SDSS-depth all-sky catalogue if one existed. On the other hand, about 25% of theseAGN candidates seem to be residingin optically resolvedgalaxiesat z < 1. 10 Avaliable athttp://skyserver.sdss.org/dr13
A39, page 10 of 13 Fig.
11.
W1magnitude distributions for the three main types of anomalies identifed by OCSVM in AllWISE: with W1 − W2 < 0, solid black (575 423 sources); with 0 ≤ W1 − W2 < 0.8, orange dotted (26 990 sources); with W1− W2 ≥ 0.8, blue dashed (39 940 sources), compared with the known sources from AllWISE × SDSS used to train the OCSVM algorithm: galaxies, magenta dot-dashed (1827 241 sources); stars, red dashed (298 269 sources); and quasars, cyan triple-dot-dashed (141 494 sources). According to studies ofAGNs identifed in both WISE and SDSS (e.g. Yan et al. 2013;Donoso et al. 2014), the combined optical-MIR r − W2 colour can be used as a diagnostic to differentiate between unobscured/type-1 and dust-obscured/type-2AGN/QSO candidates, the division being r − W2 ∼ 6 (both Vega).Wehave thus checked the behaviourof this colourin our sample of anomalies with W1− W2> 0.8also present in SDSS. Indeed, we observe bimodality in the r − W2colour with the di-vision roughly at r − W2 ∼ 6(Fig.12); a similar bimodality is also present in the distribution of the r − W1colour. This suggests that the OCSVM selects both types ofAGN populations, although we emphasise that their W1− W2 colour is in most cases much redder than for the AllWISE × SDSS spectroscopic QSOs, used as part of the training (Fig. 13). In the next step we compared our fndings with other WISE-based QSO candidate selections. We will limit ourselves to those works which presented all-sky WISE data in this con-text, namely Secrestetal. (2015)andKurczetal. (2016). The OCSVM-selected QSO candidates have much redder IR colours thantheQSO candidatesof Kurczetal. (2016),but also red-der than the quasars in the SDSS-based training sample used both here and in that paper (cf. Fig. 13). In addition, they do not show the anomalous skydistribution present in Kurcz et al. (2016,non-uniform distribution on the skywith somewhat larger surface density close to the ecliptic than at the ecliptic poles). There are over 22000 common sources between the OCSVMQSO sample and the Kurcz et al. (2016)AllWISE QSO candi-date one, and the common sources have W1−W2< 1.8. Beyond thosevalues,however,OCSVM selects much redderQSO candi-dates, missedbythe classical approachof source classifcation: inKurczetal.(2016)thevery highW1− W2 objects were as-signed to random classes and had roughly equal probabilities of belonging to anyof them. Finally, we compared the results of the OCSVM QSO se-lection with the publicly available AllWISEAGN catalogue (Secrest et al. 2015)containingover 1.4 millionAGN candidates extracted from AllWISE following the formulae of Mateos et al. (2012). In addition to another method of AGN/QSO selection A. Solarz et al.: Automated novelty detection in WISE with OCSVM Fig.
13.
Histograms of the W1 − W2 colour for quasars and quasar candidates from the following datasets: AllWISE paired up with SDSS DR13 spectroscopic (dotted orange; 141 494 sources; SDSS Collaboration et al. 2016); AllWISE OCSVM AGN candidates (black solid line;39940 sources; this paper); AllWISESVMAGN se-lection (magenta dot-dashed;4443962 sources; Kurczetal.2016);and AllWISEAGN(blue dashed;1354775 sources; Secrestetal.2015). – colour-based vs. automated – the AllWISEAGN sample also uses diﬀerent preselection criteria to ours. Namely, Secrest et al. (2015)required all their sources to haveS/N ≥ 5in all the frst three WISE channels, while we do not use the W3band for selectionor classifcation.Wenotethatdueto AllWISE observational limitations, the 12 µm S/N requirement of Secrest et al. (2015) leadstovery non-uniformskycoverageof theirAGN candidates, varying over an order magnitude in surface density on diﬀerent patchesof thesky(Figs.1and2therein); no such issues areevident in our sample except for the Galactic equator area. On the other hand, our QSO candidate sample is much shallower than the AllWISEAGN one because of our requirements of W1< 16 Fig.
14.
Skydistributionof AllWISE anomalous sources with0 ≤ W1− W2< 0.8(26 990 objects). and W2 < 16, while no such cuts were applied in Secrest et al. (2015); the latter sample thus reaches formally to the full depth of AllWISE (modulo the additional W3preselection), which is W1 ∼ 17 on most of the sky, and over W1 = 18 by the ecliptic poles(Jarrettetal. 2011). There areover25000 common sources between our QSO candidate dataset and AllWISEAGN, which means that almost 30% of our sample outside the Galaxy wasnot identifedby Secrestetal. (2015)asAGN candidates. Taking into account that our dataset is one magnitude shallower than AllWISEAGN, we expect that applying our method at the full depth of AllWISE will bring of the order of 100 000 more QSO candidates not contained in the Secrest et al. (2015)sample and uniformly distributedoverthesky.Weplantoworkonsuch a selection in the near future. 4.3. Anomalies with intermediate W1 − W2 colour: mixture of stars and compact galaxies We fnd approximately 27 000 anomalous sources with intermediate W1− W2colours (0 ≤ W1− W2 < 0.8), mostly located atlow Galactic latitudesbut outside the Bulge area,except for a small fraction by the Galactic Centre which again are supposedly photometric artefacts (Fig. 14). Interestingly, there is an en-hancementin surface densityof these outliers alsobythe Galactic Anticentre. Only 14% of them are located at |bGal| > 30◦ , i.e. on half of the sky. Regarding their photometric properties in WISE, these sources are mostlyfaint, peakingatthe limitofthe catalogue, W1 = 16. About half of them have W3 detections, which is very diﬀerent from theAGN candidate case from the previous section. Moreover,theyappeartobevery compact,having w1mag13 values below0.1. This causes their anomalous be-haviour for the algorithm, as practically no sources in the training samplehave this property (cf. Fig. 1). Similarly to what was done in Sect. 4.2, we paired up this sample with external datasets. Unlike in that case, here almost exactly a half of them have counterparts in 2MASS PSC, and the skydistribution of the matches roughly follows that of this anomaly sample. Except for a handful of real artefacts from the Galactic Bulge, all these objects are faint in the JHKs bands. On the other hand, exactly none (0) of these anomalies have a counterpart in 2MASS XSC. Altogether, this means that except for obvious artefacts, these sources are either stars or compact galaxies unresolved by 2MASS. Further evidence that they are a mixture of these two source types comes from a cross-match with the SDSS photometric catalogue. Here we fnd only ∼4500 matches, partly because this outlier dataset overlaps with SDSS footprint only to small A39, page 11 of 13 Fig.
15.
Optical-infrared colour-colour diagram for sources in the All-WISE anomaly sample with intermediate0 ≤ W1− W2< 0.8and also present in the SDSS photometric dataset. Blue and red dots are respectively stars andgalaxies according to SDSS morphological classifcation. All magnitudes are here AB; WISE W1was converted following Jarrett et al. (2011). extent, with practically no anomalies in the north Galactic cap where the SDSS coverage is the best. As shown in Prakash et al. (2015), a combination of optical ri bands and WISE W1 can be used for an eﬃcient separation of stars from galaxies (see also DESI Collaboration et al. 2016 for a similar separation using the z band rather than i). We indeed observe such a division in our anomaly sample, as shown in Fig. 15; magnitudes are in the AB system for a straightforward comparison with the SDSS and DESI studies. Colour-codingbySDSS morphological classifcation (blue=stars; red=galaxies) confrms the two-class nature of the part of the anomaly sample which has SDSS counterparts. The galaxies from this anomaly subset matched with SDSS (1500) are also present in the SDSS DR12 photometric redshift catalogue(Beck et al. 2016)and their redshift distribution is slightly shifted towards smaller redshifts than the SDSS DR12 sample. Based on the above considerations, we conclude that the All-WISE anomalies identifedbyOCSVM with0 ≤ W1−W2< 0.8 colour are a mixture of stars (probably dominating), compact galaxies outside the local volume, and a handful of actual arte-facts. However, only with the addition of optical photometry does the distinction between stars and galaxies become more straightforward.To distinguish between stars andgalaxies with-out resorting to optical measurements, i.e. based on WISE data alone, proper motion measurements could possiblybe used.We note that Kurcz et al.(2016)made an attempt to use proper mo
tions as a discriminating parameter for automated source classifcationin AllWISE data,but no improvementwas found after adding them, and actually the accuracyof the classifer was de-creasing. This eﬀect can be attributed to thefact that the WISE proper motions are not yet accurate enough to be used in source classifcation; they are reliable only for a small subset of WISE with high signal-to-noise ratio(Kirkpatrick et al. 2014, 2016). In future WISE data releases(Faherty et al. 2015;Meisner et al. 2017a,b) and in planned surveys like LSST, proper motions should become an important parameter to be used in classifcation schemes. 5. Summary and future prospects In this work we demonstrated the power of automatic semisupervised outlier detection, based on one-class reformulation of the SVM algorithm, and applied it to the WISE survey. By design the algorithm creates a model of standard patterns and relations in the data through training on a set of known objects. Then,datafromatargetsetcanbefttedtothemodelof normality and classifed as either known or unknown. The most relevant feature of the OCSVM algorithm is its ability to detect real outliers among the sources in the given dataset. In the present application to AllWISE, we found three main groups of such anomalies. The frst group contains actual photometric artefacts located in areas of high surface density which have underestimated fuxes at 4.6 µm, most likely due to blending, and thus unphysical mid-IR colours. The second group includes real astrophysical objects, whose nature is consistent with thatofa dustyAGN population. Their main outlying property is their very red mid-IR colour, which caused these sources to become unclassifable in the classical approaches to auto-mated AllWISE object divisions. The third group of anomalous sources area mixof IR-bright stars and compactgalaxies, underrepresented in the optically selected training set. By adding these so far missing sources specifc to mid-IR selection but not present in optically driven training sets, automated source separation in AllWISE data should provide more reliable results that was possible with the classical SVM approach presented in Kurcz et al. (2016). For the best performance of the SVM-based automated source classifcation it would be advisable to apply a novelty detection on the data be-fore the traditional classifcation is conducted. This approach should ensure that the training sample includesa suﬃcient number of object types contained within the surveyand should bring insight intohow well the training samplebuilt from another sur-vey can represent the data to be classifed. The OCSVM algorithm can be used not only as an outlier detector,but alsoasa meansof testingthe adequacyof training samples for fully supervised classifcation methods. Whena pat-tern in the data does not match anyof the previously learned templates, a standard supervised classifer will assign the membership to a randomly chosen class. The versatility of the OCSVM algorithm to deal with outlying and otherwise unclassifable data shouldbetakenadvantageofbyusingitasa primarystepto create reliable training samples as well as to provide insight into what types of objects the supervisor should expect to fnd. In a more remote future, algorithms such as OCSVM should prove essential in the eﬃcient search for novel, unexpected, or just rare objects in the ever growing volume of data collected by planned surveys like SPICA, SKA, or LSST. Acknowledgements. This publicationmakesuseofdata productsfromtheWide-feld Infrared Survey Explorer, which is a joint project of the University of California, Los Angeles, and the Jet Propulsion Laboratory/California Institute of Technology, funded by the National Aeronautics and Space Admin-istration. Funding for SDSS-III has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, and the U.S. Department of Energy Oﬃce of Science. The SDSS-III web site is http://www.sdss3.org/. SDSS-III is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS-III Collaboration including the University of Arizona, the Brazilian Participation Group, Brookhaven National Laboratory, Carnegie Mellon University, University of Florida, the French Participation Group, the German Participation Group, Harvard University, the Instituto de Astrofsica de Canarias, the Michigan State/Notre Dame/JINAParticipation Group, Johns Hopkins Univer-sity, Lawrence Berkeley National Laboratory, Max Planck Institute for Astro-physics, Max Planck Institute for Extraterrestrial Physics, New Mexico State A39, page 12 of 13 A. Solarz et al.: Automated novelty detection in WISE with OCSVM University,NewYorkUniversity,OhioStateUniversity,PennsylvaniaStateUniversity,Universityof Portsmouth, PrincetonUniversity,the SpanishParticipation Group, University ofTokyo, University of Utah,Vanderbilt University, Univer-sity of Virginia, University of Washington, and Yale University. The authors would like to thank the anonymous referee for the comments and recommendations which helped to improve this manuscript. Special thanks to MarkTaylor for theTOPCAT(Taylor 2005)and STILTS(Taylor 2006)software. This re-search has made use of Aladin sky atlas developed at CDS, Strasbourg Ob-servatory, France. This research has been supported by National Science Centre grants number UMO-2015/16/S/ST9/00438, UMO-2012/07/D/ST9/02785, UMO-2012/07/B/ST9/04425, and UMO-2015/17/D/ST9/02121. References Agyemang,M.,Barker,K.,&Alhajj,R.2006, Intell.DataAnal.,10,521 Angiulli,F.,Fassetti,F.,&Palopoli,L. 2009, ACMTrans. Database Syst.,34,7 Banerji,M., McMahon,R.G.,Hewett,P.C., Gonzalez-Solares,E.,&Koposov, S. E. 2013, MNRAS, 429, L55 Baron,D.,&Poznanski,D.2017, MNRAS,465,4530 Basu, S., Bilenko, M.,& Mooney, R. J. 2004, in Proc.TenthACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04(NewYork,NY, USA:ACM),59 Batuwita,R.,&Palade,V. 2013, Class Imbalance Learning Methodsfor Support Vector Machines (JohnWileyand Sons, Inc.),83 Beaumont,C.N.,Williams,J.P.,&Goodman,A.A.2011, ApJ,741,14 Beck, R., Dobos, L., Budavári,T., Szalay, A. S.,& Csabai, I. 2016, MNRAS, 460, 1371 Benjamin,R.A., Churchwell,E., Babler,B.L.,etal.2003, PASP,115,953 Bilicki,M., Jarrett,T.H., Peacock,J.A., Cluver,M.E.,&Steward,L. 2014, ApJS, 210,9 Bilicki,M., Peacock,J.A., Jarrett,T.H.,etal.2016, ApJS,225,5 Blanton,M.R.,&Roweis,S. 2007, AJ,133,734 Bolton, A. S., Schlegel, D. J., Aubourg, É., et al. 2012, AJ, 144, 144 Brandl,B.R., Bernard-Salas,J.,Spoon,H.W.W.,etal.2006, ApJ,653,1129 Cavuoti, S., Brescia, M., D’Abrusco, R., Longo, G., & Paolillo, M. 2014, MNRAS, 437, 968 Chambers, K. C., Magnier, E. A., Metcalfe, N., et al. 2016, ArXiv e-prints [arXiv:1612.05560] Chandola,V., Banerjee,A.,&Kumar,V. 2009, ACM Comput. Surv.,41,15 Chapelle,O.,&Zien,A. 2005,in AISTATS 2005, Max-Planck-Gesellschaft,57 Chapelle, O., Schkopf, B., & Zien, A. 2006, Semi-Supervised Learning, Adaptive computation and machine learning (Cambridge, USA: MIT Press), 508 Cluver,M.E., Jarrett,T.H., Hopkins,A.M.,etal.2014, ApJ,782,90 Cortes, C.,&Vapnik,V. 1995, Mach. Learn., 20, 273 Cutri, R. M., Wright, E. L., Conrow,T., et al. 2013, Explanatory Supplement to the AllWISEData Release Products,Tech.rep.,ed.R.M.Cutrietal. DESI Collaboration, Aghamousa, A., Aguilar, J., et al. 2016, ArXiv e-prints [arXiv:1611.00036] Donoso,E.,Yan,L., Stern,D.,&Assef,R.J.2014, ApJ,789,44 Fadely,R.,Hogg,D.W.,&Willman,B. 2012,ApJ,760,15 Faherty, J. K., Alatalo, K., Anderson, L. D., et al. 2015, ArXiv e-prints [arXiv:1505.01923] Hambly, N. C., MacGillivray, H. T., Read, M. A., et al. 2001, MNRAS, 326, 1279 Han, J., Kamber, M.,& Pei, J. 2011, Data Mining: Concepts andTechniques, 3rd edn. (San Francisco, USA: Morgan Kaufmann Publishers Inc.) Hautamaki,V., Karkkainen, I.,&Franti,P. 2004, in Proc.Pattern Recognition, 17th International Conference on (ICPR’04)Vol.3, ICPR ’04(Washington, DC, USA: IEEE Computer Society), 430 Hawkins, S., He, H., Williams, G. J., & Baxter, R. A. 2002, in Proc. 4th International Conference on Data Warehousing and Knowledge Discovery, DaWaK 2000 (London, UK: Springer-Verlag), 170 Heinis,S.,Kumar,S., Gezari,S.,etal.2016, ApJ,821,86 Ho,T.K.1998, IEEETrans.PatternAnal.Mach. Intell.,20,832 Hodge,V.,&Austin,J. 2004, Artif. Intell.Rev.,22,85 Hoﬀmann,H.2007, Pattern Recogn.,40,863 Hoyle, B. 2016, Astron. Comput., 16, 34 Jarrett,T. H., Chester,T., Cutri, R., et al. 2000, AJ, 119, 2498 Jarrett,T. H., Cohen, M., Masci,F., et al. 2011, ApJ, 735, 112 Jarrett,T.H.,Cluver,M.E., Magoulas,C.,etal.2017, ApJ,836,182 Jolliﬀe,I. 2002, Principal component analysis(NewYork: SpringerVerlag) Kirkpatrick,J.D., Schneider,A.,Fajardo-Acosta,S.,etal.2014, ApJ,783,122 Kirkpatrick,J.D.,Kellogg,K., Schneider,A.C.,etal.2016, ApJS,224,36 Kovács,A.,&Szapudi,I.2015,MNRAS,448,1305 Krakowski,T., Ma ek, K., Bilicki, M., et al. 2016, A&A, 596, A39 Kriegel, H.-P., Krer,P.,&Zimek,A. 2009, ACMTrans. Knowl. Discov. Data, 3,1 Kurcz, A., Bilicki, M., Solarz, A., et al. 2016,A&A, 592, A25 Langone, R., Mall, R., Alzate, C., & Suykens, J. A. K. 2015, ArXiv e-prints [arXiv:1505.00477] Le,T.,Tran, D., Ma,W.,&Sharma, D. 2010, An optimal sphere and two large margins approach for novelty detection, 2010 Int. Joint Conf. Neural Network (IJCNN) Le, T., Tran, D., Ma, W., & Sharma, D. 2011, Multiple Distribution Data Description Learning Algorithm for Novelty Detection, Adv. Knowledge Discovery Data Mining, Proc., 246 Liu,Y.-H., Liu,Y.-C.,&Chen,Y.-Z. 2011, Expert Syst. Appl., 38, 6222 Mainzer, A., Bauer, J., Cutri, R. M., et al. 2014, ApJ, 792, 30 Ma ek, K., Solarz, A., Pollo, A., et al. 2013, A&A, 557, A16 Manevitz,L.,&Yousef,M. 2007, Neurocomput.,70, 1466 Markou,M.,&Singh,S. 2003, Signal Processing,83, 2499 Marton,G.,Th,L.V.,Paladini,R.,etal.2016, MNRAS,458,3479 Mateos, S., Alonso-Herrero, A., Carrera,F. J., et al. 2012, MNRAS, 426, 3271 Meisner, A. M., Lang, D., & Schlegel, D. J. 2017a, ArXiv e-prints [arXiv:1705.06746] Meisner,A.M.,Lang,D.,&Schlegel,D.J. 2017b, AJ,153,38 Mercer, J. 1909, PhilosophicalTransactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 209, 415 Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., & Leisch, F. 2015, e1071: Misc Functions of the Department of Statistics, Probability Theory Group(Formerly: E1071),TUWien,r packageversion 1.6-7 Mika, S., Rätsch, G., Weston, J., Schkopf, B., & Mler, K.-R. 1999, in Proceedingsof the 1999 IEEE Signal Processing SocietyWorkshop,9, Max-Planck-Gesellschaft (IEEE), 41 Murphy, K.P. 2012, Machine Learning:A Probabilistic Perspective (The MIT Press) Pollo,A., Rybka,P.,&Takeuchi,T.T.2010, A&A,514,A3 Prakash,A., Licquia,T.C.,Newman,J.A.,&Rao,S.M.2015, ApJ,803,105 RCoreTeam.2013,R:ALanguageandEnvironmentfor Statistical Computing, RFoundation for Statistical Computing,Vienna, Austria Rahman,M.,Ménard,B.,&Scranton,R.2016, MNRAS,457,3912 Sangeetha, R., & Kalpana, B. 2010, A Comparative Study and Choice of an Appropriate Kernel for Support Vector Machines, eds. V. V. Das, & R.Vijaykumar (Berlin, Heidelberg: Springer Berlin Heidelberg), 549 Sauvage, M.,Tuﬀs,R.J.,&Popescu,C.C.2005, SpaceSci.Rev.,119,313 Schkopf, B., Smola, A. J., & Mler, K.-R. 1999, in Advances in Kernel Methods,ed.B. Schkopf,C.J.C.Burges,&A.J.Smola (Cambridge,USA: MIT Press), 327 Schkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., & Platt, J. 2000, Adv. Neural Inf. Process. Syst., 582 SDSS Collaboration, Albareti,F. D., Allende Prieto, C., et al. 2016, ArXiv eprints[arXiv:1608.02013] Secrest,N.J.,Dudik,R.P., Dorland,B.N.,etal.2015, ApJS,221,12 Shawe-Taylor, S.,&Cristianini, N. 2004,Kernel Methods forPattern Analysis (Cambridge, UK: Cambridge, UP) Shi,F., Liu,Y.-Y., Sun, G.-L., et al. 2015, MNRAS, 453, 122 Skrutskie,M.F.,Cutri,R.M., Stiening,R.,etal.2006, AJ,131,1163 Solarz, A., Pollo, A.,Takeuchi,T.T., et al. 2012, A&A, 541, A50 Solarz, A., Pollo, A.,Takeuchi,T.T., et al. 2015, A&A, 582, A58 Stern, D., Assef, R. J., Benford, D. J., et al. 2012, ApJ, 753, 30 Tax,D.M.,&Duin,R.P. 2004,Mach. Learn.,54,45 Tax,D.M.J.,&Duin,R.P.W. 1999,Patt. Recog. Lett.,20, 1191 Taylor, M. B. 2005, in Astronomical Data Analysis Software and Systems XIV, eds.P. Shopbell,M. Britton,&R. Ebert, ASP Conf. Ser., 347,29 Taylor, M. B. 2006, in Astronomical Data Analysis Software and Systems XV, eds.C. Gabriel,C. Arviset,D. Ponz,&S. Enrique, ASP Conf. Ser., 351, 666 Škoda, P., Shakurova, K., Koza, J., & Paliˇcka, A. 2016, ArXiv e-prints [arXiv:1612.07549] Škoda,P.,Paliˇcka,A.,Koza,J.,&Shakurova,K.2017, IAUSymp.,325,180 Vapnik,V. N. 1995, The nature of statistical learning theory (NewYork, USA: Springer-VerlagNewYork, Inc.) Vapnik, V., & Chervonenkis, A. 1974, Theory of Pattern Recognition [in Russian] (Moscow: Nauka), (German Translation: W. Wapnik & A. Tscherwonenkis), Theorie der Zeichenerkennung, Akademie–Verlag, Berlin, 1979 Walker, H. J.,Volk, K.,Wainscoat, R. J., Schwartz, D. E.,&Cohen, M. 1989, AJ, 98, 2163 Wolf, C., Meisenheimer, K., Rer, H.-J., et al. 2001,A&A, 365, 681 Wright,E.L., Eisenhardt,P.R.M., Mainzer,A.K.,etal.2010, AJ,140,1868 Yan, L., Donoso, E., Tsai, C.-W., et al. 2013,AJ, 145, 55 Yang,J.,&Wang,W. 2003, Cluseq:eﬃcient and eﬀective sequence clustering York, D. G., Adelman, J., Anderson, Jr., J. E., et al. 2000,AJ, 120, 1579 Zhang,Y.,&Zhao,Y. 2004, A&A, 422, 1113 A39, page 13 of 13