Browsing by Author "BRESSAN, Stephane"
Now showing 1 - 20 of 21
Results Per Page
Sort Options
- ItemAn Application of Constraint Programming to Mobile Data(2005-10-19) BRESSAN, Stephane; TOK, Wee Hyong; PRESTWICH, StevenGiven the current ubiquity of wireless networks and mobile devices, data must be disseminated to a large number of these devices in an effective and efficient manner. A promising technique for such applications is data broadcasting. We introduce a new optimization problem derived from mobile data broadcasting applications. A single broadcaster transmits pages of data in sequence in response to requests from multiple clients, each asking for different subsets of a set of available pages. A transmitted page can be read by any number of clients, and we aim to minimise the total time taken to transmit all pages. Users impose additional constraints on how long they are willing to wait between pages, because of limited power supply in mobile devices. This problem is closely related to the multiple sequence alignment problem from bioinformatics. We present a first constraint model for this problem, designed for use with randomized local search algorithms.
- ItemA comparative study of synthetic dataset generation techniquesDANDEKAR, Ashish; ZEN, Remmy A. M.; BRESSAN, StephaneUnrestricted availability of the datasets is important for the researchers to evaluate their strategies to solve the research problems. While publicly releasing the datasets, it is equally important to protect the privacy of the respective data owners. Synthetic datasets that preserve the utility while protecting the privacy of the data owners stands as a midway. There are two ways to synthetically generate the data. Firstly, one can generate a fully synthetic dataset by subsampling it from a synthetically generated population. This technique is known as fully synthetic dataset generation. Secondly, one can generate a partially synthetic dataset by synthesizing the values of sensitive attributes. This technique is known as partially synthetic dataset generation. The datasets generated by these two techniques vary in their utilities as well as in their risks of disclosure. We perform a comparative study of these techniques with the use of different dataset synthesisers such as linear regression, decision tree, random forest and neural network. We evaluate the e ectiveness of these techniques towards the amounts of utility that they preserve and the risks of disclosure that they su er.
- ItemDifferential Privacy for Regularised Linear RegressionDANDEKAR, Ashish; BASU, Debabrota; BRESSAN, StephaneRecent attacks on machine learning models such as membership inference attacks increase the concern for privacy. Linear regression is such an essential statistical machine learning model at risk. For a given dataset, linear regression determines the parameters of the linear equation connecting the predictor variables to the response variable. As such linear regression yields a set of unstable and over tted parameters. Regularisation terms are added to the loss function of linear regression in order to avoid overftting. LASSO, ridge, and elastic net are three variants of regularised linear regression. We present an e-differentially private functional mechanism for the aforementioned variants of regularised linear regression. We empirically and comparatively analyze its effectiveness. A functional mechanism achieves differential privacy for linear regression by adding noise to the loss function. We empirically show that an e-differentially private functional mechanism causes more error than the non-private linear regression models whereas their performances are comparable. We also discuss caveats in the functional mechanism, such as non-convexity of the noisy loss function, which causes instability in the results of di erentially private linear regression models. This discussion puts forth the need of designing a differentially private mechanism that produces a noisy loss function that is convex.
- ItemDiscovering Semantics from Data-Centric XML(2013-06-17) LI, Luochen; LE, Thuy Ngoc; WU, Huayu; LING, Tok Wang; BRESSAN, StephaneIn database applications in general, and in applications using XML in particular, the availability of a conceptual schema or of elements of semantics constitute invaluable leverage for improving the effectiveness, and sometimes the efficiency, of many tasks including query processing, keyword search and schema and data integration. The Object-Relationship-Attribute model for Semi-Structured data (ORA-SS) model is a conceptual model designed to capture the semantics of the important constructs in a variety of data models in general and in semi-structured data models in particular. It is specifically intended to capture the semantics of object classes, object identifiers, relationship types, object attributes and relationship attributes underlying XML schemas and data. For a given application, we refer to the set of instances of these semantic concepts as the ORA-semantics of the application. While ORA-SS can be used a priori for the design of new applications, we are interested, in this work, in the automatically discovering of the ORA-semantics from existing XML data and XML schemas. In this paper, we present a novel approach to automatically discover the ORA-semantics from XML schemas and XML data. The approach we proposed is based on a set of techniques and heuristics that identify the different semantic concepts. We empirically and comparatively evaluate the effectiveness of the approach. We show by experiments that the semantics discovered by our approach has more than 90% accuracy.
- ItemDiscovering Semantics from XML(2012-03-26T01:54:23Z) LI, Luochen; LE, Thuy Ngoc; LING, Tok Wang; WU, Huayu; BRESSAN, StephaneIn database applications in general, and in applications using XML in particular, the availability of a conceptual schema or of elements of semantics constitute invaluable leverage for improving the effectiveness, and sometimes the efficiency, of many tasks including query processing, keyword search and schema and data integration. The Object-Relationship-Attribute model for Semi-Structured data (ORA-SS) model is a conceptual model designed to capture the semantics of the important constructs in a variety of data models in general and in semi-structured data models in particular. It is specifically intended to capture the semantics of object classes, object identifiers, relationship types, object attributes and relationship attributes underlying XML schemas and data. For a given application, we refer to the set of instances of these semantic concepts as the ORA-semantics of the application. While ORA-SS can be used a priori for the design of new applications, we are interested, in this work, in the automatically discovering of the ORA-semantics from existing XML data and schemas. In this paper, we present a novel approach to automatically discover the ORA-semantics from XML schemas and XML data. The approach we proposed is based on a set of techniques and heuristics that identify the different semantic concepts. We empirically and comparatively evaluate the effectiveness of the approach.
- ItemEdit Distance between XML and Probabilistic XML Documents(2011-06-03) TANG, Ruiming; WU, Huayu; NOBARI, Sadegh; BRESSAN, StephaneWe propose an efficient algorithm for computing of the edit distance between an XML document and a probabilistic XML document. Probabilistic XML is a hierarchical data model capturing uncertainty of both value and structure. It is suitable to many modern applications such as information extraction, scientific data management and data integration. The computation of similarity is an essential building block for the comparison, alignment, clustering and classification of data in these applications. Several algorithms exist for measuring the structural similarity between XML documents among themselves or XML documents and XML document type definitions and schemas. The new challenge in efficiently computing the similarity between an XML document and a probabilistic XML document is the multiplicity of the possible worlds that a probabilistic XML document represents. In this paper, we devise and discuss algorithms for computing the similarity between an XML document and a probabilistic XML document. We empirically and comparatively evaluate their performance. In the absence of established corpora and benchmarks for probabilistic XML, we also propose and use random probabilistic XML models together with the associated random generation algorithms.
- ItemAn Efficient and Truthful Pricing Mechanism for Team Formation in Crowdsourcing Markets(2014-10-20T08:36:51Z) LIU, Qing; LUO, Tie; TANG, Ruiming; BRESSAN, StephaneIn a crowdsourcing market, a requester is looking to form a team of workers to perform a complex task that requires a variety of skills. Candidate workers advertise their certified skills and bid prices for their participation. We design four incentive mechanisms for selecting workers to form a valid team (that can complete the task) and determining each individual worker’s payment.We examine the profitability, individually rationality, computationally efficiency, and truthfulness for each of the four mechanisms. Our study analytically shows that one of the mechanisms, called TruTeam, is superior to the others particularly due to its computationally efficiency and truthfulness. Furthermore, our extensive simulations confirm the analysis and demonstrate that TruTeam is an efficient and truthful pricing mechanism for team formation in crowdsourcing markets.
- ItemEvaluation of Di erentially Private Non-parametric Machine Learning as a ServiceDANDEKAR, Ashish; BASU, Debabrota; BRESSAN, StephaneMachine learning algorithms create models from training data for the purpose of estimation, prediction and classi cation. While releasing parametric machine learning models requires the release of the parameters of the model, releasing non-parametric machine learning models requires the release of the training dataset along with the parameters. Indeed, estimation, prediction or classi cation with non-parametric models computes some form of correlation between new data and the training data. The release of the training dataset creates a risk of breach of privacy. An alternative to the release of the training dataset is the presentation of the non-parametric model as a service. Still, the non-parametric model as a service may leak information about the training dataset. We study how to provide di erential privacy guarantees for non-parametric models as a service. This cannot be achieved by perturbation of the output but requires perturbation of the model functions. We show how to apply the perturbation to the model functions of histogram, kernel density estimator, kernel SVM and Gaussian process regression in order to provide ( ; )-di erential privacy. We evaluate the trade-o between the privacy guarantee and the error incurred for each of these non-parametric machine learning algorithms on benchmarks and real-world datasets.Our contribution is twofold. We show that functional perturbation is not only pragmatic for releasing machine learning models as a service but also yields higher e ectiveness than output perturbation mechanisms for speci ed privacy parameters. We show the practical step to perturbate the model functions of histogram, kernel SVM, Gaussian process regression along with kernel density estimator. We evaluate the tradeo between the privacy guarantee and the error incurred for each of these non-parametric machine learning algorithms for a real-world dataset as well as a selection of benchmarks.
- ItemExpressing and Processing Path-centric XML Queries(2012-04-02T09:09:42Z) WU, Huayu; TANG, Ruiming; LING, Tok Wang; BRESSAN, StephaneThe support for navigation and browsing in XML query languages is based on XPath. XQuery, a W3C standardized XML query language, introduces FLOWR constructs on top of XPath, to express more complex query purposes. However, a family of very practical queries, which aim to return or manipulate paths as first class objects, cannot be expressed by XPath or XQuery FLOWR expressions. We call them path-centric queries. Of course, a user can program an XQuery user-defined recursive function to meet pathcentric queries, but the concern is the convenience and practicality of requiring normal database users to be equipped with programming skills. In this paper, we analyze the root cause of the limited expressivity of XPath and XQuery for path-centric queries. Based on the analysis, we propose seamless extension to XQuery FLOWR to elegantly express path-centric queries. We further investigate intra-path aggregation, an analytical operation in pathcentric queries, and show how they can be easily expressed and efficiently processed.
- ItemFast Community Detection(2013-05-21T01:36:27Z) SONG, Yi; BRESSAN, StephaneWe propose an algorithm for detecting communities in networks. The algorithm exploits degree and clustering coefficient of vertices as we hypothesize that these metrics characterize dense connections indicative of a community. Each vertex, independently, seeks the community to which it belongs by visiting its neighbor vertices and choosing its peers on the basis of their degrees and clustering coefficients. The algorithm is intrinsically data parallel. We devise a version for graphics Processing Unit (GPU). We empirically evaluate the performance of our method. We measure and compare its efficiency and effectiveness to several state of the art community detection algorithms. Effectiveness is quantified by five metrics, namely, modularity, conductance, internal density, cut ratio and weighted community clustering. Efficiency is measured by the running time. Clearly the opportunity to parallelize our algorithm yields an efficient solution to the community detection problem.
- ItemForce-directed Layout Community Detection(2013-05-21T01:33:30Z) SONG, Yi; BRESSAN, StephaneIn this paper, we propose a graph-layout based method for detecting communities in networks. We first project the graph onto a Euclidean space using Fruchterman-Reingold algorithm, a force-based graph drawing algorithm. We then cluster the vertices according to their Euclidean distance. The idea is similar to that of dimension reduction. The graph drawing in two or more dimension provides a heuristic decision as whether vertices are connected by a short path based on their Euclidean distance. We study community detection for both disjoint and overlapping communities. For the case of disjoint communities, we use k-means clustering. For the case of overlapping communities, we use fuzzy-c means algorithm. We evaluate the performance of our different algorithms for varying parameters and number of iterations. We compare the results to several state of the art community detection algorithms, each of which clusters the graph directly or indirectly according to geodesic distance. We show that, for non-trivially small graphs, our method is both effective and efficient. We measure effectiveness using modularity when the communities are not known in advance and precision when the communities are known in advance. We measure efficiency with running time. The running time of our algorithms can be controlled by the number of iterations of the Fruchterman-Reingold algorithm.
- ItemA Framework for Conditioning Uncertain(2012-01-25) TANG, Ruiming; CHENG, Reynold; WU, Huayu; BRESSAN, StephaneWe propose a framework for representing conditioned probabilistic relational data. In this framework the existence of tuples in possible worlds is determined by Boolean expressions composed from elementary events. The probability of a possible world is computed from the probabilities associated with these elementary events. In addition, a set of global constraints conditions the database. Conditioning is the formalization of the process of adding knowledge to a database. Some worlds may be impossible given the constraints and the probabilities of possible worlds are accordingly re-defined. The new constraints can come from the observation of the existence or non-existence of a tuple, from the knowledge of a specific rule, such as the existence of an exclusive set of tuples, or from the knowledge of a general rule, such as a functional dependency. We are therefore interested in computing a concise representation of the possible worlds and their respective probabilities after the addition of new constraints, namely an equivalent probabilistic database instance without constraints after conditioning. We devise and present a general algorithm for this computation. Unfortunately, the general problem involves the simplification of general Boolean expressions and is NP-hard. We therefore identify specific practical families of constraints for which we devise and present efficient algorithms.
- ItemMeasuring XML Structured-ness with Entropy(2011-06-03) TANG, Ruiming; WU, Huayu; BRESSAN, StephaneXML is semi-structured. It can be used to annotate unstructured data, to represent structured data and almost anything in-between. Yet, it is unclear how to formally characterize, yet to quantify, structuredness of XML. In this paper we propose and evaluate entropy-based metrics for XML structured-ness. The metrics measure the structural uniformity of path and subtrees, respectively. We empirically study the correlation of these metrics with real and synthetic data sets.
- ItemNon-blocking Spatial Join(2007-07-25) TOK, Wee Hyong; BRESSAN, Stephane; LEE, Mong LiWe propose and study sequential non-blocking algorithms for the processing of spatial joins on continuous data streams with unpredictable arrival rates or on large collections of spatial data that are not indexed. Given two sets of spatial data represented by their bounding boxes, the algorithms immediately and continuously compute and output the pairs of data from each set whose bounding boxes intersect. The different algorithms we propose take advantage of different possible characteristics of the data such as clustering of the input to build indexes or synopses to accelerate the production of results. We comparatively analyze the performance of the proposed algorithms using several synthetic and realistic data sets.
- ItemThe Price is Right(2012-05-28T08:07:59Z) TANG, Ruiming; WU, Huayu; KISTER, Thomas; BRESSAN, StephaneData is a modern commodity. Electronic data market places are being designed and deployed. Yet current pricing models either focus on processing or are proprietary, opaque and not conducive of a healthy commodity market dynamics. In this paper we propose a generic data pricing model that is based on minimal lineages, i.e. sets of tuples contributing to the result of a query. We show that the proposed model fulfils desirable properties such as contribution monotonicity, bounded price and arbitrage-freedom. We extend the model to the case of probabilistic databases in the X-tuple model. We propose algorithms to the compute the price of a query based on our pricing models.
- ItemPublishing Trajectory with Differential Privacy: A Priori vs A Posteriori Sampling Mechanisms(2013-04-16T03:02:38Z) SHAO, Dongxu; JIANG, Kaifeng; KISTER, Thomas; BRESSAN, Stephane; TAN, Kian LeeIt is now possible to collect and share trajectory data for any ship in the world by various means such as satellite and VHF systems. However, the publication of such data also creates new risks for privacy breach with consequences on the security and liability of the stakeholders. Thus, there is an urgent need to develop methods for preserving the privacy of published trajectory data. In this paper, we propose and comparatively investigate two mechanisms for the publication of the trajectory of individual ships under differential privacy guarantees. Traditionally, privacy and differential privacy is achieved by perturbation of the result or the data according to the sensitivity of the query. Our approach, instead, combines sampling and interpolation. We present and compare two techniques in which we sample and interpolate (a priori) and interpolate and sample (a posteriori), respectively. We show that both techniques achieve a $(0, \delta)$ form of differential privacy. We analytically and empirically, with real ship trajectories, study the privacy guarantee and utility of the methods.
- ItemRicochet: A Family of Unconstrained Algorithms for Graph clustering(2007-07-30) WIJAYA, Derry Tanti; BRESSAN, StephanePartitional graph clustering algorithms like K-means and Star necessitate a priori decisions on the number of clusters and threshold on the weight of edges to be considered, respectively. These decisions are difficult to make and their impact on clustering performance can be significant. We propose a family of algorithms for weighted graph clustering that neither requires a predefined number of clusters, unlike K-means, nor a threshold on the weight of edges, unlike Star. To do so, we use re-assignment of vertices as a halting criterion, as in K-means, and a metric for selecting clusters’ seeds, as in Star. Pictorially, the algorithms’ strategy resembles the rippling of stones thrown in a pond, thus the name ‘Ricochet’. We evaluate the performance of our proposed algorithms using standard datasets. In particular, we evaluate the impact of removing the constraints on the number of clusters and threshold by comparing the performance of our algorithms with K-means and Star. We are also comparing the performance of our algorithms with Markov Clustering which is not parameterized by number of clusters nor threshold but has a fine tuning parameter that impacts the coarseness of the result clusters.
- ItemSensitive Label Privacy Protection on Social Network Data(2012-03-30) SONG, Yi; KARRAS, Panagiotis; XIAO, Qian; BRESSAN, StephaneThe publication of social network data presents opportunities for data mining and analytics for strategic public, commercial and academic applications. Yet the publication of social network data entails a privacy threat for their users. Sensitive information should be protected. The challenge is to devise methods to publish these data in a form that affords utility without compromising privacy. Previous research has proposed various privacy models with the corresponding protection mechanisms. These early privacy models are mostly concerned with identity and link disclosure. The social networks are modeled as graphs in which users are nodes. The threat definitions and the protection mechanisms leverage structural properties of the graph. This paper is motivated by the recognition of the need for a finer grain and more personalized privacy. We propose a privacy protection scheme that not only prevents the disclosure of identity of users but also the disclosure of selected features in users' profiles. An individual user can select which features of her profile she wishes to conceal. The social networks are modeled as graphs in which users are nodes and features are labels. Labels are denoted either as sensitive or as non-sensitive. We treat node labels both as background knowledge an adversary may possess, and as sensitive information that has to be protected. We present privacy protection algorithms that allow for graph data to be published in a form such that an adversary who possesses information about a node's neighborhood cannot safely infer its identity and its sensitive labels. To this aim, the algorithms transform the original graph into a graph in which nodes are sufficiently indistinguishable. The algorithms are designed to do so while losing as little information and while preserving as much utility as possible. We evaluate empirically the extent to which the algorithms preserve the original graph's structure and properties. We show that our solution is effective, efficient and scalable while offering stronger privacy guarantees than those in previous research.
- ItemTop-k Queries over Uncertain Scores(2016-09-03) LIU, Qing; BASU, Debabrota; ABDESSALEM, Talel; BRESSAN, StephaneModern recommendation systems leverage some forms of collaborative user or crowd sourced collection of information. For instance, services like TripAdvisor, Airbnb and HungyGoWhere rely on user-generated content to describe and classify hotels, vacation rentals and restaurants. By nature of such independent collection of information, the multiplicity, diversity and varying quality of the information collected result in uncertainty. Objects, such as the services o ffered by hotels, vacation rentals and restaurants, have uncertain scores for their various features. In this context, ranking of uncertain data becomes a crucial issue. Several data models for uncertain data and several semantics for probabilistic top-k queries have been proposed in the literature. We consider here a model of objects with uncertain scores given as probability distributions and the semantics proposed by the state of the art reference work of Soliman, Hyas and Ben-David. In this paper, we explore the design space of Metropolis-Hastings Markov chain Monte Carlo algorithms for answering probabilistic top-k queries over a database of objects with uncertain scores. We are able to devise several algorithms that yield better performance than the reference algorithm. We empirically and comparatively prove the eff ectiveness and effi ciency of these new algorithms.
- ItemTwig'n Join: Progressive Query Processing of Multiple XML Streams(2007-09-03T08:58:02Z) TOK, Wee Hyong; BRESSAN, Stephane; LEE, Mong-LiWe propose a practical approach to the progressive processing of (FWR) XQuery queries on multiple XML streams, called Twig.n Join (or TnJ). The query is decomposed into a query plan combining several Twig queries on the individual streams, followed by a join and a final Twig query. The processing is itself accordingly decomposed into three pipelined stages progressively producing streams of XML fragments. We use the recently proposed TwigM algorithm for the progressive evaluation of Twig queries. We devise an original progressive join algorithm for the XML fragments leveraging our previous work on relational result-rate based progressive joins. In addition, we propose a multi-way XML value join algorithm, which uses a result-oriented approach to determine the probing sequence for multiple XML streams. We comparatively evaluate the performance of the Twig.n Join variants using both synthetic and real-life data from standard XML query processing benchmarks. We show that Twig.n Join with a result-rate based approach is most effective and efficient.