RETEXO: Scalable Neural Network Training over Distributed GraphsAashish Kolluri, Sarthak Choudhary, Bryan Hooi, and Prateek SaxenaArxiv, Under Review 2023
Graph neural networks offer a promising approach to supervised learning over graph data. Graph data, especially when it is privacy-sensitive or too large to train on centrally, is often stored partitioned across disparate processing units (clients) which want to minimize the communication costs during collaborative training. The fully-distributed setup takes such partitioning to its extreme, wherein features of only a single node and its adjacent edges are kept locally with one client processor. Existing GNNs are not architected for training in such setups and incur prohibitive costs therein. We propose RETEXO, a novel transformation of existing GNNs that improves the communication efficiency during training in the fully-distributed setup. We experimentally confirm that RETEXO offers up to 6 orders of magnitude better communication efficiency even when training shallow GNNs, with a minimal trade-off in accuracy for supervised node classification tasks.
User-customizable Transpilation of Scripting LanguagesOOPSLA 2023
LPGNet: Link Private Graph Networks for Node ClassificationAashish Kolluri, Teodora Baluta, Bryan Hooi, and Prateek SaxenaCCS 2022
Classification tasks on labeled graph-structured data have many important applications ranging from social recommendation to financial modeling. Deep neural networks are increasingly being used for node classification on graphs, wherein nodes with similar features have to be given the same label. Graph convolutional networks (GCNs) are one such widely studied neural network architecture that perform well on this task. However, powerful link-stealing attacks on GCNs have recently shown that even with black-box access to the trained model, inferring which links (or edges) are present in the training graph is practical. In this paper, we present a new neural network architecture called LPGNet for training on graphs with privacy-sensitive edges. LPGNet provides differential privacy guarantees for edges using a novel design for how graph edge structure is used during training. We empirically show that LPGNet models often lie in the sweet spot between providing privacy and utility: They can offer better utility than "trivially" private architectures which use no edge information (e.g. vanilla MLPs) and better resilience against existing link-stealing attacks than vanilla GCNs which use the full edge structure. LPGNet also offers consistently better privacy-utility tradeoffs than DPGCN, which is the state-of-the-art mechanism for retrofitting differential privacy into conventional GCNs, in most of our evaluated datasets.
An Empirical Snapshot of Differential Privacy in Graph Analysis ApplicationsAashish Kolluri, Kareem Shehata, and Prateek SaxenaUnder Review 2022
Personalized recommendations, online advertisement campaigns, and many more applications are fueled using large-scale analytics on social and contact graphs. These applications are now being hindered by the lack of data sharing between businesses due to privacy concerns of users present in the graph. Differential privacy is a strong privacy guarantee that enables sharing data while providing privacy to the users. Therefore, designing differentially private graph queries that support these applications is an attractive solution. Surprisingly, in the past decade only a few differentially private graph queries have been designed and most of them are not evaluated on an end application. Therefore, the utility of these queries in end applications is poorly understood compared to other queries on relational databases, machine learning, and aggregate statistics for numerical data. In this work, we answer how useful are state-of-the-art differentially private graph queries in two end applications — influence maximization and social recommendation. Mainly we conclude that most of the existing queries are not practically viable for both of these applications even at a conservative privacy budget of ε = 2. The private queries that compute complex outputs such as communities and synthetic graphs often sacrifice utility to a point that makes them less useful than a simple degree query. Finally, we identify the drawbacks in the existing private queries and provide potential directions for future work.
Private Hierarchical Clustering in Federated NetworksAashish Kolluri, Teodora Baluta, and Prateek SaxenaCCS 2021
Analyzing structural properties of social networks, such as identifying their clusters or finding their central nodes, has many applications. However, these applications are not supported by federated social networks that allow users to store their social contacts locally on their end devices. In the federated regime, users want access to personalized services while also keeping their social contacts private. In this paper, we take a step towards enabling analytics on federated networks with differential privacy guarantees about protecting the user’s social contacts. Specifically, we present the first work to compute hierarchical cluster trees using local differential privacy. Our algorithms for computing them are novel and come with theoretical bounds on the quality of the trees learned. Empirically, our differentially private algorithms learn trees that are of comparable quality (with at most about 10% utility loss) to the trees obtained from the non-private algorithms, while having reasonable privacy (0.5 łeq ε łeq 2). Private hierarchical cluster trees enable new application setups where a service provider can query the community structure around a target user without having their social contacts. We show the utility of such queries by redesigning two state-of-the-art social recommendation algorithms for the federated social network setup. Our recommendation algorithms significantly outperform the baselines that do not use social contacts.
SynGuar: Guaranteeing Generalization in Programming by ExampleBo Wang, Teodora Baluta, Aashish Kolluri, and Prateek SaxenaESEC/FSE 2021
Programming by Example (PBE) is a program synthesis paradigm in which the synthesizer creates a program that matches a set of given examples. In many applications of such synthesis (e.g., program repair or reverse engineering), we are to reconstruct a program that is close to a specific target program, not merely to produce some program that satisfies the seen examples. In such settings, we wish that the synthesized program generalizes well, i.e., has as few errors as possible on the unobserved examples capturing the target function behavior. In this paper, we propose the first framework (called SynGuar) for PBE synthesizers that guarantees to achieve low generalization error with high probability. Our main contribution is a procedure to dynamically calculate how many additional examples suffice to theoretically guarantee generalization. We show how our techniques can be used in 2 well-known synthesis approaches: PROSE and STUN (synthesis through unification), for common string-manipulation program benchmarks. We find that often a few hundred examples suffice to provably bound generalization error below 5% with high (≥ 98%) probability on these benchmarks. Further, we confirm this empirically: SynGuar significantly improves the accuracy of existing synthesizers in generating the right target programs. But with fewer examples chosen arbitrarily, the same baseline synthesizers (without SynGuar) overfit and lose accuracy.
Localizing Vulnerabilities Statistically From One ExploitAsiaCCS 2021
Automatic vulnerability diagnosis can help security analysts identify and, therefore, quickly patch disclosed vulnerabilities. The vulnerability localization problem is to automatically find a program point at which the "root cause" of the bug can be fixed. This paper employs a statistical localization approach to analyze a given exploit. Our main technical contribution is a novel procedure to systematically construct a test-suite which enables high-fidelity localization. We build our techniques in a tool called VulnLoc which automatically pinpoints vulnerability locations, given just one exploit, with high accuracy. VulnLoc does not make any assumptions about the availability of source code, test suites, or specialized knowledge of the type of vulnerability. It identifies actionable locations in its Top-5 outputs, where a correct patch can be applied, for about 88% of 43 CVEs arising in large real-world applications we study. These include 6 different classes of security flaws. Our results highlight the under-explored power of statistical analyses, when combined with suitable test-generation techniques.
Exploiting the Laws of Order in Smart ContractsISSTA 2019
We investigate a family of bugs in blockchain-based smart contracts, which we dub event-ordering (or EO) bugs. These bugs are intimately related to the dynamic ordering of contract events, i.e. calls of its functions, and enable potential exploits of millions of USD worth of crypto-coins. Previous techniques to detect EO bugs have been restricted to those bugs that involve just one or two event orderings. Our work provides a new formulation of the general class of EO bugs arising in long permutations of such events by using techniques from concurrent program analysis. The technical challenge in detecting EO bugs in blockchain contracts is the inherent combinatorial blowup in path and state space analysis, even for simple contracts. We propose the first use of partial-order reduction techniques, using automatically extracted happens-before relations along with several dynamic symbolic execution optimizations. We build EthRacer, an automatic analysis tool that runs directly on Ethereum bytecode and requires no hints from users. It flags 8% of over 10, 000 contracts analyzed, providing compact event traces (witnesses) that human analysts can examine in only a few minutes per contract. More than half of the flagged contracts are likely to have unintended behaviour.
Finding The Greedy, Prodigal, and Suicidal Contracts at ScaleACSAC 2018
Smart contracts—stateful executable objects hosted on blockchains like Ethereum—carry billions of dollars worth of coins and cannot be updated once deployed. We present a new systematic characterization of a class of trace vulnerabilities, which result from analyzing multiple invocations of a contract over its lifetime. We focus attention on three example properties of such trace vulnerabilities: finding contracts that either lock funds indefinitely, leak them carelessly to arbitrary users, or can be killed by anyone. We implemented Maian, the first tool for specifying and reasoning about trace properties, which employs interprocedural symbolic analysis and concrete validator for exhibiting real exploits. Our analysis of nearly one million contracts flags 34, 200 (2, 365 distinct) contracts vulnerable, in 10 seconds per contract. On a subset of 3, 759 contracts which we sampled for concrete validation and manual analysis, we reproduce real exploits at a true positive rate of 89%, yielding exploits for 3, 686 contracts. Our tool finds exploits for the infamous Parity bug that indirectly locked $200 million US worth in Ether, which previous analyses failed to capture.
(Invited Paper) on the Security of Blockchain Consensus ProtocolsSourav Das, Aashish Kolluri, Prateek Saxena, and Haifeng YuICISS 2018
In the last decade, several permissionless proof-of-work blockchain protocols have focused on scalability. Since these protocols are very difficult to change once deployed, their robustness and security are of paramount importance. This paper summarizes the desired end properties of blockchain consensus protocols and sheds light on the critical role of theoretical analyses of their design. We summarize the major paradigms in prior constructions and discuss open issues in this space.