Yale University Department of Computer Science

TAG:  virus software 
Published Time: -
Filetype: pdf
Filesize: 0
Yale University Department of Computer Science Inoculation Strategies for Victims of Viruses and the Sum-of-Squares Partition Problem James Aspnes 12 Kevin Chang 13 Aleksandr Yampolskiy 14 YALEU/DCS/TR-1295 July 2004 1 Department of Computer Science, Yale University, New Haven, CT 06520-8285, USA. 2 Email: aspnes@cs.yale.edu. Supported in part by NSF grants CCR-0098078 and CNS- 0305258. 3 Email: kevin.chang@yale.edu. Supported by NSF grant CCR-0331548. 4 Email: aleksandr.yampolskiy@yale.edu. Supported by NSF grants CCR-0098078 and ANI- 0207399.
Inoculation Strategies for Victims of Viruses and the Sum-of-Squares Partition Problem James Aspnes† Kevin Chang‡ Aleksandr Yampolskiy§ Abstract We propose a simple game for modeling containment of the spread of viruses in a graph of n nodes. Each node must choose to either install anti-virus software at some known cost C, or risk infection andaloss Lifavirusthat starts at a random initial point in the graph can reach it without being stopped by some intermediate node. The goal of individual nodes is to minimize their individual expected cost. We prove many game theoretic properties of the model, including an easily applied characterization of Nashequilibria, culminating in our showing that allowing selfish users to choose Nashequilibrium strategies is highly undesirable, because the price of anarchy is an unacceptable ( n) in the worst case. This shows in particular that a centralized solution can give a much better total cost than an equilibrium solution. Though it is NP-hard to compute such a social optimum, we show that the problem can be reduced to a previously unconsidered combinatorial problem that we call the sum-of-squares partition problem. Using a greedy algorithm based on sparse cuts, we show that this problem can be approximated to within a factor of O (log2 n), giving the same approximation ratio for the inoculation game. 1 Introduction Consider a system in whichnmachines, each of which mayor may not be protected against viruses, are connected by a network in the form of a graph, and any virus that infects some machine eventually infects all of its unprotected neighbors. If anti-virus software is available, a natural response would be to protect all the machines—but perhaps the anti-virus software itself creates costs, both in money and time to purchase and install the software and in reducedeciency or usability of the protected machine. Suppose that protectinga machine by installing anti-virus software costs the owner of the machine C, but that having the machine be infected costs L, which mayor may not be greater than C. If the virus spreads from some initial machine chosen uniformly at random, on which machines does it make sense to install the anti-virus software? The answer will depend on whether we adopt the perspective of the owner of a single machine or of the society as a whole. When the anti-virus software costs more than the loss from infection, no economically rational machine owner will install the anti-virus software, every machine will be infected, and the system will incur asocial cost of Ln. But for many graphs, selective inoculation of a few central machines can limit the spread of infection to a small subset of the graph, greatly reducing the total cost of infection in return fora small investment in anti-virus software. We can ask how much of an improvementa centralized solution can provide, and how easy it is to find a good centralized solution. Department of Computer Science, Yale University, New Haven, CT 06520-8285, USA. Email: aspnes@cs.yale.edu. Supported in part by NSF grants CCR-0098078 and CNS-0305258. Email: kevin.chang@yale.edu. Supported by NSF grant CCR-0331548. § Email: aleksandr.yampolskiy@yale.edu. Supported by NSF grants CCR-0098078 and ANI-0207399. 1
After discussing some previous work on related problems (in Section 2), we give a complete characterization of the Nashequilibria for an anti-virus software installation game in which each machine'sownerseparately chooses whether or not to install the software, without regard to the eect on other machines. (This game is defined in Section 3.) We show (in Section 4) that finding either the most or least expensive equilibrium is NP-hard, but that some Nashequilibrium can be computed in O (n3) time and that any population of nodes will quickly converge toa Nash equilibrium by updating their strategies locally based on the other nodes'strategies. Unfortunately, the cost of any such Nash equilibrium maybe badly suboptimal; the price of anarchy for this game is ( n) in the worst case. This shows that for many graphs and values of Cand L, letting the users choose individually whether or not to inoculate their machines will give bad results. We then consider (in Section 5) the possibility of a centralized solution in which a dictator computes and enforces an optimal inoculation plan. We show that essentially the same argument that shows that extreme Nashequilibria are hard to find applies to the optimal solution as well. However, we show that the problem of finding an optimal inoculation plan reduces to a graph partition problem in which we are asked to partition the graph by removingmnodes; the quality of the partition is measured by the sum of the squares of the sizes of its components. We give (in Section6) a polynomial-time approximation algorithm that removes O (log2 n) mnodesinorderto achievea partition with quality within O (1) of the optimum. Conclusions and open problems appear in Section 7. 2 Related work In this section, we describe three classes of work related to this paper: virus propagation models, economic models of investment insecurity, and game-theoretic models of security. We then discuss some work on the graph partition problem that is related to the sum-of-squares partition problem we consider in Section 6. 2.1 Viruspropagation models Traditional epidemiological models characterize the viral infection in terms of birthrate and death rate of the virus[4,9]. Usually, these models assume that an infected individual is equally likely to infect any other individual in the population; meanwhile, computer viruses usually spread via localized interactions. Kephartand White extended the traditional model by transferring it onto a directed random graph[16]. Later works (e.g. , [15,17,25]) studied virus propagation over other kinds of graphs, including Internet-like power-law graphs[22,23,26]. We do not restrict the network topology in anyway and consider a general undirected graph. Our model is in someways closer to models in percolation theory (see[18]): an infected node infects all of its unprotected neighbors, spreading infection throughout the graph until it is blocked by an anti-virus software. 2.2 Economicmodels of security Our work is motivated in part by an observation that security technologies exhibit network externalities[1].             Specifically, the benefit obtained by using security technology (anti-virus software in our case) does not accrue only to the user of the security technology but rather to all users of the network. Aspnesetal. [3]also consider anti-virus immunization, and proposed studying how to encourage highly connected nodes touse anti-viral techniques. 2
We assume that costs of installation and infection are known. Alternatively, one could use risk analysis to estimate the costs and benefits from installing a security technology (see, for example,     [13]), or estimate values based on empirical studies of the costs of security breaches[6,10]. 2.3 Game-theoretic models of security Application of game theory to network security has yielded interesting results [11,12,24]. For example, Bell uses a simple game to study network reliability. In the game, the router tries to find a least cost path and a network tester tries to maximize this cost by failing links[5]. Kunreuther and Heal recently introduced the notion of interdependent security (IDS) games, in which decisions to adopt security technology by one agentaect other agents[20]. Kearnsand Ortiz subsequently extended their paper and gave an algorithm for finding approximate Nashequilibria in this model[14]. Our work is similar to work on IDS games in certain respects: each agent in both our game and an IDS game makes a decision whether or not to invest money in a security technology, and this decisionaects other agents. The maindierences are that we assume that installing anti-virus software protects against all badeects of viruses, while the IDS work concentrates on negative side-eects of security breaches even on protected parties; and we assumea restricted network topology that contains the spread of viruses, while the IDS work assumes a complete topology. 2.4 Graphpartition problems In Section6, we describe and provide an approximate solution fora new graph partitioning problem. Previous work on other forms of graph partitioning includes the approximation algo- rithmofLeighton and Rao[21]for sparsest cut, from which the same authors derive apseudo- approximation algorithm forb-balanced cuts, where each side of the cut must have sizeb|V|or greater. The case ofb=1/2 is graph bisection, for which Feigeand Krauthgamer[8]giveagood approximation algorithm. Even et al. [7]give O (logn)-ratio pseudo-approximation algorithms for several balanced partitioning problems, including the-separator problem and thek-balanced partitioning problem. 3 Our model We represent network topology by an undirected graph G= (V,E), where V={0,1,...,n 1}is a set of network hosts and E V?Visasetof (bidirectional) communication links. Our basic model for installing anti-virus software is a one-round game: Players. Our game hasnplayers corresponding to nodes of the graph. Initially, all nodes are insecure and vulnerable to infection. Strategies. We denote the strategy of i by a i . Each node i has two possible actions: do nothing and risk being infected or inoculate itself by installing anti-virus software. Node i 'sstrategy a i is the probability that it inoculates itself. Nodes'choices can be summarized in a strategy profile~a 2[0,1]n. Ifa i is 0or1, we say that node i adopts a pure strategy; otherwise, its strategy is mixed. We call nodes that install anti-virus software secure and denote the set of such nodes by I ~a . We associate an 3
attack graph G ~a =GI ~a with~a. It is essentially the network graph with secure nodes and their edges removed (see also Figure 1). Note that both I ~a andG ~a are random variables unless all strategies are pure. Attack model. After the nodes made their choices, the adversary picks some node uniformly at random as a starting point for infection. Infection then propagates through the network graph. Node i gets infected if it has no anti-virus software installed and if any of its neighbors become infected. Individual costs. Suppose it costs Ctoinstallanti-virus software. If anode is infected, it suers alossequaltoL. For simplicity, we assume that both Cand Lareknownandare the same for all nodes; we discuss possible consequences of removing these assumptions in Section 7. The cost of a mixed strategy~a 2 [0,1]ntonodeiis cost i (~a) =a i C+ (1 a i ) L·p i (~a) . Herep i (~a) is the probability of node i being infected given the strategy profile~a, conditioned on the event that node i does not install the anti-virus software. It is equal to the probability that some vulnerable node reachable from i without passing through a secure node is the initial point of infection. For pure strategies, this is justk i /n, where k i is the size of the connected component containing i in the attack graph G ~a . 1 2 3 0 4 5 G 1 2 3 0 4 5 G a Figure1: Sample graph Ganditsattack graph G ~a for~a=010100. Social cost. The total social cost of a strategy profile is the sum of the individual costs. For pure strategies, there is a simple characterization of the total social cost in terms of the attack graphG ~a . Because each node in the same component of G ~a has the same chance of infection, thek i nodes in the i -thcomponent between them face a loss of k i · (Lk i /n) = (L/n) k2 i . So the social cost is cost(~a) = n 1 X j=0 cost j (~a) = n 1 X j=0 a j C+ (1 a j ) L·p j (~a) =C|I ~a |+ L n l X i=1 k2 i , 4
wherek 1 ,k 2 ,...k ' are the sizes of the components in G ~a . 4 Nashequilibria We consider first the choices that the nodes will make in the absence of coordination, by examining theNash equilibria of the game defined in Section 3. The assumption that the nodes will reach a Nashequilibrium is a very strong one, as it requires assuming that the nodes are aware of each other'schoicesto install or not and that the nodes can evaluate C (printed on the box for the antivirus      software) and L (which is more problematic). It also assumes that the nodes can compute a Nashequilibrium in a reasonable amount of time, which is not always possible for some games. However, we can show that Nashequilibria for our game are easily characterized in terms of the sizes of the components of the attack graph (Section 4.1), and that a population will converge to someNash equilibrium quickly even though finding the best or worst pure equilibrium as measured by total cost is NP-hard (Section 4.2). We can further imagine that some of thediculties of limited information could be overcome by considering an iterated game where nodes pay Ctorenttheanti-virus software in each round and update their strategies based on observations of losses to viruses and the strategies of other nodes in previous rounds; though we do not analyze this multi-round game formally, a simplified version is implicit in our convergence result. We also show that the hardness of finding the worst- case equilibrium does not prevent obtaining further information about its behavior; for example, its total cost is nondecreasing as a function of the inoculation cost C (Section 4.3). Unfortunately, selfish behavior proves to be highly undesirable, because the cost ofa Nash equilibrium solution maybe very far from the social optimum. In Section 4.4, we prove that while the price of anarchy, defined as the ratio of total cost between the worst Nashequilibrium and the social optimum never exceedsn, this bound is tight up to constant factors for some graphs and choicesofCand L. 4.1 Characterization of mixed and pure equilibria A useful feature of the Nashequilibrium for our game is its simple characterization: there is always a component-size thresholdt=Cn/Lsuchthateach node will install the anti-virus software if it would otherwise end up in a component of vulnerable nodes with expected size greater thant, and will not install the software if it would otherwise end up in a component with expected size less thant. When the expected component size equalst, the node isindierent between installing and not installing and may adoptamixed strategy. The threshold arises in a natural way: it is the break-even point at which the cost Cofinstallingthe software equals the expected loss L (t/n) of not installing. We define~a[i/x]to be the strategy vector that is identical to~a, except the i 'thcomponenta i isreplacedbyx. Note that attack graph G ~a[i/0] is the attack graph in which player i never installs the anti-virus software. Theorem1 (Characterization of mixed equilibria): Suppose S (i) is the expected size of the insecure component that contains node i of the attack graph G ~a[i/0] , (i.e. S (i) =np i (~a) ). FixC,L. Let the threshold bet=Cn/L. A strategy profile ~aisa Nashequilibrium if and only if 5
(a) For all i such that a i =1, S (i) t. (b) For all i such that a i =0, S (i) t. (c) For all i such that 0<a i <1, S (i) =t. Proof: Suppose~aisa Nashequilibrium and consider node i. The expected cost to node i is a i C+(1 a i )(L/n) S (i). 1. Supposea i =0. Then nodeihasexpected cost (L/n) S (i). If (L/n) S (i) >C, then i will want find the situationa i =1 withcostCpreferable. Thus, we must have S (i) CL/n=t. 2. Supposea i =1. Then nodeihasexpected cost C. If (L/n) S (i) <C, then i would find the situationa i =0 withexpected cost (L/n) S (i) <Cpreferable. Thus, we must have S (i) CL/n=t. 3. Suppose0<a i <1. If (L/n) S (i) >C, then i will find the situationa i =1 preferable. If (L/n) S (i) <C, then i will find the situationa i =0 preferable. Thus, we must have S (i) =CL/n=t. Thus, ~asatisfies condition (a), (b), and (c) above. Conversely, suppose ~asatisfies conditions (a), (b) and (c) of the theorem. Consider node i. 1. Supposea i =0. Then nodeiwillhave expected cost (L/n) S (i) <C, and thus will not want to switch to a dierenta i that puts any weight on installing at cost C. 2. Supposea i =1. Then nodeiwillhave cost C, and thus will not want to switch to a dierent a i that puts any weight on being insecure at expected cost (L/n) S (i) C. 3. Suppose0<a i <1. Then nodeiwillhave expected costa i C+ (1 a i )(L/n) S (i) =C. Switching to any other strategy will have the same expected cost. Thus, ~aisaNash equilibrium. A special case of Theorem 1 is the following characterization for pure Nashequilibria. Because nodes in a pure Nash equilibrium do not make randomized choices, the attack graph is nota random object, but a determined graph. We have the same threshold conditions as before, but the removal of randomness simplifies the situation. Corollary2 (Characterization of pure equilibria) Fix C,L. Let the threshold bet=Cn/L. A strategy profile~aisapure Nashequilibrium if and only if (a) Every component in attack graph G ~a hassizeatmostt. (b) Inserting any secure nodej 2 I ~a and its edges into G ~a yields a component of size at leastt. For example, let C=0.5 and L=1, and consider the graph in Figure 1. The threshold for this graph is t=Cn/L=3. Then Corollary 2 tells us that pure strategy ~a=010100 must bea Nash equilibrium for these Cand L. 6
4.2 Computing pure Nashequilibria Designing algorithms for finding mixed Nashequilibriaor proving hardness results for finding optimized mixed equilibria would most likely involve estimating or otherwise manipulating the expected value of the sizes of components in the attack graph, which is at the very least anon- trivial problem. Furthermore, in the absence of central control, nodes attempting to calculate their best strategy based on a mixed strategy paradigm would possibly run into similar computational issues. Thus, we turn our attention to the computation and hardness of pure Nashequilibria. Corollary     2givesusa powerful tool with which to reason about pure Nashequilibria. We now show that computing the best or worst pure Nashequilibriais hard, but that finding some intermediate Nashequilibrium is easy. A consequence of this algorithm is that the existence of a pure Nash equilibrium is always guaranteed. (The existence of a mixed Nashequilibrium is a consequence of Nash'stheorem.) Theorem 3Both computing the pure Nashequilibrium with lowest cost and computing the pure Nashequilibrium with highest cost are NP-hard problems. Proof: We reduce vertex cover to the decision problem"Does there exist a pure Nashequilib- riumwithcostless thanc?"and we reduce independent dominating set to"Does there exist a pure Nash equilibrium with cost greater thanc?" Fix some graph G= (V,E), and set C/L=1.5/nsothatt=Cn/L=1.5, where tis the component size threshold from Corollary 2. Then from Corollary 2, in any Nashequilibrium the components of the attack graph all have size at most 1, and any secure node is adjacent to some insecure node (as otherwise it coulduninstall its software and be in a component of size at most 1). It follows that ina Nashequilibrium (a) every vulnerable node is either isolated or has all neighbors secure, and (b) every secure node has an insecure neighbor. We now argue that Ghasavertexcover of size kif and only if the inoculation game on Gwiththeabove settings of Cand Lhasa Nashequilibrium with kor fewer secure nodes, or equivalently an equilibrium with social cost Ck+(n k) L/nor less, as each insecure node must be in a component of size 1 and contribute exactly L/nexpectedcost. Given a minimal vertex cover V0 V, observe that installing the software on all nodes in V0 satisfies condition (a) because V0 is a vertex cover, and (b) because V0 is minimal. Conversely, if V0 is the set of secure nodes ina Nashequilibrium, then V0 is a vertex cover by condition (a). This concludes the proof that finding a minimum-cost Nashequilibrium is NP-hard. Fora maximum cost equilibrium, consider the set of insecure vertices. These consist of isolated nodes (which are already in components of size 1) and nodes that do not install the software because all their neighbors do. Given an independent dominating set V0 V, installing the software on all nodes except the nodes in V0 satisfies condition (a) because V0 is independent and (b) because V0 is a dominating set. Conversely, the insecure nodes in any Nashequilibrium are independent by condition (a) and dominating by condition (b). This shows that Ghasan independent dominating set of size kif and only if it hasa Nashequilibrium with no more thank insecure nodes, which occurs only if it hasa Nashequilibrium with at leastn ksecurenodesor, equivalently, a cost of at least C (n k) + (L/n)(k). Theorem 3saysthatfinding extreme pure equilibria is hard. But what if we just want to converge to some equilibrium, but we don'tcarewhichone? Suppose we implement the process 7
of convergence implied by the Nashequilibrium: at each step, some participant, whose current strategy is suboptimal given the others'strategies, switches. This is an easy process to implement because each participant can detect if its strategy is suboptimal using the t=Cn/Lcomponent size threshold from Corollary 2.1 But does this process converge toa Nashequilibrium? If it does, how long does it take? By choosing an appropriate potential function, we can show that this process does indeed converge to a Nash equilibrium quickly: Theorem 4Starting from any pure strategy profile~a, if at each step some participant witha suboptimal strategy switches its strategy, the system converges to a pure Nashequilibrium in no more than 2nsteps. Proof: Lett=Cn/L. For any strategy profile~a, consider the set S big (~a) of "big"components ofG ~a of size greater thantandtheset S small (~a) of "small"components of G ~a of size less than or equal tot. Define a potential function by (~a) = X A 2S big (~a) |A| X A 2S small (~a) |A|. It is easy to see thatn (~a) nforany~a. We will now show that each step of the process reduces by at least one. There are two main cases: 1. Some node i switches from insecure to secure. In this case i was previously an element of a component inS big ofsizem>t. This former component becomes one or more new components with total sizem 1; if all of the resulting components are big, is reduced by exactly one; otherwise, is reduced by more than one as some components move from the positive to the negative side of the ledger. 2. Some node i switches from secure to insecure. In this case the resulting component containing ihasmtelements, and it replaces one or more old components with total sizem 1. As both the new component and the old components are small, the neteect on it by one. If each step reduces by one, the number of steps must be less than thedierence initial and final value of , which is at most n (n) =2n. As a special case, we can start with ~a=1nandconvergeto an equilibrium from above by checking each node once. Each such test requires computing the size of the component in the attack graph, which takes time O (|V|+|E|) =O (n2) using depth-first search; this gives: Corollary 5ANash equilibrium can be computed intime O (n3) . It is not hard to see that the 2 nin Theorem 4 is close to the best possible bound, although a more careful analysis might reduce it slightly. A lower bound of n steps is trivial: in a system with C<L/nandnoplayers secure in the initial strategy profile, it takesnstepsfor all players to 1 We must assume in this implementation either that the choice to install software or not is reversible, or that each player can observe the other players'intended actions and respond accordingly. 8
install the anti-virus software. To get closer to 2 n, consider aline witht= p n 1 2 . Now consider an execution of the process where initially players 1 throughn p n, in increasing order, install to escape the single overlarge component; but then all players not at positionsk p nforsomek uninstall; this takes 2 n 2 p nsteps. We also have: Corollary 6Apure Nashequilibrium always exists. 4.3 Consequences of changes in the inoculation cost Though Theorem 3 suggests that we cannot hope to characterize the worst pure Nashequilibrium exactly, we can giveadescription of how it reacts to changes in the inoculation cost C. Theorem 7Thecost of the worst pure Nashequilibrium is anon-decreasing function of Cwhen Crangesover[2 L/n,L) . Proof: Fix some price of anti-virus software, C 2L/nsothatb Cn/Lc 2. Suppose we increase the price from Cto C0=C+ ( > 0). We denote the worst-cost Nash equilibrium when the price is Cby~aandtheworst-cost equilibrium when the price is C0 by ~ b. If price increment is L/n, then the threshold (in Theorem 1) increases by at most one; that is, bC0n/Lc b Cn/Lc+1. We consider two cases: Case1: ~aisa Nashequilibrium for C0. This case is easy. Because ~ bis a worst-cost Nash equilibrium forC0, we have: cost C (~a) <cost C 0 (~a) cost C 0 ( ~ b) . Case2: ~ais nota Nashequilibrium for C0. This can happen only ifb C0 n/Lc=bCn/Lc+1. Specifically, there must exist anodew 2 I ~a such that adding it into attack graph G ~a yieldsa component of size bCn/Lcbutnotb C0 n/Lc. Let us denote the sizes of components adjacent towinG by k 1 ,...,k s .2 Wethenhave: P s i=1 k i =bCn/Lc1. We define anew strategy ~a0=~a[w/0], which is the same as~aexceptweno longer install anti-virus software on nodew. Moreover, cost C0 ( ~ a0) cost C (~a) = L n bCn/Lc2 C+ L n s X i=1 k2 i ! L n 0 @ bCn/Lc2 s X i=1 k i ! 2 1 A C = L n bCn/Lc2 (bCn/Lc 1)2C. (1) Equation (1) is non-negative whenever 2b Cn/Lc 1 Cn/L, 2 We say that a component K Vis adjacent to nodewif 9 v 2 Ks.t. (v,w) 2 E. 9
which always holds by assumption for all C 2L/n. We repeat this process until there do not exist any nodes violating Nashequilibrium condition. At each step, the cost of our new strategy does not decrease. Therefore, if at the end we get a Nashequilibrium ~ d, then cost C (~a) cost C0 (~a) cost C0 ( ~ d) cost C0 ( ~ b) . Hence, cost C0 ( ~ b) cost C (~a) cost C0 ( ~ d) cost C (~a) cost C 0 ( ~ a0) cost C (~a) 0. Because we chose Carbitrarily, our argument holds for all values of. 4.4 Priceofanarchy The notion of the price of anarchy was introduced by Koutsoupiasand Papadimitriouin[19]. It is defined as the worst-case ratio between the cost ofa Nashequilibrium and the cost of the optimal solution, and is thus a measure of how faraway a Nashequilibrium can be from the social optimum.3 Inourgame, we show that the price of anarchy is quite high, ( n). This is a consequence of two simple lemmas: Lemma8 (Lower bound). Let Gbethestargraph K 1,n . Let the price of the anti-virus software beC=L (n 1) /n. Then (G,C,L) =n/2. Proof: The given Cand Lsatisfyt=Cn/L=n 1. From Corollary 2, it follows that installing anti-virus software on exactly one node isa Nashequilibrium. If pure Nashstrategy~ainstallsanti- virus software on some node that is not the center node, the cost will be C+L (n 1)2/n=L (n1). An optimal strategy for the star with the given Cand Lis~a= (1,0,..., 0) (i.e. , only the center node installs anti-virus software.) Its cost is C+L (n 1) /n=2L (n 1) /n. The price of anarchy is therefore L (n 1) 2L (n 1) /n = n 2 . Lemma9 (Upper bound). Fix any graph Gandcosts C,L. Then (G,C,L) n. 3 Because our game has a random component, the cost is an expected cost. 10
Proof: Let~adenotethe optimum solution. IfC>L, no node ina Nashequilibrium will install anti-virus software. Hence, there is only one Nashequilibrium~a=0n, whose cost is Ln. If the optimum solution contains at least one secure node, then then cost(~a) C>L. (Otherwise, ~a=0nand (G,C,L) =1.) We thus have: (G,C,L) Ln L =n. IfC L, then the expected cost of the worst Nashequilibrium~ais at most Cn, because the expected cost to each node is at most C (if the expected cost to anode is greater than C, then it will want to switch to installing the software with probability 1.) If the optimum solution contains at least one secure node, then cost(~a) C. Otherwise, the optimum solution contains no secure nodes and hence cost(~a) LC. (G,C,L) Cn C =n. 5 Optimization Allowing users to selfishly choose whether or not to install anti-virus software may be grossly inefficient,         relative to the social optimum. An alternative approach to this problem is fora benevolent dictator to attempt to maximize social welfare by centrally computinga solution and imposing it on all nodes. Diculties with this approach arise from the hardness of computing the optimum solution to the inoculation problem. In the first two sections, we givea characterization of the optimum solution and use it to show that the inoculation problem is NP-hard. This suggests computing an approximate solution. We can find in polynomial time a solution with approximation ratio at most O (log2 n); such a solution is substantially better than the ( n) ratio derived from the worst Nashequilibrium. 5.1 Characterization We have a graph-theoretic characterization of optimum strategies, similar to our characterization of Nashequilibria in Theorem 1: Theorem 10Fix C,Land lett=Cn/L. If~ais an optimum strategy, then every component in attack graph G ~a has size at most max(1, (t+1) /2) . Proof: Strategy ~apartitions Gintodisjoint components. Pick some component from attack graph with at least two nodes; we call it K V (k=|K|). (If we can'tfindacomponent with at least two nodes, then all components in attack graph must have size one, and the theorem follows.) If we install an anti-virus on some node of K, we may getmnew components in G ~a , where 0 m k 1. We denote component sizes byk 1 ,...,k m , where P m i=1 k i =k1. Because~ais an 11
optimal strategy, installing anti-virus on an extra node cannot improve the total cost. Therefore, we have: C+ L n m X i=1 k2 i ! Lk2 n , k2 m X i=1 k2 i ! t. (2) Ifm=0, then Equation (2) becomes: k p t (t+1) /2. Meanwhile, form>0, k2 m X i=1 k2 i ! k2 m X i=1 k i ! 2 =k2 (k1)2 =2 k1. (3) Combining Equations (2) and (3), we get: k (t+1) /2. 5.2 Hardnessof optimal solution Unfortunately, the optimal solution is hard to compute. Theorem 11Itis NP-hard to compute an optimal strategy. Proof: The proof is by reduction from vertex cover and is similar to the proof of Theorem 3. 5.3 Reduction to sum-of-squares partition Because it is unlikely that we can find an optimal solution, we naturally consider approximation algorithms. The optimization problem that defines the inoculation problem can be posed as follows: choose the set of secure nodes I ~a that minimizes the following objective function: C|I ~a |+ L n X V2 (I ~a ) |V|2, where (I ~a ) is the set of connected components created by the removal of nodes in I ~a . 12
For the purposes of our approximation algorithm for the inoculation problem, we assume that we can"guess"m=|I ~a |, the number of secure nodes in an optimum configuration. This assumption is without loss of generality, because we can run our algorithm on all possible choices ofm=1,...n and take the best solution. Thus, a solution to the inoculation problem is reduced to finding a solution to the problem of removing mnodes from a given graph to minimize the sum of the squares of the sizes of the surviving components. We discuss this problem in Section 6. 6 Sum-of-squares partitions In Section5.3, we came across the following problem, which we now analyze in more detail. Problem 12 (Sum-of-Squares Partition Problem) By removing a set Fofatmostmnodes, par- titionthegraph into disconnected components H 1 ,...,H k , such that P i |H i |2 isminimum. Though we have arrived at this combinatorial optimization problem via our study of containing computer viruses, it maybe of independent interest. Note that it inherits NP-hardness from the inoculation problem. The edge version of the sum-of-squares-partition problem is similar, but asks for the removal of m edges, rather than nodes, to disconnect the graph. Our goal in this section is to prove the following theorem: Theorem 13Let OPT be the optimum objective function value for the Sum-of-Squares Partition Problem on Gwith the removal of at mostmnodes. We can find a set Fof O (log2 n) mnodes, such that their removal creates disconnected components H 1 ,...,H l such that P i |H i |2O (1) ·OPT. An immediate consequence of this theorem is the existence of an approximation algorithm for the inoculation problem: Corollary 14If OPT is the cost of the optimum solution for the inoculation problem, there exists a polynomial-time approximation algorithm that finds a solution with cost at most O (log2 n) ·OPT. Proof: Suppose an optimum solution contains msecure nodes, and the sizes of the insecure node components arek 1 ,...,k p , sothatOPT=Cm+L/n P i k2 i . Using our approximation algorithm for the sum-of-squares partition problem, we can find a set of O (log2 n) msecurenodessuch that the sum of the squares of the corresponding insecure components is at most O (1) P i k2 i . Thus, the cost of the approximate solution is: O (log2n) ·Cm+O (1) · L n X i k2 i O (log2n) ·Cm+O (log2 n) · L n X i k2 i =O (log2n) ·OPT. 13
6.1 ProofofTheorem 13 Our proof of Theorem 13 is based on the algorithm PartitionGraph given in Figure 2. It uses known algorithms for sparse cuts. The sparse cut literature has focused on edge cuts. However, in the case of our problem, cuts that involve removing nodes in order to disconnect the graph are more applicable. Fortunately, the Leighton-Raoapproximation algorithm for finding sparse cuts extends to node cuts: there is a well known process for converting anode cut algorithm in an undirected graph to an edge cut algorithm in a directed graph. Since the Leighton-Raosparsecut algorithm extends to edge cuts for directed graphs, it can be extended to node cuts. In particular, we will use the following fact, which is implicit in the discussion of balanced node cuts in[21]. Theorem 15There exists an O (logn) -approximation algorithm for the following problem: Given graphG, find anode cut that partitions the nodes Gintothreesets: two sets defining disconnected subgraphs with node sets, V 1 andV 2 , and a set of removed nodes R, such that the quantity (|V 1 |+ |R| 2 )(|V 2 |+ |R| 2 ) |R| (4) is maximized. We refer to the quantity in expression (4) as the sparsity of the cut. In the literature, sparsity is usually given as the the inverse of expression (4), and finding the sparsest cutis cast as a minimization problem. We have presented it as a maximization problem, since this is more natural for our application. Improvements to the approximation ratio for sparsest (edge) cut have recently been discovered [2]. However, the authors do not mention whether or not their results extend to sparse cuts for directed graphs. Our algorithm for solving the sum-of-squares partition problem, PartitionGraph (see Fig- ure2), achieves the approximation results claimed in Theorem 13. The general approach of the algorithm is similar to the greedy logn-approximation algorithm for set cover. A high-leveldescrip- tionisthatwe repeatedly remove the node cut that gives us the best per-node-benefit, quantified as its cost-eectiveness. Suppose we have a connected subgraph Hwithknodes. If node cut Rcreates connected components with node sets V 1 andV 2 , this cut has decreased the objective function value ( P size of connected component2) byk2|V 1 |2|V 2 |2. We thus define the cost-eectivenessof node cutRby (k2|V 1 |2|V 2 |2) /|R|. The cost-eectiveness of Risequalto k2|V 1 |2|V 2 |2 |R| = (|V 1 |+|V 2 |+|R|)2|V 1 |2|V 2 |2 |R| = |R|2+2|V 1 ||V 2 |+2|R| (|V 1 |+|V 2 |) |R| = 2|V 1 ||V 2 | |R| +|R|+2(k|R|) = 2|V 1 ||V 2 | |R| +2k|R|. We then have the following relation between finding sparse cuts and cost-eectiveness . 14
Lemma 16LetHbeagraph withknodes. If is the maximum cost-eectiveness of all node cutsofH, the Leighton-Raosparsestnode cut algorithm will find a cut with cost-eectiveness at least/ (clogk) , for some constantc. Proof: The sparsity of anode cut that removes node set Rand partitions the remaining nodes of Hintoconnected components with node sets V 1 andV 2 is given by: (|V 1 |+ |R| 2 )(|V 2 |+ |R| 2 ) |R| = |V 1 ||V 2 |+ |R|2 4 + |R| 2 (|V 1 |+|V 2 |) |R| = |V 1 ||V 2 | |R| + |R| 4 + 1 2 (k|R|) = |V 1 ||V 2 | |R| + k 2 |R| 4 . We then have the following relationships between the cost-eectiveness of a cut, , and its sparsity, . = |V 1 ||V 2 | |R| + k 2 |R| 4 =/2k/2+|R|/4/4. and > 2. Thus, we know there exists anode cut with sparsity at least/4 (i.e. the cut with the highest cost-eectiveness ). The sparsest cut algorithm on Hwillfindanode cut with sparsity at least / (clogk), for some constantc. This node cut will have cost-eectiveness at least 2 / (clogk). Input: AGraphG. A set of marked nodes F. 1. Use the Leighton-Raoalgorithmto find an approximate most cost-eective cut in eachcon- nectedcomponent of G. 2. LetH 1 ,...,H k be the components of Gin which the Leighton-Raoalgorithm found a cut that removes at most (20 clogn) mnodes, where cis the constant from Lemma 16. If no such component exists, then halt and output Fas the final set of marked nodes. 3. Otherwise, choose the component H j from among those considered in Step 2, for which the cost-eectiveness is highest. Let the cut be H j =V 1 [V 2 [R, where Risthefinal set of marked nodes. 4. SetF=F[R. Replace H j byV 1 andV 2 inG. If|F|> (36 clog2 n) m, then halt and output Fas the set of marked nodes. 5. Otherwise, repeat. Figure2: Algorithm PartitionGraph We now give some lemmas that characterize the behavior of the PartitionGraph algorithm. Fix an optimum solution for the sum-of-squares partition problem; let Fbeanoptimumset ofm removed nodes. Lemma 17PartitionGraph outputs at most O (log2 n) mmarkednodes. 15
Proof: Since the algorithm halts as soon as we augment the set of marked nodes such that|F|> (36 clog2n) m, we know that at the beginning of each iteration, Fcontainsatmost (36 clog2 n) m marked nodes. The lemma follows, because we add at most (20 clogn) mmarkednodesper iteration. In the next few proofs, we will abuse notation slightly and denote the order of graph G (i.e. the number of nodes) by|G|=|V (G) |. We will also denote an "intersection"ofagraph Gandanode setVasGV, which is the set of nodes that Gand Vshare. Lemma 18Suppose after a number of iterations, our graph Gconsistsofk connected components H 1 ,...,H k , and letS= P |H i |2. Either there existsacomponent H i such that the Leighton-Raoalgorithmwill find anode cut inH i with at most 20 cmlognremoved nodes and cost-eectiveness at least S/ (18 cmlogn) , orS 72OPT (or possibly both). Proof: We assume that S>2OPT. Note that the node cut defined by the set FGdivides G into a graph with objective function value at most OPT. The node cut thus induces a cost decrease ofatleastS/2. DefineF i =FH i andm i =|F i |. Also, let the subgraph induced by removing vertices in F i H i fromH i be composed of connected components Hj i forj=1,...,r i . (i.e, the optimum set of marked nodes partitions H i into these components.) Note that P i P j |Hj i |2 OPT. Since the total reduction in our objective function value from removing[ i F i fromGisatleast S/2 duetoourassumption that S>2OPT, we have: X i 0 @ |H i |2 X j |Hj i |2 1 A S/2, (5) because the outer summand on the left hand side of the inequality is the amount the objective function is reduced in each component. Let Ibethesetof indices i for which |H i |2 P r i j=1 |Hj i |2/m i S/ (4m) (i.e. the per node benefit is at leastS/ (4 m).) We have two cases. 1. There exists ani 2 I such that for all j=1,...,r i , |Hj i |1/3|H i |. We assume that m i <1/50|H i |, because otherwise removing all nodes in H i will give usa trivial node cut with cost-eectiveness at least |H i |2/ (50m i ) >S/ (18 cmlogn) for suciently large n. With this assumption, we know that there exists a set R F i that defines anode cut of Hj i that creates two connected components, V 1 andV 2 suchthat1/3|H i | | V 1 |and1/3|H i | | V 2 |. The cost-eectiveness of this cut will be 2 |V 1 ||V 2 | |R| +2|H i ||R| 2|H i |2 9m i 2|H i |2 P r i j=1 |Hj i |2 9m i S/ (18m) . Lemma 16guarantees that the Leighton-Raoalgorithmwill find a cutin H i with cost- eectiveness at least S/ (18 cmlogn). The node cut output by the algorithm cannot contain     more than 20 cmlognnodes. Such anode cut would have cost-eectiveness at most S/ (20 cmlogn), since any cutin Gcandecreasethe objective function value by at most S, which is less than the guaranteed cost-eectiveness of S/ (18 cmlogn). 16
2. For all i 2I, there exists a jsuchthat|Hj i |>1/3|H i |. Also, note that OPT> P i2 I |Hj i |2. Claim: P i2 I |H i |2 P j |Hj i |2 S/8. Proof of claim: Let I be the set of intervals such that|H i |2 P r i j=1 |Hj i |2/m i S/ (4m). Recalling equation (5), we get S/2 X i2 I 0 @ |H i |2 X j |Hj i |2 1 A + X i2 I 0 @ |H i |2 X j |Hj i |2 1 A . Also, we have X i2 I 0 @ |H i |2 X j |Hj i |2 1 A = X i2 I m i |H i |2 P j |Hj i |2 m i X i2 I m i S/ (4m) S/ (4m) X i2 I m i S/4. Combining these two inequalities proves the claim. We have the inequalities: OPT> X i2 I |Hj i |2 X i2 I 1 9 |H i |2 1 9 S 8 , where we used our claim for the last inequality. Thus, OPT 72 S. We now give the proof of Theorem 13. Leta j be the number of connected components that comprise the graph at the beginning of the jthiteration, and let those connected components be Hj 1 ,...,Hj a j . LetS j = P a j i=1 |Hj i |2 be the value of the objective function at the beginning of the j'thiteration; thus S 0 n2 isitsinitial value. Let lbethenumber of iterations the algorithm needs to terminate, and S l+1 be the objective function'sfinalvalue. We wish to show that after the algorithm terminates, we have reduced the objective function value to S l+1 =O (1) ·OPT. Let Fbethefinalsetof marked nodes removed from G. If the algorithm terminates at Step 2 of the l'thiteration because the Leighton-Raoalgorithmonly found node cuts with more than (20 clogn) mremovednodes, then from Lemma 18 we know that S l+1 72OPT. Thus, we assume this does not occur. Furthermore, we assume that S l+1 72OPT (in order to apply the"either"part of Lemma 18 to alliterations.) In order to reason about the decrease in the objective function value after each iteration, we impute to each node in Fa per-node-decrease in the objective function value, given by the cost- eectiveness of its node cut. We then show that the total imputed decrease in the objective function is at least n2 O (1) ·OPT, from which the theorem will follow. More formally, suppose the set of marked nodes is given by the sequence F={f 1 ,...,f k }, where the removed nodes are ordered in the order in which they were removed from the graph: nodes removed at an earlier iteration occur earlier in the sequence. We havek=|F|= (log 2 n) m. 17
Letb j be the iteration in whichf j was removed. We impute tof j the value j =cost-eectiveness of cut removed in iterationb j . From Lemma18, we know that j S b j / (18 cmlogn). SetT 0 =S 0 andT i =T i1i to be the value of the objective function after nodef i 's per-node-decrease contribution has been made. Note T k =S l+1 . Claim: For alli, T i T i1 T i1 / (18 cmlogn) Proof of claim: Proving the claim reduces to proving that i =T i1 TiT i1 / (18 cmlogn). Fix ani. We have two cases. 1. b i =b i1 (i.e. f i andf i1 were removed in the same iteration.) Then i S b i / (18 cmlogn), butS b i >T i , sinceS b i is the objective function value at the beginning of iterationb i , whereas T i is the objective function value "during"iterationb i . 2. b i =b i1 +1 (i.e. f i was removed in the iteration afterf i was removed.) Then i S b i / (18 cmlogn) =T i / (18 cmlogn), since in this case T i is the objective function value at the start of iteration b i . This proves the claim. We therefore have T k T 0 (11/ (18 cmlogn))kn2 (11/ (18 cmlogn))k. Sincek>36 cmlog2 n, it follows thatS l+1 =T k =O (1) O (1) ·OPT. The algorithm given above can be adapted ina straightforward way to yield an algorithm for the edge cut version of the sum-of-squares partition problem (instead of taking sparse node cuts, take sparse edge cuts), from which an analog to Theorem 13 maybe derived. The analysis of an edge cut algorithm would perhaps be slightly easier than our analysis for node cuts, since node cuts modify the node set, causing minor complications. Furthermore, such an algorithm can use sparse cut algorithms with better approximation ratios, and thus remove only O (log 3/2 n) mnodes to achieve the same constant approximation ratio for the sum of squares. 7 Conclusions and future research We have described a simple economic game that represents thedicult problem of choosing on which nodes to install anti-virus software to contain the spread of computer viruses in a network. The Nashequilibria of this game have a simple characterization, and we can show that in the worst case, the ratio between the social cost ofa Nashequilibrium and asocial optimum can be linear in the number of nodes. Our model makes some very strong simplifying assumptions: every infected node eventually infects all unprotected neighbors; the costs of installing the anti-virus software and becoming in- fectedareknown and equal for all nodes; the virus imposes no costs on protected nodes; and nodes can observe which of the other nodes intend to install the anti-virus software and adjust their own strategies in response. None of these assumptions correspond completely to reality, but we believe that as a first step the resulting model is a reasonable compromise between accuracy and analyzability, and that the results obtained with the model (especially the characterization of Nashequilibria) are similar to what one might expect with a more complex model that took into account limited information and learning by individual nodes. The natural next step is to incor- poratemore details in the model and see if such changes aect the results; this might involve both theoretical work to predict theeect of changes and experimental or observational work to study how real-world decision-makers choose whether or not to deploy specific security mechanisms. 18
We have also shown how anear-optimal deployment of anti-virus software can be computed by reduction to the sum-of-squares partition problem, anew variant of classical graph partitioning problems where the goal is to removemvertices so as to minimize the sum of the squares of the sizes of surviving components. Though it is NP-hard to solve this problem exactly, we give a polynomial-time O (log2 n)-approximation algorithm for sum-of-squares partition, which yields a corresponding approximation algorithm for anti-virus software deployment. This algorithm maybe of use as a network administration tool for choosing how to deploy anti-virus software to minimize the combined costs of deployment and infection and as a public-health tool for designing inoculation strategies for containing outbreaks of highly-infectious diseases when a good approximation to the interaction graph can be computed but the initial source of contagion is unknown. Whether or not a polynomial-time algorithm with abetter approximation ratio exists remains open. 8 Acknowledgments The authors would like to thank Joan Feigenbaum, Hong Jiang, and Yang Richard Yang for many useful discussions. References [1]R. Anderson. Why information security is hard-an economic perspective, 2001. Available at http://www.cl.cam.ac.uk/~rja14/econsec.html. [2]S. Arora, S. Rao, and U. Vazirani. Expander flows, geometric embeddings and graph partitioning.         In Proceedings of the 36thAnnualACM Symposium on the Theory of Computing, pages 222-231,2004. [3]J. Aspnes, J. Feigenbaum, M. Mitzenmacher, and D. Parkes. Towards better definitions and measures of Internet security. In Workshop on Large-Scale Network Security and Deployment Obstacles, 2003. [4]N. T. Bailey. The Mathematical theory of infectious diseases and its applications. Hafner Press, 1975. [5]M. G. H. Bell. The measurement of reliability in stochastic transport networks. In Proceedings of 2001Intelligent Transportation Systems, pages 1183-1188,2001. [6]K. Campbell, L. A. Gordon, M. P. Loeb, and L. Zhou. The economic cost of publicly announced information security breaches: Empirical evidence from the stock market. Journal of Computer Security, 2003. [7]G. Even, J. Naor, S. Rao, and B. Schieber. Fast approximate graph partitioning algorithms. SIAMJournalon Computing, 28:2187-2214,1999. [8]U. Feigeand R. Krauthgamer. Apolylogarithmic approximation of the minimum bisection. SIAMJournalon Computing, 31:1090-1118,2002. [9]J. C. Frauenthal. Mathematical modeling in epidemiology. Springer-Verlag, New York, 1980. 19
[10]L. A. Gordonand M. P. Loeb. The economics of information security investment. ACM transactions on information and system security, pages 438-457,2002. [11]S. N. Hamilton, W. L. Miller, A. Ott, and O. S. Saydjari. Challenges in applying game theory to the domain of information warfare. In 4th Information survivability workshop (ISW- 2001/2002) , Vancouver, Canada, 2002. [12]S. N. Hamilton, W. L. Miller, A. Ott, and O. S. Saydjari. The role of game theory in information warfare. In 4th Information survivability workshop (ISW-2001/2002) , Vancouver, Canada, 2002. [13]K. Hoo. How much is enough? A risk-management approach to computer security. Consortium for Research on Information Security Policy (CRISP) Working Paper., 2000. [14]M. Kearnsand L. Ortiz. Algorithms for interdependent security games. InS. Thrun, L. Saul, andB. Sch¨olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. [15]J. O. Kephart, D. M. Chess, and S. R. White. Computers and epidemiology. In IEEE Spectrum, pages 20-26,1993. [16]J. O. Kephartand S. R. White. Directed-graph epidemiological models of computer viruses. In IEEESymposium on Security and Privacy, pages 343-361,1991. [17]J. O. Kephartand S. R. White. Measuring and modeling computer virus prevalence. In Proceedings of the IEEE Symposium on Security and Privacy, 1993. [18]H. Kesten. Percolation Theory for Mathematicians, volume 2. Birkh¨auser, Boston, 1982. [19]E. Koutsoupiasand C. Papadimitriou. Worst-case equilibria. In Proceedings of the 16thannual symposium on theoretical aspects of computer science, pages 403-413,1999. [20]H. Kunreutherand G. Heal. Interdependent security. Journal of Risk and Uncertainty (Special Issue on Terrorist Risks) , 2003. [21]T. Leightonand S. Rao. Multicommodity max-flowmin-cut theorems and their use in designing approximation algorithms. Journal of the Association for Computing Machinery, 46(6):787- 832,1999. [22]R. Pastor-Satorrasand A. Vespignani. Epidemics and immunization scale-free networks. In S. Bornholdtand H. Schuster, editors, Handbook of graphs and networks: from the genome to theInternet, pages 113-132, Berlin, 2002. Wiley-VCH. [23]R. Pastor-Satorrasand A. Vespignani. Immunization of complex networks. In Physical Review Letters, volume 65,2002. [24]P. F. Syverson. A dierent look at secure distributed computation. In IEEE Computer Security Foundations Workshop (CSFW 10) , pages 109-115. IEEE Computer Society Press, June 1997. [25]C. Wang, J. C. Knight, and M. C. Elder. On computer viral infection and the eect of immunization. In ACSAC, pages 246-256,2000. 20
[26]C. Zou, D. Towsley, and W. Gong. Emailvirus propagation modeling and analysis. Technical Report CSE-03-04, University of Massachusetts, Amherst, 2002. 21
Google Search
Google
Popular Articles