Title: Approximating the Top Eigenvector in Random Order Streams

Abstract: When rows of an n×d matrix A are given in a stream, we study algorithms for approximating the top eigenvector of the matrix A T A (equivalently, the top right singular vector of A). We consider worst case inputs A but assume that the rows are presented to the streaming algorithm in a uniformly random order. We show that when the gap parameter R = σ 1 (A) 2 /σ 2 (A) 2 = Ω(1), then there is a randomized algorithm that uses O(h • d • polylog(d)) bits of space and outputs a unit vector v that has a correlation 1 -O(1/ √ R) with the top eigenvector v 1 . Here h denotes the number of heavy rows in the matrix, defined as the rows with Euclidean norm at least ∥A∥ F / d • polylog(d). We also provide a lower bound showing that any algorithm using O(hd/R) bits of space can obtain at most 1 -Ω(1/R 2 ) correlation with the top eigenvector. Thus, parameterizing the space complexity in terms of the number of heavy rows is necessary for high accuracy solutions. Our results improve upon the R = Ω(log n•log d) requirement in a recent work of Price and Xun (FOCS 2024). We note that the algorithm of Price and Xun works for arbitrary order streams whereas our algorithm requires a stronger assumption that the rows are presented in a uniformly random order. We additionally show that the gap requirements in their analysis can be brought down to R = Ω(log 2 d) for arbitrary order streams and R = Ω(log d) for random order streams. The requirement of R = Ω(log d) for random order streams is nearly tight for their analysis as we obtain a simple instance with R = Ω(log d/ log log d) for which their algorithm, with any fixed learning rate, cannot output a vector approximating the top eigenvector v 1 .

Section: Introduction
We consider the problem of approximating the top eigenvector in the streaming setting. In this problem, we are given vectors a 1 , . . . , a n ∈ R d one at a time in a stream. Let A be an n × d matrix with rows a 1 , . . . , a n . The task is to approximate the top eigenvector of the matrix A T A. Throughout the paper, we use v 1 ∈ R d to denote the top eigenvector of A T A. We focus on obtaining streaming algorithms that use a small amount of space and can output a unit vector v such that ⟨v, v 1 ⟩ 2 ≥ 1f (R), where f (R) is a decreasing function in the gap R = λ 1 (A T A)/λ 2 (A T A). Here λ 1 (•), λ 2 (•) denote the two largest eigenvalues. As the gap R becomes larger, the eigenvector approximation problem becomes easier and we want more accurate approximations to the eigenvector v 1 .
If one is allowed to use Õ(d 2 ) 2 bits of space, we can maintain the matrix A T A = i a i a T i as we see the rows a i in the stream, and at the end of processing the stream, we can compute the exact top eigenvector v 1 . When the dimension d is large, the requirement of Ω(d 2 ) bits of memory can be impractical (see e.g., applications that require a large value of d in Mitliagkas et al. (2013).) Hence, an interesting question is to study non-trivial streaming algorithms that use less memory. In this work, we focus on obtaining algorithms that use Õ(d) bits of space.
In the offline setting (where the entire matrix A is available to us), fast iterative algorithms such as Gu (2015); Musco and Musco (2015); Musco et al. (2018) can be used to quickly obtain accurate approximations to the top eigenvector when the gap R = Ω(1). In a single pass streaming setting, we cannot run these algorithms as these iterative algorithms need to see the entire matrix multiple times.
There have been two major lines of work studying the problem of eigenvector approximation and the related Principal Component Analysis (PCA) problem in the streaming setting with near-linear in d memory. In the first line of work, each row encountered in the stream is assumed to be sampled independently from an unknown distribution with mean 0 and covariance Σ and the task is to approximate the top eigenvector of Σ using the samples. In this line of work, the sample complexity required for algorithms using O(d • polylog(d)) bits of space to output an approximation to v 1 , is the main question. The algorithms are usually a variant of Oja's algorithm (Oja, 1982;Jain et al., 2016; Allen-Zhu and Li, 2017; Huang et al., 2021; Kumar and Sarkar, 2023) or the block power method (Hardt and Price, 2014;Balcan et al., 2016). We note that Kumar and Sarkar (2023) relax the i.i.d. assumption and analyze the sample complexity of Oja's algorithm for estimating the top eigenvector in the Markovian data setting.
The other line of work studies algorithms for arbitrary streams appearing in an arbitrary order. In this setting, we want algorithms to work for any input stream given in any order. A problem closely related to the eigenvector estimation problem is the Frobenius-norm Low Rank Approximation (Clarkson and Woodruff, 2017; Boutsidis et al., 2016;Upadhyay, 2016;Ghashami et al., 2016). The deterministic Frequent Directions sketch of Ghashami et al. (2016) can, using Õ(d/ε) bits of space, output a unit vector u such that
∥A(I -uu T )∥ 2 F ≤ (1 + ε)∥A(I -v 1 v T 1 )∥ 2 F .
Although the vector u is a 1 + ε approximate solution to the Frobenius norm Low Rank Approximation problem, it is possible that the vector u may be (nearly) orthogonal to the top eigenvector v 1 . Hence the Frequent Directions sketch does not guarantee top eigenvector approximation. Recently, Price and Xun (2024) study the eigenvector approximation problem in arbitrary streams and obtain results in terms of the gap R of the instance. Price and Xun prove that when R = Ω(log n • log d), a variant of Oja's algorithm outputs a unit vector v such that
⟨v, v 1 ⟩ 2 ≥ 1 - C log d R - 1 poly(d)
where C is a large enough universal constant. On the lower bound side, Price and Xun showed that any algorithm that outputs a vector v satisfying ⟨v, v 1 ⟩ 2 ≥ 1 -1 CR 2 , must use Ω(d 2 /R3 ) bits of space while processing the stream. This lower bound shows that in the important case of R = O(1), the correlation 3 that can be obtained by an algorithm using Õ(d) bits of space is at most a constant less than 1. Thus, the current best algorithms for arbitrary streams work only when R = Ω(log n • log d) and for the important case of R = O(1), there are no existing algorithms requiring significantly fewer than d 2 bits of memory. They also give a lower bound on the size of mergeable summaries for approximating the top eigenvector.
We identify an instance with R = Θ(log d/ log log d) where the algorithm of Price and Xun fails to produce a vector with even a constant correlation with the vector v 1 . This shows that their algorithm or other variants of Oja's algorithm may fail to extend to the case when R = O(1). We further show that the algorithm of Price and Xun fails to produce such a vector even when the rows in our hard instance are ordered uniformly at random, showing that even randomly ordered streams can be hard to solve for variants of Oja's algorithm.
In this work, we focus on algorithms that work on worst case inputs A while assuming that the rows of A are uniformly randomly ordered. This model is mid-way between the i.i.d. setting and the arbitrary order stream setting in terms of the generality of streams that can be modeled. We note that a number of works (Munro and Paterson, 1980;Guha et al., 2005;Chakrabarti et al., 2008;Guha and McGregor, 2009;Assadi and Sundaresan, 2023) have previously considered streaming algorithms and lower bounds for worst case inputs with random order streams, as it is a natural model often arising in practical settings. Our algorithms are parameterized in terms of the number of heavy rows in the stream. See Gupta and Singla (2021) for a gentle introduction to the random-order model. We define a row a i to be heavy if ∥a i ∥ 2 ≥ ∥A∥ F / d • polylog(d). Note that in any stream of rows, by definition, there are at most d • polylog(d) heavy rows. We state our theorem informally below: Theorem 1.1. Let a 1 , . . . , a n ∈ R d be a randomly ordered stream and let A denote the n × d matrix with rows given by a 1 , . . . , a n . If R = λ 1 (A T A)/λ 2 (A T A) > C for a large enough constant C and the number of heavy rows in the stream is at most h, then there is a streaming algorithm using O(h • d • polylog(d)) bits of space and outputting a unit vector v satisfying
⟨v, v 1 ⟩ 2 ≥ 1 -O(1/ √ R)
with a probability ≥ 4/5.
Our algorithm is a variant of the block power method. Along the way, we also improve the gap requirements in the results of Price and Xun (2024). We show that by subsampling a stream of rows, the algorithm of Price and Xun can be made to work even when the gap R is Ω(log 2 d) in arbitrary order streams, improving on the Ω(log n • log d) requirement in their analysis. We also show that in random order streams, a gap of Ω(log d) is sufficient for their algorithm, though our algorithm improves on this and works for even a constant gap.
Similar to the lower bound of Price and Xun, we show that any algorithm for random order streams must use Ω(h • d/R) bits of space to output a vector v satisfying ⟨v, v 1 ⟩ 2 ≥ 1 -1/CR 2 where C is a constant. We summarize the theorem below. Theorem 1.2. Consider an arbitrary random order stream a 1 , . . . , a n with the gap parameter
σ1(A) 2 σ2(A) 2 = R.
Let h be the number of heavy rows in the stream. Any streaming algorithm that outputs a unit vector v such that
⟨v, v 1 ⟩ 2 ≥ 1 -1/CR 2
for a large enough constant C, with a probability ≥ 1 -(1/2) R+1 over the ordering of the stream and its internal randomness, must use Ω(h • d/R) bits of space.
Techniques. The randomized power method (Gu, 2015) algorithm to approximate the top eigenvector samples a random Gaussian vector g and iteratively computes the vector v = (A T A) t g4 for t = Θ(log d) iterations and shows that when the gap R is large, v/∥v∥ 2 is a good approximation for v 1 . Thus, the algorithm needs to see the quadratic form A T A multiple times and hence, it cannot be implemented in the single-pass streaming setting of this paper.
Assume that the stream is randomly ordered and that there are no heavy rows. Our key observation is that if the stream is long enough, then we can see t approximations B T j B j5 of the quadratic form A T A. Here the matrices B 1 , . . . , B t are formed by sampling and rescaling the rows of the matrix A and importantly, the rows of B 1 , . . . , B t do not overlap in the stream, that is, they appear one after the other. Thus we can compute
v ′ = (B T t B t ) • • • (B T 1 B 1 )
• g for the starting vector g in a single pass over the stream. We prove that such matrices B j exist using the row norm sampling result of Magdon-Ismail (2010). Now, the main issue is to show that v ′ /∥v ′ ∥ 2 is a good approximation to the top eigenvector v 1 . We crucially use a singular value inequality of Wang and Xi (1997) to prove that ∥B T j B j -A T A∥ 2 ≤ ε∥A∥ 2 2 for all j suffices for v ′ /∥v ′ ∥ 2 to be a good approximation to v 1 . The above analysis assumes that there are no heavy rows. Indeed, suppose that a matrix A has a row a with a large Euclidean norm which is orthogonal to all the other rows. Also assume that the top eigenvector of the matrix A is in this direction. Since, the matrices B 1 , . . . , B t are non-overlapping substreams of the matrix A, at most one of the matrices B j can have the row a and hence the vector v ′ /∥v ′ ∥ 2 will not be a good approximation to a/∥a∥ 2 , the top eigenvector. Thus, we need to handle the heavy rows separately. We show that, by storing all the rows with a Euclidean norm larger than ∥A∥ F / d • polylog(d) and running the above described algorithm on the remaining set of rows, we can obtain a good approximation to the top eigenvector.
Our lower bound (Theorem 1.2) shows that any single-pass streaming algorithm must use space proportional to the number of heavy rows, and therefore our procedure that handles the heavy rows separately gives near-optimal bounds. Finally, the row norm sampling technique of Magdon-Ismail (2010) serves as a general technique to reduce the number of rows in the stream while (approximately) preserving the top eigenvector. We use this observation to improve the R = Ω(log n • log d) for arbitrary streams in Price and Xun (2024) to R = Ω(log 2 d). We then show that assuming a uniformly random order, the analysis of Price and Xun (2024) can be improved to show that R = Ω(log d) suffices. Thus, for random order streams, techniques before our work can be used to approximate the top eigenvector when the gap R = Ω(log d). Our work improves upon this to give an algorithm that works for streams with R = Ω(1).
Implications to practice. Often, in practical situations, we can assume that the rows being streamed are sampled independently from a nice-enough distribution, in which case Oja's algorithm, as discussed, can approximate the top eigenvector accurately given enough samples. However, independence and assumptions on the covariance matrix can be very strong assumptions in some cases and in such cases, our algorithm only requires that the order of the rows in the stream be uniformly random, in which case we output an approximation with provable guarantees.
Organization. We first introduce the row-norm sampling procedure to obtain approximate quadratic forms. The proof is a slight modification of that of Magdon-Ismail (2010). The only difference is that we instead consider a version that samples each row in the input independently with some appropriate probability and keeps the rows that are sampled after scaling appropriately. We then introduce and analyze our block power iteration algorithm when all rows have roughly the same Euclidean norm, and then extend it to the general case, which is our main result. Finally, we provide a lower bound showing that Ω(td/R) bits of space is necessary to obtain constant correlation with the top eigenvector. Due to space constraints, all of our proofs are placed in the appendix.

Section: Power Method with Approximate Quadratic Forms
In this section, we present and analyze our algorithm for approximating the top eigenvector of A T A when the rows of A are presented to the algorithm in a uniformly random order.
We first show a row sampling technique that reduces the number of rows in the stream. The rownorm sampling technique for approximating the quadratic form A T A with spectral norm guarantees was given by Magdon-Ismail (2010). The technique works irrespective of the order of the rows.

Section: Sampling for Row Reduction
Theorem 2.1. Let A be an arbitrary n × d matrix. Given p ∈ [0, 1] n , let Q be an n × n diagonal matrix such that for each i ∈ [n], we independently set Q ii = 1/ √ p i with probability p i and 0 otherwise. If for all i,
p i ≥ min 1, C ∥a i ∥ 2 2 ε 2 ∥A∥ 2 2 log d , then with probability 1 -1/ poly(d), ∥A T A -A T Q T QA∥ 2 ≤ ε∥A∥ 2 2
. With probability at least 1 -1/ poly(d), the matrix Q has at most O(ε -2 ρ log d) non-zero entries, where ρ = ∥A∥ 2 F /∥A∥ 2 2 denotes the stable rank of matrix A.
Note that given the value of ∥A∥ 2 , the sampling procedure in this theorem can be performed in a stream. Additionally, as the original stream is uniformly randomly ordered, the sub-sampled stream is also uniformly randomly ordered assuming that the sampling is independent of the order of the rows.
Given that all of the non-zero entries of the matrix have absolute value at least 1/ poly(nd) and at most poly(nd), we have that ∥A∥ 2 2 lies in the interval [1/ poly(nd), poly(nd)]. Thus, we can guess the value of ∥A∥ 2 2 as 2 i / poly(nd) for i = 0, . . . , O(log(nd)) and one of these values must be a 2-approximation to ∥A∥ 2  2 , and thus sub-sampling the rows using that guess satisfies the conditions in the above theorem. We can run the streaming algorithms on all the streams simultaneously to obtain O(log nd) vectors u 1 , . . . , u O(log nd) as the candidates for being an approximation to the top eigenvector. From Theorem 2.1, the candidate vector u j computed on the stream obtained by sampling the rows with the correct probabilities is a good approximation to the top eigenvector, and therefore ∥A • u j ∥ 2 is large for that value of j. Thus, the vector u j with the largest value ∥A • u j ∥ 2 is a good approximation to the top eigenvector v 1 . If G is a Gaussian matrix with O(ε -2 log d) rows, then for all u j , we can approximate ∥A • u j ∥ 2 up to a 1 ± ε factor using ∥G • A • u j ∥ 2 by the Johnson-Lindenstrauss lemma. Additionally, the matrix G • A can be maintained in the stream using O(ε -2 • d log d) bits (when we see a row a i , we sample an independent Gaussian vector g i and add g i a T
i to an accumulator to maintain G • A). Thus, at the end of processing the stream, we can compute a vector u j that has a large value ∥A • u j ∥ 2 , and hence is a good approximation for v 1 .
If we can process each created stream using s bits of space, then the overall space requirement is O(s • log(nd) + d • polylog(d)) bits, using O(s) bits for each guess for the value of ∥A∥ 2  2 and O(d • polylog(d)) bits for storing a Gaussian sketch of the matrix with ε = 1/ polylog(d).

Section: Random-Order Streams with bounds on Norms
Algorithm 1: Approximate Eigenvector for Streams with no Large Norms
Input: An n × d matrix A with n = Ω(η • ρ(A) • log 2 d/ε 2 ), max i ∥a i ∥ 2 2 / min i ∥a i ∥ 2 2 ≤ η Output: A vector z t ← ⌈C 1 log d⌉ Compute G • A in the stream where G is a Gaussian matrix with O(ε -2 log d) rows for ρ = 1, 2, 4, . . . , d simultaneously do p ← C 2 ηρ log d/nε 2 // p ≤ 1/(5t) for ρ ≤ 2 • ρ(A) z ρ ∼ N (0, 1) d for j = 1, . . . , t do y j ← Bin(n, p) if y j > 2np then return ⊥ end // The matrix A j•(2np):j•(2np)+y j corresponds to B j in the analysis. acc ← 0 for i = (j -1) • (2np) + 1, . . . , (j -1) • (2np) + y j do acc ← acc + ⟨a i , z ρ ⟩ • a i end // Here acc = B T j B j z ρ z ρ ← acc z ρ ← z ρ /∥z ρ ∥ 2 end end return arg max z∈{ z1,z2,z4,...,z d } ∥(G • A)z∥ 2
We now present the analysis of the block power method for random order streams assuming that the Euclidean norms of all the rows in A are close to each other. We later remove this assumption. Suppose there exists a parameter η such that
(max i ∥a i ∥ 2 2 )/(min i ∥a i ∥ 2 2 ) ≤ η.
If η is close to 1 then all the rows in the stream have roughly the same norm.
Let p = Cηρ log(d)/ε 2 n. We can see that for any row a i in the stream,
C ∥a i ∥ 2 2 ε 2 ∥A∥ 2 2 log d ≤ C η∥A∥ 2 F /n ε 2 ∥A∥ 2 2 log d ≤ Cηρ log d nε 2 = p.
Thus, p is greater than the probability with which we need to sample each row in the row-norm sampling result in Theorem 2.1. Now if we perform such a sampling of the rows of A, we sample Bin(n, p) 6 number of rows, which is tightly concentrated around np = ε -2 Cηρ log d. Thus, if we first sample y ∼ Bin(n, p) and then consider the first y number of rows in the random order stream, then we will have sampled from a distribution satisfying the requirements in Theorem 2.1 and can therefore obtain a matrix B such that
∥B T B -A T A∥ 2 ≤ ε∥A∥ 2 2
. Thus, assuming that the rows appear in a uniformly random order lets us show that the first y rows of the stream can be used to compute an approximation to the quadratic form A T A. We will now show that we can obtain O(log d) such quadratic forms in the stream given that the stream is long enough.
Assume that the number of rows in the stream n = Ω(ηρ log 2 d/ε 2 ). We partition the stream into t = Θ(log d) groups as follows: the first 2np rows are placed in the group 1, the second 2np rows are placed in the group 2, and so on. Note that since n = Ω(ηρ log 2 d/ε 2 ), we can form t such groups. Since the rows are uniformly randomly ordered, the joint distribution of the rows appearing in group 1 is the same as that of the joint distribution of the rows appearing in group 2 and so on. Let y 1 , . . . , y t ∼ Bin(n, p) be drawn independently. With probability ≥ 1 -1/ poly(d), we have y i ≤ (3/2)np for all i. For i = 1, . . . , t, let B i be the matrix formed by the first y i rows in group i. Using a union bound, we have that with probability ≥ 1 -1/ poly(d), for all i = 1, . . . , t,
∥A T A - 1 p B T i B i ∥ 2 ≤ ε∥A∥ 2 2 .
Conditioned on the above event, we will now show that running the power method on the blocks B 1 , . . . , B t lets us approximate the top singular vector of the matrix A.
Assumption 2.2. We assume that σ 1 (A)/σ 2 (A) ≥ 2.
Lemma 2.3. Let ε > 1/ poly(d) be an accuracy parameter and t = Ω(log d) be the number of iterations. Let ε ≤ c/t 2 for a small constant c. Suppose B 1 , . . . , B t all satisfy ∥A T A -B T j B j ∥ 2 ≤ ε∥A∥ 2 2 for ε < 1/5. If g is a random vector sampled from the Gaussian distribution, then the unit vector
v := (B T t B t ) • • • (B T 1 B 1 )g ∥(B T t B t ) • • • (B T 1 B 1 )g∥ 2 satisfies ⟨v, v 1 ⟩ 2 ≥ 1 1 + C ′ t √ ε
with probability ≥ 9/10 for a large enough constant C ′ . Here v 1 denotes the top right singular vector of the matrix A.
To prove this lemma, our strategy is to show that the matrix product
M := (B T t B t ) • • • (B T 1 B 1
) has a stable rank close to 1 -meaning it has one very large singular value and the rest of the singular values are small. We can then argue that the vector v = M g/∥M g∥ 2 is in the direction of the top singular vector M . Using the fact that v
T 1 (B T j B j )v 1 ≥ (1 -ε)∥A∥ 2 2
for all j, we show that the top singular vector of M must have a large correlation with v 1 . Therefore, it follows that the vector v has a large correlation with v 1 as well. As part of the proof, we crucially use an inequality from Wang and Xi (1997).
If t = Θ(log d) and 1/ poly(d) ≤ ε ≤ c/(log d) 2 , then the above lemma shows that v has a large correlation with the top singular vector v 1 . Using this lemma, we show that Algorithm 1 can be used to obtain an approximation for v 1 in random order streams with bounded norms.  F /∥A∥ 2 2 and the rows in the stream are ordered uniformly at random, then we can compute a vector v using the block power method that satisfies
|⟨v 1 , v⟩| 2 ≥ 1 -3α with probability ≥ 4/5 if σ 1 (A)/σ 2 (A) ≥ 2. The algorithm uses O(d•polylog(d)/α 4 ) bits of space.
Proof. Set ε = α 2 /C log 2 d for a large enough constant C. Assuming n = Ω(α -4 ρη log 6 d), we have n = Ω(ε -2 ρη log 2 d). Now consider the execution of Algorithm 1 on matrix A, with parameters η and ε. Let ρ = 2 j be such that ρ(A)/2 ≤ ρ ≤ ρ(A), and consider the execution in the algorithm with parameter ρ. Using Theorem 2.1, with probability ≥ 1 -1/ poly(d), the algorithm computes t matrices B 1 , . . . , B t such that for all j ∈ [t],
∥ 1 p B T j B j -A T A∥ 2 ≤ ε∥A∥ 2 2 .
Noting that
z ρ = (B T t B t ) • • • (B T 1 B 1 )g/∥(B T t B t ) • • • (B T 1 B 1 )g∥ 2
, by Lemma 2.3, we have with probability ≥ 9/10 that
⟨z ρ , v 1 ⟩ 2 ≥ 1 1 + C ′ t √ ε ≥ 1 -α.
Thus, for ρ which satisfies ρ(A)/2 ≤ ρ ≤ ρ(A), the algorithm computes a vector z ρ that has a large correlation with the vector v 1 . Since the algorithm does not know the exact value of ρ, it computes an approximation for ∥Az∥ 2 2 for all z ∈ { z 1 , z 2 , z 4 , . . . , z d }. First, we condition on the fact that with probability ≥ 1 -1/ poly(d), for all 2 . Now, for the vector z returned by the algorithm, we have ∥Az∥ 2  2
z i , ∥GAz i ∥ 2 2 = (1 ± ε)∥Az i ∥ 2 2 . Since ⟨z ρ , v 1 ⟩ 2 ≥ (1 -α), we note that ∥GAz ρ ∥ 2 2 ≥ (1 -ε)(1 -α)σ 1 (A)
≥ (1 -O(ε))(1 -α)σ 1 (A) 2 which implies that ⟨z, v 1 ⟩ 2 • σ 1 (A) 2 + (1 -⟨z, v 1 ⟩ 2 ) σ 1 (A) 2 R ≥ ∥Az∥ 2 2 ≥ (1 -α -O(ε))σ 1 (A) 2
and therefore ⟨z,
v 1 ⟩ 2 ≥ 1 -3α since R ≥ 2.

Section: Random Order Streams without Norm Bounds
Assuming that the random order streams are long enough, Theorem 2.4 shows that if all the squared row norms are within an η factor, then the block power method outputs a vector with a large correlation with the top eigenvector of the matrix A T A. For general streams, the factor η could be quite large and hence the algorithm requires very long streams to output an approximation to v 1 .
If there are no heavy rows, i.e., rows with a Euclidean norm larger than ∥A∥ F / d • polylog(d), then the row norm sampling procedure in Theorem 2.1 can be used to convert any randomly ordered stream of rows into a uniformly random stream of rows that all have the same norm. The row norm sampling procedure computes a probability
p i = min(1, Cε -2 ∥a i ∥ 2 2 log d/∥A∥ 2 2
) and samples the row a i with probability p i . If sampled, then the row a i is scaled by 1/ √ p i . From Theorem 2.1, we have that the top eigenvector of the quadratic form of the sampled-and-rescaled submatrix is a good approximation to the top eigenvector A T A when the gap R is large enough. Suppose p i < 1. If the row a i is sampled, we then have
∥a i / √ p i ∥ 2 = ε∥A∥ 2 √ C log d .
Thus, if p i < 1 for all i, then all the sampled-and-rescaled rows have the same Euclidean norm and therefore, we can run the algorithm from Theorem 2.4 by setting η = 1. Note that p i = 1 only if
∥a i ∥ 2 2 ≥ ε 2 ∥A∥ 2 2 /C log(d).
Since we assumed that there are no heavy rows, there is no row with p i = 1 as long as ε ≥ 1/ polylog(d). Thus, using Theorem 2.4 on the row norm sampled substream directly gives us a good approximation to the top eigenvector. However, in general, the streams can have rows with large Euclidean norm. We will now state our theorem and describe how such streams can be handled. 
Let R = σ 1 (A) 2 /σ 2 (A) 2 . Assume 2 ≤ R ≤ C 1 log 2 d.
Let h be the number of rows in A with norm at most ∥A∥ F / d • polylog(d), where polylog(d) = log C2 d for a large enough universal constant C 2 . Given the rows of the matrix A in a uniformly random order, there is an algorithm using O((h+1)•d•polylog(d)•log n) bits of space and which outputs a vector v such that with probability
≥ 4/5, v satisfies ⟨v, v 1 ⟩ 2 ≥ 1 -8/ √ R,
where v 1 is the top eigenvector of the matrix A T A.
The key idea in proving this theorem is to partition the matrix A into A heavy and A light , where A heavy denotes the matrix with the heavy rows and A light denotes the matrix with the rest of the rows of A.
Since we assume that there are at most h heavy rows, we can store the matrix A heavy using O(h • d • polylog(d)) bits of space. Now consider the following two cases:
(i) ∥A heavy ∥ 2 ≥ (1 -β)∥A∥ 2 or (ii) ∥A heavy ∥ 2 < (1 -β)∥A∥ 2
for some parameter β. In the first case, we can show that the top eigenvector u of A T heavy A heavy is a good approximation for v 1 . Since, we store the full matrix A heavy , we can compute u exactly at the end of the stream. Suppose ∥A heavy ∥ 2 < (1 -β)∥A∥ 2 . By the triangle inequality, we have ∥A light ∥ 2 > β∥A∥ 2 . If we set β large enough compared to 1/R, then we can show that the top eigenvector u ′ of A T light A light is a good approximation of v 1 . From the above discussion, since all the rows of A light are light, we can obtain a stream using Theorem 2.1 such that all the rows have the same norm and additionally, the top eigenvector of this stream is a good approximation for u ′ and therefore v 1 . We then approximate the top eigenvector of the new stream using Theorem 2.4. Setting β appropriately, we show that this procedure can be used to compute a vector v satisfying ⟨v,
v 1 ⟩ 2 ≥ 1 -O(1/ √ R) proving the theorem.

Section: Lower Bounds
Our algorithm uses Õ(h • d) space when the number of heavy rows in the stream is h. We want to argue that it is nearly tight. We show the following theorem. Theorem 3.1. Given a dimension d, let h and R be arbitrary with
R ≤ h ≤ d and R 2 • h = O(d).
Consider an algorithm A with the following property:
Given any fixed matrix n × d matrix A with O(h) heavy rows and gap σ 1 (A) 2 /σ 2 (A) 2 ≥ R, in the form of a uniform random order stream, the algorithm A outputs a unit vector v such that, with probability ≥ 1 -(1/2) 4R+4 over the randomness of the stream and the internal randomness of the algorithm, |⟨v,
v 1 ⟩| 2 ≥ 1 -c/R 2 .
If c is a small enough constant, then the algorithm A must use Ω(h • d/R) bits of space.
The theorem shows that a streaming algorithm must use Ω(hd/R) bits of space assuming that with high probability, it outputs a vector with a large enough correlation with the top eigenvector of A T A when the rows are given in a random order stream.
Our proof uses the same lower bound instance as that of Price and Xun (2024). The key difference from their proof is that our lower bound must hold against random order streams.
4 Improving the Gap Requirements in the Algorithm of Price and Xun

Section: Arbitrary Order Streams
As discussed in Section 2.1, we can guess an approximation of ∥A∥ 2 2 in powers of 2 and sample at most O(d log d/ε 2 ) rows in the stream to obtain a matrix B, in the form of a stream, satisfying
∥B T B -A T A∥ 2 ≤ ε∥A∥ 2
2 , with a large probability. Using Weyl's inequalities, we obtain that
σ 2 (B T B) ≤ σ 2 (A T A) + ε∥A∥ 2 2 and σ 1 (B T B) ≥ (1 -ε)σ 1 (A T A) implying R ′ = σ 1 (B) 2 /σ 2 (B) 2 ≥ (1 -ε)/(1/R + ε). For ε = 1/(2R) ≤ 1/2, we note R ′ ≥ R/3. Let n ′ = O(R 2 • d log d)
be the number of rows in the matrix B and note that R ′ = Ω(log n ′ • log d) assuming R = Ω(log 2 d). Hence, running the algorithm of Price and Xun on the rows of the matrix B, we compute a vector v for which
|⟨v, v ′ 1 ⟩| 2 ≥ 1 - log d CR ′ - 1 poly(d)
with a large probability, where v ′ 1 is the top eigenvector of the matrix B T B. We now note that if v 1 denotes the top eigenvector of the matrix
A T A, then |⟨v 1 , v ′ 1 ⟩| 2 ≥ 1 -O(1/R
) which therefore implies that with a large probability,
|⟨v, v 1 ⟩| 2 ≥ 1 - log d CR .
Thus, sub-sampling the stream using row norm sampling and then running the algorithm of Price and Xun (2024), we obtain an algorithm for arbitrary order streams with a gap R = Ω(log 2 d).

Section: Random Order Streams
Lemma 3.5 in Price and Xun (2024) can be tightened when the rows of the stream are uniformly randomly ordered. Specifically, we want to bound the following quantity:
n i=1 ⟨a i , P vi-1 ⟩ 2
where P = I -v 1 v T 1 denotes the projection away from the top eigenvector, and vi-1 is a function of v 1 , a 1 , . . . , a i-1 . We have
E[⟨a i , P vi-1 ⟩ 2 ] = E[E[⟨a i , P vi-1 ⟩ 2 | a 1 , . . . , a i-1 ]].
Given that the first i -1 rows are a 1 , . . . , a i-1 , assuming uniform random order, we have
E[⟨a i , P vi-1 ⟩ 2 | a 1 , . . . , a i-1 ] = 1 n -i + 1 vT i-1 P (A T A -a 1 a T 1 -• • • -a i-1 a T i-1 )P vi-1 ≤ σ 2 (A) 2 n -i + 1 . Hence E[⟨a i , P vi-1 ⟩ 2 ] ≤ σ 2 (A) 2 /(n-i+1) and E[ n i=1 ⟨a i , P vi-1 ⟩ 2 ] ≤ σ 2 (A) 2 (1+log n).
Price and Xun define η • σ 2 (A) 2 as σ 2 and in that notation, we obtain η n i=1 ⟨a i , P vi-1 ⟩ 2 ≤ 10σ 2 (1 + log n) with probability ≥ 9/10 by Markov's inequality. In the proof of Lemma 3.6 in Price and Xun (2024), if σ 1 /σ 2 ≥ 20(1 + log 2 n), we obtain log ∥v n ∥ 2 ≳ σ 1 . Now, σ 1 ≥ O(log d) ensures that the Proof of Theorem 1.1 in their work goes through.
Using the row-norm sampling analysis from the previous section, we can assume n = poly(d) and therefore a gap of O(log d) between the top two eigenvalues of A T A is enough for Oja's algorithm to output a vector with a large correlation with the top eigenvector in random order streams.

Section: Hard Instance for Oja's Algorithm
At a high level, the algorithm of Price and Xun (2024) runs Oja's algorithm with different learning rates η and in the event that the norm of the output vector with each of the learning rates η is small, then the row with the largest norm is output. The algorithm is simple and can be implemented using an overall space of O(d • polylog(d)) bits.
The algorithm initializes z 0 = g where g is a random Gaussian vector. The algorithm streams through the rows a 1 , . . . , a n and performs the following operation
z i ← z i-1 + η • ⟨z i-1 , a i ⟩a i .
The algorithm computes the smallest learning rate η when ∥z n ∥ 2 is large enough, and then outputs either z n /∥z n ∥ 2 or ā/∥ā∥ 2 as an approximation to the eigenvector of the matrix A T A. Here ā denotes the row in A with the largest Euclidean norm.
The following theorem shows that at gaps ≤ O(log d/ log log d), we cannot use Oja's algorithm with a fixed learning rate η to obtain constant correlation with the top eigenvector.
Theorem 5.1. Given dimension d, a constant c > 0, a parameter M , for all gap parameters R = O c (log d/ log log d) there is a stream of vectors a 1 , . . . , a n ∈ R d with n = O(R + M ) such that:
1. σ 1 (A) 2 /σ 2 (A) 2 ≥ R/2, and 2. Oja's algorithm with any learning rate η < M fails to output a unit vector v that satisfies, with probability ≥ 9/10,
|⟨v, v 1 ⟩| ≥ c
where v 1 is the top eigenvector of the matrix A T A.
Moreover, the result holds irrespective of the order in which the vectors a 1 , . . . , a n are presented to the Oja's algorithm. We will additionally show that even keeping track of the largest norm vector is insufficient to output a vector that has a large correlation with v 1 .

Section: A Omitted Proofs
A.1 Proof of Theorem 2.1
Proof. Let X i denote an indicator random variable which denotes if Q ii is nonzero. Note E[X i ] = p i and X 1 , . . . , X n are independent. Define a d × d random matrix Y i = (X i /p i -1)a i a T i , where a i denotes the i-th row of A. We note that
A T A -A T Q T QA = n i=1 (X i /p i -1)a i a T i = n i=1 Y i .
We use the Matrix Bernstein inequality (Tropp, 2015) to bound ∥ i Y i ∥ 2 . We first uniformly upper bound ∥Y i ∥ 2 . If p i = 1, by definition ∥Y i ∥ 2 = 0 with probability 1.
Let p i ̸ = 0. Then, ∥(X i /p i -1)a i a T i ∥ 2 ≤ ∥a i a T i ∥ 2 /p i ≤ ε 2 ∥A∥ 2 2 /C log d with probability 1. We now bound ∥ i E[Y 2 i ]∥ 2 . i E[Y 2 i ] = i E[(1/p i -1) 2 ]∥a i ∥ 2 2 a i a T i = i:pi>0 (1/p i -1)∥a i ∥ 2 2 a i a T i ⪯ i:pi>0 ε 2 ∥A∥ 2 2 C∥a i ∥ 2 2 log d ∥a i ∥ 2 2 a i a T i ⪯ ε 2 ∥A∥ 2 2 C log d A T A which implies ∥ i E[Y 2 i ]∥ 2 ≤ ε 2 ∥A∥ 4 2 /(C log d). Now, we obtain Pr[∥ i Y i ∥ 2 ≥ ε∥A∥ 2 2 ] ≤ 2d • exp - ε 2 ∥A∥ 4 2 /2 ε 2 ∥A∥ 4 2 /(C log d) + ε 3 ∥A∥ 4 2 /(3C log d) ≤ 2d • exp - C log d 2(1 + ε/3) . If C ≥ 6(1 + ε/3), then Pr[∥ i Y i ∥ 2 ≥ ε∥A∥ 2 2 ] ≤ 1 -2/d 2 which implies that with probability ≥ 1 -2/d 2 , ∥A T A -A T Q T QA∥ 2 ≤ ε∥A∥ 2
2 . Now, the number of non-zero entries in the matrix Q is equal to i X i . We note
E[ i X i ] ≤ Cε -2 ρ • log d. By a Chernoff bound, we obtain that i X i = O(ε -2 ρ • log d) with probability ≥ 1 -1/ poly(d). A.2 Proof of Lemma 2.3 Proof. Define M := (B T t B t ) • • • (B T 1 B 1 ).
Our strategy is to show that if v 1 is the top singular vector of the matrix A, then ∥v T 1 M ∥ 2 is comparable to ∥M ∥ F given that σ 1 (A)/σ 2 (A) ≥ 2. We can then prove the lemma using simple properties of the Gaussian vector g.

Section: For an arbitrary j, let (B
T j B j )v 1 = αv 1 + ∆ where ∆ ⊥ v 1 . We note that v T 1 (B T j B j )v 1 = α. We have α = v T 1 B T j B j v 1 ≥ (1 -ε)σ 1 (A) 2 using the fact that ∥B T j B j -A T A∥ 2 ≤ ε∥A∥ 2 2 and v T 1 A T Av 1 = σ 1 (A) 2 = ∥A∥ 2 2 .
If we show that ∆ is small, then the vector (B T j B j )v 1 is oriented in a direction very close to that of v 1 . Note that
∥(B T j B j )v 1 ∥ 2 ≤ ∥B T j B j ∥ 2 ≤ (1 + ε)σ 1 (A) 2 and ∥(B T j B j )v 1 ∥ 2 2 = α 2 + ∥∆∥ 2 2 which implies ∥∆∥ 2 2 ≤ ((1 + ε) 2 -(1 -ε) 2 )σ 1 (A) 4 = 4ε • σ 1 (A) 4 and thus ∥∆∥ 2 ≤ √ 4εσ 1 (A) 2 . Now, ∥M T v 1 ∥ 2 = ∥(B T 1 B 1 ) • • • (B T t-1 B t-1 )(⟨B T t B t v 1 , v 1 ⟩v 1 + ∆ 1 )∥ 2 ≥ ⟨B T t B t v 1 , v 1 ⟩∥(B T 1 B 1 ) • • • (B T t-1 B t-1 )v 1 ∥ 2 -∥(B T 1 B 1 ) • • • (B T t-1 B t-1 )∥ 2 ∥∆ 1 ∥ 2 ≥ ((1 -ε)σ 1 (A) 2 )∥(B T 1 B 1 ) • • • (B T t-1 B t-1 )v 1 ∥ 2 -( √ 4εσ 1 (A) 2 )∥(B T 1 B 1 ) • • • (B T t-1 B t-1 )∥ 2 .
Expanding similarly, we obtain
∥M T v 1 ∥ 2 ≥ (1 -ε) t σ 1 (A) 2t -t √ 4ε(1 + ε) t-1 σ 1 (A) 2t .
Assuming ε ≤ c/t for a small constant c, we note that (1 -ε) t ≥ (1 -2tε) and (1 + ε) t ≤ (1 + 2tε) which implies
∥M T v 1 ∥ 2 = ∥(B T 1 B 1 ) • • • (B T t B t )v 1 ∥ 2 ≥ (1 -2tε -4t √ ε)σ 1 (A) 2t .
We shall now show a bound on
∥M ∥ F = ∥(B T t B t ) • • • (B T 1 B 1
)∥ F which lets us show that the unit vector v is highly correlated with v 1 . To bound the quantity ∥M ∥ F , we first note the following facts: 2 by our gap assumption. Now, we use the following theorem.
1. ∥B T j B j ∥ 2 ≤ (1 + ε)σ 1 (A) 2 , and 2. σ 2 (B T j B j ) ≤ σ 2 (A) 2 + εσ 1 (A) 2 ≤ (1/4 + ε)σ 1 (A)
Theorem A.1 ((Wang and Xi, 1997, Theorem 3(ii))). For any r > 0 and any matrices A 1 , . . . , A t , i
(σ i (A 1 • • • A t )) r ≤ i σ i (A 1 ) r • • • σ i (A t ) r .
Applying the above theorem with r = 2, we obtain
∥(B T t B t ) • • • (B T 1 B 1 )∥ 2 F ≤ (1 + ε) 2t σ 1 (A) 4t + (d -1)(1/4 + ε) t σ 1 (A) 4t ≤ (1 + 4tε)σ 1 (A) 4t + d 3 t σ 1 (A) 4t . When t ≥ 3 log(d/ε), we have ∥(B T t B t ) • • • (B T 1 B 1 )∥ 2 F ≤ (1 + 4tε + ε)σ 1 (A) 4t
. We now use the following lemma.
Lemma A.2. Let g be a Gaussian random vector with each of the components being an independent standard Gaussian random variable. Let v = M g/∥M g∥ 2 . For any unit vector v, with probability
≥ 4/5, |⟨v, v⟩| 2 ≥ 1 1 + C ∥M ∥ 2 F -∥M T v∥ 2 2 ∥M T v∥ 2 2
for a large enough universal constant C.
Proof. Since v is a unit vector, we can write ∥M g∥ 2 2 = |v T M g| 2 + ∥(I -vv T )M g∥ 2 2 . Hence, we have
|⟨v, v⟩| 2 = |v T M g| 2 ∥M g∥ 2 2 = 1 1 + ∥(I-vv T )M g∥ 2 2 |v T M g| 2 . We now note that v T M g ∼ N (0, ∥M T v∥ 2 2 ) and E[∥(I -vv T )M g∥ 2 2 ] = tr(M T (I -vv T )M ) = ∥M ∥ 2 F -∥M T v∥ 2 2
. By a union bound, with probability ≥ 4/5, we have
∥(I -vv T )M g∥ 2 2 |v T M g| 2 ≤ C ∥M ∥ 2 F -∥M T v∥ 2 2 ∥M T v∥ 2 2
for a large enough constant C. Therefore, with probability ≥ 4/5, we get that
|⟨v, v⟩| 2 ≥ 1 1 + C ∥M ∥ 2 F -∥M T v∥ 2 2 ∥M T v∥ 2 2 .
Applying the above lemma for
M = (B T t B t ) • • • (B T 1 B 1 ) and v = v 1 , we obtain |⟨v, v 1 ⟩| 2 ≥ 1 1 + C ′ t √ ε
with probability ≥ 4/5.

Section: A.3 Proof of Theorem 2.5
Proof. Partition the matrix A into A light and A heavy , where A heavy is the submatrix with rows a i such that ∥a i ∥ 2 > ∥A∥ F / d • polylog(d) and A light is the remaining rows. From our assumption, the number of rows in A heavy is at most h. Note that given a uniformly random stream of rows of A, we can obtain a uniformly random stream of rows of A light by just filtering out the rows in A heavy .
Suppose, ∥A heavy • v 1 ∥ 2 ≥ (1 -β)∥A∥ 2 for a parameter β to be chosen later. Let v ′ 1 be the top singular vector of the matrix A heavy . Note
∥A • v ′ 1 ∥ 2 2 ≥ ∥A heavy • v ′ 1 ∥ 2 2 ≥ ∥A heavy • v 1 ∥ 2 2 ≥ (1 -β) 2 ∥A∥ 2 2
and therefore we have ⟨v ′ 1 , v 1 ⟩ 2 ≥ 1 -4β, assuming R ≥ 2. Thus, while processing the stream, we can store all the heavy rows and at the end of the stream compute the top right singular vector of A heavy , in order to obtain a good approximation for v 1 .
Suppose ∥A heavy •v 1 ∥ 2 ≤ (1-β)∥A∥ 2 . This implies ∥A light •v 1 ∥ 2 2 ≥ ∥A∥ 2 2 -∥A heavy •v 1 ∥ 2 2 ≥ β•∥A∥ 2 2 . If we set β ≥ 2/R, we have σ 1 (A light ) 2 σ 2 (A light ) 2 ≥ β∥A∥ 2 2 σ 2 (A) 2 ≥ 2.
Let v ′ 1 be the top singular vector of A light . We will describe how to approximate v ′ 1 . Consider applying the row norm sampling procedure with parameter ε to the matrix A light . Given a row a i ∈ A light the corresponding sampling probability p i is given by
p i = C log d • ∥a i ∥ 2 2 ε 2 ∥A light ∥ 2 2 ≤ C log d • ∥A∥ 2 F /(d • polylog(d)) ε 2 β 2 ∥A∥ 2 2 ≤ C ε 2 β 2 polylog(d)
.
Assuming that ε 2 β 2 ≥ 1/ polylog(d), we obtain that p i < 1 for all the rows in the matrix A light . Let B light be the matrix obtained after applying the row norm sampling procedure to the matrix A light . Note that ρ(B light ) ≈ ρ(A light ) and the number of rows in
B light is Θ(ρ(A light ) • log d • ε -2 ), and therefore Θ(ρ(B light )•log d•ε -2
). Setting ε = α 2 / log 5/2 d, we obtain that the number of rows in the matrix B light is Θ(α -4 •ρ(B light )•log 6 d) and thus assuming ε 2 β 2 = α 4 β 2 / log 5 d ≥ 1/ polylog(d), we can use Theorem 2.4 to obtain a vector v satisfying ⟨v, v ′ 1 ⟩ 2 ≥ 1 -3α. We will now show that v ′ 1 has a large correlation with v 1 which then implies v has a large correlation with v 1 . Since
∥A light ∥ 2 ≥ ∥A∥ 2 -∥A heavy ∥ 2 ≥ β∥A∥ 2 , ∥A light ∥ 2 2 = ∥A light • v ′ 1 ∥ 2 2 ≥ β∥A∥ 2 2 .

Section: Consider the following upper bound on
∥A light • v ′ 1 ∥ 2 2 : ∥A light ∥ 2 2 = ∥A light • v ′ 1 ∥ 2 2 = ∥A light • (⟨v ′ 1 , v 1 ⟩ • v 1 + (I -v 1 v T 1 )v ′ 1 )∥ 2 2 = ∥⟨v 1 , v ′ 1 ⟩A light • v 1 + A light (I -v 1 v T 1 )v ′ 1 ∥ 2 2 ≤ (1 + θ) • ⟨v 1 , v ′ 1 ⟩ 2 • ∥A light • v 1 ∥ 2 2 + (1 + 1/θ) • ∥A light (I -v 1 v T 1 )v ′ 1 ∥ 2 2
for any θ > 0. Using the fact that the rows of the matrix A light are a subset of the rows of the matrix A and that ∥A(I -
v 1 v T 1 )∥ 2 = σ 2 (A) = σ 1 (A)/ √ R, we have ∥A light ∥ 2 2 ≤ (1 + θ) • ⟨v 1 , v ′ 1 ⟩ 2 • ∥A light ∥ 2 2 + (1 + 1/θ) • σ 2 1 R • (1 -⟨v 1 , v ′ 1 ⟩ 2 ) = ⟨v 1 , v ′ 1 ⟩ 2 ((1 + θ) • ∥A light ∥ 2 2 -(1 + 1/θ)σ 2 1 /R) + (1 + 1/θ) • σ 2 1 /R which implies ⟨v 1 , v ′ 1 ⟩ 2 ≥ ∥A light ∥ 2 2 -(1 + 1/θ) • σ 2 1 /R (1 + θ)∥A light ∥ 2 2 -(1 + 1/θ)σ 2 1 /R = 1 - θ • ∥A light ∥ 2 2 (1 + θ)∥A light ∥ 2 2 -(1 + 1/θ)σ 2 1 /R ≥ 1 - θ 1 + θ -(1 + 1/θ)/Rβ using the fact that ∥A light ∥ 2 2 ≥ β 2 σ 2 1
. Now assuming Rβ ≥ 1 and picking θ = 2/(Rβ -1), we obtain
⟨v 1 , v ′ 1 ⟩ 2 ≥ 1 - 4Rβ (1 + Rβ) 2 ≥ 1 - 4 Rβ .
Lemma A.3. Let Y 1 , . . . , Y ℓ be independent random variables. Let i ∼ [ℓ] be a uniform random variable independent of X. We have
I(X ; Y 1 ) + • • • + I(X ; Y ℓ ) ≥ ℓ • (I(X; Y i ) -log 2 ℓ).
Proof. By definition, we have
I(X ; Y i ) = H(Y i ) -H(Y i | X). Now, we note that H(Y i ) ≤ H(Y i , i) = H(i) + H(Y i | i) = log 2 ℓ + H(Y 1)+•••+H (Y ℓ ) ℓ
. We now lower bound H(Y i | X). Since conditioning always decreases entropy, we obtain
H(Y i | X) ≥ H(Y i | i, X). As X is independent of i, we have H(Y i | X) ≥ H(Y i | i, X) = H(Y 1 | X) + • • • + H(Y ℓ | X) ℓ which then implies I(X ; Y i ) ≤ H(i) + H(Y 1 ) + • • • + H(Y ℓ ) ℓ - H(Y 1 | X) + • • • + H(Y ℓ | X) ℓ ≤ H(i) + I(X ; Y 1 ) + • • • + I(X ; Y ℓ ) ℓ .
Since H(i) = log 2 ℓ, we have the proof.
Using this lemma,
I(s mid ; x π -1 (1) [(1 -γ) • d + 1 : d]) + • • • + I(s mid ; x π -1 (h/2) [(1 -γ) • d + 1 : d]) = (h/2) • I(s mid ; x i [(1 -γ) • d + 1 : d] -log 2 (h/2)) ≥ Ω(hd/R) -h log 2 h. Lemma A.4. If X, Y are independent, then I(Z ; (X, Y )) ≥ I(Z ; X) + I(Z ; Y ).
Proof.
I(Z ; (X, Y )) = H((X, Y )) -H((X, Y ) | Z) = H(X) + H(Y ) -H((X, Y ) | Z).
Now, we note that for any three random variables X, Y , Z, we have
H((X, Y ) | Z) ≤ H(X | Z) + H(Y | Z) which proves the lemma.
Using the independence of x 1 , . . . , x h conditioned on the event E, we obtain
I(s mid ; (x π -1 (1) [(1 -γ) • d + 1 : d], . . . , x π -1 (h/2) [(1 -γ) • d + 1 : d])) ≥ Ω(hd/R) -h log 2 h
which then implies
H(s mid ) ≥ Ω(hd/R) using the fact that R 2 •h = O(d). Finally, we have max |s mid | ≥ Ω(hd/R).
Here |s mid | is the number of bits used in the representation of the state s mid .
A.5 Proof of Theorem 5.1
Proof. Our instance consists of the following vectors:
1. R copies of the vector
(1/ √ R)e 1 ,
2. 1 copy of the vector (1/ √ R -ε)e 2 , and 3. α copies of the vector
(1/ √ α • R)e 3 .
where α = 2M . Let A be a matrix with rows given by the stream of vectors defined above. We note that the matrix A has rank 3 and the non-zero eigenvalues of the matrix A T A are 1, 1/(R -ε), 1/R and therefore the gap λ 1 (A T A)/λ 2 (A T A) = R -ε. The top eigenvector of the matrix A T A is e 1 and the row with the largest norm is (1/ √ R -ε)e 2 . Thus, the row with the largest norm is not useful to obtain correlation with the true top eigenvector e 1 .
Consider an execution of Oja's algorithm with a learning rate η on the above stream of vectors. The final vector z n can be written as
z n = I + η R e 1 e T 1 R I + η Rα e 3 e T 3 α I + 1 R -ε e 2 e T 2 v 0 .
For j ∈ [d], let z ij denote the j-th coordinate of the vector z i so that we have
z n1 = 1 + η R R • z 01 , z n2 = 1 + η R -ε
• z 02 , and
z n3 = 1 + η Rα α • z 03 .
We note that z nj = z 0j for all j > 3. Since α = 2M , we have η/Rα ≤ 1/2 and therefore
(1 + η/Rα) ≥ exp(η/2Rα) and (1 + η/Rα) α ≥ exp(η/2R).
Recall that we want to show that |⟨z n , e 1 ⟩| < c∥z n ∥ 2 with a large probability. Suppose otherwise and that with probability ≥ 1/10, we have
|⟨z n , e 1 ⟩| > c∥z n ∥ 2 > c∥(0, 0, 0, z 04 , . . . , z 0d )∥ 2 .
Since, z 0 is initialized to be a random Gaussian, we have ∥(0, 0, 0, z 04 , . . . , z 0d )∥ 2 ≥ √ d/2 with probability 1 -exp(-d). Thus, we have with probability ≥ 1/11 that,
|z n1 | ≥ c √ d/2
which implies the learning rate must satisfy
(1 + η/R) R ≥ c ′ √ d/2
since |z 01 | ≤ 10 with probability ≥ 99/100. Hence η ≥ R((c ′ d 1/2 ) 1/R -1). Now consider |⟨z n , e 3 ⟩|/|⟨z n , e 1 ⟩|. We have
|⟨z n , e 3 ⟩| |⟨z n , e 1 ⟩| = exp(η/R) (1 + η/R) R • |z 03 | |z 01 | .
With probability ≥ 95/100, we have 1/C ≤ |z 03 |/|z 01 | ≤ C for a large enough constant C. We now consider the expression
exp(η/R) (1 + η/R) R . The expression is minimized at η = R 2 -R and is increasing in the range η ∈ [R 2 -R, ∞). When, R = O(log d/ log log d), we have that R 2 -R ≤ R((c ′ d 1/2 ) 1/R -1) and therefore for all η ≥ R((c ′ d 1/2 ) 1/R -1), we have exp(η/R) (1 + η/R) R ≥ exp((c ′ d 1/2 ) 1/R ) e • c ′ d 1/2 . When R = O(log d/ log log d), we have exp(η/R) (1 + η/R) R ≥ poly(d)
which then implies |⟨z n , e 3 ⟩| ≥ |⟨z n , e 1 ⟩| • poly(d)/C with probability ≥ 95/100 which contradicts our assumption that |⟨z n , e 1 ⟩| ≥ c∥z n ∥ 2 .
NeurIPS Paper Checklist

Section: Claims
Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?
Answer: [Yes] Justification: Our paper is purely theoretical studying space-efficient algorithms for approximating the top eigenvector. We prove all the claims made in the abstract and introduction.
Guidelines:
• The answer NA means that the abstract and introduction do not include the claims made in the paper. • The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

Section: Limitations
Question: Does the paper discuss the limitations of the work performed by the authors?
Answer: [Yes] Justification: We do not have a specific limitations section but we do qualify all the statements noting the assumptions that need to be made to prove that our algorithms work.
Guidelines:
• The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. • The authors are encouraged to create a separate "Limitations" section in their paper.
• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. • The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

Section: Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
Answer: [Yes] Justification: We include all the proofs in the main body and the appendix.
Guidelines:
• The answer NA means that the paper does not include theoretical results.
• All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. • All assumptions should be clearly stated or referenced in the statement of any theorems. • The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. • Theorems and Lemmas that the proof relies upon should be properly referenced.

Section: Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
Answer: [NA] Justification: No experimental results are given in this paper.

Section: Guidelines:
• The answer NA means that the paper does not include experiments.
• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. • Depending on the contribution, reproducibility can be accomplished in various ways.
For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. , with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility.
In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. Guidelines:

Section: Open access to data and code
• The answer NA means that the paper does not include experiments.
• The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn't make it into the paper).

Section: Code Of Ethics
Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
Answer: [Yes] Justification: Results in this paper are purely theoretical. While the algorithms proposed in this paper may be used with potentially negative consequences, the authors are unaware of such uses.
Guidelines:
• The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
• If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

Section: Broader Impacts
Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
Answer: [No] Justification: Our work studies algorithms for the top eigenvector estimation problem. Our work is purely theoretical. While our algorithms may be used to impact society in a negative way, we are unaware of such usecases.
Guidelines:
• The answer NA means that there is no societal impact of the work performed.
• If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.
• Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. • The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

Section: Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: [NA] Guidelines:
• The answer NA means that the paper poses no such risks.
• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: [NA] Guidelines:
• The answer NA means that the paper does not use existing assets.
• The authors should cite the original paper that produced the code package or dataset.
• The authors should state which version of the asset is used and, if possible, include a URL. • The name of the license (e.g., CC-BY 4.0) should be included for each asset.
• For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

Section: Acknowledgements
The authors were supported in part by a Simons Investigator Award and NSF CCF-2335412. D. Woodruff was visiting Google Research while performing this work.

Section: 
We therefore have
Setting β = 1/ √ R and α = 1/ √ R, we satisfy all the requirements assuming that R ≤ polylog(d) and obtain a vector v satisfying ⟨v, v 1 ⟩ 2 ≥ 1 -8/ √ R. When ∥A heavy ∥ 2 ≥ (1 -β)∥A∥ 2 , we already have a vector v ′ = top eigenvector of A heavy that satisfies ⟨v, v 1 ⟩ 2 ≥ 1 -4β ≥ 1 -4/ √ R. Thus, in both the cases, we obtain a vector v satisfying ⟨v,
The procedure described requires knowing the approximate values of ∥A∥ F , ∥A light ∥ 2 . Since, we assume that all the non-zero entries of the matrix have an absolute value at least 1/ poly(d) and at most poly(d), the values ∥A∥ F , ∥A light ∥ 2 lie in the interval [1/ poly(d), poly(nd)]. Hence, using O(log nd) guesses each for ∥A∥ F and ∥A light ∥ 2 and using a Gaussian sketch of A similar to that in Algorithm 1, we can obtain a vector satisfying the guarantees in the theorem.
A.4 Proof of Theorem 3.1
Proof. For each i ∈ [h], let x 1 , . . . , x h be drawn independently and uniformly at random from { +1, -1 } d . Let i ∼ [h] be drawn uniformly at random, and for an integer k to be chosen later, let y 1 , . . . , y k ∈ R d be vectors that share the first (1 -γ)d coordinates with the vector x i . Each of the last γ • d elements of each of y 1 , . . . , y k are sampled uniformly at random from the set { +1, -1 }. Define z 1 , . . . , z h+k such that for j ≤ h, z j = x j and for j > h, let z j = y j-h . Now consider the stream z 1 , . . . , z h+k . Price and Xun argue that when k ≥ 4R, the gap of this stream is at least R with large probability over the randomness used in the construction of the stream. Let π : [h + k] → [h + k] be a uniformly random permutation independent of i. Consider the following event E:
We have that the probability of the event E is
Let S i be the set of permutations π that satisfy the above event. Therefore we have Pr π [π ∈ S i ] ≥ (1/2) k+1 . If the probability of failure, δ, of the algorithm A satisfies δ ≤ (1/2) k+4 , we have that
Let s mid be the state of the algorithm after h/2 steps and s fin be the final state of the algorithm. The randomness in s fin is from the following sources: (i) randomness of the vectors x 1 , . . . , x h , (ii) the index i ∈ [h], (iii) the vectors y 1 , . . . , y k , (iv) the permutation π, and (v) the internal randomness of the algorithm. From here on, condition on the event E, i.e., that the permutation π ∈ S i . We will not explicitly mention that all entropy and information terms in the proof are conditioned on E. Since π(i) ≤ h/2, we have
Using the data processing inequality, we obtain that
When h ≤ cd/R 2 , k = 4R, γ = 1/4 and ε ≤ c/k 2 for a small constant, we have as in the proof of Theorem 1.5 in Price and Xun (2024) that,
which now implies
Note that conditioned on the event E, the distribution of i is uniform over { π -1 (1), . . . , π -1 (h/2) }. We now prove the following lemma:
• For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution. • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.


References:
[b0] Zeyuan Allen; -Zhu ; Yuanzhi Li (2017). First efficient convergence for streaming k-PCA: a global, gapfree, and near-optimal rate. IEEE
[b1] Sepehr Assadi; Janani Sundaresan (2023). Noisy) gap cycle counting strikes back: Random order streaming lower bounds for connected components and beyond. 
[b2] Maria-Florina Balcan; Simon Shaolei Du; Yining Wang; Adams Wei Yu (2016). An improved gapdependency analysis of the noisy power method. PMLR
[b3] Christos Boutsidis; David P Woodruff; Peilin Zhong (2016). Optimal principal component analysis in distributed and streaming models. 
[b4] Amit Chakrabarti; Graham Cormode; Andrew Mcgregor (2008). Robust lower bounds for communication and stream computation. 
[b5] L Kenneth; David P Clarkson;  Woodruff (2017). Low-rank approximation and regression in input sparsity time. Journal of the ACM (JACM)
[b6] Mina Ghashami; Edo Liberty; Jeff M Phillips; David P Woodruff (2016). Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing
[b7] Ming Gu (2015). Subspace iteration randomization and singular value problems. SIAM Journal on Scientific Computing
[b8] Sudipto Guha; Andrew Mcgregor (2009). Stream order and order statistics: Quantile estimation in random-order streams. SIAM Journal on Computing
[b9] Sudipto Guha; Andrew Mcgregor; Suresh Venkatasubramanian (2005). Streaming and sublinear approximation of entropy and information distances. 
[b10] Anupam Gupta; Sahil Singla (2021). Random-order models. Oxford University Press
[b11] Moritz Hardt; Eric Price (2014). The noisy power method: A meta algorithm with applications. Advances in neural information processing systems
[b12] De Huang; Jonathan Niles-Weed; Rachel Ward (2021). Streaming k-PCA: Efficient guarantees for oja's algorithm, beyond rank-one updates. PMLR
[b13] Prateek Jain; Chi Jin; M Sham; Praneeth Kakade; Aaron Netrapalli;  Sidford (2016). Streaming pca: Matching matrix bernstein and near-optimal finite sample guarantees for Oja's algorithm. PMLR
[b14] Syamantak Kumar; Purnamrita Sarkar (2023). Streaming pca for markovian data. 
[b15] Malik Magdon-Ismail (2010). Row sampling for matrix algorithms via a non-commutative bernstein bound. 
[b16] Ioannis Mitliagkas; Constantine Caramanis; Prateek Jain (2013). Memory-limited, streaming PCA. 
[b17] Ian Munro; Mike S Paterson (1980). Selection and sorting with limited storage. Theoretical computer science
[b18] Cameron Musco; Christopher Musco (2015). Randomized block krylov methods for stronger and faster approximate singular value decomposition. Advances in neural information processing systems
[b19] Cameron Musco; Christopher Musco; Aaron Sidford (2018). Stability of the lanczos method for matrix function approximation. SIAM
[b20] Erkki Oja (1982). Simplified neuron model as a principal component analyzer. Journal of mathematical biology
[b21] Eric Price; Zhiyang Xun (2024). Spectral guarantees for adversarial streaming PCA. FOCS
[b22] Joel A Tropp (2015). An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning
[b23] Jalaj Upadhyay (2016). Fast and space-optimal low-rank factorization in the streaming model with application in differential privacy. 
[b24] Bo-Ying Wang; Bo-Yan Xi (1997). Some inequalities for singular values of matrix products. Linear algebra and its applications

Figures:
Figure fig_0: 2422
Type: figure
Caption: Theorem 2 . 4 . 2 2 mini ∥ai∥ 2 2 ≤2422Let α ≥ 1/ poly(d) be an accuracy parameter. Let η be a parameter such that maxi ∥ai∥ η. If the number of rows in the stream n = Ω(α -4 • ρ(A) • η • log 6 d), where ρ(A) = 6 Bin(n, p) denotes the binomial distribution with parameters n and p.
Data: 

Figure fig_1: 
Type: figure
Caption: ∥A∥2  
Data: 

Figure fig_2: 2
Type: figure
Caption: Theorem 2 .25. Let A be an n × d matrix with its non-zero entries satisfying 1/ poly(d) ≤ |A i,j | ≤ poly(d), and hence representable using O(log d) bits of precision.
Data: 

Figure fig_3: 
Type: figure
Caption: For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g.
Data: 

Figure tab_0: 
Type: table
Caption: Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental While we encourage the release of code and data, we understand that this might not be possible, so "No" is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). • The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https: //nips.cc/public/guides/CodeSubmissionPolicy) for more details. • The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. • The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. • At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).• Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.6. Experimental Setting/DetailsQuestion: Does the paper specify all the training and test details (e.g., data splits, hyper-The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)• The assumptions made should be given (e.g., Normally distributed errors).• It should be clear whether the error bar is the standard deviation or the standard error of the mean. • It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. • For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). • If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.
Data: material?Answer: [NA]Justification: [NA]Guidelines:• The answer NA means that paper does not include experiments requiring code.• Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.• parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: [NA] Guidelines: • The answer NA means that the paper does not include experiments. • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. • The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropri-ate information about the statistical significance of the experiments? Answer: [NA] Justification: [NA] Guidelines: • The answer NA means that the paper does not include experiments. • The authors should answer "Yes" if the results are accompanied by error bars, confi-dence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. Question: For each experiment, does the paper provide sufficient information on the com-puter resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] • 8. Experiments Compute Resources Justification: [NA]


Formulas:
Formula formula_0: ∥A(I -uu T )∥ 2 F ≤ (1 + ε)∥A(I -v 1 v T 1 )∥ 2 F .

Formula formula_1: ⟨v, v 1 ⟩ 2 ≥ 1 - C log d R - 1 poly(d)

Formula formula_2: ⟨v, v 1 ⟩ 2 ≥ 1 -O(1/ √ R)

Formula formula_3: σ1(A) 2 σ2(A) 2 = R.

Formula formula_4: ⟨v, v 1 ⟩ 2 ≥ 1 -1/CR 2

Formula formula_5: v ′ = (B T t B t ) • • • (B T 1 B 1 )

Formula formula_6: p i ≥ min 1, C ∥a i ∥ 2 2 ε 2 ∥A∥ 2 2 log d , then with probability 1 -1/ poly(d), ∥A T A -A T Q T QA∥ 2 ≤ ε∥A∥ 2 2

Formula formula_7: Input: An n × d matrix A with n = Ω(η • ρ(A) • log 2 d/ε 2 ), max i ∥a i ∥ 2 2 / min i ∥a i ∥ 2 2 ≤ η Output: A vector z t ← ⌈C 1 log d⌉ Compute G • A in the stream where G is a Gaussian matrix with O(ε -2 log d) rows for ρ = 1, 2, 4, . . . , d simultaneously do p ← C 2 ηρ log d/nε 2 // p ≤ 1/(5t) for ρ ≤ 2 • ρ(A) z ρ ∼ N (0, 1) d for j = 1, . . . , t do y j ← Bin(n, p) if y j > 2np then return ⊥ end // The matrix A j•(2np):j•(2np)+y j corresponds to B j in the analysis. acc ← 0 for i = (j -1) • (2np) + 1, . . . , (j -1) • (2np) + y j do acc ← acc + ⟨a i , z ρ ⟩ • a i end // Here acc = B T j B j z ρ z ρ ← acc z ρ ← z ρ /∥z ρ ∥ 2 end end return arg max z∈{ z1,z2,z4,...,z d } ∥(G • A)z∥ 2

Formula formula_8: (max i ∥a i ∥ 2 2 )/(min i ∥a i ∥ 2 2 ) ≤ η.

Formula formula_9: C ∥a i ∥ 2 2 ε 2 ∥A∥ 2 2 log d ≤ C η∥A∥ 2 F /n ε 2 ∥A∥ 2 2 log d ≤ Cηρ log d nε 2 = p.

Formula formula_10: ∥B T B -A T A∥ 2 ≤ ε∥A∥ 2 2

Formula formula_11: ∥A T A - 1 p B T i B i ∥ 2 ≤ ε∥A∥ 2 2 .

Formula formula_12: v := (B T t B t ) • • • (B T 1 B 1 )g ∥(B T t B t ) • • • (B T 1 B 1 )g∥ 2 satisfies ⟨v, v 1 ⟩ 2 ≥ 1 1 + C ′ t √ ε

Formula formula_13: M := (B T t B t ) • • • (B T 1 B 1

Formula formula_14: T 1 (B T j B j )v 1 ≥ (1 -ε)∥A∥ 2 2

Formula formula_15: |⟨v 1 , v⟩| 2 ≥ 1 -3α with probability ≥ 4/5 if σ 1 (A)/σ 2 (A) ≥ 2. The algorithm uses O(d•polylog(d)/α 4 ) bits of space.

Formula formula_16: ∥ 1 p B T j B j -A T A∥ 2 ≤ ε∥A∥ 2 2 .

Formula formula_17: z ρ = (B T t B t ) • • • (B T 1 B 1 )g/∥(B T t B t ) • • • (B T 1 B 1 )g∥ 2

Formula formula_18: ⟨z ρ , v 1 ⟩ 2 ≥ 1 1 + C ′ t √ ε ≥ 1 -α.

Formula formula_19: z i , ∥GAz i ∥ 2 2 = (1 ± ε)∥Az i ∥ 2 2 . Since ⟨z ρ , v 1 ⟩ 2 ≥ (1 -α), we note that ∥GAz ρ ∥ 2 2 ≥ (1 -ε)(1 -α)σ 1 (A)

Formula formula_20: ≥ (1 -O(ε))(1 -α)σ 1 (A) 2 which implies that ⟨z, v 1 ⟩ 2 • σ 1 (A) 2 + (1 -⟨z, v 1 ⟩ 2 ) σ 1 (A) 2 R ≥ ∥Az∥ 2 2 ≥ (1 -α -O(ε))σ 1 (A) 2

Formula formula_21: v 1 ⟩ 2 ≥ 1 -3α since R ≥ 2.

Formula formula_22: p i = min(1, Cε -2 ∥a i ∥ 2 2 log d/∥A∥ 2 2

Formula formula_23: ∥a i / √ p i ∥ 2 = ε∥A∥ 2 √ C log d .

Formula formula_24: ∥a i ∥ 2 2 ≥ ε 2 ∥A∥ 2 2 /C log(d).

Formula formula_25: Let R = σ 1 (A) 2 /σ 2 (A) 2 . Assume 2 ≤ R ≤ C 1 log 2 d.

Formula formula_26: ≥ 4/5, v satisfies ⟨v, v 1 ⟩ 2 ≥ 1 -8/ √ R,

Formula formula_27: (i) ∥A heavy ∥ 2 ≥ (1 -β)∥A∥ 2 or (ii) ∥A heavy ∥ 2 < (1 -β)∥A∥ 2

Formula formula_28: v 1 ⟩ 2 ≥ 1 -O(1/ √ R) proving the theorem.

Formula formula_29: R ≤ h ≤ d and R 2 • h = O(d).

Formula formula_30: v 1 ⟩| 2 ≥ 1 -c/R 2 .

Formula formula_31: ∥B T B -A T A∥ 2 ≤ ε∥A∥ 2

Formula formula_32: σ 2 (B T B) ≤ σ 2 (A T A) + ε∥A∥ 2 2 and σ 1 (B T B) ≥ (1 -ε)σ 1 (A T A) implying R ′ = σ 1 (B) 2 /σ 2 (B) 2 ≥ (1 -ε)/(1/R + ε). For ε = 1/(2R) ≤ 1/2, we note R ′ ≥ R/3. Let n ′ = O(R 2 • d log d)

Formula formula_33: |⟨v, v ′ 1 ⟩| 2 ≥ 1 - log d CR ′ - 1 poly(d)

Formula formula_34: A T A, then |⟨v 1 , v ′ 1 ⟩| 2 ≥ 1 -O(1/R

Formula formula_35: |⟨v, v 1 ⟩| 2 ≥ 1 - log d CR .

Formula formula_36: n i=1 ⟨a i , P vi-1 ⟩ 2

Formula formula_37: E[⟨a i , P vi-1 ⟩ 2 ] = E[E[⟨a i , P vi-1 ⟩ 2 | a 1 , . . . , a i-1 ]].

Formula formula_38: E[⟨a i , P vi-1 ⟩ 2 | a 1 , . . . , a i-1 ] = 1 n -i + 1 vT i-1 P (A T A -a 1 a T 1 -• • • -a i-1 a T i-1 )P vi-1 ≤ σ 2 (A) 2 n -i + 1 . Hence E[⟨a i , P vi-1 ⟩ 2 ] ≤ σ 2 (A) 2 /(n-i+1) and E[ n i=1 ⟨a i , P vi-1 ⟩ 2 ] ≤ σ 2 (A) 2 (1+log n).

Formula formula_39: z i ← z i-1 + η • ⟨z i-1 , a i ⟩a i .

Formula formula_40: |⟨v, v 1 ⟩| ≥ c

Formula formula_41: A T A -A T Q T QA = n i=1 (X i /p i -1)a i a T i = n i=1 Y i .

Formula formula_42: Let p i ̸ = 0. Then, ∥(X i /p i -1)a i a T i ∥ 2 ≤ ∥a i a T i ∥ 2 /p i ≤ ε 2 ∥A∥ 2 2 /C log d with probability 1. We now bound ∥ i E[Y 2 i ]∥ 2 . i E[Y 2 i ] = i E[(1/p i -1) 2 ]∥a i ∥ 2 2 a i a T i = i:pi>0 (1/p i -1)∥a i ∥ 2 2 a i a T i ⪯ i:pi>0 ε 2 ∥A∥ 2 2 C∥a i ∥ 2 2 log d ∥a i ∥ 2 2 a i a T i ⪯ ε 2 ∥A∥ 2 2 C log d A T A which implies ∥ i E[Y 2 i ]∥ 2 ≤ ε 2 ∥A∥ 4 2 /(C log d). Now, we obtain Pr[∥ i Y i ∥ 2 ≥ ε∥A∥ 2 2 ] ≤ 2d • exp - ε 2 ∥A∥ 4 2 /2 ε 2 ∥A∥ 4 2 /(C log d) + ε 3 ∥A∥ 4 2 /(3C log d) ≤ 2d • exp - C log d 2(1 + ε/3) . If C ≥ 6(1 + ε/3), then Pr[∥ i Y i ∥ 2 ≥ ε∥A∥ 2 2 ] ≤ 1 -2/d 2 which implies that with probability ≥ 1 -2/d 2 , ∥A T A -A T Q T QA∥ 2 ≤ ε∥A∥ 2

Formula formula_43: E[ i X i ] ≤ Cε -2 ρ • log d. By a Chernoff bound, we obtain that i X i = O(ε -2 ρ • log d) with probability ≥ 1 -1/ poly(d). A.2 Proof of Lemma 2.3 Proof. Define M := (B T t B t ) • • • (B T 1 B 1 ).

Formula formula_44: T j B j )v 1 = αv 1 + ∆ where ∆ ⊥ v 1 . We note that v T 1 (B T j B j )v 1 = α. We have α = v T 1 B T j B j v 1 ≥ (1 -ε)σ 1 (A) 2 using the fact that ∥B T j B j -A T A∥ 2 ≤ ε∥A∥ 2 2 and v T 1 A T Av 1 = σ 1 (A) 2 = ∥A∥ 2 2 .

Formula formula_45: ∥(B T j B j )v 1 ∥ 2 ≤ ∥B T j B j ∥ 2 ≤ (1 + ε)σ 1 (A) 2 and ∥(B T j B j )v 1 ∥ 2 2 = α 2 + ∥∆∥ 2 2 which implies ∥∆∥ 2 2 ≤ ((1 + ε) 2 -(1 -ε) 2 )σ 1 (A) 4 = 4ε • σ 1 (A) 4 and thus ∥∆∥ 2 ≤ √ 4εσ 1 (A) 2 . Now, ∥M T v 1 ∥ 2 = ∥(B T 1 B 1 ) • • • (B T t-1 B t-1 )(⟨B T t B t v 1 , v 1 ⟩v 1 + ∆ 1 )∥ 2 ≥ ⟨B T t B t v 1 , v 1 ⟩∥(B T 1 B 1 ) • • • (B T t-1 B t-1 )v 1 ∥ 2 -∥(B T 1 B 1 ) • • • (B T t-1 B t-1 )∥ 2 ∥∆ 1 ∥ 2 ≥ ((1 -ε)σ 1 (A) 2 )∥(B T 1 B 1 ) • • • (B T t-1 B t-1 )v 1 ∥ 2 -( √ 4εσ 1 (A) 2 )∥(B T 1 B 1 ) • • • (B T t-1 B t-1 )∥ 2 .

Formula formula_46: ∥M T v 1 ∥ 2 ≥ (1 -ε) t σ 1 (A) 2t -t √ 4ε(1 + ε) t-1 σ 1 (A) 2t .

Formula formula_47: ∥M T v 1 ∥ 2 = ∥(B T 1 B 1 ) • • • (B T t B t )v 1 ∥ 2 ≥ (1 -2tε -4t √ ε)σ 1 (A) 2t .

Formula formula_48: ∥M ∥ F = ∥(B T t B t ) • • • (B T 1 B 1

Formula formula_49: 1. ∥B T j B j ∥ 2 ≤ (1 + ε)σ 1 (A) 2 , and 2. σ 2 (B T j B j ) ≤ σ 2 (A) 2 + εσ 1 (A) 2 ≤ (1/4 + ε)σ 1 (A)

Formula formula_50: (σ i (A 1 • • • A t )) r ≤ i σ i (A 1 ) r • • • σ i (A t ) r .

Formula formula_51: ∥(B T t B t ) • • • (B T 1 B 1 )∥ 2 F ≤ (1 + ε) 2t σ 1 (A) 4t + (d -1)(1/4 + ε) t σ 1 (A) 4t ≤ (1 + 4tε)σ 1 (A) 4t + d 3 t σ 1 (A) 4t . When t ≥ 3 log(d/ε), we have ∥(B T t B t ) • • • (B T 1 B 1 )∥ 2 F ≤ (1 + 4tε + ε)σ 1 (A) 4t

Formula formula_52: ≥ 4/5, |⟨v, v⟩| 2 ≥ 1 1 + C ∥M ∥ 2 F -∥M T v∥ 2 2 ∥M T v∥ 2 2

Formula formula_53: |⟨v, v⟩| 2 = |v T M g| 2 ∥M g∥ 2 2 = 1 1 + ∥(I-vv T )M g∥ 2 2 |v T M g| 2 . We now note that v T M g ∼ N (0, ∥M T v∥ 2 2 ) and E[∥(I -vv T )M g∥ 2 2 ] = tr(M T (I -vv T )M ) = ∥M ∥ 2 F -∥M T v∥ 2 2

Formula formula_54: ∥(I -vv T )M g∥ 2 2 |v T M g| 2 ≤ C ∥M ∥ 2 F -∥M T v∥ 2 2 ∥M T v∥ 2 2

Formula formula_55: |⟨v, v⟩| 2 ≥ 1 1 + C ∥M ∥ 2 F -∥M T v∥ 2 2 ∥M T v∥ 2 2 .

Formula formula_56: M = (B T t B t ) • • • (B T 1 B 1 ) and v = v 1 , we obtain |⟨v, v 1 ⟩| 2 ≥ 1 1 + C ′ t √ ε

Formula formula_57: ∥A • v ′ 1 ∥ 2 2 ≥ ∥A heavy • v ′ 1 ∥ 2 2 ≥ ∥A heavy • v 1 ∥ 2 2 ≥ (1 -β) 2 ∥A∥ 2 2

Formula formula_58: Suppose ∥A heavy •v 1 ∥ 2 ≤ (1-β)∥A∥ 2 . This implies ∥A light •v 1 ∥ 2 2 ≥ ∥A∥ 2 2 -∥A heavy •v 1 ∥ 2 2 ≥ β•∥A∥ 2 2 . If we set β ≥ 2/R, we have σ 1 (A light ) 2 σ 2 (A light ) 2 ≥ β∥A∥ 2 2 σ 2 (A) 2 ≥ 2.

Formula formula_59: p i = C log d • ∥a i ∥ 2 2 ε 2 ∥A light ∥ 2 2 ≤ C log d • ∥A∥ 2 F /(d • polylog(d)) ε 2 β 2 ∥A∥ 2 2 ≤ C ε 2 β 2 polylog(d)

Formula formula_60: B light is Θ(ρ(A light ) • log d • ε -2 ), and therefore Θ(ρ(B light )•log d•ε -2

Formula formula_61: ∥A light ∥ 2 ≥ ∥A∥ 2 -∥A heavy ∥ 2 ≥ β∥A∥ 2 , ∥A light ∥ 2 2 = ∥A light • v ′ 1 ∥ 2 2 ≥ β∥A∥ 2 2 .

Formula formula_62: ∥A light • v ′ 1 ∥ 2 2 : ∥A light ∥ 2 2 = ∥A light • v ′ 1 ∥ 2 2 = ∥A light • (⟨v ′ 1 , v 1 ⟩ • v 1 + (I -v 1 v T 1 )v ′ 1 )∥ 2 2 = ∥⟨v 1 , v ′ 1 ⟩A light • v 1 + A light (I -v 1 v T 1 )v ′ 1 ∥ 2 2 ≤ (1 + θ) • ⟨v 1 , v ′ 1 ⟩ 2 • ∥A light • v 1 ∥ 2 2 + (1 + 1/θ) • ∥A light (I -v 1 v T 1 )v ′ 1 ∥ 2 2

Formula formula_63: v 1 v T 1 )∥ 2 = σ 2 (A) = σ 1 (A)/ √ R, we have ∥A light ∥ 2 2 ≤ (1 + θ) • ⟨v 1 , v ′ 1 ⟩ 2 • ∥A light ∥ 2 2 + (1 + 1/θ) • σ 2 1 R • (1 -⟨v 1 , v ′ 1 ⟩ 2 ) = ⟨v 1 , v ′ 1 ⟩ 2 ((1 + θ) • ∥A light ∥ 2 2 -(1 + 1/θ)σ 2 1 /R) + (1 + 1/θ) • σ 2 1 /R which implies ⟨v 1 , v ′ 1 ⟩ 2 ≥ ∥A light ∥ 2 2 -(1 + 1/θ) • σ 2 1 /R (1 + θ)∥A light ∥ 2 2 -(1 + 1/θ)σ 2 1 /R = 1 - θ • ∥A light ∥ 2 2 (1 + θ)∥A light ∥ 2 2 -(1 + 1/θ)σ 2 1 /R ≥ 1 - θ 1 + θ -(1 + 1/θ)/Rβ using the fact that ∥A light ∥ 2 2 ≥ β 2 σ 2 1

Formula formula_64: ⟨v 1 , v ′ 1 ⟩ 2 ≥ 1 - 4Rβ (1 + Rβ) 2 ≥ 1 - 4 Rβ .

Formula formula_65: I(X ; Y 1 ) + • • • + I(X ; Y ℓ ) ≥ ℓ • (I(X; Y i ) -log 2 ℓ).

Formula formula_66: I(X ; Y i ) = H(Y i ) -H(Y i | X). Now, we note that H(Y i ) ≤ H(Y i , i) = H(i) + H(Y i | i) = log 2 ℓ + H(Y 1)+•••+H (Y ℓ ) ℓ

Formula formula_67: H(Y i | X) ≥ H(Y i | i, X). As X is independent of i, we have H(Y i | X) ≥ H(Y i | i, X) = H(Y 1 | X) + • • • + H(Y ℓ | X) ℓ which then implies I(X ; Y i ) ≤ H(i) + H(Y 1 ) + • • • + H(Y ℓ ) ℓ - H(Y 1 | X) + • • • + H(Y ℓ | X) ℓ ≤ H(i) + I(X ; Y 1 ) + • • • + I(X ; Y ℓ ) ℓ .

Formula formula_68: I(s mid ; x π -1 (1) [(1 -γ) • d + 1 : d]) + • • • + I(s mid ; x π -1 (h/2) [(1 -γ) • d + 1 : d]) = (h/2) • I(s mid ; x i [(1 -γ) • d + 1 : d] -log 2 (h/2)) ≥ Ω(hd/R) -h log 2 h. Lemma A.4. If X, Y are independent, then I(Z ; (X, Y )) ≥ I(Z ; X) + I(Z ; Y ).

Formula formula_69: I(Z ; (X, Y )) = H((X, Y )) -H((X, Y ) | Z) = H(X) + H(Y ) -H((X, Y ) | Z).

Formula formula_70: H((X, Y ) | Z) ≤ H(X | Z) + H(Y | Z) which proves the lemma.

Formula formula_71: I(s mid ; (x π -1 (1) [(1 -γ) • d + 1 : d], . . . , x π -1 (h/2) [(1 -γ) • d + 1 : d])) ≥ Ω(hd/R) -h log 2 h

Formula formula_72: H(s mid ) ≥ Ω(hd/R) using the fact that R 2 •h = O(d). Finally, we have max |s mid | ≥ Ω(hd/R).

Formula formula_73: (1/ √ R)e 1 ,

Formula formula_74: (1/ √ α • R)e 3 .

Formula formula_75: z n = I + η R e 1 e T 1 R I + η Rα e 3 e T 3 α I + 1 R -ε e 2 e T 2 v 0 .

Formula formula_76: z n1 = 1 + η R R • z 01 , z n2 = 1 + η R -ε

Formula formula_77: z n3 = 1 + η Rα α • z 03 .

Formula formula_78: (1 + η/Rα) ≥ exp(η/2Rα) and (1 + η/Rα) α ≥ exp(η/2R).

Formula formula_79: |⟨z n , e 1 ⟩| > c∥z n ∥ 2 > c∥(0, 0, 0, z 04 , . . . , z 0d )∥ 2 .

Formula formula_80: |z n1 | ≥ c √ d/2

Formula formula_81: (1 + η/R) R ≥ c ′ √ d/2

Formula formula_82: |⟨z n , e 3 ⟩| |⟨z n , e 1 ⟩| = exp(η/R) (1 + η/R) R • |z 03 | |z 01 | .

Formula formula_83: exp(η/R) (1 + η/R) R . The expression is minimized at η = R 2 -R and is increasing in the range η ∈ [R 2 -R, ∞). When, R = O(log d/ log log d), we have that R 2 -R ≤ R((c ′ d 1/2 ) 1/R -1) and therefore for all η ≥ R((c ′ d 1/2 ) 1/R -1), we have exp(η/R) (1 + η/R) R ≥ exp((c ′ d 1/2 ) 1/R ) e • c ′ d 1/2 . When R = O(log d/ log log d), we have exp(η/R) (1 + η/R) R ≥ poly(d)

