lanxicy.com

第一范文网 文档专家

第一范文网 文档专家

Randomization, Approximization and Optimization

Fan Haoqiang Tsinghua, IIIS

a motivating example

suppose you are facing a large volume of network flow. your memory is ve

ry limited. Only

1T/s * 10 year

General Flow Watcher

and you want to count

and you want to count the number of distinct packages. maybe counting the number of visitors to a particular site.

and MLE

is it ever possible given such limited memory?

negative result

it is not possible to precisely count the number of distinct elements if you cannot store them.

negative result

proof sketch: there must be two different sets of packages that lead the device to the same state (pigeon-hole principle). And it will fail because of its failure to distinguish between the two sets.

N N N+1 N

how to make things work?

what if we only care about an approximation of the number of distinct packages? useful in a fistful of cases.

still negative

it is even not possible to approximate the number of packages up to a constant factor. -approximation means for every input sequence, at any time,

where is the number of distinct elements, and is the estimated number.

proof sketch

there is no -approximation algorithm that uses space. key idea: the number of internal states of the algorithm is too small.

proof sketch

n/2-dim subspaces in GF(p)^n sparse subsets heavy collision

poly(N) states unbounded approx. ratio

how to make it possible?

here randomization comes into play. we can accept small probability of failure. The probability should only depend on the coin tosses made by us.

randomized approximation

a randomized algorithm is said to be approx. iff

is it possible to construct a randmoized algorithm?

our first attempt

let's play with our first idea: Bitmap Algorithm. a good hash function

record all hashes encountered

coupon collector problem

how many coupons do you have to collect before having out of different types of coupons? easy calculation of expectation gives:

coupon collection

how good is the approximation? not so sharp, but good enough

coupon collector problem

only works when is small compared to so, we only saved space instead of the desired anyway to improve this idea?

reduce the number of elements

add a random sampler to make the number of elements bounded.

bitmap counter k times the result

sample only 1/k of all packages works when D and k are of the same order of magnitude

multi-resolution

what if we don't know how big D is in advance? build filters with different sampling factors and use the most accurate one.

1/4

1/2

1/1

sampling

Does the sampling technique really works? Negative: sampling half of the data does not mean getting half of the distinct elements.

how to make things work?

the crux: if an element is seen multiple times, its probability of being chosen accumulates.

these guys should not have any effect

another way of sampling

Base our decision on the element itself. Choose iff Now, the number of distinct elements scales linearly with

and it works

multi-resolution sampling + bitmap algorithm space with approximation ratio. nearly reaches our goal.

1/4

1/2

1/1

law of small numbers

if something happens with prob. , and you have of them, what's the distribution of the number of events really happened? Poisson distribution

wait a minute

when you have a strong sense that your algorithm will work, it is time to analyze it formally.

the bitmap

how large should the bitmap be? what's the roll of bitmaps in our algorithm?

simplified model

the part that really works is not the bitmap; the multi-resolution idea is.

1/4

1/2

1/1

make it simple

what if the bitmap has only one bin? another algorithm

simplified algorithm

for each element encountered, compute the filter at level accepts the element if its hash is less than find the highest level that an element ever reaches. output uses only space now!

intuition

intuitively, this algorithm should give an approximation ratio.

16 8 4 2 1

intuition

an element have change of passing the th filter if we really see one such element, it should be backed up by other elements. how to make this argument rigid

Bernoulli trail

suppose you have done experiments, each of which has probability of giving positive outcome. what's your chance of getting at least one positive outcome? what's your change of getting no positive outcome?

our first analysis

if the (exact) number of distinct elements is the chance that is the chance that is

so we have a very good chance of giving an output in the range

amplify the probabilities

it there a way to improve our chance of giving good estimation (without changing the space complexity)? repeat (in parallel) many instances of our algorithm

and take the median

take the median (instead of the average). the probability that the median is less than is

as T increases, the probability approaches zero. the same applies to

probability amplification

so we only have to make sure that the one-side failure rates are smaller than 0.5 by a constant term. works well in practice.

wait a minute

there is a bug in our analysis (and also in the analysis of many other algorithms). our source of randomness

hash functions

we require that the hash values are

independently identically distributed

ideally, the hash should be something like:

if is not encountered fetch a number from /dev/random else return the value we have assigned to

however

but we don't have enough space to remember all previously assigned values. so how things really work is:

randomly choose some parameters at the beg inning use these parameters to compute the hash f unction

information theoretic

so long as the length of the parameters is , the hash values are not independent

take an example

say our hash is

hash(x)=x*a mod N

then hash(0) is always 0. hash(0) and hash( ) has a very good chance of colliding.

fix the bug

hash(x)=a*x+b mod P

is a (big) prime number. and are randomly chosen. given hash(0) and hash(1), the hash function is uniquely determined.

what to do

now that the hash function is not purely random. how to base our analysis on such highly correlated hash values? find some guarantees that hold.

pairwise independent

a nice property of our (linear) hash function: if we only look at two hash values hash( ) and hash( ), they are i.i.d. for any choices of and . pairwise independence does not mean global independence, but it has been good enough.

weaker bound

suppose you conducted experiments, each of which has probability of giving positive outcome, and they are pairwise independent. what's your chance of getting at least one positive output? what's your chance of not getting any positive output?

Chebyshev bound

base our analysis on expectation and variation.

the conclusion

let's forget about the math so the conclusion is that we have at least 0.057 chance of getting an output in the probability can be amplified to arbitrarily high value. but the constant 3 cannot be improved by sheer repeating.

improve the accuracy

is there a way to improve the accuracy? finer grained filters? 1,1+a,(1+a)2,...,(1+a)n

oh, no

not working. our previous approach works because 2 is a magic number that cannot be improved.

another view

only the element with the minimum hash value matters. in expectation, the minimum of i.i.d. variables is

another way

so here comes our next algorithm

maintain the minimum of the hash values encountered output

repeat many times, take the median

does it work?

a somehow enhanced version of our first algorithm it is guaranteed to be not worse than the previous algorithm by a constant factor. does the accuracy scales with number of instances?

however

no, if we only take the median the distribution is skew

do not give up

repeat in another way what if we keep the th smallest hash value?

and it works!

let be the desired relative accuracy set events, each occurs with prob. . chance that no more than events occurs is (using Chebyshev): the same applies to the other side

much better distribution

at the expense of increased space but the space complexity remains for any given and

the end of the story?

what's the intuition behind the th smallest number? bitmap with bins these two ideas lead to the same algorithm

what to do

what else do we have to do after devising an algorithm? show that it is optimal

the lower bound

communicate from the past to the present

enough space to distinguish between G1 and G2

G1 G1

G1

G2

coding theory

There exists subsets of , each with cardinality , such that any two subsets have at most elements in common. counting argument: consider how many subsets can one subset rule out.

communication complexity

if you can approximately count the number of distinct elements to arbitrarily good precision, you can differentiate from results from communication complexity says that at least space is required.

comments

hash functions are far from perfect. find solid base for our analysis hash-based data-structures?

treap

requires that the chance of one node having the smallest weight among nodes is it is not hashing. /dev/random will do the job

hashing strings

polynomial hashing if the computation is done in a field (mod ), hash(A)=hash(B) has at most max{len(A),len(B)} solutions. choosing suffices to guarantee that all of strings have different hash values. if we require that all sub-strings of a string of length do not collide, should be even greater.

over a ring

what if hashing is done over a ring? mod 232, a common practice

counterexample

01101001100101101001011001101001... collide with the inverse of itself for all odd bases

counterexample

if collided at the lowest n bits and collided more bits. string of modest length is able to fail mod 264 hash. and

more on randomized algorithms

Monte Carlo and Las Vegas examples

polynomial identity testing

is ? a polynomial expression of degree at most see if it is equal to zero (after expanding)

Monte Carlo

choose a sufficiently large field. randomly sample numbers if the polynomial is non-zero, the chance that it evaluates to zero is at most proved by induction on the number of variables and the degree of the polynomial.

Monte Carlo

no deterministic algorithms known yet

associativity test

given an operator associativity. brute force algorithm runs , see if it has

randomized algorithm

let be numbers in a field. define the convolution

if

has associativity, are

and with high probability, if randomly sampled, vice versa.

give me a proof

look closely at what we are doing. reduction to PIT

and more?

distributive test? test if a subring is an ideal? normal subgroup?

median in rectangle

in an number matrix, find one sub-matrix with the greatest median.

1 6 5 7 2 15 9 4 13 16

10 14 11 3 8 12

baseline

algorithm based on binary search.

linear time algorithm

algorithm that runs in linear time based on random sampling very clever idea

recall

the minimum of i.i.d. 0~1 random variables is in expectation. we have an efficient way of filtering out rectangles whose median is smaller than a threshold. need a way to quickly handle small number of rectangles.

compression

rectangles can be done in time.

compression

use a list to maintain the numbers. during each iteration of the binary search, the list halves.

123456789 123456789

Las Vegas

randomized global min-cut

1 3 1 1 4

min cut

choose an edge with prob. proportional to its cost, merge its two endpoints repeat until there is only one edge.

1 3 1 1 4

analyzing the algorithm

intuitively, this algorithm doesn't seem to have a good chance of success even if we repeated it multiple times. how to bound on the probability of giving the correct output in a single run?

analyzing the algorithm

look at one minimum cut compute its chance of surviving

1 3 1 1 4

key observation

the probability of destroying a cut in one round is proportional to its cost. min-cut is smaller than the average cost of cutting a node.

telescoping product

so, the chance that the cut will survive till the end is:

so, we need to repeat it times. this lead to an algorithm

implementation detail

how to run one instance of the algorithm in time? use adjacency matrix to store the graph maintain the total weight of edges incident to an edge two step samping: choose a node, then choose an edge

beat the baseline

our baseline algorithm is to run network flow times. the current state-of-art network flow algorithm runs in is there a way to improve the randomized algorithm?

heuristic that works

we are more likely to fail at the second half of the execution.

8 7 6 5 4 10 9 8 7 6 0.46 0.47 3 2 5 4 1 3

recursive

when the graph's size is reduced by , run two instances of the rest of the algorithm and choose the better one

8 7 6 5 4 10 9 8 7 6 0.46 0.47 3 2 5 4 1 3

the overall complexity

in general

yet another example

the MST linear algorithm

MST

idea: what if we randomly sample the edges? recursive application of the algorithm. and use MST of the subgraph to prune remaining edges

after pruning

consider Kruskal's algorithm add edges in order of weight from lightest to heavist edges will be actually added edges will be considered (and discarded with prob. 0.5)

make it work

after pruning, we have only edges left. the pruning step can be implemented in linear time using some data-structures work needed to make it really work how to reduce the number of nodes?

Boruvka's step

find the lightest out going edge for each node form some cycles. reduce the number of nodes by half in the worst case

Boruvka's step

Optimization

now it is time to turn to another topic

(numeric) optimization

maximize subject to some constrains

motivation

find a direction along which two sets of points are furthest separated. separating hyperplane with the largest margin

quadratic programming

subject to convex problem

general attack

gradient descent

general attack

iterative algorithm

some guarantees

when the algorithm terminates, the gradient is zero, hence a local minimum is reached if is small enough, the objective function is guaranteed to be improved

practical issues

how to choose the step size? how fast will it converge? how to handle constraints?

step size

it should be large enough so that progress is made it should be small enough to avoid divergence

step size

mathematically, setting inversely proportional to the current number of iterations will do the job in practice, extensive hand-tuning is required

step size and speed

consider minimizing start at has to be lower than 0.02 or will blow up but convergence on will be very slow

step size and speed

the eigen-value gap problem

second order methods

use (estimated) Hessian to find the best direction conjugate gradient, line search

constraints

descent and project

duality

how to get rid of the constraints? Lagrange multipliers duality

duality

meditate on the following fact: the maximum margin is the same with the minimum distance between the two polytopes

Lagrange multipliers

if is the solution to the original problem, then there is a such that is also the maximizer of consider the gradient

dual of the program

minimize subject to , are non-negative, and sum( )=1, sum( )=1 these constraints are easier to handle

coordinate descent

if all but two variables are fixed, the optimization problem becomes trivial optimize two at a time

coordinate descent

active set

another issue

before applying coordinate descent to all optimization problems, be aware of its pitfall local minimum

neural networks

paramount to the success of gradient descent algorithms

what's a neural network

a non-linear parametrized function family layered structure

neuron

non-linear computing unit

y y=f(x1*w1+x2*w2+x3*w3) w1 x1 x2 w2 w3

x3

training neural networks

given some input-output pairs, find proper values for such that the network's output is closest to the desired output

a (numeric) optimization problem

very hard

full of local-minima due to its non-linearity NP-hard for a wide range of networks virtually impossible to compute the optimal solution

gradient descent

the way neural networks are trained randomly choose use gradient descent

back propagation

chain rule to compute the derivatives

?L/?y ?L/?w ?L/?x

very hard

(hopefully) converges to a local minimum good enough in practice massive computation

google's style

neural network with millions of neurons and one billion trainable parameters optimization is done by a 1000 machine cluster (16000 cores in total) would have run for 10 years on one PC

optimization

(paralleled) gradient descent surely not (or even close to) a (local) minimum but has been good enough to give surprising result

the idea of machine learning

traditional view of supervised learning training sample, model, prediction

x,y w x',y'

example

MNIST

the theorem

the most important result in statistical learning Given sufficient training samples, under some conditions (tons of technical details omitted), if the model fits the training samples well, it is mathematically guaranteed that the model will give good prediction on unseen samples

a tangible example

SVM

support vector

theorem

let be the number of support vectors

how to reason about unseen examples?

unseen example

assume the sample are obtained from a (fixed) distribution only the support vectors mattered

support vector

leave one out

assume the model is trained on all but one (randomly chosen) training sample unbiased estimator the model will not change so long as the support vectors are not selected

reinforcement learning

neural network that plays game. backgammon on par with human champion

learning algorithm

TD-lambda play with itself and learn from the game

game playing

estimate the probability of winning for each arrangement of the board f(x,w) search for one step and greedily choose the best action

TD-lambda

loss function f should be in line with the final result f should not change too rapidly during the game

reinforcement learning

play the game with itself, and gradient descent on the parameters of f Gerald Tesauro, 1994, TD-Gammon, a SelfTeaching Backgammon Program, Achieves MasterLevel Play human master performance after 1500000 rounds of training

neural networks at work

facial keypoint detection trained on tens of thousands of images

neural networks at work

used in face++, a leading face recognition system built by megvii

mailto career@megvii.com if you are interested in an internship

相关文章:

- 每天一个linux命令(40):wc命令
- 每天一个 linux 命令(40):
*wc*命令 Linux 系统中的*wc*(Word Count)命令的...39 log*2014*.log 0 11-30 08:39 log2015.log 0 11-30 08:39 log2016....

- Linux命令之wc
- Linux命令之
*wc*_计算机软件及应用_IT/计算机_专业资料。Linux 命令之*wc**wc*...文档贡献者 996433090 贡献于*2014*-12-07 专题推荐*2014*教师资格材料分析辅......

- 2014-2015学年成都市武侯区九年级上期末数学试卷
*2014*-2015学年成都市武侯区九年级上期末数学试卷_数学_初中教育_教育专区。2015...文档贡献者 x*wc*loving 贡献于2016-01-03 相关文档推荐 暂无相关推荐文档 ...

- 材质对照表_图文
*WC*DG 棒料及板材 S20C S25C S35C S40C SS41 F4 F14 非金属材料 F13 F12...文档贡献者 小花豹子萌萌达 贡献于*2014*-12-10 1/2 相关文档推荐 ...

- 第4章 习题
- 第4 章 习题 4-1 分析
*wC*=0.2%、*wC*=0.6%、*wC*=1.2%的铁碳合金从... 正确使用机上氧气面罩©*2014*Baidu 使用百度前必读 | 文库协议...

- PBS 脚本实例说明
- '{print $1}'` ##(定义作业 ID 号)## NP=`cat $PBS_NODEFILE|
*wc*-l...文档贡献者 jiangdd_1024 贡献于*2014*-01-10 专题推荐*2014*下半年教师资格.....

- 碳化钨
- 采用钨酐(WO3)与石墨在还原气氛中 1400～1600℃ 高温下合成碳化钨(
*WC*)粉末。...文档贡献者 燃73 贡献于*2014*-03-25 专题推荐*2014*教师资格材料分析辅... ...

- 别再把厕所叫做W.C.
- 在外国人口中,
*WC*一词基本已 消失;但在很多以英语为基础外语的地方却还保留着...*2014*年建筑幕墙建筑装饰行业分析报告+申请认证 文档贡献者 必克英语 外教+助教2...

- 常用线路敷设方式
*WC*-暗敷在墙内 CE-沿天棚顶敷设 CC-暗敷在天棚顶内 SCE-吊顶内敷设 F-... 6页 免费©*2014*Baidu 使用百度前必读 | 文库协议...

- 电缆敷设表示含义
- 电缆 ZR-YJV-4*35+1*16-CT-SC50-
*WC*-CC, NH-BV-2*2.5+BVR-1*2.5-CT...*2014*教师资格中学教育知...1/2 相关文档推荐 电缆敷设表示含义 2页 1下载券...

更多相关标签: