当前位置:首页 >> 学科竞赛 >>

wc2014


Randomization, Approximization and Optimization
Fan Haoqiang Tsinghua, IIIS

a motivating example
suppose you are facing a large volume of network flow. your memory is ve

ry limited. Only
1T/s * 10 year

General Flow Watcher

and you want to count
and you want to count the number of distinct packages. maybe counting the number of visitors to a particular site.

and MLE
is it ever possible given such limited memory?

negative result
it is not possible to precisely count the number of distinct elements if you cannot store them.

negative result
proof sketch: there must be two different sets of packages that lead the device to the same state (pigeon-hole principle). And it will fail because of its failure to distinguish between the two sets.
N N N+1 N

how to make things work?
what if we only care about an approximation of the number of distinct packages? useful in a fistful of cases.

still negative
it is even not possible to approximate the number of packages up to a constant factor. -approximation means for every input sequence, at any time,

where is the number of distinct elements, and is the estimated number.

proof sketch
there is no -approximation algorithm that uses space. key idea: the number of internal states of the algorithm is too small.

proof sketch
n/2-dim subspaces in GF(p)^n sparse subsets heavy collision

poly(N) states unbounded approx. ratio

how to make it possible?
here randomization comes into play. we can accept small probability of failure. The probability should only depend on the coin tosses made by us.

randomized approximation
a randomized algorithm is said to be approx. iff

is it possible to construct a randmoized algorithm?

our first attempt
let's play with our first idea: Bitmap Algorithm. a good hash function

record all hashes encountered

coupon collector problem
how many coupons do you have to collect before having out of different types of coupons? easy calculation of expectation gives:

coupon collection
how good is the approximation? not so sharp, but good enough

coupon collector problem
only works when is small compared to so, we only saved space instead of the desired anyway to improve this idea?

reduce the number of elements
add a random sampler to make the number of elements bounded.
bitmap counter k times the result

sample only 1/k of all packages works when D and k are of the same order of magnitude

multi-resolution
what if we don't know how big D is in advance? build filters with different sampling factors and use the most accurate one.
1/4

1/2

1/1

sampling
Does the sampling technique really works? Negative: sampling half of the data does not mean getting half of the distinct elements.

how to make things work?
the crux: if an element is seen multiple times, its probability of being chosen accumulates.

these guys should not have any effect

another way of sampling
Base our decision on the element itself. Choose iff Now, the number of distinct elements scales linearly with

and it works
multi-resolution sampling + bitmap algorithm space with approximation ratio. nearly reaches our goal.
1/4

1/2

1/1

law of small numbers
if something happens with prob. , and you have of them, what's the distribution of the number of events really happened? Poisson distribution

wait a minute
when you have a strong sense that your algorithm will work, it is time to analyze it formally.

the bitmap
how large should the bitmap be? what's the roll of bitmaps in our algorithm?

simplified model
the part that really works is not the bitmap; the multi-resolution idea is.
1/4

1/2

1/1

make it simple
what if the bitmap has only one bin? another algorithm

simplified algorithm
for each element encountered, compute the filter at level accepts the element if its hash is less than find the highest level that an element ever reaches. output uses only space now!

intuition
intuitively, this algorithm should give an approximation ratio.

16 8 4 2 1

intuition
an element have change of passing the th filter if we really see one such element, it should be backed up by other elements. how to make this argument rigid

Bernoulli trail
suppose you have done experiments, each of which has probability of giving positive outcome. what's your chance of getting at least one positive outcome? what's your change of getting no positive outcome?

our first analysis
if the (exact) number of distinct elements is the chance that is the chance that is

so we have a very good chance of giving an output in the range

amplify the probabilities
it there a way to improve our chance of giving good estimation (without changing the space complexity)? repeat (in parallel) many instances of our algorithm

and take the median
take the median (instead of the average). the probability that the median is less than is

as T increases, the probability approaches zero. the same applies to

probability amplification
so we only have to make sure that the one-side failure rates are smaller than 0.5 by a constant term. works well in practice.

wait a minute
there is a bug in our analysis (and also in the analysis of many other algorithms). our source of randomness

hash functions
we require that the hash values are

independently identically distributed
ideally, the hash should be something like:
if is not encountered fetch a number from /dev/random else return the value we have assigned to

however
but we don't have enough space to remember all previously assigned values. so how things really work is:
randomly choose some parameters at the beg inning use these parameters to compute the hash f unction

information theoretic
so long as the length of the parameters is , the hash values are not independent

take an example
say our hash is
hash(x)=x*a mod N

then hash(0) is always 0. hash(0) and hash( ) has a very good chance of colliding.

fix the bug
hash(x)=a*x+b mod P

is a (big) prime number. and are randomly chosen. given hash(0) and hash(1), the hash function is uniquely determined.

what to do
now that the hash function is not purely random. how to base our analysis on such highly correlated hash values? find some guarantees that hold.

pairwise independent
a nice property of our (linear) hash function: if we only look at two hash values hash( ) and hash( ), they are i.i.d. for any choices of and . pairwise independence does not mean global independence, but it has been good enough.

weaker bound
suppose you conducted experiments, each of which has probability of giving positive outcome, and they are pairwise independent. what's your chance of getting at least one positive output? what's your chance of not getting any positive output?

Chebyshev bound
base our analysis on expectation and variation.

the conclusion
let's forget about the math so the conclusion is that we have at least 0.057 chance of getting an output in the probability can be amplified to arbitrarily high value. but the constant 3 cannot be improved by sheer repeating.

improve the accuracy
is there a way to improve the accuracy? finer grained filters? 1,1+a,(1+a)2,...,(1+a)n

oh, no
not working. our previous approach works because 2 is a magic number that cannot be improved.

another view
only the element with the minimum hash value matters. in expectation, the minimum of i.i.d. variables is

another way
so here comes our next algorithm
maintain the minimum of the hash values encountered output

repeat many times, take the median

does it work?
a somehow enhanced version of our first algorithm it is guaranteed to be not worse than the previous algorithm by a constant factor. does the accuracy scales with number of instances?

however
no, if we only take the median the distribution is skew

do not give up
repeat in another way what if we keep the th smallest hash value?

and it works!
let be the desired relative accuracy set events, each occurs with prob. . chance that no more than events occurs is (using Chebyshev): the same applies to the other side

much better distribution
at the expense of increased space but the space complexity remains for any given and

the end of the story?
what's the intuition behind the th smallest number? bitmap with bins these two ideas lead to the same algorithm

what to do
what else do we have to do after devising an algorithm? show that it is optimal

the lower bound
communicate from the past to the present
enough space to distinguish between G1 and G2

G1 G1

G1

G2

coding theory
There exists subsets of , each with cardinality , such that any two subsets have at most elements in common. counting argument: consider how many subsets can one subset rule out.

communication complexity
if you can approximately count the number of distinct elements to arbitrarily good precision, you can differentiate from results from communication complexity says that at least space is required.

comments
hash functions are far from perfect. find solid base for our analysis hash-based data-structures?

treap
requires that the chance of one node having the smallest weight among nodes is it is not hashing. /dev/random will do the job

hashing strings
polynomial hashing if the computation is done in a field (mod ), hash(A)=hash(B) has at most max{len(A),len(B)} solutions. choosing suffices to guarantee that all of strings have different hash values. if we require that all sub-strings of a string of length do not collide, should be even greater.

over a ring
what if hashing is done over a ring? mod 232, a common practice

counterexample
01101001100101101001011001101001... collide with the inverse of itself for all odd bases

counterexample
if collided at the lowest n bits and collided more bits. string of modest length is able to fail mod 264 hash. and

more on randomized algorithms
Monte Carlo and Las Vegas examples

polynomial identity testing
is ? a polynomial expression of degree at most see if it is equal to zero (after expanding)

Monte Carlo
choose a sufficiently large field. randomly sample numbers if the polynomial is non-zero, the chance that it evaluates to zero is at most proved by induction on the number of variables and the degree of the polynomial.

Monte Carlo
no deterministic algorithms known yet

associativity test
given an operator associativity. brute force algorithm runs , see if it has

randomized algorithm
let be numbers in a field. define the convolution

if

has associativity, are

and with high probability, if randomly sampled, vice versa.

give me a proof
look closely at what we are doing. reduction to PIT

and more?
distributive test? test if a subring is an ideal? normal subgroup?

median in rectangle
in an number matrix, find one sub-matrix with the greatest median.
1 6 5 7 2 15 9 4 13 16

10 14 11 3 8 12

baseline
algorithm based on binary search.

linear time algorithm
algorithm that runs in linear time based on random sampling very clever idea

recall
the minimum of i.i.d. 0~1 random variables is in expectation. we have an efficient way of filtering out rectangles whose median is smaller than a threshold. need a way to quickly handle small number of rectangles.

compression
rectangles can be done in time.

compression
use a list to maintain the numbers. during each iteration of the binary search, the list halves.
123456789 123456789

Las Vegas
randomized global min-cut
1 3 1 1 4

min cut
choose an edge with prob. proportional to its cost, merge its two endpoints repeat until there is only one edge.
1 3 1 1 4

analyzing the algorithm
intuitively, this algorithm doesn't seem to have a good chance of success even if we repeated it multiple times. how to bound on the probability of giving the correct output in a single run?

analyzing the algorithm
look at one minimum cut compute its chance of surviving
1 3 1 1 4

key observation
the probability of destroying a cut in one round is proportional to its cost. min-cut is smaller than the average cost of cutting a node.

telescoping product
so, the chance that the cut will survive till the end is:

so, we need to repeat it times. this lead to an algorithm

implementation detail
how to run one instance of the algorithm in time? use adjacency matrix to store the graph maintain the total weight of edges incident to an edge two step samping: choose a node, then choose an edge

beat the baseline
our baseline algorithm is to run network flow times. the current state-of-art network flow algorithm runs in is there a way to improve the randomized algorithm?

heuristic that works
we are more likely to fail at the second half of the execution.
8 7 6 5 4 10 9 8 7 6 0.46 0.47 3 2 5 4 1 3

recursive
when the graph's size is reduced by , run two instances of the rest of the algorithm and choose the better one
8 7 6 5 4 10 9 8 7 6 0.46 0.47 3 2 5 4 1 3

the overall complexity

in general

yet another example
the MST linear algorithm

MST
idea: what if we randomly sample the edges? recursive application of the algorithm. and use MST of the subgraph to prune remaining edges

after pruning
consider Kruskal's algorithm add edges in order of weight from lightest to heavist edges will be actually added edges will be considered (and discarded with prob. 0.5)

make it work
after pruning, we have only edges left. the pruning step can be implemented in linear time using some data-structures work needed to make it really work how to reduce the number of nodes?

Boruvka's step
find the lightest out going edge for each node form some cycles. reduce the number of nodes by half in the worst case

Boruvka's step

Optimization
now it is time to turn to another topic

(numeric) optimization
maximize subject to some constrains

motivation
find a direction along which two sets of points are furthest separated. separating hyperplane with the largest margin

quadratic programming
subject to convex problem

general attack
gradient descent

general attack
iterative algorithm

some guarantees
when the algorithm terminates, the gradient is zero, hence a local minimum is reached if is small enough, the objective function is guaranteed to be improved

practical issues
how to choose the step size? how fast will it converge? how to handle constraints?

step size
it should be large enough so that progress is made it should be small enough to avoid divergence

step size
mathematically, setting inversely proportional to the current number of iterations will do the job in practice, extensive hand-tuning is required

step size and speed
consider minimizing start at has to be lower than 0.02 or will blow up but convergence on will be very slow

step size and speed
the eigen-value gap problem

second order methods
use (estimated) Hessian to find the best direction conjugate gradient, line search

constraints
descent and project

duality
how to get rid of the constraints? Lagrange multipliers duality

duality
meditate on the following fact: the maximum margin is the same with the minimum distance between the two polytopes

Lagrange multipliers
if is the solution to the original problem, then there is a such that is also the maximizer of consider the gradient

dual of the program
minimize subject to , are non-negative, and sum( )=1, sum( )=1 these constraints are easier to handle

coordinate descent
if all but two variables are fixed, the optimization problem becomes trivial optimize two at a time

coordinate descent
active set

another issue
before applying coordinate descent to all optimization problems, be aware of its pitfall local minimum

neural networks
paramount to the success of gradient descent algorithms

what's a neural network
a non-linear parametrized function family layered structure

neuron
non-linear computing unit
y y=f(x1*w1+x2*w2+x3*w3) w1 x1 x2 w2 w3
x3

training neural networks
given some input-output pairs, find proper values for such that the network's output is closest to the desired output

a (numeric) optimization problem

very hard
full of local-minima due to its non-linearity NP-hard for a wide range of networks virtually impossible to compute the optimal solution

gradient descent
the way neural networks are trained randomly choose use gradient descent

back propagation
chain rule to compute the derivatives
?L/?y ?L/?w ?L/?x

very hard
(hopefully) converges to a local minimum good enough in practice massive computation

google's style
neural network with millions of neurons and one billion trainable parameters optimization is done by a 1000 machine cluster (16000 cores in total) would have run for 10 years on one PC

optimization
(paralleled) gradient descent surely not (or even close to) a (local) minimum but has been good enough to give surprising result

the idea of machine learning
traditional view of supervised learning training sample, model, prediction
x,y w x',y'

example
MNIST

the theorem
the most important result in statistical learning Given sufficient training samples, under some conditions (tons of technical details omitted), if the model fits the training samples well, it is mathematically guaranteed that the model will give good prediction on unseen samples

a tangible example
SVM

support vector

theorem
let be the number of support vectors

how to reason about unseen examples?

unseen example
assume the sample are obtained from a (fixed) distribution only the support vectors mattered

support vector

leave one out
assume the model is trained on all but one (randomly chosen) training sample unbiased estimator the model will not change so long as the support vectors are not selected

reinforcement learning
neural network that plays game. backgammon on par with human champion

learning algorithm
TD-lambda play with itself and learn from the game

game playing
estimate the probability of winning for each arrangement of the board f(x,w) search for one step and greedily choose the best action

TD-lambda
loss function f should be in line with the final result f should not change too rapidly during the game

reinforcement learning
play the game with itself, and gradient descent on the parameters of f Gerald Tesauro, 1994, TD-Gammon, a SelfTeaching Backgammon Program, Achieves MasterLevel Play human master performance after 1500000 rounds of training

neural networks at work
facial keypoint detection trained on tens of thousands of images

neural networks at work
used in face++, a leading face recognition system built by megvii

mailto career@megvii.com if you are interested in an internship


相关文章:
2014年中国真空开关产业竞争格局报告
//www.ibaogao.com/baogao/031213Wc2014.html 报告目录第一章 真空开关相关概述 第一节 开关简述 一、开关的种类与特点分析 二、开关和断路器的关系 三、开关...
2014年中小学幼儿园新教师优秀教学设计评选的通知
教学设计和汇 总表 用打包文件 格式 统一发送到 电子 邮箱: wcjks2014@126.com,教学设计文件名:单位名称+教师姓名+题目, 学校(园)文件包名称:新教师教学设计+...
2014年全国高考英语试题分类汇编:改错题 Word版含解析
2014年全国高考英语试题分类汇编:改错题 Word版含解析_高考_高中教育_教育专区。...The early morning barking have been disturbing us as wc arc often up all...
每天一个linux命令(40):wc命令
每天一个 linux 命令(40):wc 命令 Linux 系统中的 wc(Word Count)命令的...39 log2014.log 0 11-30 08:39 log2015.log 0 11-30 08:39 log2016....
万科形象广告汇编
更多文案设计干货欢迎关注:观止文创(微信号 GZWC2014) 万科企业形象篇广告汇编 精信广告——“珍视生活本质”系列 一、路灯篇 最温馨的灯光 一定在你回家的路上 ...
建筑电气中的WC、CC、FC代表什么意思?
WC-暗敷在墙内 CE-沿天棚顶敷设 CC-暗敷在天棚顶内 SCE-吊顶内敷设 FC-...文档贡献者 文敌霸王龙 贡献于2014-04-29 1/2 相关文档推荐 ...
水电图纸中的WC,WE,CC.CE.FC.FE.等都代表什么
水电图纸中的WC,WE,CC.CE.FC.FE.等都代表什么_建筑/土木_工程科技_专业... 2014年幼儿园教师资格考... 2014教师资格中学教育知...1/2 相关文档推荐...
《状元之路》2014届高考政治(新课标通用版)一轮复习高效作业 知能提升:模块4第4单元综合测试
woowoowc贡献于2014-02-24 0.0分 (0人评价)暂无用户评价 我要评价 贡献者等级:初试锋芒 二级 格式:doc 关键词:暂无专题推荐 2012大纲全国卷高考数学... ...
2014年“问鼎杯”全国大学生信息安全与保密技能大赛预赛试题
2014 年“问鼎杯”全国大学生信息安全与保密技能大赛预赛试题 第 1 题: 某台...s92wC4YPwA/VK8P/sB236d8///+AB8P/v/2 wC98////3f8P/v/236ge+AA...
AWS标准
AWS 标准目录索引标准号/订购号 AWS A1.1-2001 AWS A2.1WC&DC AWS A2.4-1998 AWS A3.0-2001 AWS A4.2M/A4.2-1997 AWS A AWS 标准目录索引标准号/...
更多相关标签:
wc | wc2014 数据 | wc2014题解 | wc2014题目 | wc2014成绩 | wc2014 时空穿梭 | wc2014 课件 | wc2014 解题报告 |