故障简述
公司在访问Redis时使用了JedisPool。当Redis实例不可达时,会将该实例放入黑名单。后台线程周期性扫描黑名单列表,如果可达,则恢复。在检测时会新建新的JedisPool,通过jedisPool.getResource().close()
的方式检测可达性。由于是周期性检测,每次检测都会new一个新的JedisPool,而且在创建JedisPool时,配置了minIdle为1。这样就埋下隐患。如果Redis长时间不可达,会new很多的JedisPool,当Redis恢复时,由于JedisPool有后台的周期性驱逐线程(如果连接长时间空闲,则销毁;为保证该pool内有足够minIdle连接,又会创建新的连接),这样会创建大量的连接。达到Redis的最大连接数限制,正常请求的连接会收到服务端返回的ERR max number of clients reached
错误而抛出异常。注意,虽然客户端收到了错误,但是对于客户端而言连接是建立上了,客户端将请求发送到了服务端,在读取服务端请求的返回值时,服务端返回了ERR max number of clients reached
错误。对于Redis服务端而言,对于造成服务端达到“最大连接数限制”的连接,服务端会直接关闭。
Caused by: redis.clients.jedis.exceptions.JedisDataException: ERR max number of clients reached
at redis.clients.jedis.Protocol.processError(Protocol.java:130)
at redis.clients.jedis.Protocol.process(Protocol.java:164)
at redis.clients.jedis.Protocol.read(Protocol.java:218)
at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:341)
at redis.clients.jedis.Connection.getBinaryMultiBulkReply(Connection.java:277)
at redis.clients.jedis.BinaryJedis.mget(BinaryJedis.java:606)
复制代码
有个疑问:
为什么日志中还有写失败的请求呢?不应该是正常建立的那些连接,可以正常写数据吗?因为被“达到最大连接数异常”的连接已经被回收了,不可能再被客户端使用了。难道服务端有清理连接的逻辑?
Caused by: java.net.SocketException: Connection reset by peer (Write failed)
at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
at redis.clients.util.RedisOutputStream.flushBuffer(RedisOutputStream.java:52)
at redis.clients.util.RedisOutputStream.flush(RedisOutputStream.java:216)
at redis.clients.jedis.Connection.flush(Connection.java:332)
... 30 more
复制代码
驱逐线程
驱逐线程的创建
/**
* Create a new <code>GenericObjectPool</code> using a specific
* configuration.
*
* @param factory The object factory to be used to create object instances
* used by this pool
* @param config The configuration to use for this pool instance. The
* configuration is used by value. Subsequent changes to
* the configuration object will not be reflected in the
* pool.
*/
public GenericObjectPool(PooledObjectFactory<T> factory,
GenericObjectPoolConfig config) {
// 还记得之前的JMX问题吗?
super(config, ONAME_BASE, config.getJmxNamePrefix());
if (factory == null) {
jmxUnregister(); // tidy up
throw new IllegalArgumentException("factory may not be null");
}
this.factory = factory;
idleObjects = new LinkedBlockingDeque<PooledObject<T>>(config.getFairness());
setConfig(config);
// 这里开启驱逐线程
startEvictor(getTimeBetweenEvictionRunsMillis());
}
复制代码
可以看到,驱逐线程是在构造函数中创建开启的。也就是说,每new一个JedisPool都会有一个对应的驱逐线程在周期性执行。
回忆一下,也是在这个构造函数里往JMX进行了注册,并引发了另外一个问题: new JedisPool可能会很慢。
驱逐线程的实现
/**
* <p>Starts the evictor with the given delay. If there is an evictor
* running when this method is called, it is stopped and replaced with a
* new evictor with the specified delay.</p>
*
* <p>This method needs to be final, since it is called from a constructor.
* See POOL-195.</p>
*
* @param delay time in milliseconds before start and between eviction runs
*/
final void startEvictor(long delay) {
synchronized (evictionLock) {
if (null != evictor) {
EvictionTimer.cancel(evictor);
evictor = null;
evictionIterator = null;
}
if (delay > 0) {
evictor = new Evictor();
EvictionTimer.schedule(evictor, delay, delay);
}
}
}
复制代码
注释写的很清楚,两点:
- 如果驱逐任务已经被创建,那么就取消。
- 这种情况,delay参数一般是-1,仅仅是取消驱逐任务,而不开启新的驱逐任务。
- 想一下,在coding过程中,取消过吗?如果没有,有啥问题?
- 如果没有驱逐任务,那么按照周期调度驱逐任务。
- 周期默认是30s。
驱逐周期的说明
public static final long DEFAULT_TIME_BETWEEN_EVICTION_RUNS_MILLIS = -1L;
private volatile long timeBetweenEvictionRunsMillis =
BaseObjectPoolConfig.DEFAULT_TIME_BETWEEN_EVICTION_RUNS_MILLIS;
/**
* Returns the number of milliseconds to sleep between runs of the idle
* object evictor thread. When non-positive, no idle object evictor thread
* will be run.
*
* @return number of milliseconds to sleep between evictor runs
*
* @see #setTimeBetweenEvictionRunsMillis
*/
public final long getTimeBetweenEvictionRunsMillis() {
return timeBetweenEvictionRunsMillis;
}
复制代码
注释写的也很清楚:如果是非正数(包括负数或0),那么就不会有空闲对象的驱逐线程被创建。
可以看到上面的默认值是-1,也就是不开启驱逐线程。但是JedisPoolConfig却给出了JedisPool的默认值:
public class JedisPoolConfig extends GenericObjectPoolConfig {
public JedisPoolConfig() {
// defaults to make your life with connection pool easier :)
setTestWhileIdle(true);
setMinEvictableIdleTimeMillis(60000);
setTimeBetweenEvictionRunsMillis(30000);
setNumTestsPerEvictionRun(-1);
}
}
复制代码
上面的注释说:这些默认值会使得你连接池的生命周期更容易。这个life是连接池的还是coder的life?