Jim Walker [Tue, 17 May 2016 16:41:10 +0000 (17:41 +0100)]
MB-19635: Initialise failovers correctly from 2.5.x vbstate
When loading a vb file, don't force the failover table data
to be ("[{\"id\":0,\"seq\":0}]"); if the file doesn't contain
any data.
Change-Id: I41673bf848fcbab9b616edec5c7fd2ab9a3ddd6b
Reviewed-on: http://review.couchbase.org/64155
Well-Formed: buildbot <build@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Dave Rigby <daver@couchbase.com>
Jim Walker [Mon, 16 May 2016 15:24:35 +0000 (16:24 +0100)]
MB-19503: Fix ConnMap so notifications don't go missing [2]
Previous patch[1] cleared the isNotificationScheduled flag
at the wrong place and meant things could then never
again get scheduled.
This is because we only cleared the flag if tp->isPaused()
yet we still pop the notification from the queue, so we
left tp->isNotificationScheduled yet the queue is empty.
Now no more notifications will ever get scheduled!
So we need to clear the notification scheduled boolean
unconditionally of the other flags on tp.
[1] - Commit
0856e0b3d3fc6
Change-Id: I11c9fd72f4b35102328022bd4c334a9e09a61cd0
Reviewed-on: http://review.couchbase.org/64072
Well-Formed: buildbot <build@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Dave Rigby <daver@couchbase.com>
Reviewed-by: Will Gardner <will.gardner@couchbase.com>
Jim Walker [Wed, 11 May 2016 15:26:47 +0000 (16:26 +0100)]
MB-19503: Fix ConnMap so notifications don't go missing.
There's a reliance on an atomic bool and cmpxchg to
prevent the producer of notification from queueing
himself if he's already got a notification scheduled.
There's an ordering issue though where the producers code
can execute, see the flag is true and not bother queueing
a notification, yet the consumer side is about to clear the
flag and finish. The notification thus never gets queued
and the producer side thinks he will get a notification.
In my terminology:
producer is ConnMap::notifyPausedConnection
consumer is ConnMap::notifyAllPausedConnections
Change-Id: Id324b6369c5ee3a6b6758a7a93e017a4ff7c4a78
Reviewed-on: http://review.couchbase.org/64025
Well-Formed: buildbot <build@couchbase.com>
Reviewed-by: Dave Rigby <daver@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Manu Dhundi [Fri, 15 Apr 2016 20:54:42 +0000 (13:54 -0700)]
MB-16656: Send snapshotEnd as highSeqno for replica vb in GET_ALL_VB_SEQNOS call
For replica vbucket we must send snapshotEnd received in the last snapshotMarker
as the high seqno. Sending lastClosedChkSeqno can cause problems for view engine
which builds an index from replica vbucket.
Previously this was sent correctly in seqno stats, now adding it for
GET_ALL_VB_SEQNOS as well.
Change-Id: Ifad267521184c4976e1cb194e6814b56963298b0
Reviewed-on: http://review.couchbase.org/62925
Reviewed-by: abhinav dangeti <abhinav@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
abhinavdangeti [Fri, 8 Apr 2016 03:03:22 +0000 (20:03 -0700)]
Merge remote-tracking branch 'couchbase/3.0.x' into 'couchbase/sherlock'
couchbase/3.0.x:
|\
| * 771859f MB-19093 [BP]: [ActiveStream] Address potential lock-inversion scenarios
| * 6811030 MB-19075: Remove printing of empty string in CouchKVStore::getMulti()
Change-Id: I06717dcaad2b764582ebe378a6f9023f7ca76d88
abhinavdangeti [Fri, 15 Jan 2016 16:36:52 +0000 (08:36 -0800)]
MB-19093 [BP]: [ActiveStream] Address potential lock-inversion scenarios
Acquire vbucket state lock only when really necessary
in the ActiveStream context. Also avoid acquiring one
lock within the other wherever possible in the ActiveStream
context again.
This change is to avert potential deadlocks due to
lock inversion that will be induced by upcoming changes,
here are the scenarios:
(i) Locking between streamsMutex, streamMutex and
vb_stateLock in the set operation - handle
response scenario.
(http://factory.couchbase.com/job/ep-engine-threadsanitizer-master/1225/console)
(ii) In case of a set operation, vb_stateLock is
acquired and then streamMutex is acquired for
notification. During markDiskSnapshot, the
streamMutex is acquired before the vb_stateLock
lock is acquired.
(http://factory.couchbase.com/job/ep-engine-threadsanitizer-master/1268/console)
(Already reviewed at: http://review.couchbase.org/58557)
Change-Id: I5e5a3e2cc5ba9ae17090e1a3ee4bde100d305f1c
Reviewed-on: http://review.couchbase.org/62498
Well-Formed: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Reviewed-by: Dave Rigby <daver@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Sriram Ganesan [Fri, 1 Apr 2016 23:12:10 +0000 (16:12 -0700)]
MB-19075: Remove printing of empty string in CouchKVStore::getMulti()
In case of an error in opening a file, an error message is logged.
But the string that is supposed to hold the name of the file is
not populated, thus resulting in an empty string getting printed.
Remove the string from printed as openDB already prints the name
of the file in case of an open failure.
Change-Id: Ife3aec8381ead4f2e0b84c921a3781efa39a2126
Reviewed-on: http://review.couchbase.org/62325
Well-Formed: buildbot <build@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Sriram Ganesan [Tue, 8 Mar 2016 22:08:50 +0000 (14:08 -0800)]
MB-18476: Handle write failures more gracefully in the mutation log
Log and error message in case of a write failure and remove any unnecessary
asserts in that code path
Change-Id: I50b7e4de4d414e21bf00404a22863baff06c0f4f
Reviewed-on: http://review.couchbase.org/61116
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Dave Rigby [Thu, 3 Mar 2016 15:03:27 +0000 (15:03 +0000)]
MB-17517: Testcase for invalid max_cas in _local/vbstate
Unit test for fe67145 - verify that if a vbucket is created with a max
CAS of -1, upon re-reading from disk it is detected and fixed.
Change-Id: Ifdf3e642589f93191786ac55c1cf02207159f657
Reviewed-on: http://review.couchbase.org/60858
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Jim Walker <jim@couchbase.com>
Dave Rigby [Wed, 2 Mar 2016 16:56:40 +0000 (16:56 +0000)]
MB-17517: Don't create TAP_MUTATIONS with CAS of -1
In TapProducer::getNextItem(), a TAP mutation is generated by asking
EPStore to get() an Item for the given key. However, if the key in
question is locked (for example, if there was a SET followed by a LOCK
operation), then get() returns an Item with a CAS of -1. This is
normally only intended for end-users (we 'hide' the CAS of locked
items so only the client which locked it, and has the correct CAS can
mutate it). However we incorrectly end up creating a TAP_MUTATION
packet with a CAS of -1; corrupting the data sent to the TAP consumer
(replica, backup, etc).
Fix by adding a new option to get() to make the CAS 'hiding' optional,
and choose to not hide it for the TapProducer.
Change-Id: I4e9f4963e77437c9dc5a9ebb482c727e8ef4beb7
Reviewed-on: http://review.couchbase.org/60801
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Dave Rigby [Wed, 2 Mar 2016 16:43:41 +0000 (16:43 +0000)]
[BP] MB-17517: Check for and disard an invalid max_cas in _local/vbstate
If there is an invalid max CAS value on disk (in _local/vbstate),
discard it (resetting it to zero). Instead will need to rebuild the
max CAS value from the items we load from the file.
Change-Id: Ibe283160ab0b477b1a2a91985036269ce1d590af
Reviewed-on: http://review.couchbase.org/60795
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Chiyoung Seo [Tue, 1 Mar 2016 17:59:44 +0000 (17:59 +0000)]
Merge "Merge remote-tracking branch 'couchbase/3.0.x' into 'couchbase/sherlock'" into sherlock
Sriram Ganesan [Sat, 27 Feb 2016 02:30:02 +0000 (18:30 -0800)]
MB-17024: Add more logs during bucket deletion
Add more logs and change certain logs to WARNING level while these
tasks are stopped during bucket deletion
Change-Id: Ia2ada9d8525cec4c3e50ed087d2d052a05663873
Reviewed-on: http://review.couchbase.org/60596
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Dave Rigby <daver@couchbase.com>
Reviewed-by: abhinav dangeti <abhinav@couchbase.com>
Jim Walker [Tue, 1 Mar 2016 12:34:11 +0000 (12:34 +0000)]
Merge remote-tracking branch 'couchbase/3.0.x' into 'couchbase/sherlock'
* couchbase/3.0.x:
MB-17889: No notify from streamRequest
MB-17889: Address view_query latency regression
Change-Id: Ib1a5c25fccdfddc85e6072ec1d6831569c84128f
Jim Walker [Mon, 22 Feb 2016 21:01:09 +0000 (21:01 +0000)]
MB-17889: No notify from streamRequest
The 3.1.3 (i.e. before the DCP churn) didn't ever notify from
streamRequest, so that is reverted and helps to bring
view_query latency down (view engine is constantly creating
streams).
A second tweak is to not call stream->next whilst holding
the streamMutex. This can block streamRequest again
affecting the view-engine's DCP stream requests.
Change-Id: I5b57fd7998003251fb32897f37c8a2f15f687a13
Reviewed-on: http://review.couchbase.org/60352
Reviewed-by: Manu Dhundi <manu@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Well-Formed: buildbot <build@couchbase.com>
Reviewed-by: Dave Rigby <daver@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Jim Walker [Wed, 10 Feb 2016 16:33:10 +0000 (16:33 +0000)]
MB-17889: Address view_query latency regression
Some of the stale=false query latency tests show a 2x increase
in the 80 and 95 percentile. This is observed when moving from
3.1.4 1835 to 1836.
The prime suspect is that the snapshot yield parameter is now
stalling the view engine's DCP data.
In the change from 1835 to 1836 this value was tuned based upon
queue lengths observed during rebalance, moving from 10 to 256.
It could be that the task now runs for longer periods blocking
DCP backfill from running, and view engine drives many backfills
due to the way it frequently closes and opens streams.
Note that DCP Backfill and the snapshot task both share the same
task type, so can block each other.
This patch moves this config value back to 10 (as it was in 1835).
The view-query latency is very difficult to relibably reproduce, observe
and tune, but there is evidence (a trend) that with this config value
at 10, the performance (view latency) is improved.
Some small scale rebalance tests (3 node cluster, swap 1 node for 1)
showed that with 10 rebalance was not adversly affected, but it's a risk.
A latency comparison is attached to the MB which hints that
the latency is better.
https://issues.couchbase.com/secure/attachment/29513/29513_benchmark.png
Change-Id: I6ecf8ff950f77638eb03e4fedaefb700cf945d54
Reviewed-on: http://review.couchbase.org/59750
Well-Formed: buildbot <build@couchbase.com>
Reviewed-by: Dave Rigby <daver@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Manu Dhundi [Thu, 18 Feb 2016 20:04:11 +0000 (12:04 -0800)]
Merge remote-tracking branch 'couchbase/3.0.x' into 'couchbase/sherlock'
*
|\
| * 61ab952 2016-02-18 | MB-17889: DCP producer can leave an operation behind
Change-Id: Iad7e13aef50018432bedc17f776984c7a21a4cb4
Jim Walker [Thu, 18 Feb 2016 15:14:28 +0000 (15:14 +0000)]
MB-17889: DCP producer can leave an operation behind
In low-traffic setups there's a case in DcpProducer::getNextItem
where the function exits and pauses the task, yet an item was
waiting to be sent.
getNextItem() looked like
1. setpaused=false;
2. while(ready.pop(vbucket)) {
3. process(vbucket);
4. }
5. setpaused=true;
6. return NULL;
The notifier
a. if(ready.pushUnique(vbucket))
b. wakeupIfPaused;
If a and b execute between 4 and 5, then the producer will sleep
and not process the vbucket until the next wakeup (which maybe never).
This is not good, the first operation will have a long latency before
it can be seen on DCP. As long as it takes for the second operation.
If a and b occur between 5 and 6, that's fine, wakupIfPaused will re-wake
the producer.
a,b,5,6 is bad
5,a,6,b is ok
5,6,a,b is ok
5,a,b,6 is ok
5,6,a,b is ok
...
The fix.
getNextItem()
0. do {
1. setpaused=false;
2. while(ready.pop(vbucket)) {
3. process(vbucket);
4. }
5. setpaused=true;
6. } while(!ready.empty());
7. return NULL;
Now if ab occurs after 4, but before 5, it's ok as 6 will now consume
the vbucket.
5,a,b,6 is ok, as 6 will loop and consume
5,a,6,b is ok, " " " " "
6,a,b,7 is ok, paused is true (5), b will wake the task
Change-Id: Ib412a85ee10de0e2a2ca4116d0cc85bbad538da2
Reviewed-on: http://review.couchbase.org/60196
Well-Formed: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Reviewed-by: abhinav dangeti <abhinav@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
abhinavdangeti [Wed, 17 Feb 2016 18:10:54 +0000 (10:10 -0800)]
Merge remote-tracking branch 'couchbase/3.0.x' into 'couchbase/sherlock'
couchbase/3.0.x:
|\
| * 37d53b1 MB-18171: Break cyclic reference between ActiveStream & ChkptProcesser
| * 928ba39 MB-17766: Set maxNumAuxIO in stream_test to zero
| * 9ac7fd0 [BP] MB-17766: Fix intermittant stream_test failure
Change-Id: I5a01bbf360bd5e59a91ae8400f3de14cc50f5911
abhinavdangeti [Tue, 16 Feb 2016 18:49:09 +0000 (10:49 -0800)]
MB-18171: Break cyclic reference between ActiveStream & ChkptProcesser
Removing circular dependency between ActiveStream and
ActiveStreamCheckpointProcesserTask where each holds a reference
to the other causing a memory leak during shutdown.
Also explicitly clear the queues of checkpointProcessor task upon
disconnection of the DcpProducer, so as to remove a cyclic reference
between DcpProducer, ActiveStream, and ActiveStreamCheckpointProcesserTask.
Change-Id: Ifac03a40132431476a6b5000725ce972068b47f4
Reviewed-on: http://review.couchbase.org/60060
Well-Formed: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
abhinavdangeti [Tue, 16 Feb 2016 19:12:53 +0000 (11:12 -0800)]
MB-17766: Set maxNumAuxIO in stream_test to zero
Setting maxNumAuxIO to zero will ensure that the producer's
ActiveStreamCheckpointProcesserTask will never run causing
unexpected results in the test context.
Change-Id: I5e7f4b18b1b72af1f99e83cadc5ee979dbcd4cae
Reviewed-on: http://review.couchbase.org/60059
Tested-by: buildbot <build@couchbase.com>
Well-Formed: Hari Kodungallur <hari.kodungallur@couchbase.com>
Well-Formed: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Dave Rigby [Tue, 16 Feb 2016 12:54:38 +0000 (12:54 +0000)]
[BP] MB-17766: Fix intermittant stream_test failure
Address two issues:
1) end sequence numbers were incorrect, which could result in not
having any items in our cursor.
2) Don't check CheckpointMamager::registerCursor() return falue, we
don't actually care if any other cursors are already registered for
a given checkpoint (persistence cursor sometimes registers before
us).
Change-Id: I1145d5fb61c0c12f019154c979afdd50b4060509
Reviewed-on: http://review.couchbase.org/60058
Tested-by: buildbot <build@couchbase.com>
Well-Formed: Hari Kodungallur <hari.kodungallur@couchbase.com>
Well-Formed: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
abhinavdangeti [Fri, 12 Feb 2016 17:13:58 +0000 (09:13 -0800)]
Merge branch 'couchbase/3.0.x' into 'couchbase/sherlock'
couchbase/3.0.x:
|\
| * b84d09d MB-17766: Regression test that checks for race during takeover
| * ba305c4 MB-17766: Incorrect ordering of messages during ActiveStream's takeover-send phase
| * 4f39683 MB-17766: Avoid copy overhead of std::deque in getOutstandingItems
| * e3f4855 MB-17766: Refactor nextCheckpointItemTask to allow testing
Change-Id: I57a2aa37abc4ab60f09648bd7b02740b4bf933e6
abhinavdangeti [Fri, 12 Feb 2016 16:54:56 +0000 (08:54 -0800)]
Merge branch 'couchbase/3.0.x' into 'couchbase/sherlock'
couchbase/3.0.x:
|\
| * 0da7d42 MB-17885: Address compilation errors in ep_testsuite.cc
| * b7ee24c MB-17885: Update flow control bytesSent correctly on DCP producer
Change-Id: I70cda64395781a433a8e40720bdc5c75f5d0e3c2
Dave Rigby [Tue, 9 Feb 2016 18:21:25 +0000 (18:21 +0000)]
MB-17766: Regression test that checks for race during takeover
Module test: ep-engine_stream_test
Change-Id: I8e11722b1ed1029c8b969dcb88000c5903fbb0ca
Reviewed-on: http://review.couchbase.org/59666
Well-Formed: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
abhinavdangeti [Tue, 9 Feb 2016 20:18:01 +0000 (12:18 -0800)]
MB-17766: Incorrect ordering of messages during ActiveStream's takeover-send phase
A race between the step() and ActiveStreamCheckpointProcessorTask
can cause mutations to be queued into the readyQ after the
setVBucketState(active) message.
Here's the scenario in chronology (T: Front-end thread, BT: IO thread):
1. T1: ActiveStream::setVBucketStateAckRecieved()
2. T1: transitionState(takeoverSend) => schedules ActiveStreamCheckpointProcessorTask
3. BT1: manageConnections() notifies memcached about specific conn (max idle time: 5s)
4. BT2: ActiveStreamCheckpointProcessorTask runs, gets all Items For Cursor
5. T1: step() -> takeoverSendPhase() -> readyQ is empty
=> nextCheckpointItem() return false, as getNumItemsForCursor returns 0
=> setVbucketState(active) added to readyQ
6. BT2: ActiveStreamCheckpointProcessorTask continues, adds mutations acquired into readyQ
-> Note the mutations were acquired in step 4
=> Notified memcached connections
7. T1: step() .. ships messages in incorrect order
On the new master, the vbucket is promoted to active state and then more
mutations are received from the old master. If there were front end ops
at this time, there could be an inconsistency in highSeqno or in worst
cases crashes in checkpoint manager due to highSeqno not belonging in
the designated range.
The fix: Add an atomic flag that is also checked for along with
getNumItemsForCursor in nextCheckpointItem(). This flag is set before
retrieving all items for a cursor (getAllItemsForCursor) and unset after
all the retrieved items have been added to the ready queue of the stream.
Change-Id: I5c04d47cc99c7dd3b2d87cb68dd30d36473226e5
Reviewed-on: http://review.couchbase.org/59627
Well-Formed: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
abhinavdangeti [Wed, 10 Feb 2016 18:03:03 +0000 (10:03 -0800)]
MB-17766: Avoid copy overhead of std::deque in getOutstandingItems
Change-Id: I771182bd54a0f702f70287ff4728d26b7ffaa323
Reviewed-on: http://review.couchbase.org/59754
Well-Formed: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Dave Rigby [Tue, 9 Feb 2016 13:33:38 +0000 (13:33 +0000)]
MB-17766: Refactor nextCheckpointItemTask to allow testing
Split nextCheckpointItemTask() into two inner (protected) functions,
to allow testing of the fix for MB-17766.
Change-Id: I9d441d873cf7f727f90a966d4dda03043c7f6480
Reviewed-on: http://review.couchbase.org/59664
Well-Formed: buildbot <build@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
abhinavdangeti [Tue, 9 Feb 2016 22:27:10 +0000 (14:27 -0800)]
MB-17885: Address compilation errors in ep_testsuite.cc
3.0.x don't support C++11!
Change-Id: Ia3adf0c7ace9b771b999427811f0872774740386
Reviewed-on: http://review.couchbase.org/59678
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Well-Formed: buildbot <build@couchbase.com>
Dave Rigby [Fri, 5 Feb 2016 11:47:27 +0000 (11:47 +0000)]
MB-17517: Regenerate CAS for TAP/DCP mutations with invalid CAS
When receiving mutations via TAP or DCP, validate the CAS if invalid
generate a new one.
This is instead of the simply dropping the mutaiton (returning an
error to the producer), as by dropping the mutation we could cause
data loss.
Change-Id: I3183aab7c5eecb090ccc560319a7aac915cb35b8
Reviewed-on: http://review.couchbase.org/59597
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Manu Dhundi [Mon, 8 Feb 2016 17:42:25 +0000 (09:42 -0800)]
MB-17885: Update flow control bytesSent correctly on DCP producer
This is a fix for a regression introduced recently. Also this adds
a DCP test case to test flow control behavior of DCP producer.
Change-Id: Ia56858cb9e687a0a045b582c18e4b68948cb460c
Reviewed-on: http://review.couchbase.org/59575
Reviewed-by: Jim Walker <jim@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: Chiyoung Seo <chiyoung@couchbase.com>
Norair Khachiyan [Wed, 27 Jan 2016 23:54:30 +0000 (15:54 -0800)]
MB-16910: Stop logging multiple 'warmup is complete' messages
Fix prevents logging more than one 'Engine warmup is complete' message
for each bucket. Fix resolves problem of overloading log file with the
message mentioned above.
Change-Id: Icfdeb5def6e3f055159e1a57430c6dc661382e30
Reviewed-on: http://review.couchbase.org/58898
Reviewed-by: abhinav dangeti <abhinav@couchbase.com>
Reviewed-by: Manu Dhundi <manu@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: Chiyoung Seo <chiyoung@couchbase.com>
abhinavdangeti [Wed, 27 Jan 2016 22:40:20 +0000 (14:40 -0800)]
Merge remote-tracking branch 'couchbase/3.0.x' into 'couchbase/sherlock'
couchbase/3.0.x:
|\
| * ae39d7f MB-17502: DCP performance regression fixed.
Change-Id: Id3262008479ffe7e350a9a4d88438ed541ac242a
Jim Walker [Wed, 6 Jan 2016 15:40:57 +0000 (15:40 +0000)]
MB-17502: DCP performance regression fixed.
Many patches were added to speed up DCP, however some of that
performance was lost when doing some code tidying without
re-profiling.
With all the DCP performance patches (particuarly)
87869fd39 straight
DCP performance is a touch slower. This is because DCP used to take
one lock and then do work. The new code has more locks, but holds
them for fewer lines of code. This means that DCP is friendlier/fairer
to the other threads interacting with a DCP producer.
The front-end operation threads are no longer stalled for long periods
whilst DCP holds the one lock.
Frontend latency before locking changes:
=== Latency [With background DCP] - 100000 items
Percentile
Median 95th 99th Std Dev
Add 16.337 34.894 45.241 25.627
Get 1.226 1.524 1.745 0.435
Replace 16.311 34.386 42.097 8.435
Delete 15.636 32.915 41.999 7.408
Frontend latency after locking changes:
=== Latency [With background DCP] - 100000 items
Percentile
Median 95th 99th Std Dev
Add 3.996 12.159 20.724 11.376
Get 1.299 1.629 1.730 0.634
Replace 4.274 12.831 22.988 4.523
Delete 3.142 10.302 14.292 3.350
The average and 95th/99th are all improved.
Fix details:
The roundRobin/vbReady code has a bufferLog.pauseIfFull call on the
"hot" part of the loop, this is the main cause of the regression.
With that fixed CPU profiling and benchmarking shows that DCP is back
to 3.1.3 levels but highlighted that:
1. DcpProducer::getNextItem was hot (5% of a DcpProducer thread).
2. DcpConsumer::processBufferedItems was hitting SpinLock hard.
20 to 30% at times was consumed by SpinLock code.
3. snapshot creation was frequently yielding even though it had work todo.
So to address 1. the fix is actually to remove the roundRobin/vbReady
code. It is actually no better and in some cases a little slower than
the orginal. This code is replaces with std:: structures *but* the
Mutex used has a much smaller scope.
Note the DcpProducerReadyQueue has been profiled and proven that having
the std::map powering find() is much faster than searching the list.
This is important because the find method is part of the front-end
operation thread.
To address 2. it was observed that the consumer code is constructing
a passive_stream_t frequently, then testing if there is a pointer.
The construction uses the SpinLock code and can be avoided just by
testing the streams[vb] directly and only then do we construct
a copy of the passive_stream_t. This avoids the SpinLock code on
every iteration of the for loop in the affected function.
To address 3. ensure that the snapshot tasks work queue doesn't have
duplicates, there's no need. Then raise the number of snapshots before
yield. Various rebalances showed that around 250 was enough, so let's go
with 256.
Change-Id: I8fb0bd30f8e07d000192675de425726ad26e403a
Reviewed-on: http://review.couchbase.org/58886
Well-Formed: buildbot <build@couchbase.com>
Reviewed-by: abhinav dangeti <abhinav@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Sriram Ganesan [Thu, 21 Jan 2016 20:00:36 +0000 (12:00 -0800)]
MB-17517: check for invalid CAS value in [set/delete]WithMeta
In a [set/delete]WithMeta call from either XDCR or from DCP,
a corrupt CAS value for the incoming item should return an
error
Change-Id: Id87627ae35ef711de4704a960b3aa7d1e9b9a48b
Reviewed-on: http://review.couchbase.org/58910
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Sriram Ganesan [Thu, 21 Jan 2016 02:15:56 +0000 (18:15 -0800)]
MB-17517: return EINVAL instead of assert in arithmetic
If a get performed on an item returns a CAS value of zero, then
return EINVAL as opposed to asserting
Change-Id: If3d43c270bcc627029d0954dab0e570c83ddca74
Reviewed-on: http://review.couchbase.org/58868
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: abhinav dangeti <abhinav@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Jim Walker [Tue, 12 Jan 2016 14:18:50 +0000 (14:18 +0000)]
Merge remote-tracking branch 'couchbase/3.0.x' into 'couchbase/sherlock'
* couchbase/3.0.x:
MB-17086: Fix to performance regression.
MB-16632: Use a background task to handle snapshot creation
Change-Id: Ib018040e29c699316b5519a1325fff1476b803eb
Sriram Ganesan [Thu, 7 Jan 2016 18:38:38 +0000 (10:38 -0800)]
MB-17289: Update existing temp item CAS to incoming CAS in deleteWithMeta
If the incoming CAS is non-zero, a comparison is made with an
existing item CAS to check if the delete can be done or not.
But in case a temp existing item has to be persisted, then
update the existing temp item CAS to the incoming CAS in order
to bypass the above-mentioned comparison only for forced
deletions
Change-Id: Idca2615a7e3166e98b26c9d447b1161742226436
Reviewed-on: http://review.couchbase.org/58385
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Jim Walker [Fri, 8 Jan 2016 14:02:43 +0000 (14:02 +0000)]
MB-17086: Fix to performance regression.
Revert "MB-16632: As part of queueDirty schedule a DCP connections notifier task"
This reverts commit
fa17728e7ca0c637c84a2208b5decfe7ba7e54f1.
Performance testing showed that a regression has been introduced and that
fa17728 was the cause.
The regression was introduced by some fixes made during review that weren't
re-profiled.
Performance can be improved by making some further changes but the investigation
revealed that performance is actually at its best without fa17728.
Change-Id: I7ac3ff49d0b9ce8563f3a932dd337a58d03a0153
Reviewed-on: http://review.couchbase.org/58380
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Jim Walker [Thu, 12 Nov 2015 11:14:11 +0000 (11:14 +0000)]
MB-16632: Use a background task to handle snapshot creation
Frontend threads are delayed by large snaphots due to the time taken
in processing the items into the readyQ.
Moving this work to a background task frees frontend threads to
do other work.
Change-Id: Ic399ef06be996b7b7e179c4c8934a0f5a74cb8f7
Reviewed-on: http://review.couchbase.org/57148
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Sriram Ganesan [Tue, 5 Jan 2016 23:26:59 +0000 (15:26 -0800)]
MB-17231: conditionally delete temp items in get API
There are some conditions in which there might be a need to retain
a temporary non-existent item. For example, in the context of an
arithmetic API call, a get call in full eviction results in a
temp non-existent item that needs to be retained in order for a
subsequent ADD operation to not perform another background fetch.
Change-Id: I357d28a91b1e7ce56006accf919e48e99a643a2e
Reviewed-on: http://review.couchbase.org/58294
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Mike Wiederhold [Mon, 2 Nov 2015 23:27:48 +0000 (15:27 -0800)]
Log DCP stream creation before closing the stream
Th endStream() function emits a log message saying the stream was
closed. This log message should always come after the one that says
the stream was created.
Change-Id: Ibcd259373f5675cf49ea7a95ba3f2276507abe7a
Reviewed-on: http://review.couchbase.org/56615
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
abhinavdangeti [Tue, 5 Jan 2016 02:20:06 +0000 (18:20 -0800)]
Merge remote-tracking branch 'couchbase/3.0.x' into 'couchbase/sherlock'
couchbase/3.0.x:
|\
| * e338e1b MB-17220: [BP] Seperate logs for notifying seqno/checkpoint persistence
| * 02064cd MB-17168: Log lastSentSeqno during takeover state change
| * 380307d MB-16656: Set the open chkpt id on replica to 0 when disk snapshot is recvd.
| * 6f869d1 MB-16656: Stream a full (disk+mem) snapshot from DCP producer on replica vb
Change-Id: I92464ddfa7a20e1db12d2ab432b4072f6f57c8d8
abhinavdangeti [Tue, 1 Dec 2015 19:08:41 +0000 (11:08 -0800)]
MB-17220: [BP] Seperate logs for notifying seqno/checkpoint persistence
- Print different logs while notifying completion or timeouts
during seqno persistence and checkpoint persistence.
- Also adding additional information to the logs.
Change-Id: Idf29cab2197f37b180b0295b19f6b46542bdc6b6
Reviewed-on: http://review.couchbase.org/58233
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Manu Dhundi [Mon, 4 Jan 2016 19:43:32 +0000 (11:43 -0800)]
MB-17088: Fix the underflow in certain vb stats.
Fix the potential underflow bug in dirtyQueueMem, dirtyQueueAge and
dirtyQueuePendingWrites stats.
Change-Id: If5df08e709404399b4e4595f150f5c4616897278
Reviewed-on: http://review.couchbase.org/58067
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
abhinavdangeti [Mon, 4 Jan 2016 17:43:01 +0000 (09:43 -0800)]
MB-17168: Log lastSentSeqno during takeover state change
When an active vbucket state is changed to dead as part
of takeover, log a message that would indicate the last
sent seqno for the vbucket on the particular stream and
the vbucket's high seqno.
Change-Id: I7097b79cf41b2c62688ddb9345bc529ac08b2223
Reviewed-on: http://review.couchbase.org/58217
Reviewed-by: Sriram Ganesan <sriram@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Manu Dhundi [Sat, 19 Dec 2015 02:15:16 +0000 (18:15 -0800)]
MB-16656: Set the open chkpt id on replica to 0 when disk snapshot is recvd.
Currently due to a bug in 3.0.x the open checkpoint id is not set to 0
when replica receives a disk snapshot from active.
Change-Id: Iffda89b8da713539a52d50aa4acc33458ae7150e
Reviewed-on: http://review.couchbase.org/57834
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Manu Dhundi [Mon, 21 Dec 2015 21:42:07 +0000 (13:42 -0800)]
MB-16656: Stream a full (disk+mem) snapshot from DCP producer on replica vb
A replica vbucket receives items from an active vbucket, and until a full
snapshot is received the data on the replica vbucket is not consistent due
to de-duplication and other reasons. Hence while streaming items to a DCP
client from a replica vbucket we need to combine backfill and in memory
snapshots and send items in one snapshot. A caveat here is the replica vb
might not have received all the items in the latest (memory) snapshot, so the
DCP client streaming from replica will have to wait till the replica gets
all the items in the latest snapshot from the active.
Change-Id: I4db622f967316d120506dc9b125211578194bb60
Reviewed-on: http://review.couchbase.org/57816
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Manu Dhundi <manu@couchbase.com>
abhinavdangeti [Mon, 21 Dec 2015 23:34:48 +0000 (15:34 -0800)]
Merge remote-tracking branch 'couchbase/3.0.x' into 'couchbase/sherlock'
couchbase/3.0.x:
* 9b12d0f [DcpStream] Removing extra exception/abort that was added recently
Change-Id: Ied21f9fb093c403cbcef44fcc6d2c9229030a2ad
abhinavdangeti [Mon, 21 Dec 2015 20:03:40 +0000 (12:03 -0800)]
[DcpStream] Removing extra exception/abort that was added recently
Exceptions in 3.0.x are unhandled which makes them pretty
much the same as aborts/asserts.
Although it is impossible for the event where an active stream
enters STREAM_READING state to occur , it may be the better
thing to do - to have the risk of hitting this assertion be ZERO
for the maintainance releases only.
Change-Id: I0a1eff5ab6c8cec8ad6d97e9a1c2201844c25fbd
Reviewed-on: http://review.couchbase.org/57962
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Sriram Ganesan <sriram@couchbase.com>
Reviewed-by: Manu Dhundi <manu@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
abhinavdangeti [Wed, 16 Dec 2015 17:26:09 +0000 (09:26 -0800)]
Log destruction of Passive stream if it were live
Transition state of passive stream to dead in its
destructor, and log a message if we're destroying
a live stream.
Change-Id: I39651ee022a321048409345b9d987dc7c15003cf
Reviewed-on: http://review.couchbase.org/57861
Reviewed-by: Manu Dhundi <manu@couchbase.com>
Reviewed-by: Sundararaman Sridharan <sundar@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
abhinavdangeti [Fri, 18 Dec 2015 17:22:45 +0000 (09:22 -0800)]
Merge remote-tracking branch 'couchbase/3.0.x' into 'couchbase/sherlock'
couchbase/3.0.x:
* 5ffc61f MB-17051: [DcpProducer] Ensure no un-notified streams are left behind
* 5107095 [DcpProducer] Refactor function name to indicate intent
Change-Id: I9c6d50e478846aab1719575e97efe3106cbe8425
abhinavdangeti [Thu, 17 Dec 2015 18:57:56 +0000 (10:57 -0800)]
MB-17051: [DcpProducer] Ensure no un-notified streams are left behind
Reiterate vbReady list at the end of a DcpProducer step to
ensure un-notified vbuckets are not left unprocessed.
Change-Id: I21065cf99f8be0af6dedf506237ce3dbe683387d
Reviewed-on: http://review.couchbase.org/57867
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
abhinavdangeti [Thu, 17 Dec 2015 03:56:44 +0000 (19:56 -0800)]
[DcpProducer] Refactor function name to indicate intent
unpauseIfSpace --> unpauseIfSpaceAvailable
Change-Id: Ifb4ec181e2228d819ab460bd03eccfefd75c48d6
Reviewed-on: http://review.couchbase.org/57872
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Dave Rigby [Wed, 2 Dec 2015 14:55:32 +0000 (14:55 +0000)]
MB-16949: BackfillManager: Fix leak due to circular reference
There is a circular reference between BackfillManager and DcpProducer,
which means that BackfillManager is never deleted. This is caused by
BackfillManager holding an (unused) SingleThreadedRCPtr to
DcpProducer.
Fix this by Removing this unused smart pointer. Unfortunately exposes
a latent deadlock when destructing BackfillManager - ~BackfillManager
(indirectly) calls ~ActiveStream, which in turn calls
BackfillManager::bytesSent() which attempts to acquire the Backfill
lock; however the lock is already acquired in ~BackfillManager.
Fix this by removing the call to bytesSent() in ~ActiveStream - as
it's already been updated in other methods.
The now correct deletion of DcpProducer and the smart-pointer
triggered deletion of ActiveStream leads to a lock-inversion
problem in that different locks are being acquired in different
orders on the possible deletion paths.
Fix this by removing the lock acquisition from destructors.
The locks in destructors are providing no protection, no
code should be refering to the deleting object, they risk
accessing freed memory with or without the locks.
Change-Id: Id79fbf8bc39463d4874c2f61378b7f110345631b
Reviewed-on: http://review.couchbase.org/57417
Reviewed-by: Jim Walker <jim@couchbase.com>
Reviewed-by: abhinav dangeti <abhinav@couchbase.com>
Well-Formed: buildbot <build@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
abhinavdangeti [Fri, 11 Dec 2015 19:36:51 +0000 (11:36 -0800)]
Merge remote-tracking branch 'couchbase/3.0.x' into 'couchbase/sherlock'
couchbase/3.0.x:
* c8ce24b Fix heap-use-after-free issue detected by thread sanitizer
Change-Id: I00298e6a311e87d11fb8dd15adace2ddbfdfcc3f
abhinavdangeti [Thu, 10 Dec 2015 22:23:44 +0000 (14:23 -0800)]
Fix heap-use-after-free issue detected by thread sanitizer
No need to stop Producer Notififer in the destructor of
dcpConnMap. This is already taken care of when the executor
pool is unregistered.
WARNING: ThreadSanitizer: heap-use-after-free (pid=158780)
Read of size 8 at 0x7d180000c1a0 by main thread:
#0 DcpConnMap::~DcpConnMap() /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/tasks.h:103 (ep.so+0x0000000453e1)
#1 DcpConnMap::~DcpConnMap() /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/connmap.cc:954 (ep.so+0x0000000456f5)
#2 EventuallyPersistentEngine::~EventuallyPersistentEngine() /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/ep_engine.cc:6410 (ep.so+0x0000000d0e5c)
#3 EvpDestroy(engine_interface*, bool) /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/ep_engine.cc:147 (ep.so+0x0000000b27f7)
#4 mock_destroy(engine_interface*, bool) /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/memcached/programs/engine_testapp/engine_testapp.cc:99 (engine_testapp+0x0000004cbd97)
#5 destroy_bucket(engine_interface*, engine_interface_v1*, bool) /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/memcached/programs/engine_testapp/engine_testapp.cc:996 (engine_testapp+0x0000004cbc19)
#6 perf_latency_baseline_multi_thread_bucket(test*, int, int, int) /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/tests/ep_perfsuite.cc:386 (ep_perfsuite.so+0x00000000dfc4)
#7 perf_latency_baseline_multi_bucket_4(test*) /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/tests/ep_perfsuite.cc:429 (ep_perfsuite.so+0x0000000091ef)
#8 execute_test(test, char const*, char const*) /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/memcached/programs/engine_testapp/engine_testapp.cc:1104 (engine_testapp+0x0000004cb21c)
#9 __libc_start_main /build/buildd/eglibc-2.15/csu/libc-start.c:226 (libc.so.6+0x00000002176c)
Previous write of size 8 at 0x7d180000c1a0 by thread T15 (mutexes: write M11751):
#0 operator delete(void*) <null> (engine_testapp+0x0000004641db)
#1 DcpConnMap::DcpProducerNotifier::~DcpProducerNotifier() /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/connmap.h:530 (ep.so+0x00000004ab85)
#2 ExecutorThread::run() /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/atomic.h:325 (ep.so+0x0000000f17cb)
#3 launch_executor_thread(void*) /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/executorthread.cc:33 (ep.so+0x0000000f15f5)
#4 platform_thread_wrap(void*) /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/platform/src/cb_pthreads.cc:54 (libplatform.so.0.1.0+0x000000004e7b)
Change-Id: Ib458d0826cc33b4b233da5a422b90bcf08d408bb
Reviewed-on: http://review.couchbase.org/57699
Well-Formed: buildbot <build@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Dave Rigby [Thu, 10 Dec 2015 09:02:23 +0000 (09:02 +0000)]
Merge remote-tracking branch 'couchbase/3.0.x' into sherlock
* couchbase/3.0.x:
MB-17006: [BP] DCP Producer could miss fetching items from a stream
Fix test case test_dcp_early_termination
Fix compilation issue on windows
[BP] MB-16915: Remove cyclic reference between DcpConsumer and PassiveStream.
[BP] MB-16915: RollbackTask to hold ref count ptr for DCP consumer instead of raw ptr
[BP] MB-16915: Use refcounted pointers on producer/consumer
MB-16915: Remove cyclic reference between DcpConsumer and PassiveStream.
MB-16915: RollbackTask to hold ref count ptr for DCP consumer instead of raw ptr
MB-16632: As part of queueDirty schedule a DCP connections notifier task
Change-Id: Iaf2ae12b97c92fd99b3b967cc47f8654b44dfc28
Dave Rigby [Thu, 10 Dec 2015 08:48:03 +0000 (08:48 +0000)]
Merge branch 'couchbase/3.0.x' into couchbase/sherlock
* 3.0.x:
MB-16915: Use refcounted pointers on producer/consumer
MB-16632: Replace std::list with std:deque in DCP checkpoint processing
Change-Id: Ic754e821b486361640a5f409d589c0e61cf93847
abhinavdangeti [Mon, 7 Dec 2015 21:02:55 +0000 (13:02 -0800)]
MB-17006: [BP] DCP Producer could miss fetching items from a stream
Here's the scenario:
1. Stream currently in backfill phase
2. When backfill is received, 1 item added to readyQ
a. itemsReady of stream set to true (Producer notified)
3. Front end op comes in
a. item added to checkpoint queue
b. itemsReady not set to true, as it already is (Producer not
notified)
4. Producer calls stream::next()
a. stream in backfillPhase(): 1 item popped from readyQ
b. backfill task still running => no state transition to IN_MEMORY
c. 1 op returned to producer, producer re-adds vbucket to ready list
5. Backfill completes
6. Producer calls stream::next()
a. stream in backfillPhase(): no items in readyQ
b. As backfill task has completed, state transitions to IN_MEMORY
c. no items in readyQ => NULL returned
d. As no op obtained, producer doesn't re-add vbucket to ready list
=> Front end item remains stuck in checkpoint queue, until more front
end ops come in - which would notify the producer
The proposed fix here is: In step 6b, when the producer sees the
backfill task has completed, and the state for the stream transitions
to IN_MEMORY, move checkpoint items into readyQ. This way the readyQ
will not be empty, and the producer would re-add the vbucket back into
the ready list.
Change-Id: I3403d3926f97788074990ef0e4c69cac902b2a93
Reviewed-on: http://review.couchbase.org/57516
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Change-Id: I0821402fefd01f0851572d7c22ccee5fc065778d
Reviewed-on: http://review.couchbase.org/57641
Well-Formed: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
abhinavdangeti [Wed, 9 Dec 2015 19:42:43 +0000 (11:42 -0800)]
Fix test case test_dcp_early_termination
Account for tasks that are already in the future queue of the
auxIO dispatcher to ensure all DCP backfill tasks (auxIO) have
completed.
Change-Id: I9544a79436193f3ef42b08a2b6615eb4be4792ce
Reviewed-on: http://review.couchbase.org/57642
Well-Formed: buildbot <build@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Dave Rigby [Wed, 9 Dec 2015 11:19:26 +0000 (11:19 +0000)]
Merge branch 'couchbase/3.0.x' into couchbase/sherlock
* 3.0.x:
MB-16632: Reducing locking contention in DCP-Producer/Stream
Incorrect log paramenters while logging backfill completion
Change-Id: I29a3aa4b0a2a8d24b9fe94a566354da9a4304f5c
Dave Rigby [Wed, 9 Dec 2015 10:12:25 +0000 (10:12 +0000)]
Merge remote-tracking branch 'couchbase/3.1.3' into 'couchbase/3.0.x'
* couchbase/3.1.3:
Fix compilation issue on windows
[BP] MB-16915: Remove cyclic reference between DcpConsumer and PassiveStream.
[BP] MB-16915: RollbackTask to hold ref count ptr for DCP consumer instead of raw ptr
[BP] MB-16915: Use refcounted pointers on producer/consumer
Change-Id: Ied8b262fef0eb06671277524e17f0b6cbf7acbeb
abhinavdangeti [Fri, 4 Dec 2015 00:15:47 +0000 (16:15 -0800)]
Fix compilation issue on windows
<http://factory.couchbase.com/job/win_cs_build/ws/couchbase\ep-engine\test
s\ep_testsuite.cc(4958)> : error C2782: 'void checkeqfn(T,T,const char
,const char ,const int)' : template parameter 'T' is ambiguous
<http://factory.couchbase.com/job/win_cs_build/ws/couchbase\ep-engine\test
s\ep_testsuite.cc(74)> : see declaration of 'checkeqfn'
could be 'unsigned __int64'
or 'unsigned long'
NMAKE : fatal error U1077:
'C:\PROGRA~2\MICROS~2.0\VC\bin\amd64\cl.exe' :
return code '0x2'
Change-Id: I9a0bf5bd74276ebe9ac6a709302704a2bab06c25
Reviewed-on: http://review.couchbase.org/57454
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: abhinav dangeti <abhinav@couchbase.com>
Manu Dhundi [Thu, 3 Dec 2015 21:32:35 +0000 (13:32 -0800)]
[BP] MB-16915: Remove cyclic reference between DcpConsumer and PassiveStream.
DcpConsumer holds a reference to PassiveStream and vice versa. We must
make sure that one of them (DcpConsumer here) releases the reference
to another in a function other than the object destructor.
Change-Id: I8e5c262bc5ac50342f85ba80d481987a26a7a21d
Reviewed-on: http://review.couchbase.org/57429
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Reviewed-on: http://review.couchbase.org/57449
Tested-by: Chiyoung Seo <chiyoung@couchbase.com>
Manu Dhundi [Thu, 3 Dec 2015 01:19:51 +0000 (17:19 -0800)]
[BP] MB-16915: RollbackTask to hold ref count ptr for DCP consumer instead of raw ptr
Rollback task is spawned when a DCP consumer is asked to rollback by a DCP
producer. Rollback runs in background and there is a possibility that the DCP
consumer object gets deleted before rollback task completes. We can avoid this
if RollbackTask holds a ref counted ptr of DCP consumer instead of a raw ptr.
Change-Id: I00c1bced0ec445226e64e6f7647a3bfbfb063f94
Reviewed-on: http://review.couchbase.org/57427
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Reviewed-on: http://review.couchbase.org/57448
Tested-by: Chiyoung Seo <chiyoung@couchbase.com>
Jim Walker [Mon, 30 Nov 2015 13:31:59 +0000 (13:31 +0000)]
[BP] MB-16915: Use refcounted pointers on producer/consumer
Prevents a race/crash occuring when the DcpProducer is destroyed
and there are backfill tasks running/pending.
The test case reveals the probem when run under valgrind as
a series of invalid reads of freed memory. E.g.
==40673== Thread 17:
==40673== Invalid read of size 8
==40673== at 0x71A3CEE: DCPBackfill::run() (dcp-stream.cc:175)
==40673== by 0x717215C: ExecutorThread::run() (executorthread.cc:110)
==40673== by 0x7172868: launch_executor_thread (executorthread.cc:34)
==40673== by 0x503EC67: platform_thread_wrap (cb_pthreads.c:24)
==40673== by 0x524A181: start_thread (pthread_create.c:312)
==40673== by 0x555A47C: clone (clone.S:111)
==40673== Address 0x64c2380 is 48 bytes inside a block of size 384 free'd
==40673== at 0x4C2C2BC: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==40673== by 0x718C4ED: DcpConnMap::manageConnections() (atomic.h:430)
==40673== by 0x71906A5: ConnManager::run() (connmap.cc:151)
==40673== by 0x717215C: ExecutorThread::run() (executorthread.cc:110)
==40673== by 0x7172868: launch_executor_thread (executorthread.cc:34)
==40673== by 0x503EC67: platform_thread_wrap (cb_pthreads.c:24)
==40673== by 0x524A181: start_thread (pthread_create.c:312)
==40673== by 0x555A47C: clone (clone.S:111)
Change-Id: I32a7dfd10daa4565b9cbb4c8142ed8f71c13ca31
Reviewed-on: http://review.couchbase.org/57296
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Reviewed-on: http://review.couchbase.org/57447
Tested-by: Chiyoung Seo <chiyoung@couchbase.com>
Manu Dhundi [Thu, 3 Dec 2015 21:32:35 +0000 (13:32 -0800)]
MB-16915: Remove cyclic reference between DcpConsumer and PassiveStream.
DcpConsumer holds a reference to PassiveStream and vice versa. We must
make sure that one of them (DcpConsumer here) releases the reference
to another in a function other than the object destructor.
Change-Id: I8e5c262bc5ac50342f85ba80d481987a26a7a21d
Reviewed-on: http://review.couchbase.org/57429
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Manu Dhundi [Thu, 3 Dec 2015 01:19:51 +0000 (17:19 -0800)]
MB-16915: RollbackTask to hold ref count ptr for DCP consumer instead of raw ptr
Rollback task is spawned when a DCP consumer is asked to rollback by a DCP
producer. Rollback runs in background and there is a possibility that the DCP
consumer object gets deleted before rollback task completes. We can avoid this
if RollbackTask holds a ref counted ptr of DCP consumer instead of a raw ptr.
Change-Id: I00c1bced0ec445226e64e6f7647a3bfbfb063f94
Reviewed-on: http://review.couchbase.org/57427
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
abhinavdangeti [Thu, 22 Oct 2015 21:54:28 +0000 (14:54 -0700)]
MB-16632: As part of queueDirty schedule a DCP connections notifier task
This is how things are done for TAP.
This pretty much removed the notifications' lock overhead on
store/delete/(front-end) OP latencies.
Change-Id: I32c3c26daf6ea8cebeecc2a81fb1f0e957ba3e3d
Reviewed-on: http://review.couchbase.org/56301
Reviewed-by: Dave Rigby <daver@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Jim Walker [Mon, 30 Nov 2015 13:31:59 +0000 (13:31 +0000)]
MB-16915: Use refcounted pointers on producer/consumer
Prevents a race/crash occuring when the DcpProducer is destroyed
and there are backfill tasks running/pending.
The test case reveals the probem when run under valgrind as
a series of invalid reads of freed memory. E.g.
==40673== Thread 17:
==40673== Invalid read of size 8
==40673== at 0x71A3CEE: DCPBackfill::run() (dcp-stream.cc:175)
==40673== by 0x717215C: ExecutorThread::run() (executorthread.cc:110)
==40673== by 0x7172868: launch_executor_thread (executorthread.cc:34)
==40673== by 0x503EC67: platform_thread_wrap (cb_pthreads.c:24)
==40673== by 0x524A181: start_thread (pthread_create.c:312)
==40673== by 0x555A47C: clone (clone.S:111)
==40673== Address 0x64c2380 is 48 bytes inside a block of size 384 free'd
==40673== at 0x4C2C2BC: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==40673== by 0x718C4ED: DcpConnMap::manageConnections() (atomic.h:430)
==40673== by 0x71906A5: ConnManager::run() (connmap.cc:151)
==40673== by 0x717215C: ExecutorThread::run() (executorthread.cc:110)
==40673== by 0x7172868: launch_executor_thread (executorthread.cc:34)
==40673== by 0x503EC67: platform_thread_wrap (cb_pthreads.c:24)
==40673== by 0x524A181: start_thread (pthread_create.c:312)
==40673== by 0x555A47C: clone (clone.S:111)
Change-Id: I32a7dfd10daa4565b9cbb4c8142ed8f71c13ca31
Reviewed-on: http://review.couchbase.org/57296
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Jim Walker [Mon, 30 Nov 2015 13:31:59 +0000 (13:31 +0000)]
MB-16915: Use refcounted pointers on producer/consumer
Prevents a race/crash occuring when the DcpProducer is destroyed
and there are backfill tasks running/pending.
The test case reveals the probem when run under valgrind as
a series of invalid reads of freed memory. E.g.
==40673== Thread 17:
==40673== Invalid read of size 8
==40673== at 0x71A3CEE: DCPBackfill::run() (dcp-stream.cc:175)
==40673== by 0x717215C: ExecutorThread::run() (executorthread.cc:110)
==40673== by 0x7172868: launch_executor_thread (executorthread.cc:34)
==40673== by 0x503EC67: platform_thread_wrap (cb_pthreads.c:24)
==40673== by 0x524A181: start_thread (pthread_create.c:312)
==40673== by 0x555A47C: clone (clone.S:111)
==40673== Address 0x64c2380 is 48 bytes inside a block of size 384 free'd
==40673== at 0x4C2C2BC: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==40673== by 0x718C4ED: DcpConnMap::manageConnections() (atomic.h:430)
==40673== by 0x71906A5: ConnManager::run() (connmap.cc:151)
==40673== by 0x717215C: ExecutorThread::run() (executorthread.cc:110)
==40673== by 0x7172868: launch_executor_thread (executorthread.cc:34)
==40673== by 0x503EC67: platform_thread_wrap (cb_pthreads.c:24)
==40673== by 0x524A181: start_thread (pthread_create.c:312)
==40673== by 0x555A47C: clone (clone.S:111)
Change-Id: I32a7dfd10daa4565b9cbb4c8142ed8f71c13ca31
Reviewed-on: http://review.couchbase.org/57377
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Jim Walker [Wed, 4 Nov 2015 12:17:57 +0000 (12:17 +0000)]
MB-16632: Replace std::list with std:deque in DCP checkpoint processing
The algorithm does not need a std::list when it is implementing
nothing more than an queue.
This change brings some performance improvement to snaphot marker
creation.
Change-Id: I2f1ac82364737e9f56ff9c0c11b3cc1775b3f0d2
Reviewed-on: http://review.couchbase.org/57147
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
abhinavdangeti [Thu, 22 Oct 2015 22:36:12 +0000 (15:36 -0700)]
MB-16632: Reducing locking contention in DCP-Producer/Stream
- Adding a new RWLock for streams in Producer and avoid queueLock
- Improving BufferLog and remove need for queueLock on access
- Adding an array of atomic bool for lockless vbucket ready notification
- Changing some ActiveStream variables to be atomic to allow for lockless
updates.
Change-Id: I11c54f1058c4c8a3f013dfc858a39d17362c9531
Reviewed-on: http://review.couchbase.org/56300
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
abhinavdangeti [Mon, 23 Nov 2015 17:05:01 +0000 (09:05 -0800)]
Incorrect log paramenters while logging backfill completion
Change-Id: I877fd7067862f09801ffd16e7014a0c952e8c559
Reviewed-on: http://review.couchbase.org/56417
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Daniel Owen <owend@couchbase.com>
Reviewed-by: Manu Dhundi <manu@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Manu Dhundi [Wed, 18 Nov 2015 01:45:17 +0000 (17:45 -0800)]
Merge remote-tracking branch 'couchbase/3.0.x'
couchbase/3.0.x:
MB-16836: Reset the stat 'ep_bg_fetched' to 0 on 'cbstats reset' command
Change-Id: I0ad9b7fdcd4b37cbffa38361ec33ccc4f3eec78b
Manu Dhundi [Tue, 17 Nov 2015 22:55:18 +0000 (14:55 -0800)]
MB-16836: Reset the stat 'ep_bg_fetched' to 0 on 'cbstats reset' command
Change-Id: I444bd6c76265788d4366061d4d2b25e3c5e60518
Reviewed-on: http://review.couchbase.org/57129
Reviewed-by: abhinav dangeti <abhinav@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Jim Walker [Thu, 12 Nov 2015 12:21:41 +0000 (12:21 +0000)]
MB-16357: Create a variable to get correct locking scope
A mistake in
495e00acc24 means that no variable is
created for the ReaderLockHolder, the compiler either
optimises away the lock constructor/destructor or the lock
scope is wrong.
Either way we need to create a variable.
Change-Id: I156fcf6b0df7aee99b2c33c0e6e80d3c6ef7e546
Reviewed-on: http://review.couchbase.org/56977
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Dave Rigby <daver@couchbase.com>
Jim Walker [Thu, 12 Nov 2015 12:04:56 +0000 (12:04 +0000)]
MB-16357: Create a variable to get correct locking scope
A mistake in
495e00acc24 means that no variable is
created for the ReaderLockHolder, the compiler either
optimises away the lock constructor/destructor or the lock
scope is wrong.
Either way we need to create a variable.
Change-Id: I642ac64d71b73d3d78207ff50d33539a06ce0e7e
Reviewed-on: http://review.couchbase.org/56976
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Dave Rigby <daver@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
abhinavdangeti [Fri, 30 Oct 2015 19:01:06 +0000 (12:01 -0700)]
Merge remote-tracking branch 'couchbase/3.0.x' into 'couchbase/sherlock'
couchbase/3.0.x:
* 8939f83 MB-16686: Remove sanity check while adding TAP over DCP
* 0f1ff18 MB-15171: [BP] Initialize dcpConnMap_ to NULL in engine constructor
* 44bd072 MB-14825: [BP] While trying to stream next checkpoint item, check if vbucket is valid
Change-Id: I9ec7c046aa16aeeb0999438ef0186109828f3044
abhinavdangeti [Fri, 30 Oct 2015 17:11:46 +0000 (10:11 -0700)]
MB-16686: Remove sanity check while adding TAP over DCP
This check isn't accurate as certain TAP messages from
the producer carry no vbucket information - initialized to
zero (expected), as they aren't vbucket specific operations.
In such a scenario, if the TAP consumer needs to be created,
it wouldn't be allowed to if a DCP passive stream exists
for vbucket 0. This would break an online upgrade.
Change-Id: I310b9cf4dbaf652c233cba02de7ca72469efa89d
Reviewed-on: http://review.couchbase.org/56564
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Dave Rigby [Mon, 19 Oct 2015 11:41:49 +0000 (12:41 +0100)]
MB-16574: Log when cluster config is changed
Change-Id: I0b234e6330ab3e5c0cde84c14977aff8034000d6
Reviewed-on: http://review.couchbase.org/56219
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Trond Norbye <trond.norbye@gmail.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Sriram Ganesan [Thu, 28 May 2015 01:51:23 +0000 (18:51 -0700)]
MB-15171: [BP] Initialize dcpConnMap_ to NULL in engine constructor
Not initializing this variable to NULL can cause access to an
invalid pointer during engine destroy.
Change-Id: Icc5d848f7826bb6331deb40b4832efcf64622dea
Reviewed-on: http://review.couchbase.org/51492
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Reviewed-by: abhinav dangeti <abhinav@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Reviewed-on: http://review.couchbase.org/56171
Manu Dhundi [Thu, 7 May 2015 01:30:24 +0000 (18:30 -0700)]
MB-14825: [BP] While trying to stream next checkpoint item, check if vbucket is valid
If a vbucket is deleted in middle of a DCP connection streaming a checkpoint
item, we should handle such a scenario in a graceful manner.
Change-Id: I24fe52adc572f504f492f015f82fc8d5e0325925
Reviewed-on: http://review.couchbase.org/50674
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: Chiyoung Seo <chiyoung@couchbase.com>
Reviewed-on: http://review.couchbase.org/56170
Tested-by: buildbot <build@couchbase.com>
abhinavdangeti [Wed, 14 Oct 2015 18:29:15 +0000 (11:29 -0700)]
Merge remote-tracking branch 'couchbase/3.0.x' into sherlock
couchbase/3.0.x:
* dec472a MB-16500 [BP]: Address data race in DcpConsumer, by acquiring readyMutex
* 6abf179 MB-16500 [BP]: Removing unnecessary locking in consumer code
* d388995 MB-16500 [BP]: MB-16496 Fix the race on vbucket state between persistVBState() and compactVB()
* 8828d4d MB-16500 [BP]: Address possible data race in checkpoint remover
* 52e40ed MB-16500 [BP]: Address possible data race in item/expiry pagers
* 5305a71 MB-16500 [BP]: Fix potenial deadlock around Connmap::relaseLock / connLock
* b155309 MB-16500 [BP]: Address possible data race in Notifiable: setSuspended
* 3b8a7be MB-16500 [BP]: Address possible data race in ConnHandler: lastWalkTime
Change-Id: Ie72bf67c10b0bed125a89534815474b1309763bb
abhinavdangeti [Wed, 7 Oct 2015 21:49:41 +0000 (14:49 -0700)]
MB-16500 [BP]: Address data race in DcpConsumer, by acquiring readyMutex
WARNING: ThreadSanitizer: data race (pid=27652)
Write of size 8 at 0x7d08000443c0 by main thread (mutexes: write M57876):
#0 operator delete(void*) <null>:0 (engine_testapp+0x000000050e7b)
#1 __gnu_cxx::new_allocator<std::_List_node<unsigned short> >::deallocate(std::_List_node<unsigned short>*, unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/ext/new_allocator.h:110 (ep.so+0x00000005d69a)
#2 DcpConsumer::step(dcp_message_producers*) /home/abhinav/couchbase/ep-engine/src/dcp/consumer.cc:516 (ep.so+0x00000005c5cc)
#3 EvpDcpStep(engine_interface*, void const*, dcp_message_producers*) /home/abhinav/couchbase/ep-engine/src/ep_engine.cc:1479 (ep.so+0x0000000b480b)
#4 mock_dcp_step(engine_interface*, void const*, dcp_message_producers*) /home/abhinav/couchbase/memcached/programs/engine_testapp/engine_testapp.cc:476 (engine_testapp+0x0000000bb055)
#5 dcp_step(engine_interface*, engine_interface_v1*, void const*) /home/abhinav/couchbase/ep-engine/tests/ep_test_apis.cc:1219 (ep_testsuite.so+0x0000000b61bd)
#6 test_chk_manager_rollback(engine_interface*, engine_interface_v1*) /home/abhinav/couchbase/ep-engine/tests/ep_testsuite.cc:5526 (ep_testsuite.so+0x0000000809b4)
#7 execute_test(test, char const*, char const*) /home/abhinav/couchbase/memcached/programs/engine_testapp/engine_testapp.cc:1090 (engine_testapp+0x0000000b952c)
#8 __libc_start_main /build/buildd/eglibc-2.19/csu/libc-start.c:287 (libc.so.6+0x000000021ec4)
Previous write of size 8 at 0x7d08000443c0 by thread T16:
#0 operator new(unsigned long) <null>:0 (engine_testapp+0x00000005090d)
#1 __gnu_cxx::new_allocator<std::_List_node<unsigned short> >::allocate(unsigned long, void const*) /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/ext/new_allocator.h:104 (ep.so+0x00000005f265)
#2 PassiveStream::reconnectStream(RCPtr<VBucket>&, unsigned int, unsigned long) /home/abhinav/couchbase/ep-engine/src/dcp/stream.cc:1104 (ep.so+0x000000076f5f)
#3 DcpConsumer::doRollback(unsigned int, unsigned short, unsigned long) /home/abhinav/couchbase/ep-engine/src/dcp/consumer.cc:676 (ep.so+0x00000005db67)
#4 RollbackTask::run() /home/abhinav/couchbase/ep-engine/src/dcp/consumer.cc:574 (ep.so+0x00000005d9d4)
#5 ExecutorThread::run() /home/abhinav/couchbase/ep-engine/src/executorthread.cc:115 (ep.so+0x0000000f834c)
#6 launch_executor_thread(void*) /home/abhinav/couchbase/ep-engine/src/executorthread.cc:33 (ep.so+0x0000000f7eb5)
#7 platform_thread_wrap /home/abhinav/couchbase/platform/src/cb_pthreads.c:23 (libplatform.so.0.1.0+0x000000003d71)
Change-Id: I196a78e54bf8014967a51cdb081126597153f77b
Reviewed-on: http://review.couchbase.org/55881
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Reviewed-on: http://review.couchbase.org/56065
Reviewed-by: Dave Rigby <daver@couchbase.com>
abhinavdangeti [Thu, 13 Aug 2015 18:46:35 +0000 (11:46 -0700)]
MB-16500 [BP]: Removing unnecessary locking in consumer code
streamMutex is to protect the ready list, but not the streams list.
The front end operations: addStream, closeStream, handleResponse, step
- wouldn't race with each other over the streams list, as multiple
memcached threads will not serve a single cookie.
The back end operations: processBufferedMessages (doesn't grab lock any
way), doRollback just read from streams list.
An addstream (front end op) is the only one that updates streams, and
this wouldn't update when a rollback is in progress.
Therefore, renaming the streamMutex lock in DCPConsumer to readyMutex
which is more apt for its operation - guarding the ready list.
Change-Id: Ia342d7243fef4b97b729aa94fdc64ad020711589
Reviewed-on: http://review.couchbase.org/54406
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Manu Dhundi <manu@couchbase.com>
Reviewed-on: http://review.couchbase.org/56080
Reviewed-by: Dave Rigby <daver@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: abhinav dangeti <abhinav@couchbase.com>
Chiyoung Seo [Fri, 9 Oct 2015 18:39:10 +0000 (11:39 -0700)]
MB-16500 [BP]: MB-16496 Fix the race on vbucket state between persistVBState() and compactVB()
The following data race is reported by thread sanitizer:
WARNING: ThreadSanitizer: data race (pid=29921)
Write of size 8 at 0x7d680001f580 by thread T5 (mutexes: write M12734):
#0 VBucket::setPurgeSeqno() ep-engine/src/vbucket.h:215:9 (ep.so+0x000000086204)
#1 EventuallyPersistentStore::compactVBucket() ep-engine/src/ep.cc:1584 (ep.so+0x000000086204)
#2 CompactVBucketTask::run() ep-engine/src/tasks.cc:94:12 (ep.so+0x00000012971e)
#3 ExecutorThread::run() ep-engine/src/executorthread.cc:115:26 (ep.so+0x0000000ea41d)
#4 launch_executor_thread() ep-engine/src/executorthread.cc:33:9 (ep.so+0x0000000e9fe5)
#5 platform_thread_wrap platform/src/cb_pthreads.c:23:5 (libplatform.so.0.1.0+0x000000004161)
Previous read of size 8 at 0x7d680001f580 by thread T7:
#0 VBucket::getPurgeSeqno() ep-engine/src/vbucket.h:211:16 (ep.so+0x0000000821d3)
#1 EventuallyPersistentStore::persistVBState() ep-engine/src/ep.cc:1217 (ep.so+0x0000000821d3)
#2 VBStatePersistTask::run() ep-engine/src/tasks.cc:86:12 (ep.so+0x000000129636)
#3 ExecutorThread::run() ep-engine/src/executorthread.cc:115:26 (ep.so+0x0000000ea41d)
#4 launch_executor_thread() ep-engine/src/executorthread.cc:33:9 (ep.so+0x0000000e9fe5)
#5 platform_thread_wrap platform/src/cb_pthreads.c:23:5 (libplatform.so.0.1.0+0x000000004161)
Location is heap block of size 1392 at 0x7d680001f200 allocated by main thread:
#0 operator new() <null> (engine_testapp+0x00000045cded)
#1 EventuallyPersistentStore::setVBucketState() ep-engine/src/ep.cc:1300:30 (ep.so+0x000000082b1a)
#2 EventuallyPersistentEngine::setVBucketState() ep-engine/src/ep_engine.h:718:16 (ep.so+0x0000000ca308)
#3 setVBucket()) ep-engine/src/ep_engine.cc:884 (ep.so+0x0000000ca308)
#4 processUnknownCommand()) ep-engine/src/ep_engine.cc:1178 (ep.so+0x0000000ca308)
#5 EvpUnknownCommand()) ep-engine/src/ep_engine.cc:1389:38 (ep.so+0x0000000aafc8)
#6 mock_unknown_command()) memcached/programs/engine_testapp/engine_testapp.cc:380:19 (engine_testapp+0x0000004c56b9)
#7 set_vbucket_state() ep-engine/tests/ep_test_apis.cc:607:9 (ep_testsuite.so+0x0000000a3a4b)
#8 test_setup() ep-engine/tests/ep_testsuite_common.cc:146:28 (ep_testsuite.so+0x00000009cdda)
#9 execute_test() memcached/programs/engine_testapp/engine_testapp.cc:1085:47 (engine_testapp+0x0000004c4103)
#10 main memcached/programs/engine_testapp/engine_testapp.cc:1439 (engine_testapp+0x0000004c4103)
To address the above issue, vbucket states should be read after grabbing
the vbucket writer lock in EPStore::persistVBState().
Change-Id: I5a42b3e15a1cf5c941d399897bc68d6f35746ff3
Reviewed-on: http://review.couchbase.org/55972
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: abhinav dangeti <abhinav@couchbase.com>
Reviewed-on: http://review.couchbase.org/56068
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
abhinavdangeti [Thu, 8 Oct 2015 22:14:25 +0000 (15:14 -0700)]
MB-16500 [BP]: Address possible data race in checkpoint remover
WARNING: ThreadSanitizer: data race (pid=102986)
Read of size 1 at 0x7d180000c298 by thread T17:
#0 ClosedUnrefCheckpointRemoverTask::run() /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/checkpoint_remover.cc:139 (ep.so+0x00000003d4a6)
#1 ExecutorThread::run() /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/executorthread.cc:115 (ep.so+0x0000000f9c4c)
#2 launch_executor_thread(void*) /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/executorthread.cc:33 (ep.so+0x0000000f9815)
#3 platform_thread_wrap /home/couchbase/.ccache/tmp/cb_pthread.tmp.
9515862b2292.83537.i:0 (libplatform.so.0.1.0+0x0000000041b1)
Previous write of size 1 at 0x7d180000c298 by thread T18:
#0 CheckpointVisitor::complete() /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/checkpoint_remover.cc:69 (ep.so+0x00000003e0a1)
#1 VBCBAdaptor::run() /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/ep.cc:3789 (ep.so+0x00000009d64a)
#2 ExecutorThread::run() /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/executorthread.cc:115 (ep.so+0x0000000f9c4c)
#3 launch_executor_thread(void*) /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/executorthread.cc:33 (ep.so+0x0000000f9815)
#4 platform_thread_wrap /home/couchbase/.ccache/tmp/cb_pthread.tmp.
9515862b2292.83537.i:0 (libplatform.so.0.1.0+0x0000000041b1)
Change-Id: I8579d5af259490e41028f57f302e547b0826fa61
Reviewed-on: http://review.couchbase.org/55937
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Reviewed-on: http://review.couchbase.org/56067
abhinavdangeti [Thu, 8 Oct 2015 22:05:58 +0000 (15:05 -0700)]
MB-16500 [BP]: Address possible data race in item/expiry pagers
WARNING: ThreadSanitizer: data race (pid=102450)
Write of size 1 at 0x7d180000c2f8 by thread T17:
#0 PagingVisitor::complete() /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/item_pager.cc:175 (ep.so+0x000000106c57)
#1 VBCBAdaptor::run() /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/ep.cc:3789 (ep.so+0x00000009d64a)
#2 ExecutorThread::run() /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/executorthread.cc:115 (ep.so+0x0000000f9c4c)
#3 launch_executor_thread(void*) /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/executorthread.cc:33 (ep.so+0x0000000f9815)
#4 platform_thread_wrap /home/couchbase/.ccache/tmp/cb_pthread.tmp.
9515862b2292.83537.i:0 (libplatform.so.0.1.0+0x0000000041b1)
Previous read of size 1 at 0x7d180000c2f8 by thread T18:
#0 ExpiredItemPager::run() /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/item_pager.cc:334 (ep.so+0x0000001053c6)
#1 ExecutorThread::run() /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/executorthread.cc:115 (ep.so+0x0000000f9c4c)
#2 launch_executor_thread(void*) /home/couchbase/jenkins/workspace/ep-engine-threadsanitizer-master/ep-engine/src/executorthread.cc:33 (ep.so+0x0000000f9815)
#3 platform_thread_wrap /home/couchbase/.ccache/tmp/cb_pthread.tmp.
9515862b2292.83537.i:0 (libplatform.so.0.1.0+0x0000000041b1)
Change-Id: Iebfe280c95847ee80b2d80d08b0eb340f40663d9
Reviewed-on: http://review.couchbase.org/55935
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
Reviewed-on: http://review.couchbase.org/56066
Dave Rigby [Wed, 7 Oct 2015 11:23:39 +0000 (11:23 +0000)]
MB-16500 [BP]: Fix potenial deadlock around Connmap::relaseLock / connLock
As reported by ThreadSanitizer (see below), we have a lock inversion
creating a potential deadlock in Connmap, related to how we shutdown
connections:
There exists a cycle in lock order graph:
M2176 (Connmap::releaseLock in ConnMap::notifyPausedConnection() connmap.cc:235) =>
M128093 (mock_server::conn_struct::mutex) =>
M2177 (Connmap::connsLock in TapConnMap::shutdownAllConnections(), connmap.cc:770) =>
M2176 (Connmap::releaseLock in TapConnMap::shutdownAllConnections(), connmap.cc:777) DEADLOCK!
The problem appears to be that in TapConnMap::shutdownAllConnections()
we first acquire {connsLock}, then acquire {releaseLock}; all while
holding the cookie lock.
Fix is to drop releaseLock in shutdownAllConnections() once we've
released any references, *then* acquire the connLock to actually clear
out the connection map.
ThreadSanitizer report follows (irrelevent parts removed):
WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) (pid=1087)
Cycle in lock order graph: M2176 (0x7d840001b810) => M128093 (0x7d280000efa0) => M2177 (0x7d840001b850) => M2176
Mutex M128093 acquired here while holding mutex M2176 in thread T10:
<cut>
Mutex M2176 previously acquired by the same thread here:
#0 pthread_mutex_lock <null> ()
...
#5 ConnMap::notifyPausedConnection() ep-engine/src/connmap.cc:235 ()
<cut>
Mutex M2177 acquired here while holding mutex M128093 in main thread:
#0 pthread_mutex_lock <null> ()
...
#5 TapConnMap::newProducer() ep-engine/src/connmap.cc:378 ()
#6 EventuallyPersistentEngine::createTapQueue() ep-engine/src/ep_engine.cc:2663:23 ()
<cut>
Mutex M128093 previously acquired by the same thread here:
#0 pthread_mutex_lock <null> ()
#1 cb_mutex_enter platform/src/cb_pthreads.c:115:14 ()
#2 lock_mock_cookie memcached/programs/engine_testapp/mock_server.c:436:4 ()
#3 test_tap_stream() ep-engine/tests/ep_testsuite.cc:6751:5 ()
#4 execute_test() memcached/programs/engine_testapp/engine_testapp.cc:1090:19 ()
#5 main memcached/programs/engine_testapp/engine_testapp.cc:1439 ()
Mutex M2176 acquired here while holding mutex M2177 in main thread:
#0 pthread_mutex_lock <null> ()
...
#5 TapConnMap::shutdownAllConnections() ep-engine/src/connmap.cc:777 ()
#6 EventuallyPersistentEngine::destroy() ep-engine/src/ep_engine.cc:2190:9 ()
<cut>
Mutex M2177 previously acquired by the same thread here:
#0 pthread_mutex_lock <null> ()
...
#5 TapConnMap::shutdownAllConnections() ep-engine/src/connmap.cc:770 ()
#6 EventuallyPersistentEngine::destroy() ep-engine/src/ep_engine.cc:2190:9 ()
<cut>
Change-Id: Ic9a4f028b202277729df025333ce630be056903d
Reviewed-on: http://review.couchbase.org/55863
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Reviewed-on: http://review.couchbase.org/56064
abhinavdangeti [Tue, 6 Oct 2015 00:47:23 +0000 (17:47 -0700)]
MB-16500 [BP]: Address possible data race in Notifiable: setSuspended
WARNING: ThreadSanitizer: data race (pid=19078)
Write of size 1 at 0x7d600000f078 by thread T10 (mutexes: write M21717):
#0 Notifiable::setSuspended(bool) /home/abhinav/couchbase/ep-engine/src/tapconnection.h:764 (ep.so+0x00000005fe6a)
#1 TapProducer::suspendedConnection_UNLOCKED(bool) /home/abhinav/couchbase/ep-engine/src/tapconnection.cc:717 (ep.so+0x00000013b24e)
#2 ExecutorThread::run() /home/abhinav/couchbase/ep-engine/src/executorthread.cc:112 (ep.so+0x0000000f89f6)
#3 launch_executor_thread(void*) /home/abhinav/couchbase/ep-engine/src/executorthread.cc:33 (ep.so+0x0000000f8595)
#4 platform_thread_wrap /home/abhinav/couchbase/platform/src/cb_pthreads.c:23 (libplatform.so.0.1.0+0x000000003d31)
Previous read of size 1 at 0x7d600000f078 by main thread (mutexes: write M21715):
#0 Notifiable::isSuspended() /home/abhinav/couchbase/ep-engine/src/tapconnection.h:768 (ep.so+0x0000000dfdf2)
#1 EventuallyPersistentEngine::walkTapQueue(void const*, void**, void**, unsigned short*, unsigned char*, unsigned short*, unsigned int*, unsigned short*) /home/abhinav/couchbase/ep-engine/src/ep_engine.cc:2531 (ep.so+0x0000000b7ecc)
#2 EvpTapIterator(engine_interface*, void const*, void**, void**, unsigned short*, unsigned char*, unsigned short*, unsigned int*, unsigned short*) /home/abhinav/couchbase/ep-engine/src/ep_engine.cc:1440 (ep.so+0x0000000d5ad7)
#3 mock_tap_iterator(engine_interface*, void const*, void**, void**, unsigned short*, unsigned char*, unsigned short*, unsigned int*, unsigned short*) /home/abhinav/couchbase/memcached/programs/engine_testapp/engine_testapp.cc:81 (engine_testapp+0x0000000bbda2)
#4 test_tap_ack_stream(engine_interface*, engine_interface_v1*) /home/abhinav/couchbase/ep-engine/tests/ep_testsuite.cc:7353 (ep_testsuite.so+0x0000000504a7)
#5 execute_test(test, char const*, char const*) /home/abhinav/couchbase/memcached/programs/engine_testapp/engine_testapp.cc:1090 (engine_testapp+0x0000000b946c)
#6 __libc_start_main /build/buildd/eglibc-2.19/csu/libc-start.c:287 (libc.so.6+0x000000021ec4)
Change-Id: I596c93c048767911b052861193822ca17270a5dd
Reviewed-on: http://review.couchbase.org/55792
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Reviewed-on: http://review.couchbase.org/56063
abhinavdangeti [Mon, 5 Oct 2015 22:11:24 +0000 (15:11 -0700)]
MB-16500 [BP]: Address possible data race in ConnHandler: lastWalkTime
WARNING: ThreadSanitizer: data race (pid=26185)
Write of size 4 at 0x7d4c0000a154 by main thread:
#0 ConnHandler::setLastWalkTime() /home/abhinav/couchbase/ep-engine/src/tapconnection.h:356 (ep.so+0x000000065641)
#1 EvpDcpStep(engine_interface*, void const*, dcp_message_producers*) /home/abhinav/couchbase/ep-engine/src/ep_engine.cc:1478 (ep.so+0x0000000b4d5b)
#2 mock_dcp_step(engine_interface*, void const*, dcp_message_producers*) /home/abhinav/couchbase/memcached/programs/engine_testapp/engine_testapp.cc:476 (engine_testapp+0x0000000baf95)
#3 dcp_stream(engine_interface*, engine_interface_v1*, char const*, void const*, unsigned short, unsigned int, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, int, int, int, int, bool, bool, unsigned char, bool, unsigned long*, bool) /home/abhinav/couchbase/ep-engine/tests/ep_testsuite.cc:4164 (ep_testsuite.so+0x0000000990df)
#4 test_dcp_producer_stream_req_dgm(engine_interface*, engine_interface_v1*) /home/abhinav/couchbase/ep-engine/tests/ep_testsuite.cc:4564 (ep_testsuite.so+0x000000077634)
#5 execute_test(test, char const*, char const*) /home/abhinav/couchbase/memcached/programs/engine_testapp/engine_testapp.cc:1090 (engine_testapp+0x0000000b946c)
#6 __libc_start_main /build/buildd/eglibc-2.19/csu/libc-start.c:287 (libc.so.6+0x000000021ec4)
Previous read of size 4 at 0x7d4c0000a154 by thread T10 (mutexes: write M1367):
#0 ConnHandler::getLastWalkTime() /home/abhinav/couchbase/ep-engine/src/tapconnection.h:360 (ep.so+0x000000049cbe)
#1 ConnManager::run() /home/abhinav/couchbase/ep-engine/src/connmap.cc:150 (ep.so+0x00000005031e)
#2 ExecutorThread::run() /home/abhinav/couchbase/ep-engine/src/executorthread.cc:112 (ep.so+0x0000000f8746)
#3 launch_executor_thread(void*) /home/abhinav/couchbase/ep-engine/src/executorthread.cc:33 (ep.so+0x0000000f82e5)
#4 platform_thread_wrap /home/abhinav/couchbase/platform/src/cb_pthreads.c:23 (libplatform.so.0.1.0+0x000000003d31)
Change-Id: I2c5024afde6cb749aad901572bfd68734f6d7d5d
Reviewed-on: http://review.couchbase.org/55780
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Reviewed-on: http://review.couchbase.org/56062
abhinavdangeti [Thu, 8 Oct 2015 21:10:14 +0000 (14:10 -0700)]
Merge remote-tracking branch 'couchbase/3.0.x' into 'sherlock'
couchbase/3.0.x:
* MB-16357: Interlock expiry and vbucket state changes
Change-Id: Ie4ba4138d23a4b811bd43ee182a641c99b5896ad
Jim Walker [Thu, 1 Oct 2015 16:12:19 +0000 (17:12 +0100)]
MB-16357: Interlock expiry and vbucket state changes
Expiry should only occur whilst the vbucket is active.
Background tasks performing expiry deletion must stop
driving deletions when the vb changes status to !active.
Using a reader/writer lock the core deleteExpiredItem
function which is used by both compaction driven expiry
and the item pager are now interlocked with vbucket::setState()
Change-Id: I19d30c3d7855778613ccb4534a042c0daf627b8c
Reviewed-on: http://review.couchbase.org/55646
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: buildbot <build@couchbase.com>
abhinavdangeti [Fri, 2 Oct 2015 20:36:21 +0000 (13:36 -0700)]
Merge remote-tracking branch 'couchbase/3.0.x' into sherlock
couchbase/3.0.x:
* MB-16434: In consumer stream get bytes cleared atomically.
* MB-16434: In setDead release streamMutex before cleaning up stream buffer
* [BP] MB-14382 Increase the default number of ht_locks by a factor of approx. 10
* MB-16421: BGFetch to restore an item that is non-resident
* MB-16402: Ensure objectregistry has an engine when deleting the VBucketMemoryDeletionTask.
Change-Id: I56bf61ddb324f2472682f3b3ac7f86e81e3f6636
Manu Dhundi [Thu, 1 Oct 2015 22:38:14 +0000 (15:38 -0700)]
MB-16434: In consumer stream get bytes cleared atomically.
When a comsumer stream is set to dead we clear the consumer buffer and
notify the producer of the number of bytes cleared. Clearing the
consumer buffer and the accounting of the bytes cleared should be done
atomically
Change-Id: I602d5307c6c2bbd3dc7f03f1d6c43cbe294389ac
Reviewed-on: http://review.couchbase.org/55708
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Manu Dhundi [Thu, 1 Oct 2015 22:35:47 +0000 (15:35 -0700)]
MB-16434: In setDead release streamMutex before cleaning up stream buffer
To avoid lock order inversion in DCP passive stream we must release streamMutex
before acquiring bufMutex. This is because while processing incoming mutations
on dcp consumer we acquire bufMutex first and then streamMutex.
Change-Id: I129d014dc3a7ec91416af04851419782b89cea23
Reviewed-on: http://review.couchbase.org/55707
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Chiyoung Seo [Fri, 27 Mar 2015 21:58:27 +0000 (14:58 -0700)]
[BP] MB-14382 Increase the default number of ht_locks by a factor of approx. 10
This change is backported from sherlock.
From the recent perf test results, we observed the lock contention on hash table
buckets when a hash table scanning task (i.e., expiry pager, item pager,
accecss scanner, etc.) is running.
As a long term solution, we may need to implement resizing the number of
hash table locks dynamically at runtime.
Change-Id: Ic7b9f951f58fe7190d8d0d23fb62979057e545ac
Reviewed-on: http://review.couchbase.org/48869
Reviewed-by: Michael Wiederhold <mike@couchbase.com>
Tested-by: Chiyoung Seo <chiyoung@couchbase.com>
Reviewed-on: http://review.couchbase.org/55705
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
abhinavdangeti [Thu, 1 Oct 2015 00:40:14 +0000 (17:40 -0700)]
MB-16421: BGFetch to restore an item that is non-resident
In a full eviction scenario, a bgfetch is to restore an item
not just if the hash table item is a temp-initial item, but
if the hash table item is non-resident as well.
Change-Id: I127b0cbe7034133a656b046cd4022635be23aedd
Reviewed-on: http://review.couchbase.org/55680
Tested-by: buildbot <build@couchbase.com>
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>