Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On POST from Mayfly occasionally loosing data #483

Closed
neilh10 opened this issue Jul 23, 2020 · 12 comments
Closed

On POST from Mayfly occasionally loosing data #483

neilh10 opened this issue Jul 23, 2020 · 12 comments

Comments

@neilh10
Copy link

neilh10 commented Jul 23, 2020

While doing some new Mayfly development, extending the functions so that there is a sendEveryX readings, some POSTED individual readings appears to be lost, while a record with timestamp is still accepted.
This sendEveryX feature allows the Mayfly data logger to collect samples, and then periodically setup a connection to the internet, and send all the previously connected data samples on one TCP/IP connection.
edit :
Looking at the code the algorithm is

//establish a connection to internet over modem
repeat while there are SensorsReadings Queued
{
_outClient->connect(enviroDIYHost, enviroDIYPort // establish a TCP/IP connection,
then POST it
_outClient->readBytes(tempBuffer, REQUIRED_MIN_RSP_SZ); // read any response
_outClient->stop(); //tear down TCP/IP connection
}
// Turn off modem
edit_end

delete , only one POST was made per TCP/IP connection. end_delete

From the Mayflys functionality each POST is a unique event.

I would expect that either the whole record is accepted, or the whole record is lost.
At this point, I've not checked what HTTP Status code the Mayfly received for the partial record.

Just wondering if there is some thing I should be doing differently. Thanks.

Test Detail

With this test run the Mayfly, it is setup to take a minimal set of readings every 2minutes, and sendEveryX=2.
So does two POSTS, in quick succession, every 4 minutes, with either 5secs between POSTS (timeout no response from server), or if an ack was received, HTTP Status 201, an immediate 2nd post.

For an overnight test run from 2020-07-23 0:26 (line 4732) to 2020-07-23 19:12 (line 5292) - attached XLS
there are 12 data losses at

2020-07-23 1:04 2020-07-23 2:04 2020-07-23 2:08 ..

A standard post with data sequence is SampleNumber, Vbat, RSSI
and Mayfly debug lists this on a sample POST

POST /api/data-stream/ HTTP/1.1
Host: data.envirodiy.org
TOKEN: e47f83b1-51c6-4be9-9951-c0029849a39b
Content-Length: 228
Content-Type: application/json

{"sampling_feature":"bdcb95ef-097e-4013-8c5f-73810e4bba55","timestamp":"2020-07-23T11:50:00-08:00","199c4833-482b-43af-a0da-8db5fdc64871":4,"2ab799ea-0ff5-4798-b9a6-9549345f7310":4.685,"ac3c5798-2004-4fc3-9679-111977c6b0a8":-30}

and for the above sequence it received a
HTTP/1.1 2014-4fc3-9679-111977c6b0a8

which is take to be a successful POST. Sometimes there is no response from the POST and times out after 5seconds.

The target is https://monitormywatershed.org/sites/tu_rc_test05/ and downloaded data is

tu_rc_test05_TimeSeriesResults 200723_1215b.xlsx

@SRGDamia1
Copy link
Member

I would expect that either the whole record is accepted, or the whole record is lost.

That's what I would have expected as well.

@tslawecki
Copy link
Collaborator

@neillh10, we are currently having some difficulties with server resources that might be causing unexpected drops. I'd like to check the server logs - could you share the IP address from which the data are POST'd?

@neilh10
Copy link
Author

neilh10 commented Jul 23, 2020

The setup is a Digi WiFi module, requesting off a local network. that then connects to a DSL. The ISP is sonic.net.
This is what I get :
WhatIsMyIp 2020-07-23 161913

@neilh10
Copy link
Author

neilh10 commented Jul 23, 2020

Hi @tslawecki thanks for the quick response.
Here is a more recent POSTSLOG recorded from the Mayfly - I had a bug in last nights recording so couldn't post it.
The format is ,,<Internal #of Post attempts>, < Posted Record essential time variant >
The plan is as per
EnviroDIY/ModularSensors#194
and also to implement an OffsetTime.

Right now to facilitate Mayfly integration testing its on a simple accelerated 2min between readings, and then accumulate 2readings and post them. So its on a fairly async 4min schedule.
If it would help on the server insights I could attempt to shift it to a time frame when there perhaps aren't as many hits on the server. Just let me now

A POSTLOG record is
2020-07-23T21:32:10-08:00,201,0,1595511000,1,4.685,-31

so the core Time above the record 1595511000,1,4.685,-31 would have been formated on the outgoing post as , sequenceNumber, Vbat, RSSI
this POST was at 2020-07-23T21:32:10-08:00
and received a Http Status 201 (good)
and it would be the first attempt '0'

So would be no further attempts made.

Currently there are NO repeat attempts, though that is what I am now working on, and was hoping to run it over the weekend. I'm away as from tomorrow for a few days.

My algorithm, (unless you suggest otherwise) is if I don't get a Http Status 201, is to store the record, for a later repeat attempt.
Then the Mayfly on its regular internet access schedule, after attempting a POST on the most recent data, if it gets a 201, it will then run through all the previous unsuccessful attempts. That is they will be out of time order.

From past experience with wireless systems, there can be times when there is a fog or other atmospheric conditions where there is no wireless connection to the cell tower. Then of course when the weather clears up, the connection is established.

POSTLOG_200723_1557.TXT

@neilh10
Copy link
Author

neilh10 commented Jul 24, 2020

I've added the following comment to the description
edit :
Looking at the code the algorithm is

//establish a connection to internet over modem
repeat while there are SensorsReadings Queued
{
_outClient->connect(enviroDIYHost, enviroDIYPort // establish a TCP/IP connection,
then POST it
_outClient->readBytes(tempBuffer, REQUIRED_MIN_RSP_SZ); // read any response
_outClient->stop(); //tear down TCP/IP connection
}
// Turn off modem
edit_end

@neilh10
Copy link
Author

neilh10 commented Jul 24, 2020

I had some overnight tests running - POSTing every 14minutes with a 3min delay. That is 2min readings, and collecting 7samples before attempting the POST.
The test results show the POSTs all went through, but there was some individual data losses. See attached tu_rc_test05_TimeSeriesResults 200724_0703.xlsx
All posts appear to be partially completed

tu_rc_test05_TimeSeriesResults 200724_0703.xlsx

Debug log of the start of the test run.
tty200723-2228quedOvernight.txt

I'm planning on putting on a test that is the target for the hydrologists - samples every 15minutes, take 8 samples, then upload (every 2hours). I'll try an offset of 7minutes to shift it off the the 15minute interval.

@neilh10
Copy link
Author

neilh10 commented Jul 28, 2020

In a test run last night 404 POST attempts, spaced as 2 POSTS every 4minutes, 399 successfully pushed ALL the records. 4 of the POSTS failed to be recorded at ALL (success, not a partial record).
The Mayfly was set up to take a reading every two minutes, post after 2 readings, and offset the server connections by 1minute. The mayfly is using a Digi WiFi S6 modem, with a Lithium battery, and 5V charging.

The testing time was from UTC 2020-07-28 2:54 to 2020-07-28 16:20 ~ my IP still shows as the same above

However starting from 2020-07-28 6:54, none of the POSTs received a response HTTP Status 201. But of these 205 POST attempts, all 201 were actually fully inserted into the database, with 4 not being there. I can tell from the EnviroDIY_Mayfly_SampleNum

The 4 that never made it into the server database, where all at the same time between UTC (server time) 2020-07-28 6:50 and 2020-07-28 7:02

The none HTTP Status 201, were largely timeouts ( no response received from server).
Two responses were recorded by the Mayfly as 502, and where never inserted into server database at (this is UTC Mayfly) time.
That is EnviroDIY_Mayfly_SampleNum 123 and 123
2020-07-28T07:01:10-08:00,502,0,1595890680,123,4.730,0
2020-07-28T07:01:11-08:00,502,0,1595890800,124,4.730,0

Taken from the Mayfly POSTLOG (Mayfly POST time, followed by HTTP status)
2020-07-28T06:45:11-08:00,201,0,1595889840,116,4.730,-32
2020-07-28T06:49:14-08:00,201,0,1595889960,117,4.730,-31
2020-07-28T06:49:16-08:00,201,0,1595890080,118,4.730,-31
2020-07-28T06:53:15-08:00,504,0,1595890200,119,4.730,-33
2020-07-28T06:53:22-08:00,504,0,1595890320,120,4.730,-33 <--- Missing from server download
2020-07-28T06:57:15-08:00,504,0,1595890440,121,4.730,0
2020-07-28T06:57:22-08:00,504,0,1595890560,122,4.730,0 <--- Missing from server download
2020-07-28T07:01:10-08:00,502,0,1595890680,123,4.730,0 <--- Missing from server download
2020-07-28T07:01:11-08:00,502,0,1595890800,124,4.730,0 <--- Missing from server download
2020-07-28T07:05:15-08:00,504,0,1595890920,125,4.730,-32
2020-07-28T07:05:22-08:00,504,0,1595891040,126,4.730,-32

In case its useful, I restarted the test this morning UTC Mayfly 2020-07-28T17:19:00-08:00 , all readings with retrys successfully delivered.

For testing purposes, I inspect my private test log, and of course I love recording successes!!!.

The new test format is take readings 2minutes apart, post after 8 readings, with an offset of 1minute.
So it will attempt to POST 8 readings every 16minutes, on an odd minute boundary.

The first 8 POSTs are attempted whatever the http response is.
If the last POST receives a 201, then it attempts to repeat all the previous POST attempts that did not receive a 201, until a non-201 is received.

This test record is
2020-07-28T17:19:00-08:00 all POSTED with status 201
2020-07-28T17:35:00-08:00 all status 201

2020-07-28T17:51:00-08:00 first 5 POSTs, 504, but then received 201s. Then immediately repeated the 5 failed POSTs with success.
(hurrah another new functionality code test case proven!!)

Then all 8 POSTs status 201 on the following attempts
2020-07-28T18:07:00-08:00
2020-07-28T18:39:00-08:00
2020-07-28T18:55:00-08:00
2020-07-28T19:11:00-08:00

@aufdenkampe
Copy link
Member

@neilh10, this detailed info is very useful!

@tslawecki, any thoughts?

@neilh10
Copy link
Author

neilh10 commented Jul 29, 2020

Oops just did POSTS from a Mayfly I'm preparing for the field with a Xbee3 LTE Verizon, and it inserted a partial dataload.

Its target is https://monitormywatershed.org/sites/tu_rc_test04/
The system is setup for taking a reading every 15min, collecting 8 readings, then delaying 7min to POST.

Unfortunately I can't see what the IP number (I'll output it next build) - and the RevX Systems interface is very simple.

Here is the listing with Mayfly_SampleNum at end

2020-07-29 9:45 | -8:00 | 2020-07-29 17:45 | 0.16198 | 0.0257 | 18.35 | 15.42 |   |   |   |   | -81 | 5.14 | 4
2020-07-29 10:00 | -8:00 | 2020-07-29 18:00 | 0.16166 | 0.0249 | 19.03 | 15.83 |   |   |   |   | -81 | 5.109 | 5
2020-07-29 10:15 | -8:00 | 2020-07-29 18:15 | 0.16244 |   |   |   |   |   |   |   |   |   |   <<<<< partial
2020-07-29 10:30 | -8:00 | 2020-07-29 18:30 | 0.15887 | 0.0212 | 20.74 | 17.19 |   |   |   |   | -81 | 5.049 | 7
2020-07-29 10:45 | -8:00 | 2020-07-29 18:45 | 0.16115 | 0.0244 | 23.1 | 18.53 |   |   |   |   | -81 | 5.064 | 8

tu_rc_test04_TimeSeriesResults_200729_1353.xlsx

This is the series of 8 posts as seen from the Mayfly debug log, Field EnviroDIY_Mayfly_SampleNum (Seq), 1->8

... zzzZZ Processor awake @ 1596019920 <--LoggerBase
2020-07-29T18:52:00-08:00 log&Pub V_batt 4.97 Lbatt= 4
Current Unix Timestamp: 1596048720 -> 2020-07-29T18:52:00-08:00 <--LoggerBase
sendOffset Post Readings <--LoggerBase

Waking up Digi XBee3 Cellular LTE-M ... <--LoggerBase
Connecting to the Internet... <--LoggerBase

Connecting client <--EnviroDIYPublisher
Client connected after 1673 ms <--EnviroDIYPublisher
Filled from SD QUE in 16 ms <--EnviroDIYPublisher
POST /api/data-stream/ HTTP/1.1
Host: data.envirodiy.org
TOKEN: 30f562fc-f9bd-4884-97b7-3c9f7b942087
Content-Length: 411
Content-Type: application/json

{"sampling_feature":"61055766-be1e-4f49-a509-bcdf28ec183e","timestamp":"2020-07-29T09:00:00-08:00","1da9af8c-168b-42e1-a0f7-e405bf1a307c":1,"49c86561-d83e-454b-928a-cc961f445083":5.125,"93eb7647-14c3-40a7-8248-06e84680ff6c":0.16139,"e7cd9faa-398b-4c0d-bcf5-ef6bf371c159":16.87,"081a8bad-1876-4dc5-aeb6-6ef1883fd410":14.28,"193286f0-7ee8-44f4-9488-cb8ddc2cf54f":0.0254,"381bdd6a-f39f-465a-9f64-ddfcbc0c3e7b":-81}

Client waited 2470 ms. Stopped after 316 ms <--EnviroDIYPublisher
Rsp:' HTTP/1.1 201f-465a-9f64-ddfcbc0c3e7b ' <--EnviroDIYPublisher
-- Response Code -- 201

POST seq2 reading & -- Response Code -- 504
POST seq3 reading & -- Response Code -- 504
POST seq4 reading & -- Response Code -- 504
POST seq5 reading & -- Response Code -- 504

POST seq6 that is partial insertion
Connecting client <--EnviroDIYPublisher
Client connected after 247 ms <--EnviroDIYPublisher
Filled from SD QUE in 17 ms <--EnviroDIYPublisher
POST /api/data-stream/ HTTP/1.1
Host: data.envirodiy.org
TOKEN: 30f562fc-f9bd-4884-97b7-3c9f7b942087
Content-Length: 411
Content-Type: application/json

{"sampling_feature":"61055766-be1e-4f49-a509-bcdf28ec183e","timestamp":"2020-07-29T10:15:00-08:00","1da9af8c-168b-42e1-a0f7-e405bf1a307c":6,"49c86561-d83e-454b-928a-cc961f445083":5.109,"93eb7647-14c3-40a7-8248-06e84680ff6c":0.16244,"e7cd9faa-398b-4c0d-bcf5-ef6bf371c159":19.88,"081a8bad-1876-4dc5-aeb6-6ef1883fd410":16.38,"193286f0-7ee8-44f4-9488-cb8ddc2cf54f":0.0249,"381bdd6a-f39f-465a-9f64-ddfcbc0c3e7b":-81}

Client waited 5002 ms. Stopped after 314 ms <--EnviroDIYPublisher
-- Response Code -- 504

@neilh10
Copy link
Author

neilh10 commented Jul 30, 2020

I've put in some code that lists the Xbee LTE modem IP.
The above LTE modem was subsequently was assigned an IP [ 100.102.218.146 ] twice subsequently. Hopefully it was also assigned that number above.

@aufdenkampe
Copy link
Member

@neilh10, it would be very helpful to hear what you are experiencing since we found a fix for the intermittent Gateway server errors, as described in #484 (comment).

@neilh10
Copy link
Author

neilh10 commented Aug 18, 2020

After some extensive testing this last week, I'm seeing complete delivery of the whole records. No Partial records. Hurrah.
In a series of tests had 7000+ POSTS (with retrys hard to count) that ran from 2020-08-07 11:46 to 2020-08-18 17:02

@neilh10 neilh10 closed this as completed Aug 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants