It is a case where I involved actively in the second day to complete my investigation, after learning my negligence and assumption. The happening, in the case, stopped the testing for a day.
A day with no testing and programming, it is a cost which has not turned into any benefit to business! I heard the same from business team.
Happening in the case
One of my fellow tester had got a project assignment and expectation was to carry out the performance test. The latest release and deployment was available for the testing and it was used. While the fellow tester started the task from tester's desk, the business team started using the product from their place.
The tester noticed the product crashing after few actions. Clueless why it is, just the message was available that said, "Something went wrong!" Looking at this, tester had rolled back the latest installation and did a fresh installation. Yet, the same behavior and message. This made the tester to use different test setup and noticed the same behavior and message.
By this time, I was said about this behavior.
Observing the happening
Fellow tester walked through me the context of task, it's priority and expected time to finish the task. Following with that walk through, tester gave me the happening in the test environment and the observations recorded for the crash.
With this, the tester said, the business team is not facing this behavior. But here it is crashing after few actions on launching it.
Hearing this, we wanted to make sure the version of the product and hardware setup was similar or close enough to be fair. It looked everything is symmetrical in terms of product version and setup.
I asked for the log and noticed the parsing failing at the client end. Running through the stack trace in the log it was evident for me that there is a problem in processing the data. Which data, that was the question. The stack trace said, the data that was wrong. But the claim from the tester was, the same data is being used by testing team and business team. Then, why the crash experience just for the tester.
Product getting crashed is a common sight. The interesting aspect in the crash is knowing the root. I said this to tester and asked to observe data transmitted and update me on the same.
How negligent here, I'm!
I knew, from so the so far investigation, we are very close in knowing the root cause of the problem. But then, from here, on updating what to do next in investigation and analyze, I moved into my practice. This is the mistake I did. I should have joined the tester in completing it.
But that next part of investigation did not happen for the whole day and it never came to me back. I heard business team is using product without any problem. At least, I should have inquired with the engineering team but which I did not.
Next day's fresh listening
I noticed the discussions between business and engineering team. The calculation was on the time that went out seeing no testing and no clue of the crash. After few minutes, I heard from tester, it works if I choose different network. It was tried in morning and since it worked, the other network line is used.
That shook me very strongly and my eyes just became very straight in sight seeing the engineering team. I asked, "How? That should not be a problem at all from network information I see here. It should be something else, not the network."
I could not convince myself that changing the network will solve the crash. Just the product did not exhibit the behavior for a reason. What was the reason? I started to converse with the product again.
Going back to the undone
I requested for the data from the data monitoring between the client and server. This was one part of the investigation which was suppose to continue. My bad, I assumed, this was done and engineering team saw no problem here. I did not think of asking it and look at it, on the previous day, because I assumed.
This time, I was very keen to look and requested to setup environment for monitoring the data. Used different networks and noticed data sent, received and parsed.
What's the problem?
From last day's investigation it was evident to me, the client could parse the data and assign it to the object. The consequence was Null Pointer Exception and product getting crashed.
Now with data, I started analyzing and every bit of data and line by line information in the log. The problem is, the product is unable to handle the data at this transition point if that is not from the server. But why? Isn't that a expectation? I left the question with the engineering team for their discussion.
The request from client did not reach the server at all. The web monitoring and filtering system (WMFS) in the network it did not let the request to reach server. The response was from the WMFS to client.
The other mystery to learn was, how is it possible to work on one network and not on another network line, while the service line is same. In one network, out of the first three request, all got response from WMFS. While on another network, the second and third response was from the server while just the first request got response from WMFS. This showed, though first request fails, the second and third requests are crucial.
Beyond the corners of the problem
I see, there is no harm in having WMFS in place. Customer who procures the product can have WMFS in place. Product cannot say, not to use WMFS to its users.
The product (client module) should be able to parse the response and have identity mechanism with server token to parse data, which it receives be it from server and other sources.
Learning
For engineering team, it is an alert to handle data in all possibly identified corners. To me, it was the learning which said, not to be negligent and hand over the investigation if you do not request for the update after few minutes.
To business team, the learning is to follow up on investigation which started and see a closure.